digitalmars.D - utf-32 text

Carlos Santander B. (33/33) Sep 06 2004 Somebody enlighten (sp?) me, please.

Arcane Jill (29/61) Sep 07 2004 As it should.

Carlos Santander B. (9/9) Sep 07 2004 "Arcane Jill" escribi� en el mensaje
James McComb (6/9) Sep 08 2004 That sounds like a beatiful knockdown argument:

Arcane Jill (8/12) Sep 08 2004 Huh? I think you may be a little confused there. dchar-->char is not los...

"Carlos Santander B." <carlos8294 msn.com> writes:

Somebody enlighten (sp?) me, please.
AFAIU, this code:

//////////////////////////////////
import std.file;
import std.utf;

void main ()
{
    char [] u32 = `import std.stdio; void main() { writefln("adi�s"); }`;
    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
}

//////////////////////////////////

Should produce a valid D program:
import std.stdio; void main() { writefln("adi�s"); }

In fact, DMD accepts it. Now my questions:
1. How do I edit the created file (test32.d)? I tried a number of different
editors and not even one of them could display the text correctly. Notepad shows
something like " i m p o r t ..." and that's the general case (NULL before every
letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
them too.
2. UTF32 is always 4 bytes per character, right? Then why did the resulting
program output this? "adi??s" (2 bytes for "�", 1 for the rest). Further testing
showed it was the output as if it was UTF8. Did I miss something in the process?
(FWIW, the original file was saved as UTF8 and UTF16-BE).
3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
exactly the same. Do BE or LE matter at all?. However I could do this:
"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
\U0000fffe"). Why is that? Is that the correct way to use \u?
4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
[], does DMD convert it automatically or does it produce an invalid string?

Take for what it is: just ignorance.

-----------------------
Carlos Santander Bernal

Sep 06 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <chja8o$20d3$1 digitaldaemon.com>, Carlos Santander B. says...
Somebody enlighten (sp?) me, please.
AFAIU, this code:

//////////////////////////////////
import std.file;
import std.utf;

void main ()
{
    char [] u32 = `import std.stdio; void main() { writefln("adi�s"); }`;
    void [] txt = cast (void[]) "\xFF\xFE\x00\x00";
    write("test32.d", txt ~ cast(void[]) toUTF32(u32));
}

//////////////////////////////////

Should produce a valid D program:

It does.


In fact, DMD accepts it.

As it should.


Now my questions:
1. How do I edit the created file (test32.d)? I tried a number of different
editors and not even one of them could display the text correctly.

That's because most Windows text editors don't grok UTF-32. You can blame this
on Microsoft. Microsoft incorrectly lists the following encodings:
* ANSI                     SHOULD BE: WINDOWS-1252 (NOT an ANSI standard)
* Unicode                  SHOULD BE: UTF-16LE
* Unicode (big endian)     SHOULD BE: UTF-16BE

and most Windows text editors follow suit.



Notepad shows
something like " i m p o r t ..." and that's the general case (NULL before every
letter). Why is that? SciTe thought it was UTF16-LE, and probably the rest of
them too.

You will have to ask individual text editor vendors that.

One editor which gets it /right/ is SC Unipad (www.unipad.org). Unfortunately
this is hideously expensive.




2. UTF32 is always 4 bytes per character, right? Then why did the resulting
program output this? "adi??s" (2 bytes for "�", 1 for the rest). Further testing
showed it was the output as if it was UTF8. Did I miss something in the process?
(FWIW, the original file was saved as UTF8 and UTF16-BE).

You didn't miss anything. Blame it on the text editor.


3. I tried to use the other BOM (00 00 FE FF) for testing and the results were
exactly the same. Do BE or LE matter at all?.

To Unicode, yes. To an application which doesn't understand it, no.


However I could do this:
"\u0000\uFEFF", but not this: "\uFFFE\u0000" ("invalid UTF character
\U0000fffe"). Why is that? Is that the correct way to use \u?

\u is used to denote a Unicode codepoint, and nothing else. It should /not/ be
used to inject bytes into a byte array. The actual bytes inserted will depend on
the encoding of the character literal -- normally UTF-8 in D, although there are
arguments that D should be more flexible in this regard.

The phrase "invalid UTF character" is meaningless, since there is no such thing
as a "UTF character". However, U+FFFE is a noncharacter codepoint, and it is
indeed invalid to find such a codepoint in a conformant Unicode string (which of
course is precisely why U+FEFF was chosen as the byte-order-mark).



4. If I save a file as, say, UTF8 and then assign a string literal to a dchar
[], does DMD convert it automatically or does it produce an invalid string?

Current DMD behavior is:
*) COMPILE-TIME constants are converted.
*) Values known only at RUN-TIME are not.

Again, plenty of us believe that this is not the best way for DMD to behave, and
that implicit conversion should happen always, just as it does from short to
int, because such conversions generate zero loss of information.


Arcane Jill

Sep 07 2004

"Carlos Santander B." <carlos8294 msn.com> writes:

"Arcane Jill" <Arcane_member pathlink.com> escribi� en el mensaje
news:chjpcj$284s$1 digitaldaemon.com
|
| ...
|
| Arcane Jill

Thanks

-----------------------
Carlos Santander Bernal

Sep 07 2004

James McComb <alan jamesmccomb.id.au> writes:

Arcane Jill wrote:

 Again, plenty of us believe that this is not the best way for DMD to behave,
and
 that implicit conversion should happen always, just as it does from short to
 int, because such conversions generate zero loss of information.

That sounds like a beatiful knockdown argument:

short-->int does not lose information, so it happens implicitly.

dchar-->char does not lose information, so it should happen implicitly.

+1 for implicit conversions between char, wchar and dchar.

James McComb

Sep 08 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <chognu$1hnj$1 digitaldaemon.com>, James McComb says...

That sounds like a beatiful knockdown argument:

short-->int does not lose information, so it happens implicitly.

Yes. That's what happens now, and it's perfectly sensible.

dchar-->char does not lose information, so it should happen implicitly.

Huh? I think you may be a little confused there. dchar-->char is not lossless. 

(And for that matter, char-->dchar, IMO, should either require an explicit cast
or throw a UTF exception if char value >0x80).


+1 for implicit conversions between char, wchar and dchar.

Lossless conversion is possible between char[], wchar[] and dchar[] - but /not/
between char, wchar and dchar. Please be aware of the difference.

Arcane Jill

Sep 08 2004

D Programming

C/C++ Programming

Other

digitalmars.D - utf-32 text