digitalmars.D - Encoding and doFormat

Sean Kelly (6/6) Jul 20 2004 Is there any instance where we might want to output UTF-16 or UTF-32 enc...

Walter (9/14) Jul 20 2004 encoded

Sean Kelly (5/8) Jul 20 2004 Oops. I misread part of the documentation on std.format where it was ta...

Walter (8/16) Jul 20 2004 which

Sean Kelly (3/5) Jul 20 2004 Right, because all char types can be implicitly cast to dchar, correct?

Walter (4/8) Jul 20 2004 Not exactly. doFormat() examines the type of each argument, and does any
Arcane Jill (11/15) Jul 21 2004 They can be implicitly cast, but they cannot be /correctly/ cast. I have

Derek Parnell (11/32) Jul 21 2004 (Jill, I'm not critizing, disputing, argueing, being ornory, etc... I'm

Arcane Jill (34/37) Jul 21 2004 Oh that's easy to answer.

Sean Kelly (7/27) Jul 21 2004 Oops, right. What unFormat does is read everything into dchars then

Arcane Jill (21/27) Jul 20 2004 Conventionally, streams are considered to be a sequence of octets (bytes...

Sean Kelly (9/37) Jul 20 2004 doFormat handles all the work for writef. And while writef is really fo...

Sean Kelly (6/9) Jul 20 2004 That's what I get for posting hastily. writef actually seems to switch ...

Sean Kelly <sean f4.ca> writes:

Is there any instance where we might want to output UTF-16 or UTF-32 encoded
strings from doFormat?  Alternately, is there any instance where we might want
to read UTF-16 or UTF-32 encoded streams?  I've been kicking around how best to
handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation
in the current implementation of doFormat.


Sean

Jul 20 2004

"Walter" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cdjpll$1pks$1 digitaldaemon.com...
 Is there any instance where we might want to output UTF-16 or UTF-32

encoded
 strings from doFormat?  Alternately, is there any instance where we might

want
 to read UTF-16 or UTF-32 encoded streams?  I've been kicking around how

best to
 handle this aspect of unFormat and was suddenly struck by the UTF-8

limitation
 in the current implementation of doFormat.

doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's, which
are UTF-32. Look at std.stdio.writef(), it writes its output based on what
the stream's format is set to.

Jul 20 2004

Sean Kelly <sean f4.ca> writes:

In article <cdk2sj$1tvi$1 digitaldaemon.com>, Walter says...
doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's, which
are UTF-32. Look at std.stdio.writef(), it writes its output based on what
the stream's format is set to.

Oops.  I misread part of the documentation on std.format where it was talking
about printing portions of the format string.  So input and output is in dchars?
That makes life quite easy.  I guess unFormat is pretty much finished then.


Sean

Jul 20 2004

"Walter" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cdk425$1uid$1 digitaldaemon.com...
 In article <cdk2sj$1tvi$1 digitaldaemon.com>, Walter says...
doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's,


which
are UTF-32. Look at std.stdio.writef(), it writes its output based on


what
the stream's format is set to.

 Oops.  I misread part of the documentation on std.format where it was

talking
 about printing portions of the format string.  So input and output is in

dchars?
 That makes life quite easy.  I guess unFormat is pretty much finished

then.

Input is chars, wchars, or dchars.

Jul 20 2004

Sean Kelly <sean f4.ca> writes:

Walter wrote:
 
 Input is chars, wchars, or dchars.

Right, because all char types can be implicitly cast to dchar, correct?

Sean

Jul 20 2004

"Walter" <newshound digitalmars.com> writes:

"Sean Kelly" <sean f4.ca> wrote in message
news:cdkm15$25m7$1 digitaldaemon.com...
 Walter wrote:
 Input is chars, wchars, or dchars.

 Right, because all char types can be implicitly cast to dchar, correct?

Not exactly. doFormat() examines the type of each argument, and does any
conversions as necessary. Implicit conversions are not necessary.

Jul 20 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdkm15$25m7$1 digitaldaemon.com>, Sean Kelly says...
Walter wrote:
 
 Input is chars, wchars, or dchars.

Right, because all char types can be implicitly cast to dchar, correct?

They can be implicitly cast, but they cannot be /correctly/ cast. I have
mentioned this before (and suggested that it be considered a bug) but Walter was
adamant that the runtime overhead involved in checking would be undesirable.

The problem can be demonstrated by example. Suppose you cast a char containing
UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte
UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0,
instead of (as I would prefer) throwing a UTF conversion exception.

In general, char values >0x7F should not be cast to wchars or dchars, because
these values are /not characters/.

Arcane Jill

Jul 21 2004

Derek Parnell <derek psych.ward> writes:

On Wed, 21 Jul 2004 07:24:42 +0000 (UTC), Arcane Jill wrote:

 In article <cdkm15$25m7$1 digitaldaemon.com>, Sean Kelly says...
Walter wrote:
 
 Input is chars, wchars, or dchars.

Right, because all char types can be implicitly cast to dchar, correct?

 
 They can be implicitly cast, but they cannot be /correctly/ cast. I have
 mentioned this before (and suggested that it be considered a bug) but Walter
was
 adamant that the runtime overhead involved in checking would be undesirable.
 
 The problem can be demonstrated by example. Suppose you cast a char containing
 UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte
 UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0,
 instead of (as I would prefer) throwing a UTF conversion exception.
 
 In general, char values >0x7F should not be cast to wchars or dchars, because
 these values are /not characters/.
 
 Arcane Jill

(Jill, I'm not critizing, disputing, argueing, being ornory, etc... I'm
just trying to understand UTF better and I think you'd be one the best
sources at the moment)

If D char variables are supposed to be UTF-8 characters, then why does D
allow a char to contain non-UTF-8 bit patterns (eg. a UTF-8-fragment)? I
can see that a byte could, but a char? Or is D a bit simple here?

-- 
Derek
Melbourne, Australia
21/Jul/04 5:42:59 PM

Jul 21 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdl74j$2ea7$1 digitaldaemon.com>, Derek Parnell says...

If D char variables are supposed to be UTF-8 characters, then why does D
allow a char to contain non-UTF-8 bit patterns (eg. a UTF-8-fragment)? I
can see that a byte could, but a char? Or is D a bit simple here?

Oh that's easy to answer.

Okay, first off, there is no such thing as a "UTF-8 character". UTF-8 is an
encoding of Unicode, so there is only a "Unicode character", which may be
encoded in UTF-8 as a multi-byte sequence. So a char, in fact, can /only/
contain a UTF-8 fragment.

Fortunately, there are some UTF-8 multibyte sequences which are, in fact,
exactly one byte long. The Unicode characters represented by such
one-byte-sequences are the characters U+0000 to U+007F inclusive - in other
words, ASCII. UTF-8 was designed that way on purpose, to maintain compatibility
with ASCII.

Thus, if a char contains a value in the range 0x00 to 0x7F inclusive then it may
be interpretted either as an ASCII character or a "one-byte UTF-8 fragment which
happens to represent a complete Unicode character". Both interpretations are
equally valid and interchangable.

On the other hand, if a char contains a value in the range 0x80 to 0xF8 then it
can /only/ be a UTF-8 fragment, since these bytes form part of multibyte
sequences which are /at least/ two bytes long, and so cannot be equated with a
single character.

(Values in the range 0xF9 to 0xFF are completely meaningless. That's one reason
why char.init is 0xFF).

To answer your question: "Why does D allow a char to contain non-UTF-8 bit
patterns" - for the same reason that it allows a dchar to conatain non-UTF-32
bit patterns - it's simply a platform-native integer type. Constraining char to
contain values in the range 0x00 to 0xF8 (or constraining dchar to values in the
range 0x00000000 to 0x0010FFFF excluding 0x0000D800 to 0x0000DFFF) would add
run-time overhead that is simply not necessary.

If I have misunderstood your question, and you were actually intending to ask
"Why does D allow a char to contain UTF-8 fragments which cannot be interpretted
in isolation" then the answer has to be that char exists so that char[] can
exist. Only in a /string/ does UTF-8 make any real sense. A string needs an
array, and an array has to be an array of /something/. char is that something.

Any help?

Arcane Jill

Jul 21 2004

Sean Kelly <sean f4.ca> writes:

Arcane Jill wrote:

 In article <cdkm15$25m7$1 digitaldaemon.com>, Sean Kelly says...
 
Walter wrote:

Input is chars, wchars, or dchars.

Right, because all char types can be implicitly cast to dchar, correct?

 
 
 They can be implicitly cast, but they cannot be /correctly/ cast. I have
 mentioned this before (and suggested that it be considered a bug) but Walter
was
 adamant that the runtime overhead involved in checking would be undesirable.
 
 The problem can be demonstrated by example. Suppose you cast a char containing
 UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte
 UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0,
 instead of (as I would prefer) throwing a UTF conversion exception.
 
 In general, char values >0x7F should not be cast to wchars or dchars, because
 these values are /not characters/.

Oops, right.  What unFormat does is read everything into dchars then 
convert to UTF-8 before writing to a char array or to UTF-16 before 
writing to a wchar array.  The missing piece is converting from UTF-8 or 
UTF-16 when reading, which should be done in a day or two--I decided to 
rewrite the utf routines to allow a put/get delegate.


Sean

Jul 21 2004

Arcane Jill <Arcane_member pathlink.com> writes:

In article <cdjpll$1pks$1 digitaldaemon.com>, Sean Kelly says...
Is there any instance where we might want to output UTF-16 or UTF-32 encoded
strings from doFormat?  Alternately, is there any instance where we might want
to read UTF-16 or UTF-32 encoded streams?  I've been kicking around how best to
handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation
in the current implementation of doFormat.

Sean

Conventionally, streams are considered to be a sequence of octets (bytes). So,
by that reasoning, you would never want to read or write UTF-16 or UTF-32
to/from a stream because the units of those formats are not eight bits wide.

However, the formats UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE are
eight-bit-wide standards, merely consisting of UTF-16 and UTF-32 in
little-endian and big-endian representation respectively. You would certainly
expect to encounter these. Text files can use them, for example.

Going beyond streams, there are also "wide" streams, called Readers and Writers
in Java, and maybe filters more generically. They exist for such purposes as
transcoding, uppercasing, etc., but don't, in general, send their data to
consoles, files, sockets, etc (because those devices expect 8-bit wide input).
Transcoders of course are capable of transforming a "wide" stream to a normal
stream. A UTF-16LE encoder, for example, is utterly trivial.

I don't know what doFormat does, so I can't answer your specific question. I can
say, though, that if you're processing bytes or ubytes, you don't need to bother
with UTF-32 or UTF-16 (or even UTF-8). If you're processing characters, however,
you may well be better off keeping everything in dchars throughout.

I don't know if that's helpful or not. What does doFormat() do? What's it for?
(I should probably know that, but I'm not that on-the-case right now).

Arcane Jill

Jul 20 2004

Sean Kelly <sean f4.ca> writes:

In article <cdk49v$1ump$1 digitaldaemon.com>, Arcane Jill says...
In article <cdjpll$1pks$1 digitaldaemon.com>, Sean Kelly says...
Is there any instance where we might want to output UTF-16 or UTF-32 encoded
strings from doFormat?  Alternately, is there any instance where we might want
to read UTF-16 or UTF-32 encoded streams?  I've been kicking around how best to
handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation
in the current implementation of doFormat.

Sean

Conventionally, streams are considered to be a sequence of octets (bytes). So,
by that reasoning, you would never want to read or write UTF-16 or UTF-32
to/from a stream because the units of those formats are not eight bits wide.

However, the formats UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE are
eight-bit-wide standards, merely consisting of UTF-16 and UTF-32 in
little-endian and big-endian representation respectively. You would certainly
expect to encounter these. Text files can use them, for example.

Going beyond streams, there are also "wide" streams, called Readers and Writers
in Java, and maybe filters more generically. They exist for such purposes as
transcoding, uppercasing, etc., but don't, in general, send their data to
consoles, files, sockets, etc (because those devices expect 8-bit wide input).
Transcoders of course are capable of transforming a "wide" stream to a normal
stream. A UTF-16LE encoder, for example, is utterly trivial.

I don't know what doFormat does, so I can't answer your specific question. I can
say, though, that if you're processing bytes or ubytes, you don't need to bother
with UTF-32 or UTF-16 (or even UTF-8). If you're processing characters, however,
you may well be better off keeping everything in dchars throughout.

I don't know if that's helpful or not. What does doFormat() do? What's it for?
(I should probably know that, but I'm not that on-the-case right now).

doFormat handles all the work for writef.  And while writef is really for
console and file output, doFormat can do in-memory string formatting as well.
It turns out I miswrote my original question, as writef uses the wide char
output routines but doFormat deals entirely in dchars.  So I think it would be
safe for me to have readf use wide char input routines and leave unFormat as-is.
This puts aside the issue of configurable encoding however, but from what you've
said perhaps that isn't much of a problem.


Sean

Jul 20 2004

Sean Kelly <sean f4.ca> writes:

In article <cdk74m$1vqj$1 digitaldaemon.com>, Sean Kelly says...
It turns out I miswrote my original question, as writef uses the wide char
output routines but doFormat deals entirely in dchars.  So I think it would be
safe for me to have readf use wide char input routines and leave unFormat as-is.

That's what I get for posting hastily.  writef actually seems to switch between
UTF-8 and (possibly) UTF-16 based on information gleaned from the file pointer.
Looks like I'm going to have to play with readf a little more, though I'm not
looking forward to handling unget.  Stinkin multibyte encoding schemes.


Sean

Jul 20 2004

D Programming

C/C++ Programming

Other

digitalmars.D - Encoding and doFormat