www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Improving D's support of code-pages

reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
D's support for unicode is a wonderful thing. The ability to comprehend 
UTF-8, -16, and -32 strings in a straightforward, native fasion is 
invaluable. However, the outside world consists of more than encodings 
of Unicode. The ability to deal with code pages in a straightforward 
manner should be considered absolutely vital.

I will be describing what I think is the optimal way of dealing with 
code-pages. This includes some changes and additions to Phobos (or 
Tango, if that is your preferred platform; to be truthful I am unsure of 
the state of code pages in that library), as well as describing a new D 
idiom.

(I am aware that Mango has some bindings to the ICU code-page conversion 
libraries, but this sort of functionality /really/ belongs in the 
standard library.)

The idiom is this: A string not known to be encoded in UTF-8, -16, or 
-32 should be stored as a ubyte[]. All internal string manipulation 
should be done in one of the Unicode encoding types (char[], wchar[], or 
dchar[]), and all input and output should be done with the ubyte[] type. 
There are some exceptions to this, of course. If you're reading input 
which you know to be in one of D's Unicode encoding types, or writing 
something out in one of those formats, naturally there's no reason you 
shouldn't just read into or write from the D type directly.

However, in many real-world situations, you are not reading something in 
a Unicode encoding, nor do you always want to write one out. This is 
particularly the case when writing something to the console. Not all 
Windows command lines or Linux shells are set up to handle UTF-8, though 
this is very common on Linux. My Windows command-line, however, uses the 
default of the ancient CP437, and this is not uncommon. The point is 
that, on many systems, outputting raw UTF-8 results in garbage.

----
Additions to Phobos
----

The first thing Phobos needs are the following functions. (Their basic 
interface has been cribbed from Python.)

char[] decode(ubyte[] str, string encoding, string error="strict");
wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

ubyte[] encode(char[] str, string encoding, string error="strict");
ubyte[] encode(wchar[] str, string encoding, string error="strict");
ubyte[] encode(dchar[] str, string encoding, string error="strict");

What follows is a description of these functions. For the sake of 
simplicity, I will only be referring to the char[] versions of these 
functions. The wchar[] and dchar[] versions should operate in an 
identical fashion.

Let's say you've read in a file and stored it in a ubyte[]:

ubyte[] file = something();

You're already in a bit of a situation, here, if you don't know the 
encoding of the file. If you've gotten this far without knowing it, or 
knowing how to get it, you probably need to re-think your design.

Let's say you know the file is in Latin-1. Since all of D's 
string-processing facilities expect to deal with a Unicode encoding, you 
want to convert this to UTF-8. You should just be able to decode it:

char[] str = decode(file, "latin-1");

And, ta-da! Your string is now converted to UTF-8, and all of D's string 
processing abilities can be brought to bear.

Now let's say that, after you've done whatever you were going to with 
the string, you want to write it back out in Latin-1. This is just a 
simple call to encode:

ubyte[] new_file = encode(str, "latin-1");

But wait! What if the UTF-8 string contains characters which are not 
valid Latin-1 characters? This is where the 'error' parameter comes into 
play. (Note that the error parameter is present in both encode and 
decode.) This parameter has three valid values:

  * "strict" causes an exception to be thrown. This is the default.
  * "ignore" causes the invalid characters to simply be ignored, and 
elided from the returned string.
  * "replace" causes the invalid characters to be replaced with a 
suitable replacement character. When calling decode, this should be the 
official U+FFFD REPLACEMENT CHARACTER. When calling encode, something 
specific to the code-page would have to be chosen; a '?' would be 
appropriate in the various ASCII-based code pages.

Using strings rather than an enum means this functionality could be 
extended by the user in the future (as Python allows).

Latin-1 is not a very interesting encoding. The situation gets more 
interesting if we are talking about a multi-byte encoding, such as 
UTF-16. So let's say we're reading a file encoded in UTF-16:

ubyte[] utf16_file = whatever();
char[] decode(utf16_file, "utf-16");

While you /could/ simply cast the ubyte[] to a wchar[], this code has 
the advantage of totally seperating the encoding which your program's 
input is in from the type with which you represent the data internally.

Using UTF-16 also means you might have errors during decoding, if there 
are invalid UTF-16 code units in the input string.

These functions might fit into std.string, although a new module such as 
std.codepages would work, as well.

----
Improvements to Phobos
----

The behavior of writef (and perhaps of D's formatting in general) must 
be altered.

Currently, printing a char[] causes D to output the raw bytes in the 
string. As I previously mentioned, this is not a good thing. On many 
platforms, this can easily result in garbage being printed to the screen.

I propose changing writef to check the console's encoding, and to 
attempt to encode the output in that encoding. Then it can simply output 
the resulting raw bytes. Checking this encoding is a platform-specific 
operation, but essentially every platform (particularly Linux, Windows, 
and OS X) has a way to do it. If the string cannot be encoded in that 
encoding, the exception thrown by encode() should be allowed to 
propagate and terminate the program (or be caught by the user). If the 
user wishes to avoid that exception, they should call encode() 
explicitly themselves. For this reason, Phobos will also need a function 
for retrieving the console's default encoding made available to the user.

This implies something else: Printing a ubyte[] should cause those 
actual bytes to be printed directly. While it is currently possible to 
do this with e.g. std.cstream.dout.write(), it would be very convenient 
to do this with writef, especially combined with encode().

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org
Aug 18 2007
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Kirk McDonald wrote:
 ----
 Additions to Phobos
 ----
 
 The first thing Phobos needs are the following functions. (Their basic 
 interface has been cribbed from Python.)
 
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
 
 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");
If you (or someone else) wants to write these, I'll put them in.
 ----
 Improvements to Phobos
 ----
 
 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.
 
 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply output 
 the resulting raw bytes. Checking this encoding is a platform-specific 
 operation, but essentially every platform (particularly Linux, Windows, 
 and OS X) has a way to do it. If the string cannot be encoded in that 
 encoding, the exception thrown by encode() should be allowed to 
 propagate and terminate the program (or be caught by the user). If the 
 user wishes to avoid that exception, they should call encode() 
 explicitly themselves. For this reason, Phobos will also need a function 
 for retrieving the console's default encoding made available to the user.
There's a big problem with this - what if the output is being sent to a file?
Aug 18 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Walter Bright wrote:
 Kirk McDonald wrote:
 
 ----
 Additions to Phobos
 ----

 The first thing Phobos needs are the following functions. (Their basic 
 interface has been cribbed from Python.)

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");
If you (or someone else) wants to write these, I'll put them in.
It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).
 ----
 Improvements to Phobos
 ----

 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.

 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.

 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply 
 output the resulting raw bytes. Checking this encoding is a 
 platform-specific operation, but essentially every platform 
 (particularly Linux, Windows, and OS X) has a way to do it. If the 
 string cannot be encoded in that encoding, the exception thrown by 
 encode() should be allowed to propagate and terminate the program (or 
 be caught by the user). If the user wishes to avoid that exception, 
 they should call encode() explicitly themselves. For this reason, 
 Phobos will also need a function for retrieving the console's default 
 encoding made available to the user.
There's a big problem with this - what if the output is being sent to a file?
Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.) -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
next sibling parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Kirk McDonald wrote:
 Walter Bright wrote:
 
 Kirk McDonald wrote:

 ----
 Additions to Phobos
 ----

 The first thing Phobos needs are the following functions. (Their 
 basic interface has been cribbed from Python.)

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");
If you (or someone else) wants to write these, I'll put them in.
It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).
 ----
 Improvements to Phobos
 ----

 The behavior of writef (and perhaps of D's formatting in general) 
 must be altered.

 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the 
 screen.

 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply 
 output the resulting raw bytes. Checking this encoding is a 
 platform-specific operation, but essentially every platform 
 (particularly Linux, Windows, and OS X) has a way to do it. If the 
 string cannot be encoded in that encoding, the exception thrown by 
 encode() should be allowed to propagate and terminate the program (or 
 be caught by the user). If the user wishes to avoid that exception, 
 they should call encode() explicitly themselves. For this reason, 
 Phobos will also need a function for retrieving the console's default 
 encoding made available to the user.
There's a big problem with this - what if the output is being sent to a file?
Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
parent reply BCS <ao pathlink.com> writes:
Reply to Kirk,

 Kirk McDonald wrote:
 
 Walter Bright wrote:
 
 Kirk McDonald wrote:
 
 ----
 Additions to Phobos
 ----
 The first thing Phobos needs are the following functions. (Their
 basic interface has been cribbed from Python.)
 
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string
 error="strict"); dchar[] ddecode(ubyte[] str, string encoding,
 string error="strict");
 
 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string
 error="strict"); ubyte[] encode(dchar[] str, string encoding,
 string error="strict");
 
If you (or someone else) wants to write these, I'll put them in.
It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).
 ----
 Improvements to Phobos
 ----
 The behavior of writef (and perhaps of D's formatting in general)
 must be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in
 the string. As I previously mentioned, this is not a good thing. On
 many platforms, this can easily result in garbage being printed to
 the screen.
 
 I propose changing writef to check the console's encoding, and to
 attempt to encode the output in that encoding. Then it can simply
 output the resulting raw bytes. Checking this encoding is a
 platform-specific operation, but essentially every platform
 (particularly Linux, Windows, and OS X) has a way to do it. If the
 string cannot be encoded in that encoding, the exception thrown by
 encode() should be allowed to propagate and terminate the program
 (or be caught by the user). If the user wishes to avoid that
 exception, they should call encode() explicitly themselves. For
 this reason, Phobos will also need a function for retrieving the
 console's default encoding made available to the user.
 
There's a big problem with this - what if the output is being sent to a file?
Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
"Stream" has a writef, so you can call writef for a file.
Aug 18 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
BCS wrote:
 Reply to Kirk,
 Kirk McDonald wrote:
 Files have no inherent encoding, only the console does. In this way,
 writing to a file is different than writing to the console. The user
 must explcitly provide an encoding when writing to a file; or, if
 they are writing a char[], wchar[], or dchar[], the encoding will be
 UTF-8, -16, or -32. (Writing a char[] implies an encoding, while
 writing a ubyte[] does not.)
I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
"Stream" has a writef, so you can call writef for a file.
But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
parent reply BCS <ao pathlink.com> writes:
Reply to Kirk,

 BCS wrote:
 
 Reply to Kirk,
 
 Kirk McDonald wrote:
 
 Files have no inherent encoding, only the console does. In this
 way, writing to a file is different than writing to the console.
 The user must explcitly provide an encoding when writing to a file;
 or, if they are writing a char[], wchar[], or dchar[], the encoding
 will be UTF-8, -16, or -32. (Writing a char[] implies an encoding,
 while writing a ubyte[] does not.)
 
I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
"Stream" has a writef, so you can call writef for a file.
But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout.
I was looking the other way. So you are saying that only the console functions should have the code page stuff? what about a dout? it goes to the console and also has a writef. I'm not putting down your idea, I'm just looking for (and hoping not to find) problems.
Aug 18 2007
parent Kirk McDonald <kirklin.mcdonald gmail.com> writes:
BCS wrote:
 Reply to Kirk,
 
 BCS wrote:

 Reply to Kirk,

 Kirk McDonald wrote:

 Files have no inherent encoding, only the console does. In this
 way, writing to a file is different than writing to the console.
 The user must explcitly provide an encoding when writing to a file;
 or, if they are writing a char[], wchar[], or dchar[], the encoding
 will be UTF-8, -16, or -32. (Writing a char[] implies an encoding,
 while writing a ubyte[] does not.)
I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
"Stream" has a writef, so you can call writef for a file.
But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout.
I was looking the other way. So you are saying that only the console functions should have the code page stuff? what about a dout? it goes to the console and also has a writef. I'm not putting down your idea, I'm just looking for (and hoping not to find) problems.
The functions encode() and decode() are available on their own. If you want to explicitly encode something you're writing to a file, you can simply say e.g.: somefile.write(encode(some_utf8_string, "cp437")); Only the console functions would call this /implicitly/, since only they /have/ an implicit encoding (which is the console's encoding as reported by the OS). Since you might not always want to encode the stuff you print out, you should be able to use std.cstream.dout.writef() to get around this. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Kirk McDonald wrote:
 Walter Bright wrote:
 There's a big problem with this - what if the output is being sent to 
 a file?
Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?
Aug 18 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Walter Bright wrote:
 Kirk McDonald wrote:
 
 Walter Bright wrote:

 There's a big problem with this - what if the output is being sent to 
 a file?
Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?
Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Kirk McDonald wrote:
 Walter Bright wrote:
 The problem is that whatever is sent to a file should be the same as 
 what is sent to the screen. Consider if stdout is piped to another 
 application - what should happen?
Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason.
stdio can detect whether it is being written to the console or not. That's fine. The problem is: foo will generate one kind of output. foo | more will do something else. This will result in a nice cascade of bug reports. There's also: foo >output cat output | more which will do something else, too.
Aug 18 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Walter Bright wrote:
 Kirk McDonald wrote:
 
 Walter Bright wrote:

 The problem is that whatever is sent to a file should be the same as 
 what is sent to the screen. Consider if stdout is piped to another 
 application - what should happen?
Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason.
stdio can detect whether it is being written to the console or not. That's fine. The problem is: foo will generate one kind of output. foo | more will do something else. This will result in a nice cascade of bug reports. There's also: foo >output cat output | more which will do something else, too.
Pardon? I haven't said anything about stdio behaving differently whether it's printing to the console or not. writefln() would /always/ attempt to encode in the console's encoding. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Kirk McDonald wrote:
 Pardon? I haven't said anything about stdio behaving differently whether 
 it's printing to the console or not. writefln() would /always/ attempt 
 to encode in the console's encoding.
Ok, I misunderstood. Now, what if stdout is reopened to be a file?
Aug 18 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Walter Bright wrote:
 Kirk McDonald wrote:
 
 Pardon? I haven't said anything about stdio behaving differently 
 whether it's printing to the console or not. writefln() would /always/ 
 attempt to encode in the console's encoding.
Ok, I misunderstood. Now, what if stdout is reopened to be a file?
I've been thinking about these issues more carefully. It is harder than I initially thought. :-) Ignoring my ideas of implicitly encoding writefln's output, I regard the encode/decode functions as vital. These alone would improve the current situation immensely. Printing ubyte[] arrays as the "raw bytes" therein when using writef() is basically nonsense, thanks to the fact that doFormat itself is Unicode aware. I should have realized this sooner. However, you can still write them with dout.write(). This should be adequate. Here is another proposal regarding implicit encoding, slightly modified from my first one: The Stream class should be modified to have an encoding attribute. This should usually be null. If it is present, output should be encoded into that encoding. (To facilitate this, the encoding module should provide a doEncode function, analogous to the doFormat function, which has a void delegate(ubyte) or possibly a void delegate(ubyte[]) callback.) Next, std.stdio.writef should be modified to write to the object referenced by std.cstream.dout, instead of the FILE* stdout. The next step is obvious: std.cstream.dout's encoding attibute should be set to the console's encoding. Finally, though dout should obviously remain a CFile instance, it should be stored in a Stream reference. If another Stream object is substituted for dout, then the behavior of writefln (and anything else relying on dout) would be redirected. Whether the output is still implicitly encoded would depend entirely on this new object's encoding attribute. It occurs to me that this could be somewhat slow. Examination of the source reveals that every printed character from dout is the result of a virtual method call. However, I do wonder how important the performance of printing to the console really is. Thoughts? Is this a thoroughly stupid idea? -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
next sibling parent kris <foo bar.com> writes:
Kirk -

It's not a stupid idea, but you may not have all the necessary pieces? 
For example, this kind of processing should probably not be bound to an 
application by default (bloat?) and thus you'd perhaps need some 
mechanism to (dynamically) attach custom processing onto a stream?

Tango supports this via stream filters, and deewiant (for example) has 
an output filter for doing specific code-page conversion. Tango also has 
UnicodeFile as a template for converting between internal utf8/16/32 and 
an external UTF representation (all 8 varieties) along with BOM support; 
much as you were describing earlier.

The console is a PITA when it comes to encodings, especially when 
redirection is involved. Thus, we decided long ago that Tango would be 
utf8 only for console IO, and for all variations thereof ... gives it a 
known state. From there, either a filter or a replacement console-device 
can be injected into the IO framework for customization purposes.

Unix has a good lib for code-page support, called iconv. The IBM ICU 
project also has extensive code-page support, along with  a bus, 
helicopter, cruise-liner, and a kitchen-sink, all wrapped up in a very 
powerful (UTF16) API. But the latter is too heavyweight to be embedded 
in a core library, which is why those wrappers still reside in Mango 
rather than Tango. On the other hand, Tango does have a codepage API 
much like what you suggest, as a free-function lightweight converter

- Kris




Kirk McDonald wrote:
 Walter Bright wrote:
 Kirk McDonald wrote:

 Pardon? I haven't said anything about stdio behaving differently 
 whether it's printing to the console or not. writefln() would 
 /always/ attempt to encode in the console's encoding.
Ok, I misunderstood. Now, what if stdout is reopened to be a file?
I've been thinking about these issues more carefully. It is harder than I initially thought. :-) Ignoring my ideas of implicitly encoding writefln's output, I regard the encode/decode functions as vital. These alone would improve the current situation immensely. Printing ubyte[] arrays as the "raw bytes" therein when using writef() is basically nonsense, thanks to the fact that doFormat itself is Unicode aware. I should have realized this sooner. However, you can still write them with dout.write(). This should be adequate. Here is another proposal regarding implicit encoding, slightly modified from my first one: The Stream class should be modified to have an encoding attribute. This should usually be null. If it is present, output should be encoded into that encoding. (To facilitate this, the encoding module should provide a doEncode function, analogous to the doFormat function, which has a void delegate(ubyte) or possibly a void delegate(ubyte[]) callback.) Next, std.stdio.writef should be modified to write to the object referenced by std.cstream.dout, instead of the FILE* stdout. The next step is obvious: std.cstream.dout's encoding attibute should be set to the console's encoding. Finally, though dout should obviously remain a CFile instance, it should be stored in a Stream reference. If another Stream object is substituted for dout, then the behavior of writefln (and anything else relying on dout) would be redirected. Whether the output is still implicitly encoded would depend entirely on this new object's encoding attribute. It occurs to me that this could be somewhat slow. Examination of the source reveals that every printed character from dout is the result of a virtual method call. However, I do wonder how important the performance of printing to the console really is. Thoughts? Is this a thoroughly stupid idea?
Aug 19 2007
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Kirk McDonald wrote:
 I've been thinking about these issues more carefully. It is harder than 
 I initially thought. :-)
<g>
 Ignoring my ideas of implicitly encoding writefln's output, I regard the 
 encode/decode functions as vital. These alone would improve the current 
 situation immensely.
Sure.
 The Stream class should be modified to have an encoding attribute. This 
 should usually be null. If it is present, output should be encoded into 
 that encoding. (To facilitate this, the encoding module should provide a 
 doEncode function, analogous to the doFormat function, which has a void 
 delegate(ubyte) or possibly a void delegate(ubyte[]) callback.)
 
 Next, std.stdio.writef should be modified to write to the object 
 referenced by std.cstream.dout, instead of the FILE* stdout. The next 
 step is obvious: std.cstream.dout's encoding attibute should be set to 
 the console's encoding. Finally, though dout should obviously remain a 
 CFile instance, it should be stored in a Stream reference.
 
 If another Stream object is substituted for dout, then the behavior of 
 writefln (and anything else relying on dout) would be redirected. 
 Whether the output is still implicitly encoded would depend entirely on 
 this new object's encoding attribute.
 
 It occurs to me that this could be somewhat slow. Examination of the 
 source reveals that every printed character from dout is the result of a 
 virtual method call. However, I do wonder how important the performance 
 of printing to the console really is.
 
 Thoughts? Is this a thoroughly stupid idea?
I generally wish to avoid merging writef with streams, for performance reasons. Currently, stdout is marked as being "char" or "wchar", and writef does the conversions. It could possibly also be marked as "UTF8" or "whatever", too.
Aug 19 2007
prev sibling parent Roald Ribe <rr.nospam nospam.teikom.no> writes:
Hi,

Since D supports Win32 only and not older Windows versions, have you
considered setting the console in a D compatible mode, rather than
making D output in console compatible ways?

I am not sure if this can be done to a console that is already
created, and it may just work on NT platforms, but I seem to
remember that there is a console function that changes the console
operation into UTF16 mode.

Anyway, it just a thought that someone may want to investigate.

Roald
Aug 19 2007
prev sibling next sibling parent Deewiant <deewiant.doesnotlike.spam gmail.com> writes:
Kirk McDonald wrote:
 The idiom is this: A string not known to be encoded in UTF-8, -16, or -32 
 should be stored as a ubyte[]. All internal string manipulation should be 
 done in one of the Unicode encoding types (char[], wchar[], or dchar[]), and 
 all input and output should be done with the ubyte[] type.
I asked about this when Tango was first announced, and was dismayed that this wasn't the case. Good that somebody else has the same thought. I tried doing this in an application manually, but it resulted in so many casts (ubyte[] to char[] for the standard library functions, the other way for their return values) that I gave up. It's the same way for both Phobos and Tango.
 This implies something else: Printing a ubyte[] should cause those actual 
 bytes to be printed directly. While it is currently possible to do this with 
 e.g. std.cstream.dout.write(), it would be very convenient to do this with 
 writef, especially combined with encode().
Tango still doesn't have out-of-the-box support for just sending bytes to output, although I'm doing my best to get what I've coded to do it to be added. One problem is, as you said in another post, that std.format.doFormat / tango.text.convert.Format are Unicode aware. Dealing with non-Unicode in a D app is very difficult without conversion to UTF-(8|16|32), which is potentially expensive. Another problem is that essentially every C binding out there uses 'char' when they really mean 'ubyte'. Without implicit casts from char to ubyte and vice versa, this really doesn't work in practice, and with it, the theory breaks down. All in all it's a very complicated problem, as you've noted. If you can find a good and actually working solution, great. But I don't think it's here yet. -- Remove ".doesnotlike.spam" from the mail address.
Aug 19 2007
prev sibling next sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Kirk McDonald wrote:

 However, in many real-world situations, you are not reading something in 
 a Unicode encoding, nor do you always want to write one out. This is 
 particularly the case when writing something to the console. Not all 
 Windows command lines or Linux shells are set up to handle UTF-8, though 
 this is very common on Linux. My Windows command-line, however, uses the 
 default of the ancient CP437, and this is not uncommon. The point is 
 that, on many systems, outputting raw UTF-8 results in garbage.
It was my understanding that D by design only supports UTF environments, and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"... It's not only output, if you run on a such a system and try to read the args (char[][]) you can get an UTF exception due to it being malformed. i.e. the current behaviour is just reading the raw bytes and pretending that it is UTF, whether that's true or not (exceptions and/or garbage)
 ----
 Improvements to Phobos
 ----
 
 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.
By design, I thought. As usual everything "works" for ASCII characters. Not that bad for a trade-off between the whatever-the-system-uses of C and lets-include-every-weird-encoding-ever-in-the-core-library of Java ?
 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply output 
 the resulting raw bytes. Checking this encoding is a platform-specific 
 operation, but essentially every platform (particularly Linux, Windows, 
 and OS X) has a way to do it. If the string cannot be encoded in that 
 encoding, the exception thrown by encode() should be allowed to 
 propagate and terminate the program (or be caught by the user). If the 
 user wishes to avoid that exception, they should call encode() 
 explicitly themselves. For this reason, Phobos will also need a function 
 for retrieving the console's default encoding made available to the user.
Probably not a bad idea (Java does something like this), but it would bloat the standard library. Adding support for common legacy encodings like cp437/cp1252/iso88591/roman wouldn't be unthinkable in principle, but it's hard to "draw the line" and much easier to only support UTF-8 ? If you want some code for doing such conversions, I have old "mapping" and "libiconv" modules on my home page at http://www.algonet.se/~afb/d/ /// converts a 8-bit charset encoding string into unicode char[] decode_string(ubyte[] string, wchar[256] mapping); /// converts a unicode string into 8-bit charset encoding ubyte[] encode_string(char[] string, wchar[256] mapping); (http://www.digitalmars.com/d/archives/digitalmars/D/12967.html) /// allocate a converter between charsets fromcode and tocode extern (C) iconv_t iconv_open (char *tocode, char *fromcode); /// convert inbuf to outbuf and set inbytesleft to unused input and /// outbuf to unused output and return number of non-reversable /// conversions or -1 on error. extern (C) size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft); Mapping ISO-8859-1 (Latin-1) to UTF-8 is by far the easiest, see: http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (under 8-bit)
 This implies something else: Printing a ubyte[] should cause those 
 actual bytes to be printed directly. While it is currently possible to 
 do this with e.g. std.cstream.dout.write(), it would be very convenient 
 to do this with writef, especially combined with encode().
Printing ubytes would be nice, currently that's easiest with printf... But adding codepages to D feels a little like adding 16-bit support :-) --anders
Aug 19 2007
parent reply Sean Kelly <sean f4.ca> writes:
Anders F Björklund wrote:
 
 It was my understanding that D by design only supports UTF environments,
 and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
 It's not only output, if you run on a such a system and try to read the
 args (char[][]) you can get an UTF exception due to it being malformed.
Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway. Sean
Aug 19 2007
next sibling parent =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Sean Kelly wrote:

 It was my understanding that D by design only supports UTF environments,
 and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
 It's not only output, if you run on a such a system and try to read the
 args (char[][]) you can get an UTF exception due to it being malformed.
Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are.
Sorry, I was talking about Phobos. Another library difference, I guess.
 The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.
On Mac OS X it defaults to MacRoman, but you can change it to ISO-8859-1 or UTF-8 with the flick of a menu... (Display > Character Set Encoding) I even heard rumors of a Windows command to do the same... (chcp 65001) But I also heard it could lead to problems with some DOS batch files ? --anders
Aug 19 2007
prev sibling next sibling parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Sean Kelly wrote:

 Tango converts the input args to UTF-8 on Win32 rather than just 
 accepting them as they are.  The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.
From my limited understanding, this automatic conversion seems to only be happening with DMD on Windows and not when running GDC on Windows ? --anders
Aug 20 2007
parent reply Sean Kelly <sean f4.ca> writes:
Anders F Björklund wrote:
 Sean Kelly wrote:
 
 Tango converts the input args to UTF-8 on Win32 rather than just 
 accepting them as they are.  The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.
From my limited understanding, this automatic conversion seems to only be happening with DMD on Windows and not when running GDC on Windows ?
Yes. As far as I know, GDC works on Windows with cygwin but not with mingw or just plain old Win32, is this correct? The routines Tango currently uses to perform the conversion are Win32 library calls, and therefore, I assume, not available to GDC. However, I suppose I could use POSIX calls for GDC--I hadn't considered that case. Sean
Aug 20 2007
parent reply =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Sean Kelly wrote:

  From my limited understanding, this automatic conversion seems to only
 be happening with DMD on Windows and not when running GDC on Windows ?
Yes. As far as I know, GDC works on Windows with cygwin but not with mingw or just plain old Win32, is this correct?
No, this is not correct. The "gdcwin" binaries are all about providing the regular Windows/MinGW with GDC just as the "gdcmac" binaries are about providing MacOSX/Xcode with GDC *without* extra requirements... You can build GDC for Cygwin and Darwin too, including the rest of the FSF/GCC toolchain, but it's not a strict requirement as it also builds OK using the patched versions of GCC that MinGW or Xcode are providing.
 The routines Tango 
 currently uses to perform the conversion are Win32 library calls, and 
 therefore, I assume, not available to GDC.  However, I suppose I could 
 use POSIX calls for GDC--I hadn't considered that case.
You can use Win32 calls, as long as you wrap them in version(Win32) ? --anders
Aug 20 2007
parent reply Sean Kelly <sean f4.ca> writes:
Anders F Björklund wrote:
 Sean Kelly wrote:
 
 The routines Tango currently uses to perform the conversion are Win32 
 library calls, and therefore, I assume, not available to GDC.  
 However, I suppose I could use POSIX calls for GDC--I hadn't 
 considered that case.
You can use Win32 calls, as long as you wrap them in version(Win32) ?
Yup. They require an additional library to be linked as well. I take care of this with "pragma(lib)" on DMD, but don't know if GDC supports this. Aside from that, it's simply a matter of GDC users having the .lib file available (it's included with DMC). Sean
Aug 20 2007
parent =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:
Sean Kelly wrote:

 You can use Win32 calls, as long as you wrap them in version(Win32) ?
Yup. They require an additional library to be linked as well. I take care of this with "pragma(lib)" on DMD, but don't know if GDC supports this.
Not unless the build tool does (e.g. it being listed in Makefile)
 Aside from that, it's simply a matter of GDC users having the 
 .lib file available (it's included with DMC).
As long as it is available in MinGW, it shouldn't be any problem. --anders
Aug 20 2007
prev sibling parent Lars Noschinski <lars-2006-1 usenet.noschinski.de> writes:
* Sean Kelly <sean f4.ca> [07-08-20 02:40]:
Anders F Björklund wrote:
It was my understanding that D by design only supports UTF environments,
and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
It's not only output, if you run on a such a system and try to read the
args (char[][]) you can get an UTF exception due to it being malformed.
Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway.
Probably args should by (u)byte[][] anyway. Converting command line arguments could have pretty annoying effects. For example, unix filenames may contain any 8-bit value except '/' and '\0', arguments may contain every char except '\0'. They are also charset agnostic, the only place where the charset is the terminal emulator, all other parts of the system treat it as binary data. Also, an automatic charset conversion on console output would probably be annoying, as stdin and stderr are often used to read and write binary data, as in tar -c foo | gzip -9 | split targzipped-foo. So at least, one should use isatty to decide, if the in/output is an interactive terminal.
Aug 20 2007
prev sibling parent reply Leandro Lucarella <llucax gmail.com> writes:
Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
Why isn't error an enum instead of a string? -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ .------------------------------------------------------------------------, \ GPG: 5F5A8D05 // F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05 / '--------------------------------------------------------------------' - Mire, don Inodoro! Una paloma con un anillo en la pata! Debe ser mensajera y cayó aquí! - Y... si no es mensajera es coqueta... o casada. -- Mendieta e Inodoro Pereyra
Aug 20 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Leandro Lucarella wrote:
 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
 
char[] decode(ubyte[] str, string encoding, string error="strict");
wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
Why isn't error an enum instead of a string?
Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace". -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 20 2007
parent reply Regan Heath <regan netmail.co.nz> writes:
Kirk McDonald wrote:
 Leandro Lucarella wrote:
 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
Why isn't error an enum instead of a string?
Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace".
Not a bad idea. I would like to suggest alternate function signatures: //The error code for the callback enum DecodeMode { ..no idea what goes here.. } //The callback types typedef char function(DecodeMode,char) DecodeCHandler; typedef wchar function(DecodeMode,wchar) DecodeWHandler; typedef dchar function(DecodeMode,dchar) DecodeDHandler; //The decode functions uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler handler); uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler handler); uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler handler); Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate. I think you still want to use an enum to represent the cases the callback needs to handle (assuming there is more than one) the same handler function could be used for both encode and decode then. I think you want to pass the destination buffers, allowing re-use/preallocation for efficiency. I think you either return the resulting length of the destination data, or perhaps pass "dst" as 'ref' and change the length internally*. Not sure what you would return if you did that. (* changing length should never cause deallocation of buffer) Regan
Aug 20 2007
parent reply Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Regan Heath wrote:
 Kirk McDonald wrote:
 
 Leandro Lucarella wrote:

 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
Why isn't error an enum instead of a string?
Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace".
Not a bad idea. I would like to suggest alternate function signatures: //The error code for the callback enum DecodeMode { ..no idea what goes here.. } //The callback types typedef char function(DecodeMode,char) DecodeCHandler; typedef wchar function(DecodeMode,wchar) DecodeWHandler; typedef dchar function(DecodeMode,dchar) DecodeDHandler; //The decode functions uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler handler); uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler handler); uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler handler); Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate.
I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate. Here's another approach to the error handler thing: typedef int error_t; alias void delegate(string encoding, dchar, ref ubyte[]) encode_error_handler; alias void delegate(string encoding, ubyte[], size_t, ref dchar) decode_error_handler; error_t register_error(encode_error_handler dg1, decode_error_handler dg2); error_t Strict, Ignore, Replace; The register_error function would return a new, unique ID for a given error handler. A handler only wanting to handle encoding or decoding could simply pass null for the one it doesn't want to handle. The encode_error_handler receives the encoding and the unicode character that could not be encoded. It also has a 'ref ubyte[]' argument, which should be set to whatever the replacement character is. (It could be passed in as a slice over an internal buffer. Recuding its length should never cause an allocation.) The decode_error_handler receives the encoding, the ubyte[] buffer, and the index of the character in it which could not be encoded. It also has a 'ref dchar' argument, which should be set to whatever the replacement character is. Strict, Ignore, and Replace could be implemented like this: static this { Strict = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { throw new EncodeError(format("Could not encode character \\u%x in encoding '%s'.", c, encoding)); }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { throw new DecodeError(format("Count not decode \\x%x from encoding '%s'.", buf[idx], encoding)); } ); Ignore = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { dest = null; }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { dest = 0; // This would probably have to be special-cased. } ); Replace = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { dest.length = 1; dest[0] = '?'; }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { dest = '\uFFFD'; // The Unicode REPLACEMENT CHARACTER } ); }
 I think you still want to use an enum to represent the cases the 
 callback needs to handle (assuming there is more than one) the same 
 handler function could be used for both encode and decode then.
 
 I think you want to pass the destination buffers, allowing 
 re-use/preallocation for efficiency.
 
The implementation could use doEncode and doDecode functions, analogous to doFormat, for efficiency. void doEncode(void delegate(ubyte[]) dg, char[], string encoding, error_t handler); void doEncode(void delegate(ubyte[]) dg, wchar[], string encoding, error_t handler); void doEncode(void delegate(ubyte[]) dg, dchar[], string encoding, error_t handler); void doDecode(void delegate(dchar str) dg, ubyte[], string encoding, error_t handler); The ubyte[] arguments in the callbacks could be slices over an internal buffer. No allocation is necessary. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 20 2007
parent reply "Rioshin an'Harthen" <rharth75 hotmail.com> writes:
"Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä 
news:facpkj$13ml$1 digitalmars.com...
 Regan Heath wrote:
 Kirk McDonald wrote:
 Technically 'char' in C is a signed byte, not an unsigned one therefore 
 byte[] is more accurate.
I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate.
<ramble> True. The C standard does not define the signedness of the char type. What it does require of the char type is that it guarantees that any character in the basic execution character set A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 fit into a char in such a way that they are non-negative. Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005): 6.2.5 Types 3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of vlaues that can be represented in that type. 5.2.4.2.1 Size of integer types <limits.h> - number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8 - minimum value for an object of type signed char SCHAR_MIN -127 // -(2^7 - 1) - maximum value for an object of type signed char SCHAR_MAX 127 // 2^7 - 1 - maximum value for an object of type unsigned char UCHAR_MAX 255 // 2^8 - 1 - minimum value for an object of type char CHAR_MIN see below - maximum value for an object of type char CHAR_MAX see below 2 If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX. The value of UCHAR_MAX shall equal 2^(CHAR_BIT) - 1. </ramble> So, applying this to the discussion would suggest that either byte[] or ubyte[] would be appropriate. However, the most natural would be to handle data as raw data without signs, thus ubyte[] feels more natural to use as the standard type for any data whatsoever.
Aug 20 2007
parent Kirk McDonald <kirklin.mcdonald gmail.com> writes:
Rioshin an'Harthen wrote:
 "Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä 
 news:facpkj$13ml$1 digitalmars.com...
 
 Regan Heath wrote:

 Kirk McDonald wrote:
 Technically 'char' in C is a signed byte, not an unsigned one 
 therefore byte[] is more accurate.
I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate.
<ramble> True. The C standard does not define the signedness of the char type. What it does require of the char type is that it guarantees that any character in the basic execution character set A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 fit into a char in such a way that they are non-negative. Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005): 6.2.5 Types 3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of vlaues that can be represented in that type. 5.2.4.2.1 Size of integer types <limits.h> - number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8 - minimum value for an object of type signed char SCHAR_MIN -127 // -(2^7 - 1) - maximum value for an object of type signed char SCHAR_MAX 127 // 2^7 - 1 - maximum value for an object of type unsigned char UCHAR_MAX 255 // 2^8 - 1 - minimum value for an object of type char CHAR_MIN see below - maximum value for an object of type char CHAR_MAX see below 2 If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX. The value of UCHAR_MAX shall equal 2^(CHAR_BIT) - 1. </ramble> So, applying this to the discussion would suggest that either byte[] or ubyte[] would be appropriate. However, the most natural would be to handle data as raw data without signs, thus ubyte[] feels more natural to use as the standard type for any data whatsoever.
Although this is interesting, and it does agree with what I was saying, it is basically irrelevant. When passing a string to decode(), the bytes therein could be in any encoding, even one which has nothing to do with the above. (It could be in a multi-byte encoding!) None of those guarantees which the C standard requires apply to these raw bytes. Therefore ubyte[] is /definitely/ more appropraite. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 20 2007