digitalmars.D - Improving D's support of code-pages

Kirk McDonald (111/111) Aug 18 2007 D's support for unicode is a wonderful thing. The ability to comprehend

Walter Bright (4/39) Aug 18 2007 There's a big problem with this - what if the output is being sent to a

Kirk McDonald (14/60) Aug 18 2007 It is not a small amount of work. Perhaps I will take a look at how big

Kirk McDonald (10/74) Aug 18 2007 I should clarify this: When treating stdout like a file, it should be

BCS (2/67) Aug 18 2007 "Stream" has a writef, so you can call writef for a file.

Kirk McDonald (11/29) Aug 18 2007 But, again, files have no inherent encoding. (And if you are treating

BCS (6/32) Aug 18 2007 I was looking the other way. So you are saying that only the console fun...

Kirk McDonald (15/53) Aug 18 2007 The functions encode() and decode() are available on their own. If you

Walter Bright (4/14) Aug 18 2007 The problem is that whatever is sent to a file should be the same as

Kirk McDonald (22/42) Aug 18 2007 Let's say there's a function get_console_encoding() which returns the

Walter Bright (11/39) Aug 18 2007 stdio can detect whether it is being written to the console or not.

Kirk McDonald (9/60) Aug 18 2007 Pardon? I haven't said anything about stdio behaving differently whether...

Walter Bright (3/6) Aug 18 2007 Ok, I misunderstood.

Kirk McDonald (36/46) Aug 18 2007 I've been thinking about these issues more carefully. It is harder than

kris (24/75) Aug 19 2007 Kirk -
Walter Bright (8/36) Aug 19 2007 Sure.
Roald Ribe (10/10) Aug 19 2007 Hi,

Deewiant (19/27) Aug 19 2007 I asked about this when Tango was first announced, and was dismayed that...
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (35/66) Aug 19 2007 It was my understanding that D by design only supports UTF environments,

Sean Kelly (5/10) Aug 19 2007 Tango converts the input args to UTF-8 on Win32 rather than just

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (7/16) Aug 19 2007 On Mac OS X it defaults to MacRoman, but you can change it to ISO-8859-1
=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (4/7) Aug 20 2007 From my limited understanding, this automatic conversion seems to only

Sean Kelly (7/15) Aug 20 2007 Yes. As far as I know, GDC works on Windows with cygwin but not with

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (9/18) Aug 20 2007 No, this is not correct. The "gdcwin" binaries are all about providing

Sean Kelly (6/14) Aug 20 2007 Yup. They require an additional library to be linked as well. I take

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (4/11) Aug 20 2007 As long as it is available in MinGW, it shouldn't be any problem.

Lars Noschinski (13/21) Aug 20 2007 Probably args should by (u)byte[][] anyway. Converting command line

Leandro Lucarella (11/14) Aug 20 2007 Why isn't error an enum instead of a string?

Kirk McDonald (10/19) Aug 20 2007 Perhaps it would be useful to allow the user to define new

Regan Heath (28/43) Aug 20 2007 Not a bad idea.

Kirk McDonald (75/127) Aug 20 2007 I don't agree with this last part. For starters, I had thought the

Rioshin an'Harthen (50/58) Aug 20 2007

Kirk McDonald (12/84) Aug 20 2007 Although this is interesting, and it does agree with what I was saying,

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

D's support for unicode is a wonderful thing. The ability to comprehend 
UTF-8, -16, and -32 strings in a straightforward, native fasion is 
invaluable. However, the outside world consists of more than encodings 
of Unicode. The ability to deal with code pages in a straightforward 
manner should be considered absolutely vital.

I will be describing what I think is the optimal way of dealing with 
code-pages. This includes some changes and additions to Phobos (or 
Tango, if that is your preferred platform; to be truthful I am unsure of 
the state of code pages in that library), as well as describing a new D 
idiom.

(I am aware that Mango has some bindings to the ICU code-page conversion 
libraries, but this sort of functionality /really/ belongs in the 
standard library.)

The idiom is this: A string not known to be encoded in UTF-8, -16, or 
-32 should be stored as a ubyte[]. All internal string manipulation 
should be done in one of the Unicode encoding types (char[], wchar[], or 
dchar[]), and all input and output should be done with the ubyte[] type. 
There are some exceptions to this, of course. If you're reading input 
which you know to be in one of D's Unicode encoding types, or writing 
something out in one of those formats, naturally there's no reason you 
shouldn't just read into or write from the D type directly.

However, in many real-world situations, you are not reading something in 
a Unicode encoding, nor do you always want to write one out. This is 
particularly the case when writing something to the console. Not all 
Windows command lines or Linux shells are set up to handle UTF-8, though 
this is very common on Linux. My Windows command-line, however, uses the 
default of the ancient CP437, and this is not uncommon. The point is 
that, on many systems, outputting raw UTF-8 results in garbage.

----
Additions to Phobos
----

The first thing Phobos needs are the following functions. (Their basic 
interface has been cribbed from Python.)

char[] decode(ubyte[] str, string encoding, string error="strict");
wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

ubyte[] encode(char[] str, string encoding, string error="strict");
ubyte[] encode(wchar[] str, string encoding, string error="strict");
ubyte[] encode(dchar[] str, string encoding, string error="strict");

What follows is a description of these functions. For the sake of 
simplicity, I will only be referring to the char[] versions of these 
functions. The wchar[] and dchar[] versions should operate in an 
identical fashion.

Let's say you've read in a file and stored it in a ubyte[]:

ubyte[] file = something();

You're already in a bit of a situation, here, if you don't know the 
encoding of the file. If you've gotten this far without knowing it, or 
knowing how to get it, you probably need to re-think your design.

Let's say you know the file is in Latin-1. Since all of D's 
string-processing facilities expect to deal with a Unicode encoding, you 
want to convert this to UTF-8. You should just be able to decode it:

char[] str = decode(file, "latin-1");

And, ta-da! Your string is now converted to UTF-8, and all of D's string 
processing abilities can be brought to bear.

Now let's say that, after you've done whatever you were going to with 
the string, you want to write it back out in Latin-1. This is just a 
simple call to encode:

ubyte[] new_file = encode(str, "latin-1");

But wait! What if the UTF-8 string contains characters which are not 
valid Latin-1 characters? This is where the 'error' parameter comes into 
play. (Note that the error parameter is present in both encode and 
decode.) This parameter has three valid values:

  * "strict" causes an exception to be thrown. This is the default.
  * "ignore" causes the invalid characters to simply be ignored, and 
elided from the returned string.
  * "replace" causes the invalid characters to be replaced with a 
suitable replacement character. When calling decode, this should be the 
official U+FFFD REPLACEMENT CHARACTER. When calling encode, something 
specific to the code-page would have to be chosen; a '?' would be 
appropriate in the various ASCII-based code pages.

Using strings rather than an enum means this functionality could be 
extended by the user in the future (as Python allows).

Latin-1 is not a very interesting encoding. The situation gets more 
interesting if we are talking about a multi-byte encoding, such as 
UTF-16. So let's say we're reading a file encoded in UTF-16:

ubyte[] utf16_file = whatever();
char[] decode(utf16_file, "utf-16");

While you /could/ simply cast the ubyte[] to a wchar[], this code has 
the advantage of totally seperating the encoding which your program's 
input is in from the type with which you represent the data internally.

Using UTF-16 also means you might have errors during decoding, if there 
are invalid UTF-16 code units in the input string.

These functions might fit into std.string, although a new module such as 
std.codepages would work, as well.

----
Improvements to Phobos
----

The behavior of writef (and perhaps of D's formatting in general) must 
be altered.

Currently, printing a char[] causes D to output the raw bytes in the 
string. As I previously mentioned, this is not a good thing. On many 
platforms, this can easily result in garbage being printed to the screen.

I propose changing writef to check the console's encoding, and to 
attempt to encode the output in that encoding. Then it can simply output 
the resulting raw bytes. Checking this encoding is a platform-specific 
operation, but essentially every platform (particularly Linux, Windows, 
and OS X) has a way to do it. If the string cannot be encoded in that 
encoding, the exception thrown by encode() should be allowed to 
propagate and terminate the program (or be caught by the user). If the 
user wishes to avoid that exception, they should call encode() 
explicitly themselves. For this reason, Phobos will also need a function 
for retrieving the console's default encoding made available to the user.

This implies something else: Printing a ubyte[] should cause those 
actual bytes to be printed directly. While it is currently possible to 
do this with e.g. std.cstream.dout.write(), it would be very convenient 
to do this with writef, especially combined with encode().

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

Walter Bright <newshound1 digitalmars.com> writes:

Kirk McDonald wrote:
 ----
 Additions to Phobos
 ----
 
 The first thing Phobos needs are the following functions. (Their basic 
 interface has been cribbed from Python.)
 
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");
 
 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");

If you (or someone else) wants to write these, I'll put them in.

 ----
 Improvements to Phobos
 ----
 
 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.
 
 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply output 
 the resulting raw bytes. Checking this encoding is a platform-specific 
 operation, but essentially every platform (particularly Linux, Windows, 
 and OS X) has a way to do it. If the string cannot be encoded in that 
 encoding, the exception thrown by encode() should be allowed to 
 propagate and terminate the program (or be caught by the user). If the 
 user wishes to avoid that exception, they should call encode() 
 explicitly themselves. For this reason, Phobos will also need a function 
 for retrieving the console's default encoding made available to the user.

There's a big problem with this - what if the output is being sent to a 
file?

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Walter Bright wrote:
 Kirk McDonald wrote:
 
 ----
 Additions to Phobos
 ----

 The first thing Phobos needs are the following functions. (Their basic 
 interface has been cribbed from Python.)

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");

 
 
 If you (or someone else) wants to write these, I'll put them in.
 

It is not a small amount of work. Perhaps I will take a look at how big 
of a problem it is (after the conference).

 ----
 Improvements to Phobos
 ----

 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.

 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.

 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply 
 output the resulting raw bytes. Checking this encoding is a 
 platform-specific operation, but essentially every platform 
 (particularly Linux, Windows, and OS X) has a way to do it. If the 
 string cannot be encoded in that encoding, the exception thrown by 
 encode() should be allowed to propagate and terminate the program (or 
 be caught by the user). If the user wishes to avoid that exception, 
 they should call encode() explicitly themselves. For this reason, 
 Phobos will also need a function for retrieving the console's default 
 encoding made available to the user.

 
 
 There's a big problem with this - what if the output is being sent to a 
 file?

Files have no inherent encoding, only the console does. In this way, 
writing to a file is different than writing to the console. The user 
must explcitly provide an encoding when writing to a file; or, if they 
are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
-16, or -32. (Writing a char[] implies an encoding, while writing a 
ubyte[] does not.)

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Kirk McDonald wrote:
 Walter Bright wrote:
 
 Kirk McDonald wrote:

 ----
 Additions to Phobos
 ----

 The first thing Phobos needs are the following functions. (Their 
 basic interface has been cribbed from Python.)

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string error="strict");
 ubyte[] encode(dchar[] str, string encoding, string error="strict");



 If you (or someone else) wants to write these, I'll put them in.

 
 It is not a small amount of work. Perhaps I will take a look at how big 
 of a problem it is (after the conference).
 
 ----
 Improvements to Phobos
 ----

 The behavior of writef (and perhaps of D's formatting in general) 
 must be altered.

 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the 
 screen.

 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply 
 output the resulting raw bytes. Checking this encoding is a 
 platform-specific operation, but essentially every platform 
 (particularly Linux, Windows, and OS X) has a way to do it. If the 
 string cannot be encoded in that encoding, the exception thrown by 
 encode() should be allowed to propagate and terminate the program (or 
 be caught by the user). If the user wishes to avoid that exception, 
 they should call encode() explicitly themselves. For this reason, 
 Phobos will also need a function for retrieving the console's default 
 encoding made available to the user.



 There's a big problem with this - what if the output is being sent to 
 a file?

 
 
 Files have no inherent encoding, only the console does. In this way, 
 writing to a file is different than writing to the console. The user 
 must explcitly provide an encoding when writing to a file; or, if they 
 are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
 -16, or -32. (Writing a char[] implies an encoding, while writing a 
 ubyte[] does not.)
 

I should clarify this: When treating stdout like a file, it should be 
like any other file: writing to it writes raw bytes. But when calling 
writef, which is not treating it like a file, it should attempt to 
encode the output into the console's default encoding.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

BCS <ao pathlink.com> writes:

Reply to Kirk,

 Kirk McDonald wrote:
 
 Walter Bright wrote:
 
 Kirk McDonald wrote:
 
 ----
 Additions to Phobos
 ----
 The first thing Phobos needs are the following functions. (Their
 basic interface has been cribbed from Python.)
 
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string
 error="strict"); dchar[] ddecode(ubyte[] str, string encoding,
 string error="strict");
 
 ubyte[] encode(char[] str, string encoding, string error="strict");
 ubyte[] encode(wchar[] str, string encoding, string
 error="strict"); ubyte[] encode(dchar[] str, string encoding,
 string error="strict");
 

 If you (or someone else) wants to write these, I'll put them in.
 

 It is not a small amount of work. Perhaps I will take a look at how
 big of a problem it is (after the conference).
 
 ----
 Improvements to Phobos
 ----
 The behavior of writef (and perhaps of D's formatting in general)
 must be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in
 the string. As I previously mentioned, this is not a good thing. On
 many platforms, this can easily result in garbage being printed to
 the screen.
 
 I propose changing writef to check the console's encoding, and to
 attempt to encode the output in that encoding. Then it can simply
 output the resulting raw bytes. Checking this encoding is a
 platform-specific operation, but essentially every platform
 (particularly Linux, Windows, and OS X) has a way to do it. If the
 string cannot be encoded in that encoding, the exception thrown by
 encode() should be allowed to propagate and terminate the program
 (or be caught by the user). If the user wishes to avoid that
 exception, they should call encode() explicitly themselves. For
 this reason, Phobos will also need a function for retrieving the
 console's default encoding made available to the user.
 

 There's a big problem with this - what if the output is being sent
 to a file?
 

 Files have no inherent encoding, only the console does. In this way,
 writing to a file is different than writing to the console. The user
 must explcitly provide an encoding when writing to a file; or, if
 they are writing a char[], wchar[], or dchar[], the encoding will be
 UTF-8, -16, or -32. (Writing a char[] implies an encoding, while
 writing a ubyte[] does not.)
 

 I should clarify this: When treating stdout like a file, it should be
 like any other file: writing to it writes raw bytes. But when calling
 writef, which is not treating it like a file, it should attempt to
 encode the output into the console's default encoding.
 

"Stream" has a writef, so you can call writef for a file.

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

BCS wrote:
 Reply to Kirk,
 Kirk McDonald wrote:
 Files have no inherent encoding, only the console does. In this way,
 writing to a file is different than writing to the console. The user
 must explcitly provide an encoding when writing to a file; or, if
 they are writing a char[], wchar[], or dchar[], the encoding will be
 UTF-8, -16, or -32. (Writing a char[] implies an encoding, while
 writing a ubyte[] does not.)

 I should clarify this: When treating stdout like a file, it should be
 like any other file: writing to it writes raw bytes. But when calling
 writef, which is not treating it like a file, it should attempt to
 encode the output into the console's default encoding.

 
 "Stream" has a writef, so you can call writef for a file.
 
 

But, again, files have no inherent encoding. (And if you are treating 
stdout as a file, it shouldn't have one, either.) This business about 
implicitly encoding things should be limited to the std.stdio.writef 
(and writefln, &c) function. Treating stdout as a file should be 
considered a way to get 'raw' access to stdout.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

BCS <ao pathlink.com> writes:

Reply to Kirk,

 BCS wrote:
 
 Reply to Kirk,
 
 Kirk McDonald wrote:
 
 Files have no inherent encoding, only the console does. In this
 way, writing to a file is different than writing to the console.
 The user must explcitly provide an encoding when writing to a file;
 or, if they are writing a char[], wchar[], or dchar[], the encoding
 will be UTF-8, -16, or -32. (Writing a char[] implies an encoding,
 while writing a ubyte[] does not.)
 

 I should clarify this: When treating stdout like a file, it should
 be like any other file: writing to it writes raw bytes. But when
 calling writef, which is not treating it like a file, it should
 attempt to encode the output into the console's default encoding.
 

 "Stream" has a writef, so you can call writef for a file.
 

 But, again, files have no inherent encoding. (And if you are treating
 stdout as a file, it shouldn't have one, either.) This business about
 implicitly encoding things should be limited to the std.stdio.writef
 (and writefln, &c) function. Treating stdout as a file should be
 considered a way to get 'raw' access to stdout.
 

I was looking the other way. So you are saying that only the console functions 
should have the code page stuff?

what about a dout? it goes to the console and also has a writef.

I'm not putting down your idea, I'm just looking for (and hoping not to find) 
problems.

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

BCS wrote:
 Reply to Kirk,
 
 BCS wrote:

 Reply to Kirk,

 Kirk McDonald wrote:

 Files have no inherent encoding, only the console does. In this
 way, writing to a file is different than writing to the console.
 The user must explcitly provide an encoding when writing to a file;
 or, if they are writing a char[], wchar[], or dchar[], the encoding
 will be UTF-8, -16, or -32. (Writing a char[] implies an encoding,
 while writing a ubyte[] does not.)

 I should clarify this: When treating stdout like a file, it should
 be like any other file: writing to it writes raw bytes. But when
 calling writef, which is not treating it like a file, it should
 attempt to encode the output into the console's default encoding.

 "Stream" has a writef, so you can call writef for a file.

 But, again, files have no inherent encoding. (And if you are treating
 stdout as a file, it shouldn't have one, either.) This business about
 implicitly encoding things should be limited to the std.stdio.writef
 (and writefln, &c) function. Treating stdout as a file should be
 considered a way to get 'raw' access to stdout.

 
 I was looking the other way. So you are saying that only the console 
 functions should have the code page stuff?
 
 what about a dout? it goes to the console and also has a writef.
 
 I'm not putting down your idea, I'm just looking for (and hoping not to 
 find) problems.
 
 

The functions encode() and decode() are available on their own. If you 
want to explicitly encode something you're writing to a file, you can 
simply say e.g.:

somefile.write(encode(some_utf8_string, "cp437"));

Only the console functions would call this /implicitly/, since only they 
/have/ an implicit encoding (which is the console's encoding as reported 
by the OS). Since you might not always want to encode the stuff you 
print out, you should be able to use std.cstream.dout.writef() to get 
around this.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

Walter Bright <newshound1 digitalmars.com> writes:

Kirk McDonald wrote:
 Walter Bright wrote:
 There's a big problem with this - what if the output is being sent to 
 a file?

 
 Files have no inherent encoding, only the console does. In this way, 
 writing to a file is different than writing to the console. The user 
 must explcitly provide an encoding when writing to a file; or, if they 
 are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
 -16, or -32. (Writing a char[] implies an encoding, while writing a 
 ubyte[] does not.)


The problem is that whatever is sent to a file should be the same as 
what is sent to the screen. Consider if stdout is piped to another 
application - what should happen?

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Walter Bright wrote:
 Kirk McDonald wrote:
 
 Walter Bright wrote:

 There's a big problem with this - what if the output is being sent to 
 a file?


 Files have no inherent encoding, only the console does. In this way, 
 writing to a file is different than writing to the console. The user 
 must explcitly provide an encoding when writing to a file; or, if they 
 are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, 
 -16, or -32. (Writing a char[] implies an encoding, while writing a 
 ubyte[] does not.)

 
 
 
 The problem is that whatever is sent to a file should be the same as 
 what is sent to the screen. Consider if stdout is piped to another 
 application - what should happen?

Let's say there's a function get_console_encoding() which returns the 
console's current encoding. I'm simply proposing that this:

char[] str = something();
std.stdio.writefln(str);

Should end up being equivalent to this:

std.cstream.dout.writefln(encode(str, get_console_encoding()));

So, to answer your question, if you use std.stdio.writefln, you will 
send a string encoded in the console's default encoding to the other 
application's stdin. This is the encoding the other application should 
be expecting, anyway (unless it isn't; code pages are annoying like that).

In any event, if you explicitly want to output something in a particular 
encoding, this /will/ work:

std.stdio.writefln(encode(str, whatever));

This is because encode() returns a ubyte[], and writefln should print 
the data in ubyte[]s directly, as I suggested in my original post for 
this precise reason.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

Walter Bright <newshound1 digitalmars.com> writes:

Kirk McDonald wrote:
 Walter Bright wrote:
 The problem is that whatever is sent to a file should be the same as 
 what is sent to the screen. Consider if stdout is piped to another 
 application - what should happen?

 
 Let's say there's a function get_console_encoding() which returns the 
 console's current encoding. I'm simply proposing that this:
 
 char[] str = something();
 std.stdio.writefln(str);
 
 Should end up being equivalent to this:
 
 std.cstream.dout.writefln(encode(str, get_console_encoding()));
 
 So, to answer your question, if you use std.stdio.writefln, you will 
 send a string encoded in the console's default encoding to the other 
 application's stdin. This is the encoding the other application should 
 be expecting, anyway (unless it isn't; code pages are annoying like that).
 
 In any event, if you explicitly want to output something in a particular 
 encoding, this /will/ work:
 
 std.stdio.writefln(encode(str, whatever));
 
 This is because encode() returns a ubyte[], and writefln should print 
 the data in ubyte[]s directly, as I suggested in my original post for 
 this precise reason.

stdio can detect whether it is being written to the console or not. 
That's fine. The problem is:

foo

will generate one kind of output.

foo | more

will do something else. This will result in a nice cascade of bug 
reports. There's also:

foo >output
cat output | more

which will do something else, too.

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Walter Bright wrote:
 Kirk McDonald wrote:
 
 Walter Bright wrote:

 The problem is that whatever is sent to a file should be the same as 
 what is sent to the screen. Consider if stdout is piped to another 
 application - what should happen?


 Let's say there's a function get_console_encoding() which returns the 
 console's current encoding. I'm simply proposing that this:

 char[] str = something();
 std.stdio.writefln(str);

 Should end up being equivalent to this:

 std.cstream.dout.writefln(encode(str, get_console_encoding()));

 So, to answer your question, if you use std.stdio.writefln, you will 
 send a string encoded in the console's default encoding to the other 
 application's stdin. This is the encoding the other application should 
 be expecting, anyway (unless it isn't; code pages are annoying like 
 that).

 In any event, if you explicitly want to output something in a 
 particular encoding, this /will/ work:

 std.stdio.writefln(encode(str, whatever));

 This is because encode() returns a ubyte[], and writefln should print 
 the data in ubyte[]s directly, as I suggested in my original post for 
 this precise reason.

 
 
 stdio can detect whether it is being written to the console or not. 
 That's fine. The problem is:
 
 foo
 
 will generate one kind of output.
 
 foo | more
 
 will do something else. This will result in a nice cascade of bug 
 reports. There's also:
 
 foo >output
 cat output | more
 
 which will do something else, too.

Pardon? I haven't said anything about stdio behaving differently whether 
it's printing to the console or not. writefln() would /always/ attempt 
to encode in the console's encoding.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

Walter Bright <newshound1 digitalmars.com> writes:

Kirk McDonald wrote:
 Pardon? I haven't said anything about stdio behaving differently whether 
 it's printing to the console or not. writefln() would /always/ attempt 
 to encode in the console's encoding.

Ok, I misunderstood.

Now, what if stdout is reopened to be a file?

Aug 18 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Walter Bright wrote:
 Kirk McDonald wrote:
 
 Pardon? I haven't said anything about stdio behaving differently 
 whether it's printing to the console or not. writefln() would /always/ 
 attempt to encode in the console's encoding.

 
 
 Ok, I misunderstood.
 
 Now, what if stdout is reopened to be a file?

I've been thinking about these issues more carefully. It is harder than 
I initially thought. :-)

Ignoring my ideas of implicitly encoding writefln's output, I regard the 
encode/decode functions as vital. These alone would improve the current 
situation immensely.

Printing ubyte[] arrays as the "raw bytes" therein when using writef() 
is basically nonsense, thanks to the fact that doFormat itself is 
Unicode aware. I should have realized this sooner. However, you can 
still write them with dout.write(). This should be adequate.

Here is another proposal regarding implicit encoding, slightly modified 
from my first one:

The Stream class should be modified to have an encoding attribute. This 
should usually be null. If it is present, output should be encoded into 
that encoding. (To facilitate this, the encoding module should provide a 
doEncode function, analogous to the doFormat function, which has a void 
delegate(ubyte) or possibly a void delegate(ubyte[]) callback.)

Next, std.stdio.writef should be modified to write to the object 
referenced by std.cstream.dout, instead of the FILE* stdout. The next 
step is obvious: std.cstream.dout's encoding attibute should be set to 
the console's encoding. Finally, though dout should obviously remain a 
CFile instance, it should be stored in a Stream reference.

If another Stream object is substituted for dout, then the behavior of 
writefln (and anything else relying on dout) would be redirected. 
Whether the output is still implicitly encoded would depend entirely on 
this new object's encoding attribute.

It occurs to me that this could be somewhat slow. Examination of the 
source reveals that every printed character from dout is the result of a 
virtual method call. However, I do wonder how important the performance 
of printing to the console really is.

Thoughts? Is this a thoroughly stupid idea?

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 18 2007

kris <foo bar.com> writes:

Kirk -

It's not a stupid idea, but you may not have all the necessary pieces? 
For example, this kind of processing should probably not be bound to an 
application by default (bloat?) and thus you'd perhaps need some 
mechanism to (dynamically) attach custom processing onto a stream?

Tango supports this via stream filters, and deewiant (for example) has 
an output filter for doing specific code-page conversion. Tango also has 
UnicodeFile as a template for converting between internal utf8/16/32 and 
an external UTF representation (all 8 varieties) along with BOM support; 
much as you were describing earlier.

The console is a PITA when it comes to encodings, especially when 
redirection is involved. Thus, we decided long ago that Tango would be 
utf8 only for console IO, and for all variations thereof ... gives it a 
known state. From there, either a filter or a replacement console-device 
can be injected into the IO framework for customization purposes.

Unix has a good lib for code-page support, called iconv. The IBM ICU 
project also has extensive code-page support, along with  a bus, 
helicopter, cruise-liner, and a kitchen-sink, all wrapped up in a very 
powerful (UTF16) API. But the latter is too heavyweight to be embedded 
in a core library, which is why those wrappers still reside in Mango 
rather than Tango. On the other hand, Tango does have a codepage API 
much like what you suggest, as a free-function lightweight converter

- Kris




Kirk McDonald wrote:
 Walter Bright wrote:
 Kirk McDonald wrote:

 Pardon? I haven't said anything about stdio behaving differently 
 whether it's printing to the console or not. writefln() would 
 /always/ attempt to encode in the console's encoding.


 Ok, I misunderstood.

 Now, what if stdout is reopened to be a file?

 
 I've been thinking about these issues more carefully. It is harder than 
 I initially thought. :-)
 
 Ignoring my ideas of implicitly encoding writefln's output, I regard the 
 encode/decode functions as vital. These alone would improve the current 
 situation immensely.
 
 Printing ubyte[] arrays as the "raw bytes" therein when using writef() 
 is basically nonsense, thanks to the fact that doFormat itself is 
 Unicode aware. I should have realized this sooner. However, you can 
 still write them with dout.write(). This should be adequate.
 
 Here is another proposal regarding implicit encoding, slightly modified 
 from my first one:
 
 The Stream class should be modified to have an encoding attribute. This 
 should usually be null. If it is present, output should be encoded into 
 that encoding. (To facilitate this, the encoding module should provide a 
 doEncode function, analogous to the doFormat function, which has a void 
 delegate(ubyte) or possibly a void delegate(ubyte[]) callback.)
 
 Next, std.stdio.writef should be modified to write to the object 
 referenced by std.cstream.dout, instead of the FILE* stdout. The next 
 step is obvious: std.cstream.dout's encoding attibute should be set to 
 the console's encoding. Finally, though dout should obviously remain a 
 CFile instance, it should be stored in a Stream reference.
 
 If another Stream object is substituted for dout, then the behavior of 
 writefln (and anything else relying on dout) would be redirected. 
 Whether the output is still implicitly encoded would depend entirely on 
 this new object's encoding attribute.
 
 It occurs to me that this could be somewhat slow. Examination of the 
 source reveals that every printed character from dout is the result of a 
 virtual method call. However, I do wonder how important the performance 
 of printing to the console really is.
 
 Thoughts? Is this a thoroughly stupid idea?

Aug 19 2007

Walter Bright <newshound1 digitalmars.com> writes:

Kirk McDonald wrote:
 I've been thinking about these issues more carefully. It is harder than 
 I initially thought. :-)

<g>

 Ignoring my ideas of implicitly encoding writefln's output, I regard the 
 encode/decode functions as vital. These alone would improve the current 
 situation immensely.

Sure.

 The Stream class should be modified to have an encoding attribute. This 
 should usually be null. If it is present, output should be encoded into 
 that encoding. (To facilitate this, the encoding module should provide a 
 doEncode function, analogous to the doFormat function, which has a void 
 delegate(ubyte) or possibly a void delegate(ubyte[]) callback.)
 
 Next, std.stdio.writef should be modified to write to the object 
 referenced by std.cstream.dout, instead of the FILE* stdout. The next 
 step is obvious: std.cstream.dout's encoding attibute should be set to 
 the console's encoding. Finally, though dout should obviously remain a 
 CFile instance, it should be stored in a Stream reference.
 
 If another Stream object is substituted for dout, then the behavior of 
 writefln (and anything else relying on dout) would be redirected. 
 Whether the output is still implicitly encoded would depend entirely on 
 this new object's encoding attribute.
 
 It occurs to me that this could be somewhat slow. Examination of the 
 source reveals that every printed character from dout is the result of a 
 virtual method call. However, I do wonder how important the performance 
 of printing to the console really is.
 
 Thoughts? Is this a thoroughly stupid idea?

I generally wish to avoid merging writef with streams, for performance 
reasons.

Currently, stdout is marked as being "char" or "wchar", and writef does 
the conversions. It could possibly also be marked as "UTF8" or 
"whatever", too.

Aug 19 2007

Roald Ribe <rr.nospam nospam.teikom.no> writes:

Hi,

Since D supports Win32 only and not older Windows versions, have you
considered setting the console in a D compatible mode, rather than
making D output in console compatible ways?

I am not sure if this can be done to a console that is already
created, and it may just work on NT platforms, but I seem to
remember that there is a console function that changes the console
operation into UTF16 mode.

Anyway, it just a thought that someone may want to investigate.

Roald

Aug 19 2007

Deewiant <deewiant.doesnotlike.spam gmail.com> writes:

Kirk McDonald wrote:
 The idiom is this: A string not known to be encoded in UTF-8, -16, or -32 
 should be stored as a ubyte[]. All internal string manipulation should be 
 done in one of the Unicode encoding types (char[], wchar[], or dchar[]), and 
 all input and output should be done with the ubyte[] type.

I asked about this when Tango was first announced, and was dismayed that this
wasn't the case. Good that somebody else has the same thought.

I tried doing this in an application manually, but it resulted in so many casts
(ubyte[] to char[] for the standard library functions, the other way for their
return values) that I gave up. It's the same way for both Phobos and Tango.

 This implies something else: Printing a ubyte[] should cause those actual 
 bytes to be printed directly. While it is currently possible to do this with 
 e.g. std.cstream.dout.write(), it would be very convenient to do this with 
 writef, especially combined with encode().

Tango still doesn't have out-of-the-box support for just sending bytes to
output, although I'm doing my best to get what I've coded to do it to be added.

One problem is, as you said in another post, that std.format.doFormat /
tango.text.convert.Format are Unicode aware. Dealing with non-Unicode in a D app
is very difficult without conversion to UTF-(8|16|32), which is potentially
expensive.

Another problem is that essentially every C binding out there uses 'char' when
they really mean 'ubyte'. Without implicit casts from char to ubyte and vice
versa, this really doesn't work in practice, and with it, the theory breaks
down.

All in all it's a very complicated problem, as you've noted. If you can find a
good and actually working solution, great. But I don't think it's here yet.

-- 
Remove ".doesnotlike.spam" from the mail address.

Aug 19 2007

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Kirk McDonald wrote:

 However, in many real-world situations, you are not reading something in 
 a Unicode encoding, nor do you always want to write one out. This is 
 particularly the case when writing something to the console. Not all 
 Windows command lines or Linux shells are set up to handle UTF-8, though 
 this is very common on Linux. My Windows command-line, however, uses the 
 default of the ancient CP437, and this is not uncommon. The point is 
 that, on many systems, outputting raw UTF-8 results in garbage.

It was my understanding that D by design only supports UTF environments,
and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
It's not only output, if you run on a such a system and try to read the
args (char[][]) you can get an UTF exception due to it being malformed.

i.e. the current behaviour is just reading the raw bytes and pretending
that it is UTF, whether that's true or not (exceptions and/or garbage)

 ----
 Improvements to Phobos
 ----
 
 The behavior of writef (and perhaps of D's formatting in general) must 
 be altered.
 
 Currently, printing a char[] causes D to output the raw bytes in the 
 string. As I previously mentioned, this is not a good thing. On many 
 platforms, this can easily result in garbage being printed to the screen.

By design, I thought. As usual everything "works" for ASCII characters.

Not that bad for a trade-off between the whatever-the-system-uses of C
and lets-include-every-weird-encoding-ever-in-the-core-library of Java ?

 I propose changing writef to check the console's encoding, and to 
 attempt to encode the output in that encoding. Then it can simply output 
 the resulting raw bytes. Checking this encoding is a platform-specific 
 operation, but essentially every platform (particularly Linux, Windows, 
 and OS X) has a way to do it. If the string cannot be encoded in that 
 encoding, the exception thrown by encode() should be allowed to 
 propagate and terminate the program (or be caught by the user). If the 
 user wishes to avoid that exception, they should call encode() 
 explicitly themselves. For this reason, Phobos will also need a function 
 for retrieving the console's default encoding made available to the user.

Probably not a bad idea (Java does something like this), but it would
bloat the standard library. Adding support for common legacy encodings
like cp437/cp1252/iso88591/roman wouldn't be unthinkable in principle,
but it's hard to "draw the line" and much easier to only support UTF-8 ?

If you want some code for doing such conversions, I have old "mapping"
and "libiconv" modules on my home page at http://www.algonet.se/~afb/d/


/// converts a 8-bit charset encoding string into unicode
char[] decode_string(ubyte[] string, wchar[256] mapping);

/// converts a unicode string into 8-bit charset encoding
ubyte[] encode_string(char[] string, wchar[256] mapping);

(http://www.digitalmars.com/d/archives/digitalmars/D/12967.html)


   /// allocate a converter between charsets fromcode and tocode
   extern (C) iconv_t iconv_open (char *tocode, char *fromcode);

   /// convert inbuf to outbuf and set inbytesleft to unused input and
   /// outbuf to unused output and return number of non-reversable
   /// conversions or -1 on error.
   extern (C) size_t iconv (iconv_t cd, void **inbuf,
			   size_t *inbytesleft,
			   void **outbuf,
			   size_t *outbytesleft);

Mapping ISO-8859-1 (Latin-1) to UTF-8 is by far the easiest, see:
http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (under 8-bit)

 This implies something else: Printing a ubyte[] should cause those 
 actual bytes to be printed directly. While it is currently possible to 
 do this with e.g. std.cstream.dout.write(), it would be very convenient 
 to do this with writef, especially combined with encode().

Printing ubytes would be nice, currently that's easiest with printf...

But adding codepages to D feels a little like adding 16-bit support :-)

--anders

Aug 19 2007

Sean Kelly <sean f4.ca> writes:

Anders F Björklund wrote:
 
 It was my understanding that D by design only supports UTF environments,
 and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
 It's not only output, if you run on a such a system and try to read the
 args (char[][]) you can get an UTF exception due to it being malformed.

Tango converts the input args to UTF-8 on Win32 rather than just 
accepting them as they are.  The args are left alone on Unix however, 
because most Unix consoles seem to use Unicode anyway.


Sean

Aug 19 2007

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Sean Kelly wrote:

 It was my understanding that D by design only supports UTF environments,
 and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
 It's not only output, if you run on a such a system and try to read the
 args (char[][]) you can get an UTF exception due to it being malformed.

 
 Tango converts the input args to UTF-8 on Win32 rather than just 
 accepting them as they are.

Sorry, I was talking about Phobos. Another library difference, I guess.

 The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.

On Mac OS X it defaults to MacRoman, but you can change it to ISO-8859-1
or UTF-8 with the flick of a menu... (Display > Character Set Encoding)

I even heard rumors of a Windows command to do the same... (chcp 65001)
But I also heard it could lead to problems with some DOS batch files ?

--anders

Aug 19 2007

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Sean Kelly wrote:

 Tango converts the input args to UTF-8 on Win32 rather than just 
 accepting them as they are.  The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.

 From my limited understanding, this automatic conversion seems to only
be happening with DMD on Windows and not when running GDC on Windows ?

--anders

Aug 20 2007

Sean Kelly <sean f4.ca> writes:

Anders F Björklund wrote:
 Sean Kelly wrote:
 
 Tango converts the input args to UTF-8 on Win32 rather than just 
 accepting them as they are.  The args are left alone on Unix however, 
 because most Unix consoles seem to use Unicode anyway.

 
  From my limited understanding, this automatic conversion seems to only
 be happening with DMD on Windows and not when running GDC on Windows ?

Yes.  As far as I know, GDC works on Windows with cygwin but not with 
mingw or just plain old Win32, is this correct?  The routines Tango 
currently uses to perform the conversion are Win32 library calls, and 
therefore, I assume, not available to GDC.  However, I suppose I could 
use POSIX calls for GDC--I hadn't considered that case.


Sean

Aug 20 2007

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Sean Kelly wrote:

  From my limited understanding, this automatic conversion seems to only
 be happening with DMD on Windows and not when running GDC on Windows ?

 
 Yes.  As far as I know, GDC works on Windows with cygwin but not with 
 mingw or just plain old Win32, is this correct? 

No, this is not correct. The "gdcwin" binaries are all about providing
the regular Windows/MinGW with GDC just as the "gdcmac" binaries are
about providing MacOSX/Xcode with GDC *without* extra requirements...

You can build GDC for Cygwin and Darwin too, including the rest of the
FSF/GCC toolchain, but it's not a strict requirement as it also builds
OK using the patched versions of GCC that MinGW or Xcode are providing.

 The routines Tango 
 currently uses to perform the conversion are Win32 library calls, and 
 therefore, I assume, not available to GDC.  However, I suppose I could 
 use POSIX calls for GDC--I hadn't considered that case.

You can use Win32 calls, as long as you wrap them in version(Win32) ?

--anders

Aug 20 2007

Sean Kelly <sean f4.ca> writes:

Anders F Björklund wrote:
 Sean Kelly wrote:
 
 The routines Tango currently uses to perform the conversion are Win32 
 library calls, and therefore, I assume, not available to GDC.  
 However, I suppose I could use POSIX calls for GDC--I hadn't 
 considered that case.

 
 You can use Win32 calls, as long as you wrap them in version(Win32) ?

Yup.  They require an additional library to be linked as well.  I take 
care of this with "pragma(lib)" on DMD, but don't know if GDC supports 
this.  Aside from that, it's simply a matter of GDC users having the 
.lib file available (it's included with DMC).


Sean

Aug 20 2007

=?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= <afb algonet.se> writes:

Sean Kelly wrote:

 You can use Win32 calls, as long as you wrap them in version(Win32) ?

 
 Yup.  They require an additional library to be linked as well.  I take 
 care of this with "pragma(lib)" on DMD, but don't know if GDC supports 
 this. 

Not unless the build tool does (e.g. it being listed in Makefile)

 Aside from that, it's simply a matter of GDC users having the 
 .lib file available (it's included with DMC).

As long as it is available in MinGW, it shouldn't be any problem.

--anders

Aug 20 2007

Lars Noschinski <lars-2006-1 usenet.noschinski.de> writes:

* Sean Kelly <sean f4.ca> [07-08-20 02:40]:
Anders F Bj�rklund wrote:
It was my understanding that D by design only supports UTF environments,
and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"...
It's not only output, if you run on a such a system and try to read the
args (char[][]) you can get an UTF exception due to it being malformed.

Tango converts the input args to UTF-8 on Win32 rather than just accepting them 
as they are.  The args are left alone on Unix however, because most Unix 
consoles seem to use Unicode anyway.

Probably args should by (u)byte[][] anyway. Converting command line
arguments could have pretty annoying effects. For example, unix
filenames may contain any 8-bit value except '/' and '\0', arguments may
contain every char except '\0'. They are also charset agnostic, the only
place where the charset is the terminal emulator, all other parts of the
system treat it as binary data.

Also, an automatic charset conversion on console output would probably
be annoying, as stdin and stderr are often used to read and write binary
data, as in

      tar -c foo | gzip -9 | split targzipped-foo.

So at least, one should use isatty to decide, if the in/output is an
interactive terminal.

Aug 20 2007

Leandro Lucarella <llucax gmail.com> writes:

Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

Why isn't error an enum instead of a string?

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
 .------------------------------------------------------------------------,
  \  GPG: 5F5A8D05 // F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05 /
   '--------------------------------------------------------------------'
- Mire, don Inodoro! Una paloma con un anillo en la pata! Debe ser
  mensajera y cayó aquí!
- Y... si no es mensajera es coqueta... o casada.
	-- Mendieta e Inodoro Pereyra

Aug 20 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Leandro Lucarella wrote:
 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:
 
char[] decode(ubyte[] str, string encoding, string error="strict");
wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
dchar[] ddecode(ubyte[] str, string encoding, string error="strict");

 
 
 Why isn't error an enum instead of a string?
 

Perhaps it would be useful to allow the user to define new 
error-handlers somehow, and provide a callback for them. (Python allows 
something like this.) This would allow you to, for instance, provide a 
different replacement character than the one provided by "replace".

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 20 2007

Regan Heath <regan netmail.co.nz> writes:

Kirk McDonald wrote:
 Leandro Lucarella wrote:
 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");


 Why isn't error an enum instead of a string?

 
 Perhaps it would be useful to allow the user to define new 
 error-handlers somehow, and provide a callback for them. (Python allows 
 something like this.) This would allow you to, for instance, provide a 
 different replacement character than the one provided by "replace".

Not a bad idea.

I would like to suggest alternate function signatures:

//The error code for the callback
enum DecodeMode { ..no idea what goes here.. }

//The callback types
typedef char function(DecodeMode,char) DecodeCHandler;
typedef wchar function(DecodeMode,wchar) DecodeWHandler;
typedef dchar function(DecodeMode,dchar) DecodeDHandler;

//The decode functions
uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler 
handler);
uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler 
handler);
uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler 
handler);

Technically 'char' in C is a signed byte, not an unsigned one therefore 
byte[] is more accurate.

I think you still want to use an enum to represent the cases the 
callback needs to handle (assuming there is more than one) the same 
handler function could be used for both encode and decode then.

I think you want to pass the destination buffers, allowing 
re-use/preallocation for efficiency.

I think you either return the resulting length of the destination data, 
or perhaps pass "dst" as 'ref' and change the length internally*.  Not 
sure what you would return if you did that.

(* changing length should never cause deallocation of buffer)

Regan

Aug 20 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Regan Heath wrote:
 Kirk McDonald wrote:
 
 Leandro Lucarella wrote:

 Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:

 char[] decode(ubyte[] str, string encoding, string error="strict");
 wchar[] wdecode(ubyte[] str, string encoding, string error="strict");
 dchar[] ddecode(ubyte[] str, string encoding, string error="strict");



 Why isn't error an enum instead of a string?

 Perhaps it would be useful to allow the user to define new 
 error-handlers somehow, and provide a callback for them. (Python 
 allows something like this.) This would allow you to, for instance, 
 provide a different replacement character than the one provided by 
 "replace".

 
 
 Not a bad idea.
 
 I would like to suggest alternate function signatures:
 
 //The error code for the callback
 enum DecodeMode { ..no idea what goes here.. }
 
 //The callback types
 typedef char function(DecodeMode,char) DecodeCHandler;
 typedef wchar function(DecodeMode,wchar) DecodeWHandler;
 typedef dchar function(DecodeMode,dchar) DecodeDHandler;
 
 //The decode functions
 uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler 
 handler);
 uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler 
 handler);
 uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler 
 handler);
 
 Technically 'char' in C is a signed byte, not an unsigned one therefore 
 byte[] is more accurate.
 

I don't agree with this last part. For starters, I had thought the 
signed-ness of 'char' in C was not defined. In any case, we're talking 
about chunks of arbitrary, homogenous binary data, so I think ubyte[] is 
most appropriate.

Here's another approach to the error handler thing:

typedef int error_t;

alias void delegate(string encoding, dchar, ref ubyte[])
     encode_error_handler;
alias void delegate(string encoding, ubyte[], size_t, ref dchar)
     decode_error_handler;

error_t register_error(encode_error_handler dg1, decode_error_handler dg2);

error_t Strict, Ignore, Replace;

The register_error function would return a new, unique ID for a given 
error handler. A handler only wanting to handle encoding or decoding 
could simply pass null for the one it doesn't want to handle.

The encode_error_handler receives the encoding and the unicode character 
that could not be encoded. It also has a 'ref ubyte[]' argument, which 
should be set to whatever the replacement character is. (It could be 
passed in as a slice over an internal buffer. Recuding its length should 
never cause an allocation.)

The decode_error_handler receives the encoding, the ubyte[] buffer, and 
the index of the character in it which could not be encoded. It also has 
a 'ref dchar' argument, which should be set to whatever the replacement 
character is.

Strict, Ignore, and Replace could be implemented like this:

static this {
     Strict = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             throw new EncodeError(format("Could not encode character 
\\u%x in encoding '%s'.", c, encoding));
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             throw new DecodeError(format("Count not decode \\x%x from 
encoding '%s'.", buf[idx], encoding));
         }
     );

     Ignore = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             dest = null;
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             dest = 0; // This would probably have to be special-cased.
         }
     );
     Replace = register_error(
         delegate void(string encoding, dchar c, ref ubyte[] dest) {
             dest.length = 1;
             dest[0] = '?';
         },
         delegate void(string encoding, ubyte[] buf, size_t idx, ref 
dchar dest) {
             dest = '\uFFFD'; // The Unicode REPLACEMENT CHARACTER
         }
     );
}

 I think you still want to use an enum to represent the cases the 
 callback needs to handle (assuming there is more than one) the same 
 handler function could be used for both encode and decode then.
 
 I think you want to pass the destination buffers, allowing 
 re-use/preallocation for efficiency.
 

The implementation could use doEncode and doDecode functions, analogous 
to doFormat, for efficiency.

void doEncode(void delegate(ubyte[]) dg, char[], string encoding,
	error_t handler);
void doEncode(void delegate(ubyte[]) dg, wchar[], string encoding,
	error_t handler);
void doEncode(void delegate(ubyte[]) dg, dchar[], string encoding,
	error_t handler);

void doDecode(void delegate(dchar str) dg, ubyte[], string encoding,
	error_t handler);

The ubyte[] arguments in the callbacks could be slices over an internal 
buffer. No allocation is necessary.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 20 2007

"Rioshin an'Harthen" <rharth75 hotmail.com> writes:

"Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä 
news:facpkj$13ml$1 digitalmars.com...
 Regan Heath wrote:
 Kirk McDonald wrote:
 Technically 'char' in C is a signed byte, not an unsigned one therefore 
 byte[] is more accurate.

 I don't agree with this last part. For starters, I had thought the 
 signed-ness of 'char' in C was not defined. In any case, we're talking 
 about chunks of arbitrary, homogenous binary data, so I think ubyte[] is 
 most appropriate.

<ramble>
True. The C standard does not define the signedness of the char type.

What it does require of the char type is that it guarantees that any 
character
in the basic execution character set

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    0 1 2 3 4 5 6 7 8 9


fit into a char in such a way that they are non-negative.

Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005):

6.2.5 Types
3  An object declared as type char is large enough to store any member of 
the
    basic execution character set. If a member of the basic execution 
character
    set is stored in a char object, its value is guaranteed to be 
nonnegative. If
    any other character is stored in a char object, the resulting value is
    implementation-defined but shall be within the range of vlaues that can 
be
    represented in that type.

5.2.4.2.1 Size of integer types <limits.h>
    - number of bits for smallest object that is not a bit-field (byte)
        CHAR_BIT                    8
    - minimum value for an object of type signed char
        SCHAR_MIN                -127 // -(2^7 - 1)
    - maximum value for an object of type signed char
        SCHAR_MAX                127 // 2^7 - 1
    - maximum value for an object of type unsigned char
        UCHAR_MAX                255 // 2^8 - 1
    - minimum value for an object of type char
        CHAR_MIN                    see below
    - maximum value for an object of type char
        CHAR_MAX                   see below

2  If the value of an object of type char is treated as a signed integer 
when
    used in an expression, the value of CHAR_MIN shall be the same as
    that of SCHAR_MIN and the value of CHAR_MAX shall be the same
    as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall
    be 0 and the value of CHAR_MAX shall be the same as that of
    UCHAR_MAX. The value of UCHAR_MAX shall equal
    2^(CHAR_BIT) - 1.
</ramble>

So, applying this to the discussion would suggest that either byte[] or
ubyte[] would be appropriate. However, the most natural would be to
handle data as raw data without signs, thus ubyte[] feels more natural to
use as the standard type for any data whatsoever.

Aug 20 2007

Kirk McDonald <kirklin.mcdonald gmail.com> writes:

Rioshin an'Harthen wrote:
 "Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä 
 news:facpkj$13ml$1 digitalmars.com...
 
 Regan Heath wrote:

 Kirk McDonald wrote:
 Technically 'char' in C is a signed byte, not an unsigned one 
 therefore byte[] is more accurate.


 I don't agree with this last part. For starters, I had thought the 
 signed-ness of 'char' in C was not defined. In any case, we're talking 
 about chunks of arbitrary, homogenous binary data, so I think ubyte[] 
 is most appropriate.

 
 
 <ramble>
 True. The C standard does not define the signedness of the char type.
 
 What it does require of the char type is that it guarantees that any 
 character
 in the basic execution character set
 
    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    0 1 2 3 4 5 6 7 8 9

 
 fit into a char in such a way that they are non-negative.
 
 Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005):
 
 6.2.5 Types
 3  An object declared as type char is large enough to store any member 
 of the
    basic execution character set. If a member of the basic execution 
 character
    set is stored in a char object, its value is guaranteed to be 
 nonnegative. If
    any other character is stored in a char object, the resulting value is
    implementation-defined but shall be within the range of vlaues that 
 can be
    represented in that type.
 
 5.2.4.2.1 Size of integer types <limits.h>
    - number of bits for smallest object that is not a bit-field (byte)
        CHAR_BIT                    8
    - minimum value for an object of type signed char
        SCHAR_MIN                -127 // -(2^7 - 1)
    - maximum value for an object of type signed char
        SCHAR_MAX                127 // 2^7 - 1
    - maximum value for an object of type unsigned char
        UCHAR_MAX                255 // 2^8 - 1
    - minimum value for an object of type char
        CHAR_MIN                    see below
    - maximum value for an object of type char
        CHAR_MAX                   see below
 
 2  If the value of an object of type char is treated as a signed integer 
 when
    used in an expression, the value of CHAR_MIN shall be the same as
    that of SCHAR_MIN and the value of CHAR_MAX shall be the same
    as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall
    be 0 and the value of CHAR_MAX shall be the same as that of
    UCHAR_MAX. The value of UCHAR_MAX shall equal
    2^(CHAR_BIT) - 1.
 </ramble>
 
 So, applying this to the discussion would suggest that either byte[] or
 ubyte[] would be appropriate. However, the most natural would be to
 handle data as raw data without signs, thus ubyte[] feels more natural to
 use as the standard type for any data whatsoever.

Although this is interesting, and it does agree with what I was saying, 
it is basically irrelevant. When passing a string to decode(), the bytes 
therein could be in any encoding, even one which has nothing to do with 
the above. (It could be in a multi-byte encoding!) None of those 
guarantees which the C standard requires apply to these raw bytes. 
Therefore ubyte[] is /definitely/ more appropraite.

-- 
Kirk McDonald
http://kirkmcdonald.blogspot.com
Pyd: Connecting D and Python
http://pyd.dsource.org

Aug 20 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Improving D's support of code-pages