digitalmars.D - Improving D's support of code-pages
- Kirk McDonald (111/111) Aug 18 2007 D's support for unicode is a wonderful thing. The ability to comprehend
- Walter Bright (4/39) Aug 18 2007 There's a big problem with this - what if the output is being sent to a
- Kirk McDonald (14/60) Aug 18 2007 It is not a small amount of work. Perhaps I will take a look at how big
- Kirk McDonald (10/74) Aug 18 2007 I should clarify this: When treating stdout like a file, it should be
- BCS (2/67) Aug 18 2007 "Stream" has a writef, so you can call writef for a file.
- Kirk McDonald (11/29) Aug 18 2007 But, again, files have no inherent encoding. (And if you are treating
- BCS (6/32) Aug 18 2007 I was looking the other way. So you are saying that only the console fun...
- Kirk McDonald (15/53) Aug 18 2007 The functions encode() and decode() are available on their own. If you
- Walter Bright (4/14) Aug 18 2007 The problem is that whatever is sent to a file should be the same as
- Kirk McDonald (22/42) Aug 18 2007 Let's say there's a function get_console_encoding() which returns the
- Walter Bright (11/39) Aug 18 2007 stdio can detect whether it is being written to the console or not.
- Kirk McDonald (9/60) Aug 18 2007 Pardon? I haven't said anything about stdio behaving differently whether...
- Walter Bright (3/6) Aug 18 2007 Ok, I misunderstood.
- Kirk McDonald (36/46) Aug 18 2007 I've been thinking about these issues more carefully. It is harder than
- kris (24/75) Aug 19 2007 Kirk -
- Walter Bright (8/36) Aug 19 2007 Sure.
- Roald Ribe (10/10) Aug 19 2007 Hi,
- Deewiant (19/27) Aug 19 2007 I asked about this when Tango was first announced, and was dismayed that...
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (35/66) Aug 19 2007 It was my understanding that D by design only supports UTF environments,
- Sean Kelly (5/10) Aug 19 2007 Tango converts the input args to UTF-8 on Win32 rather than just
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (7/16) Aug 19 2007 On Mac OS X it defaults to MacRoman, but you can change it to ISO-8859-1
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (4/7) Aug 20 2007 From my limited understanding, this automatic conversion seems to only
- Sean Kelly (7/15) Aug 20 2007 Yes. As far as I know, GDC works on Windows with cygwin but not with
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (9/18) Aug 20 2007 No, this is not correct. The "gdcwin" binaries are all about providing
- Sean Kelly (6/14) Aug 20 2007 Yup. They require an additional library to be linked as well. I take
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (4/11) Aug 20 2007 As long as it is available in MinGW, it shouldn't be any problem.
- Lars Noschinski (13/21) Aug 20 2007 Probably args should by (u)byte[][] anyway. Converting command line
- Leandro Lucarella (11/14) Aug 20 2007 Why isn't error an enum instead of a string?
- Kirk McDonald (10/19) Aug 20 2007 Perhaps it would be useful to allow the user to define new
- Regan Heath (28/43) Aug 20 2007 Not a bad idea.
- Kirk McDonald (75/127) Aug 20 2007 I don't agree with this last part. For starters, I had thought the
-
Rioshin an'Harthen
(50/58)
Aug 20 2007
- Kirk McDonald (12/84) Aug 20 2007 Although this is interesting, and it does agree with what I was saying,
D's support for unicode is a wonderful thing. The ability to comprehend UTF-8, -16, and -32 strings in a straightforward, native fasion is invaluable. However, the outside world consists of more than encodings of Unicode. The ability to deal with code pages in a straightforward manner should be considered absolutely vital. I will be describing what I think is the optimal way of dealing with code-pages. This includes some changes and additions to Phobos (or Tango, if that is your preferred platform; to be truthful I am unsure of the state of code pages in that library), as well as describing a new D idiom. (I am aware that Mango has some bindings to the ICU code-page conversion libraries, but this sort of functionality /really/ belongs in the standard library.) The idiom is this: A string not known to be encoded in UTF-8, -16, or -32 should be stored as a ubyte[]. All internal string manipulation should be done in one of the Unicode encoding types (char[], wchar[], or dchar[]), and all input and output should be done with the ubyte[] type. There are some exceptions to this, of course. If you're reading input which you know to be in one of D's Unicode encoding types, or writing something out in one of those formats, naturally there's no reason you shouldn't just read into or write from the D type directly. However, in many real-world situations, you are not reading something in a Unicode encoding, nor do you always want to write one out. This is particularly the case when writing something to the console. Not all Windows command lines or Linux shells are set up to handle UTF-8, though this is very common on Linux. My Windows command-line, however, uses the default of the ancient CP437, and this is not uncommon. The point is that, on many systems, outputting raw UTF-8 results in garbage. ---- Additions to Phobos ---- The first thing Phobos needs are the following functions. (Their basic interface has been cribbed from Python.) char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict"); ubyte[] encode(char[] str, string encoding, string error="strict"); ubyte[] encode(wchar[] str, string encoding, string error="strict"); ubyte[] encode(dchar[] str, string encoding, string error="strict"); What follows is a description of these functions. For the sake of simplicity, I will only be referring to the char[] versions of these functions. The wchar[] and dchar[] versions should operate in an identical fashion. Let's say you've read in a file and stored it in a ubyte[]: ubyte[] file = something(); You're already in a bit of a situation, here, if you don't know the encoding of the file. If you've gotten this far without knowing it, or knowing how to get it, you probably need to re-think your design. Let's say you know the file is in Latin-1. Since all of D's string-processing facilities expect to deal with a Unicode encoding, you want to convert this to UTF-8. You should just be able to decode it: char[] str = decode(file, "latin-1"); And, ta-da! Your string is now converted to UTF-8, and all of D's string processing abilities can be brought to bear. Now let's say that, after you've done whatever you were going to with the string, you want to write it back out in Latin-1. This is just a simple call to encode: ubyte[] new_file = encode(str, "latin-1"); But wait! What if the UTF-8 string contains characters which are not valid Latin-1 characters? This is where the 'error' parameter comes into play. (Note that the error parameter is present in both encode and decode.) This parameter has three valid values: * "strict" causes an exception to be thrown. This is the default. * "ignore" causes the invalid characters to simply be ignored, and elided from the returned string. * "replace" causes the invalid characters to be replaced with a suitable replacement character. When calling decode, this should be the official U+FFFD REPLACEMENT CHARACTER. When calling encode, something specific to the code-page would have to be chosen; a '?' would be appropriate in the various ASCII-based code pages. Using strings rather than an enum means this functionality could be extended by the user in the future (as Python allows). Latin-1 is not a very interesting encoding. The situation gets more interesting if we are talking about a multi-byte encoding, such as UTF-16. So let's say we're reading a file encoded in UTF-16: ubyte[] utf16_file = whatever(); char[] decode(utf16_file, "utf-16"); While you /could/ simply cast the ubyte[] to a wchar[], this code has the advantage of totally seperating the encoding which your program's input is in from the type with which you represent the data internally. Using UTF-16 also means you might have errors during decoding, if there are invalid UTF-16 code units in the input string. These functions might fit into std.string, although a new module such as std.codepages would work, as well. ---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen. I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user. This implies something else: Printing a ubyte[] should cause those actual bytes to be printed directly. While it is currently possible to do this with e.g. std.cstream.dout.write(), it would be very convenient to do this with writef, especially combined with encode(). -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 18 2007
Kirk McDonald wrote:---- Additions to Phobos ---- The first thing Phobos needs are the following functions. (Their basic interface has been cribbed from Python.) char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict"); ubyte[] encode(char[] str, string encoding, string error="strict"); ubyte[] encode(wchar[] str, string encoding, string error="strict"); ubyte[] encode(dchar[] str, string encoding, string error="strict");If you (or someone else) wants to write these, I'll put them in.---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen. I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user.There's a big problem with this - what if the output is being sent to a file?
Aug 18 2007
Walter Bright wrote:Kirk McDonald wrote:It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).---- Additions to Phobos ---- The first thing Phobos needs are the following functions. (Their basic interface has been cribbed from Python.) char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict"); ubyte[] encode(char[] str, string encoding, string error="strict"); ubyte[] encode(wchar[] str, string encoding, string error="strict"); ubyte[] encode(dchar[] str, string encoding, string error="strict");If you (or someone else) wants to write these, I'll put them in.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.) -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen. I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user.There's a big problem with this - what if the output is being sent to a file?
Aug 18 2007
Kirk McDonald wrote:Walter Bright wrote:I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgKirk McDonald wrote:It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).---- Additions to Phobos ---- The first thing Phobos needs are the following functions. (Their basic interface has been cribbed from Python.) char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict"); ubyte[] encode(char[] str, string encoding, string error="strict"); ubyte[] encode(wchar[] str, string encoding, string error="strict"); ubyte[] encode(dchar[] str, string encoding, string error="strict");If you (or someone else) wants to write these, I'll put them in.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen. I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user.There's a big problem with this - what if the output is being sent to a file?
Aug 18 2007
Reply to Kirk,Kirk McDonald wrote:"Stream" has a writef, so you can call writef for a file.Walter Bright wrote:I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.Kirk McDonald wrote:It is not a small amount of work. Perhaps I will take a look at how big of a problem it is (after the conference).---- Additions to Phobos ---- The first thing Phobos needs are the following functions. (Their basic interface has been cribbed from Python.) char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict"); ubyte[] encode(char[] str, string encoding, string error="strict"); ubyte[] encode(wchar[] str, string encoding, string error="strict"); ubyte[] encode(dchar[] str, string encoding, string error="strict");If you (or someone else) wants to write these, I'll put them in.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen. I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user.There's a big problem with this - what if the output is being sent to a file?
Aug 18 2007
BCS wrote:Reply to Kirk,But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgKirk McDonald wrote:"Stream" has a writef, so you can call writef for a file.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
Aug 18 2007
Reply to Kirk,BCS wrote:I was looking the other way. So you are saying that only the console functions should have the code page stuff? what about a dout? it goes to the console and also has a writef. I'm not putting down your idea, I'm just looking for (and hoping not to find) problems.Reply to Kirk,But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout.Kirk McDonald wrote:"Stream" has a writef, so you can call writef for a file.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
Aug 18 2007
BCS wrote:Reply to Kirk,The functions encode() and decode() are available on their own. If you want to explicitly encode something you're writing to a file, you can simply say e.g.: somefile.write(encode(some_utf8_string, "cp437")); Only the console functions would call this /implicitly/, since only they /have/ an implicit encoding (which is the console's encoding as reported by the OS). Since you might not always want to encode the stuff you print out, you should be able to use std.cstream.dout.writef() to get around this. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgBCS wrote:I was looking the other way. So you are saying that only the console functions should have the code page stuff? what about a dout? it goes to the console and also has a writef. I'm not putting down your idea, I'm just looking for (and hoping not to find) problems.Reply to Kirk,But, again, files have no inherent encoding. (And if you are treating stdout as a file, it shouldn't have one, either.) This business about implicitly encoding things should be limited to the std.stdio.writef (and writefln, &c) function. Treating stdout as a file should be considered a way to get 'raw' access to stdout.Kirk McDonald wrote:"Stream" has a writef, so you can call writef for a file.Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)I should clarify this: When treating stdout like a file, it should be like any other file: writing to it writes raw bytes. But when calling writef, which is not treating it like a file, it should attempt to encode the output into the console's default encoding.
Aug 18 2007
Kirk McDonald wrote:Walter Bright wrote:The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?There's a big problem with this - what if the output is being sent to a file?Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
Aug 18 2007
Walter Bright wrote:Kirk McDonald wrote:Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgWalter Bright wrote:The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?There's a big problem with this - what if the output is being sent to a file?Files have no inherent encoding, only the console does. In this way, writing to a file is different than writing to the console. The user must explcitly provide an encoding when writing to a file; or, if they are writing a char[], wchar[], or dchar[], the encoding will be UTF-8, -16, or -32. (Writing a char[] implies an encoding, while writing a ubyte[] does not.)
Aug 18 2007
Kirk McDonald wrote:Walter Bright wrote:stdio can detect whether it is being written to the console or not. That's fine. The problem is: foo will generate one kind of output. foo | more will do something else. This will result in a nice cascade of bug reports. There's also: foo >output cat output | more which will do something else, too.The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason.
Aug 18 2007
Walter Bright wrote:Kirk McDonald wrote:Pardon? I haven't said anything about stdio behaving differently whether it's printing to the console or not. writefln() would /always/ attempt to encode in the console's encoding. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgWalter Bright wrote:stdio can detect whether it is being written to the console or not. That's fine. The problem is: foo will generate one kind of output. foo | more will do something else. This will result in a nice cascade of bug reports. There's also: foo >output cat output | more which will do something else, too.The problem is that whatever is sent to a file should be the same as what is sent to the screen. Consider if stdout is piped to another application - what should happen?Let's say there's a function get_console_encoding() which returns the console's current encoding. I'm simply proposing that this: char[] str = something(); std.stdio.writefln(str); Should end up being equivalent to this: std.cstream.dout.writefln(encode(str, get_console_encoding())); So, to answer your question, if you use std.stdio.writefln, you will send a string encoded in the console's default encoding to the other application's stdin. This is the encoding the other application should be expecting, anyway (unless it isn't; code pages are annoying like that). In any event, if you explicitly want to output something in a particular encoding, this /will/ work: std.stdio.writefln(encode(str, whatever)); This is because encode() returns a ubyte[], and writefln should print the data in ubyte[]s directly, as I suggested in my original post for this precise reason.
Aug 18 2007
Kirk McDonald wrote:Pardon? I haven't said anything about stdio behaving differently whether it's printing to the console or not. writefln() would /always/ attempt to encode in the console's encoding.Ok, I misunderstood. Now, what if stdout is reopened to be a file?
Aug 18 2007
Walter Bright wrote:Kirk McDonald wrote:I've been thinking about these issues more carefully. It is harder than I initially thought. :-) Ignoring my ideas of implicitly encoding writefln's output, I regard the encode/decode functions as vital. These alone would improve the current situation immensely. Printing ubyte[] arrays as the "raw bytes" therein when using writef() is basically nonsense, thanks to the fact that doFormat itself is Unicode aware. I should have realized this sooner. However, you can still write them with dout.write(). This should be adequate. Here is another proposal regarding implicit encoding, slightly modified from my first one: The Stream class should be modified to have an encoding attribute. This should usually be null. If it is present, output should be encoded into that encoding. (To facilitate this, the encoding module should provide a doEncode function, analogous to the doFormat function, which has a void delegate(ubyte) or possibly a void delegate(ubyte[]) callback.) Next, std.stdio.writef should be modified to write to the object referenced by std.cstream.dout, instead of the FILE* stdout. The next step is obvious: std.cstream.dout's encoding attibute should be set to the console's encoding. Finally, though dout should obviously remain a CFile instance, it should be stored in a Stream reference. If another Stream object is substituted for dout, then the behavior of writefln (and anything else relying on dout) would be redirected. Whether the output is still implicitly encoded would depend entirely on this new object's encoding attribute. It occurs to me that this could be somewhat slow. Examination of the source reveals that every printed character from dout is the result of a virtual method call. However, I do wonder how important the performance of printing to the console really is. Thoughts? Is this a thoroughly stupid idea? -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgPardon? I haven't said anything about stdio behaving differently whether it's printing to the console or not. writefln() would /always/ attempt to encode in the console's encoding.Ok, I misunderstood. Now, what if stdout is reopened to be a file?
Aug 18 2007
Kirk - It's not a stupid idea, but you may not have all the necessary pieces? For example, this kind of processing should probably not be bound to an application by default (bloat?) and thus you'd perhaps need some mechanism to (dynamically) attach custom processing onto a stream? Tango supports this via stream filters, and deewiant (for example) has an output filter for doing specific code-page conversion. Tango also has UnicodeFile as a template for converting between internal utf8/16/32 and an external UTF representation (all 8 varieties) along with BOM support; much as you were describing earlier. The console is a PITA when it comes to encodings, especially when redirection is involved. Thus, we decided long ago that Tango would be utf8 only for console IO, and for all variations thereof ... gives it a known state. From there, either a filter or a replacement console-device can be injected into the IO framework for customization purposes. Unix has a good lib for code-page support, called iconv. The IBM ICU project also has extensive code-page support, along with a bus, helicopter, cruise-liner, and a kitchen-sink, all wrapped up in a very powerful (UTF16) API. But the latter is too heavyweight to be embedded in a core library, which is why those wrappers still reside in Mango rather than Tango. On the other hand, Tango does have a codepage API much like what you suggest, as a free-function lightweight converter - Kris Kirk McDonald wrote:Walter Bright wrote:Kirk McDonald wrote:I've been thinking about these issues more carefully. It is harder than I initially thought. :-) Ignoring my ideas of implicitly encoding writefln's output, I regard the encode/decode functions as vital. These alone would improve the current situation immensely. Printing ubyte[] arrays as the "raw bytes" therein when using writef() is basically nonsense, thanks to the fact that doFormat itself is Unicode aware. I should have realized this sooner. However, you can still write them with dout.write(). This should be adequate. Here is another proposal regarding implicit encoding, slightly modified from my first one: The Stream class should be modified to have an encoding attribute. This should usually be null. If it is present, output should be encoded into that encoding. (To facilitate this, the encoding module should provide a doEncode function, analogous to the doFormat function, which has a void delegate(ubyte) or possibly a void delegate(ubyte[]) callback.) Next, std.stdio.writef should be modified to write to the object referenced by std.cstream.dout, instead of the FILE* stdout. The next step is obvious: std.cstream.dout's encoding attibute should be set to the console's encoding. Finally, though dout should obviously remain a CFile instance, it should be stored in a Stream reference. If another Stream object is substituted for dout, then the behavior of writefln (and anything else relying on dout) would be redirected. Whether the output is still implicitly encoded would depend entirely on this new object's encoding attribute. It occurs to me that this could be somewhat slow. Examination of the source reveals that every printed character from dout is the result of a virtual method call. However, I do wonder how important the performance of printing to the console really is. Thoughts? Is this a thoroughly stupid idea?Pardon? I haven't said anything about stdio behaving differently whether it's printing to the console or not. writefln() would /always/ attempt to encode in the console's encoding.Ok, I misunderstood. Now, what if stdout is reopened to be a file?
Aug 19 2007
Kirk McDonald wrote:I've been thinking about these issues more carefully. It is harder than I initially thought. :-)<g>Ignoring my ideas of implicitly encoding writefln's output, I regard the encode/decode functions as vital. These alone would improve the current situation immensely.Sure.The Stream class should be modified to have an encoding attribute. This should usually be null. If it is present, output should be encoded into that encoding. (To facilitate this, the encoding module should provide a doEncode function, analogous to the doFormat function, which has a void delegate(ubyte) or possibly a void delegate(ubyte[]) callback.) Next, std.stdio.writef should be modified to write to the object referenced by std.cstream.dout, instead of the FILE* stdout. The next step is obvious: std.cstream.dout's encoding attibute should be set to the console's encoding. Finally, though dout should obviously remain a CFile instance, it should be stored in a Stream reference. If another Stream object is substituted for dout, then the behavior of writefln (and anything else relying on dout) would be redirected. Whether the output is still implicitly encoded would depend entirely on this new object's encoding attribute. It occurs to me that this could be somewhat slow. Examination of the source reveals that every printed character from dout is the result of a virtual method call. However, I do wonder how important the performance of printing to the console really is. Thoughts? Is this a thoroughly stupid idea?I generally wish to avoid merging writef with streams, for performance reasons. Currently, stdout is marked as being "char" or "wchar", and writef does the conversions. It could possibly also be marked as "UTF8" or "whatever", too.
Aug 19 2007
Hi, Since D supports Win32 only and not older Windows versions, have you considered setting the console in a D compatible mode, rather than making D output in console compatible ways? I am not sure if this can be done to a console that is already created, and it may just work on NT platforms, but I seem to remember that there is a console function that changes the console operation into UTF16 mode. Anyway, it just a thought that someone may want to investigate. Roald
Aug 19 2007
Kirk McDonald wrote:The idiom is this: A string not known to be encoded in UTF-8, -16, or -32 should be stored as a ubyte[]. All internal string manipulation should be done in one of the Unicode encoding types (char[], wchar[], or dchar[]), and all input and output should be done with the ubyte[] type.I asked about this when Tango was first announced, and was dismayed that this wasn't the case. Good that somebody else has the same thought. I tried doing this in an application manually, but it resulted in so many casts (ubyte[] to char[] for the standard library functions, the other way for their return values) that I gave up. It's the same way for both Phobos and Tango.This implies something else: Printing a ubyte[] should cause those actual bytes to be printed directly. While it is currently possible to do this with e.g. std.cstream.dout.write(), it would be very convenient to do this with writef, especially combined with encode().Tango still doesn't have out-of-the-box support for just sending bytes to output, although I'm doing my best to get what I've coded to do it to be added. One problem is, as you said in another post, that std.format.doFormat / tango.text.convert.Format are Unicode aware. Dealing with non-Unicode in a D app is very difficult without conversion to UTF-(8|16|32), which is potentially expensive. Another problem is that essentially every C binding out there uses 'char' when they really mean 'ubyte'. Without implicit casts from char to ubyte and vice versa, this really doesn't work in practice, and with it, the theory breaks down. All in all it's a very complicated problem, as you've noted. If you can find a good and actually working solution, great. But I don't think it's here yet. -- Remove ".doesnotlike.spam" from the mail address.
Aug 19 2007
Kirk McDonald wrote:However, in many real-world situations, you are not reading something in a Unicode encoding, nor do you always want to write one out. This is particularly the case when writing something to the console. Not all Windows command lines or Linux shells are set up to handle UTF-8, though this is very common on Linux. My Windows command-line, however, uses the default of the ancient CP437, and this is not uncommon. The point is that, on many systems, outputting raw UTF-8 results in garbage.It was my understanding that D by design only supports UTF environments, and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"... It's not only output, if you run on a such a system and try to read the args (char[][]) you can get an UTF exception due to it being malformed. i.e. the current behaviour is just reading the raw bytes and pretending that it is UTF, whether that's true or not (exceptions and/or garbage)---- Improvements to Phobos ---- The behavior of writef (and perhaps of D's formatting in general) must be altered. Currently, printing a char[] causes D to output the raw bytes in the string. As I previously mentioned, this is not a good thing. On many platforms, this can easily result in garbage being printed to the screen.By design, I thought. As usual everything "works" for ASCII characters. Not that bad for a trade-off between the whatever-the-system-uses of C and lets-include-every-weird-encoding-ever-in-the-core-library of Java ?I propose changing writef to check the console's encoding, and to attempt to encode the output in that encoding. Then it can simply output the resulting raw bytes. Checking this encoding is a platform-specific operation, but essentially every platform (particularly Linux, Windows, and OS X) has a way to do it. If the string cannot be encoded in that encoding, the exception thrown by encode() should be allowed to propagate and terminate the program (or be caught by the user). If the user wishes to avoid that exception, they should call encode() explicitly themselves. For this reason, Phobos will also need a function for retrieving the console's default encoding made available to the user.Probably not a bad idea (Java does something like this), but it would bloat the standard library. Adding support for common legacy encodings like cp437/cp1252/iso88591/roman wouldn't be unthinkable in principle, but it's hard to "draw the line" and much easier to only support UTF-8 ? If you want some code for doing such conversions, I have old "mapping" and "libiconv" modules on my home page at http://www.algonet.se/~afb/d/ /// converts a 8-bit charset encoding string into unicode char[] decode_string(ubyte[] string, wchar[256] mapping); /// converts a unicode string into 8-bit charset encoding ubyte[] encode_string(char[] string, wchar[256] mapping); (http://www.digitalmars.com/d/archives/digitalmars/D/12967.html) /// allocate a converter between charsets fromcode and tocode extern (C) iconv_t iconv_open (char *tocode, char *fromcode); /// convert inbuf to outbuf and set inbytesleft to unused input and /// outbuf to unused output and return number of non-reversable /// conversions or -1 on error. extern (C) size_t iconv (iconv_t cd, void **inbuf, size_t *inbytesleft, void **outbuf, size_t *outbytesleft); Mapping ISO-8859-1 (Latin-1) to UTF-8 is by far the easiest, see: http://www.prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (under 8-bit)This implies something else: Printing a ubyte[] should cause those actual bytes to be printed directly. While it is currently possible to do this with e.g. std.cstream.dout.write(), it would be very convenient to do this with writef, especially combined with encode().Printing ubytes would be nice, currently that's easiest with printf... But adding codepages to D feels a little like adding 16-bit support :-) --anders
Aug 19 2007
Anders F Björklund wrote:It was my understanding that D by design only supports UTF environments, and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"... It's not only output, if you run on a such a system and try to read the args (char[][]) you can get an UTF exception due to it being malformed.Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway. Sean
Aug 19 2007
Sean Kelly wrote:Sorry, I was talking about Phobos. Another library difference, I guess.It was my understanding that D by design only supports UTF environments, and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"... It's not only output, if you run on a such a system and try to read the args (char[][]) you can get an UTF exception due to it being malformed.Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are.The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway.On Mac OS X it defaults to MacRoman, but you can change it to ISO-8859-1 or UTF-8 with the flick of a menu... (Display > Character Set Encoding) I even heard rumors of a Windows command to do the same... (chcp 65001) But I also heard it could lead to problems with some DOS batch files ? --anders
Aug 19 2007
Sean Kelly wrote:Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway.From my limited understanding, this automatic conversion seems to only be happening with DMD on Windows and not when running GDC on Windows ? --anders
Aug 20 2007
Anders F Björklund wrote:Sean Kelly wrote:Yes. As far as I know, GDC works on Windows with cygwin but not with mingw or just plain old Win32, is this correct? The routines Tango currently uses to perform the conversion are Win32 library calls, and therefore, I assume, not available to GDC. However, I suppose I could use POSIX calls for GDC--I hadn't considered that case. SeanTango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway.From my limited understanding, this automatic conversion seems to only be happening with DMD on Windows and not when running GDC on Windows ?
Aug 20 2007
Sean Kelly wrote:No, this is not correct. The "gdcwin" binaries are all about providing the regular Windows/MinGW with GDC just as the "gdcmac" binaries are about providing MacOSX/Xcode with GDC *without* extra requirements... You can build GDC for Cygwin and Darwin too, including the rest of the FSF/GCC toolchain, but it's not a strict requirement as it also builds OK using the patched versions of GCC that MinGW or Xcode are providing.From my limited understanding, this automatic conversion seems to only be happening with DMD on Windows and not when running GDC on Windows ?Yes. As far as I know, GDC works on Windows with cygwin but not with mingw or just plain old Win32, is this correct?The routines Tango currently uses to perform the conversion are Win32 library calls, and therefore, I assume, not available to GDC. However, I suppose I could use POSIX calls for GDC--I hadn't considered that case.You can use Win32 calls, as long as you wrap them in version(Win32) ? --anders
Aug 20 2007
Anders F Björklund wrote:Sean Kelly wrote:Yup. They require an additional library to be linked as well. I take care of this with "pragma(lib)" on DMD, but don't know if GDC supports this. Aside from that, it's simply a matter of GDC users having the .lib file available (it's included with DMC). SeanThe routines Tango currently uses to perform the conversion are Win32 library calls, and therefore, I assume, not available to GDC. However, I suppose I could use POSIX calls for GDC--I hadn't considered that case.You can use Win32 calls, as long as you wrap them in version(Win32) ?
Aug 20 2007
Sean Kelly wrote:Not unless the build tool does (e.g. it being listed in Makefile)You can use Win32 calls, as long as you wrap them in version(Win32) ?Yup. They require an additional library to be linked as well. I take care of this with "pragma(lib)" on DMD, but don't know if GDC supports this.Aside from that, it's simply a matter of GDC users having the .lib file available (it's included with DMC).As long as it is available in MinGW, it shouldn't be any problem. --anders
Aug 20 2007
* Sean Kelly <sean f4.ca> [07-08-20 02:40]:Anders F Björklund wrote:Probably args should by (u)byte[][] anyway. Converting command line arguments could have pretty annoying effects. For example, unix filenames may contain any 8-bit value except '/' and '\0', arguments may contain every char except '\0'. They are also charset agnostic, the only place where the charset is the terminal emulator, all other parts of the system treat it as binary data. Also, an automatic charset conversion on console output would probably be annoying, as stdin and stderr are often used to read and write binary data, as in tar -c foo | gzip -9 | split targzipped-foo. So at least, one should use isatty to decide, if the in/output is an interactive terminal.It was my understanding that D by design only supports UTF environments, and the behaviour on legacy systems (CP437/ISO-8859-1) is "undefined"... It's not only output, if you run on a such a system and try to read the args (char[][]) you can get an UTF exception due to it being malformed.Tango converts the input args to UTF-8 on Win32 rather than just accepting them as they are. The args are left alone on Unix however, because most Unix consoles seem to use Unicode anyway.
Aug 20 2007
Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict");Why isn't error an enum instead of a string? -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ .------------------------------------------------------------------------, \ GPG: 5F5A8D05 // F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05 / '--------------------------------------------------------------------' - Mire, don Inodoro! Una paloma con un anillo en la pata! Debe ser mensajera y cayó aquÃ! - Y... si no es mensajera es coqueta... o casada. -- Mendieta e Inodoro Pereyra
Aug 20 2007
Leandro Lucarella wrote:Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace". -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgchar[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict");Why isn't error an enum instead of a string?
Aug 20 2007
Kirk McDonald wrote:Leandro Lucarella wrote:Not a bad idea. I would like to suggest alternate function signatures: //The error code for the callback enum DecodeMode { ..no idea what goes here.. } //The callback types typedef char function(DecodeMode,char) DecodeCHandler; typedef wchar function(DecodeMode,wchar) DecodeWHandler; typedef dchar function(DecodeMode,dchar) DecodeDHandler; //The decode functions uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler handler); uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler handler); uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler handler); Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate. I think you still want to use an enum to represent the cases the callback needs to handle (assuming there is more than one) the same handler function could be used for both encode and decode then. I think you want to pass the destination buffers, allowing re-use/preallocation for efficiency. I think you either return the resulting length of the destination data, or perhaps pass "dst" as 'ref' and change the length internally*. Not sure what you would return if you did that. (* changing length should never cause deallocation of buffer) ReganKirk McDonald, el 18 de agosto a las 14:33 me escribiste:Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace".char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict");Why isn't error an enum instead of a string?
Aug 20 2007
Regan Heath wrote:Kirk McDonald wrote:I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate. Here's another approach to the error handler thing: typedef int error_t; alias void delegate(string encoding, dchar, ref ubyte[]) encode_error_handler; alias void delegate(string encoding, ubyte[], size_t, ref dchar) decode_error_handler; error_t register_error(encode_error_handler dg1, decode_error_handler dg2); error_t Strict, Ignore, Replace; The register_error function would return a new, unique ID for a given error handler. A handler only wanting to handle encoding or decoding could simply pass null for the one it doesn't want to handle. The encode_error_handler receives the encoding and the unicode character that could not be encoded. It also has a 'ref ubyte[]' argument, which should be set to whatever the replacement character is. (It could be passed in as a slice over an internal buffer. Recuding its length should never cause an allocation.) The decode_error_handler receives the encoding, the ubyte[] buffer, and the index of the character in it which could not be encoded. It also has a 'ref dchar' argument, which should be set to whatever the replacement character is. Strict, Ignore, and Replace could be implemented like this: static this { Strict = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { throw new EncodeError(format("Could not encode character \\u%x in encoding '%s'.", c, encoding)); }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { throw new DecodeError(format("Count not decode \\x%x from encoding '%s'.", buf[idx], encoding)); } ); Ignore = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { dest = null; }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { dest = 0; // This would probably have to be special-cased. } ); Replace = register_error( delegate void(string encoding, dchar c, ref ubyte[] dest) { dest.length = 1; dest[0] = '?'; }, delegate void(string encoding, ubyte[] buf, size_t idx, ref dchar dest) { dest = '\uFFFD'; // The Unicode REPLACEMENT CHARACTER } ); }Leandro Lucarella wrote:Not a bad idea. I would like to suggest alternate function signatures: //The error code for the callback enum DecodeMode { ..no idea what goes here.. } //The callback types typedef char function(DecodeMode,char) DecodeCHandler; typedef wchar function(DecodeMode,wchar) DecodeWHandler; typedef dchar function(DecodeMode,dchar) DecodeDHandler; //The decode functions uint decode(byte[] str, char[] dst, string encoding, DecodeCHandler handler); uint decode(byte[] str, wchar[] dst, string encoding, DecodeWHandler handler); uint decode(byte[] str, dchar[] dst, string encoding, DecodeDHandler handler); Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate.Kirk McDonald, el 18 de agosto a las 14:33 me escribiste:Perhaps it would be useful to allow the user to define new error-handlers somehow, and provide a callback for them. (Python allows something like this.) This would allow you to, for instance, provide a different replacement character than the one provided by "replace".char[] decode(ubyte[] str, string encoding, string error="strict"); wchar[] wdecode(ubyte[] str, string encoding, string error="strict"); dchar[] ddecode(ubyte[] str, string encoding, string error="strict");Why isn't error an enum instead of a string?I think you still want to use an enum to represent the cases the callback needs to handle (assuming there is more than one) the same handler function could be used for both encode and decode then. I think you want to pass the destination buffers, allowing re-use/preallocation for efficiency.The implementation could use doEncode and doDecode functions, analogous to doFormat, for efficiency. void doEncode(void delegate(ubyte[]) dg, char[], string encoding, error_t handler); void doEncode(void delegate(ubyte[]) dg, wchar[], string encoding, error_t handler); void doEncode(void delegate(ubyte[]) dg, dchar[], string encoding, error_t handler); void doDecode(void delegate(dchar str) dg, ubyte[], string encoding, error_t handler); The ubyte[] arguments in the callbacks could be slices over an internal buffer. No allocation is necessary. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.org
Aug 20 2007
"Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä news:facpkj$13ml$1 digitalmars.com...Regan Heath wrote:<ramble> True. The C standard does not define the signedness of the char type. What it does require of the char type is that it guarantees that any character in the basic execution character set A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 fit into a char in such a way that they are non-negative. Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005): 6.2.5 Types 3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of vlaues that can be represented in that type. 5.2.4.2.1 Size of integer types <limits.h> - number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8 - minimum value for an object of type signed char SCHAR_MIN -127 // -(2^7 - 1) - maximum value for an object of type signed char SCHAR_MAX 127 // 2^7 - 1 - maximum value for an object of type unsigned char UCHAR_MAX 255 // 2^8 - 1 - minimum value for an object of type char CHAR_MIN see below - maximum value for an object of type char CHAR_MAX see below 2 If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX. The value of UCHAR_MAX shall equal 2^(CHAR_BIT) - 1. </ramble> So, applying this to the discussion would suggest that either byte[] or ubyte[] would be appropriate. However, the most natural would be to handle data as raw data without signs, thus ubyte[] feels more natural to use as the standard type for any data whatsoever.Kirk McDonald wrote: Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate.I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate.
Aug 20 2007
Rioshin an'Harthen wrote:"Kirk McDonald" <kirklin.mcdonald gmail.com> kirjoitti viestissä news:facpkj$13ml$1 digitalmars.com...Although this is interesting, and it does agree with what I was saying, it is basically irrelevant. When passing a string to decode(), the bytes therein could be in any encoding, even one which has nothing to do with the above. (It could be in a multi-byte encoding!) None of those guarantees which the C standard requires apply to these raw bytes. Therefore ubyte[] is /definitely/ more appropraite. -- Kirk McDonald http://kirkmcdonald.blogspot.com Pyd: Connecting D and Python http://pyd.dsource.orgRegan Heath wrote:<ramble> True. The C standard does not define the signedness of the char type. What it does require of the char type is that it guarantees that any character in the basic execution character set A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 fit into a char in such a way that they are non-negative. Quote from standard (ISO/IEC 9899:TC2 Committee Draft May 6, 2005): 6.2.5 Types 3 An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of vlaues that can be represented in that type. 5.2.4.2.1 Size of integer types <limits.h> - number of bits for smallest object that is not a bit-field (byte) CHAR_BIT 8 - minimum value for an object of type signed char SCHAR_MIN -127 // -(2^7 - 1) - maximum value for an object of type signed char SCHAR_MAX 127 // 2^7 - 1 - maximum value for an object of type unsigned char UCHAR_MAX 255 // 2^8 - 1 - minimum value for an object of type char CHAR_MIN see below - maximum value for an object of type char CHAR_MAX see below 2 If the value of an object of type char is treated as a signed integer when used in an expression, the value of CHAR_MIN shall be the same as that of SCHAR_MIN and the value of CHAR_MAX shall be the same as that of SCHAR_MAX. Otherwise, the value of CHAR_MIN shall be 0 and the value of CHAR_MAX shall be the same as that of UCHAR_MAX. The value of UCHAR_MAX shall equal 2^(CHAR_BIT) - 1. </ramble> So, applying this to the discussion would suggest that either byte[] or ubyte[] would be appropriate. However, the most natural would be to handle data as raw data without signs, thus ubyte[] feels more natural to use as the standard type for any data whatsoever.Kirk McDonald wrote: Technically 'char' in C is a signed byte, not an unsigned one therefore byte[] is more accurate.I don't agree with this last part. For starters, I had thought the signed-ness of 'char' in C was not defined. In any case, we're talking about chunks of arbitrary, homogenous binary data, so I think ubyte[] is most appropriate.
Aug 20 2007