digitalmars.D - char, wchar and dchar should be supported equally
- James McComb (48/48) Jun 03 2005 I like D having char, wchar and dchar. And I like the way that they will...
- Trevor Parscal (11/32) Jun 03 2005 well.. wtoString is a bad naming convention.. I think toWString or
- Hasan Aljudy (22/54) Jun 03 2005 I think that toString or any std function that takes a string and
- Trevor Parscal (9/13) Jun 03 2005 The best idea for this I have heard thus far.. Especially since, anytime...
- Regan Heath (43/50) Jun 03 2005 If you're using char[] then it gets converted to dchar[], processed, the...
- James McComb (5/17) Jun 04 2005 Thinks: so that's how you do it! :)
- Regan Heath (15/21) Jun 03 2005 Yes and No. In many cases, yes, especially where ASCII is used. However ...
- Hasan Aljudy (4/11) Jun 03 2005 What then is the point of having all of these different types?
- Regan Heath (24/35) Jun 03 2005 They're each better or worse depending on the data you're operating on.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/14) Jun 04 2005 That's like saying that booleans should always be represented
- Hasan Aljudy (15/29) Jun 04 2005 No, it's not like representing booleans with ints .. it's actually like
- Kris (9/38) Jun 04 2005 It would be great to resolve this ongoing concern. However, you might
- Vathix (2/6) Jun 04 2005 Maybe there should be isascii(char) somewhere :)
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (5/13) Jun 05 2005 I suggested that enhancement last year, but it wasn't popular...
- Derek Parnell (32/48) Jun 05 2005 You mean like this ...
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (10/27) Jun 05 2005 Is that the "Natural Docs" format ?
- Derek Parnell (15/45) Jun 05 2005 Good on ya.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/14) Jun 05 2005 http://www.naturaldocs.org/
- Derek Parnell (19/23) Jun 04 2005 Yes please. I've had to write dchar[] versions of a lot of things in
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (5/8) Jun 04 2005 Not that anyone cares, but templates also have severe problems
I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally. For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar! But what if every function in std.string had wchar and dchar versions? Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.) Also, in order for char, wchar and dchar to be supported equally, Object should have wtoString and dtoString methods. (Because toString cannot be overloaded based on its return type.) Does anyone else out there feel the same? Or should I get over it and JUC (Just Use Char) like I already JUB (Just Use Bit)? James McComb <code> import std.stdio; template TStringFunctions(T) { T[] toStringz(T[] str) { if (!str) return ""; T[] copy = str.dup; return copy ~= '\0'; } // Other string functions... } alias TStringFunctions!(char) stringFunctions; alias TStringFunctions!(wchar) wstringFunctions; alias TStringFunctions!(dchar) dstringFunctions; alias stringFunctions.toStringz toStringz; alias wstringFunctions.toStringz wtoStringz; alias dstringFunctions.toStringz dtoStringz; // Other string function aliases... // Example usage void main() { char[] str = "utf-8 string"; wchar[] wstr = "utf-16 string"; str = toStringz(str); wstr = wtoStringz(wstr); } </code>
Jun 03 2005
James McComb wrote:I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally. For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar! But what if every function in std.string had wchar and dchar versions? Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.) *snip* Object should have wtoString and dtoString methods.well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast. That's my two cents anyhoo, as an avid dchar[] user. -- Thanks, Trevor Parscal www.trevorparscal.com trevorparscal hotmail.com
Jun 03 2005
Trevor Parscal wrote:James McComb wrote:I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar. Assuming that dchar is implicitly convertable to char and wchar, there can be no loss of information when doing something like: <code> dchar[] someFunction(dchar[]) ... ... wchar[] wtest = ... wtest = someFunction(wtest); //no loss ... char[] test = .. test = someFunction(test); //no loss </code> of course I maybe wrong, but I'm assuming that converting a char to wchar is like converting an int to double .. where any extra space is just filled with zeros (speaking in the bit level), and you can convert an int to double, process it, and convert it back to int, and assume that no information will be lost because of the conversion to double. ofcourse information can be lost if "int" is not enough to store the value returned from the function, but this has nothing to do with converting back and forth to double then to int.I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally. For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar! But what if every function in std.string had wchar and dchar versions? Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.) *snip* Object should have wtoString and dtoString methods.well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast. That's my two cents anyhoo, as an avid dchar[] user.
Jun 03 2005
Hasan Aljudy wrote:I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe) -- Thanks, Trevor Parscal www.trevorparscal.com trevorparscal hotmail.com
Jun 03 2005
On Fri, 03 Jun 2005 20:42:25 -0700, Trevor Parscal <trevorparscal hotmail.com> wrote:Hasan Aljudy wrote:If you're using char[] then it gets converted to dchar[], processed, then converted back. That's not ideal IMO. Ideally we only want conversion to happen in 1, or at most 2 places. 1. Data is converted on input from <input format> to <internal format>. 2. Data is converted on output from <internal format> to <output format>. they will do both (for one reason or another). Each application will have a different <internal format> chosen for some specific reason, perhaps even a different <internal format> for each group of data. So, Ideally we require 3 variants of every single string function. But of course, we dont want to be repeating ourselves all the time, in fact we want only one 'function' we just want to re-use it for all 3 string types. So, might I suggest using templates eg. import std.stdio; import std.ctype; template toLowerT(Type) { Type[] toLowerT(Type[] input) { Type[] res = input.dup; foreach(inout Type c; res) c = tolower(c); return res; } } alias toLowerT!(char) toLower; alias toLowerT!(wchar) toLower; alias toLowerT!(dchar) toLower; void main() { char[] a = "REGAN"; wchar[] b = "WAS"; dchar[] c = "HERE"; //we can even use the x.fn() form as opposed to fn(x) if we wish. writefln("%s=%s",a,a.toLower()); writefln("%s=%s",b,b.toLower()); writefln("%s=%s",c,c.toLower()); } NOTE: I realise using ctype's tolower function will only work with ASCII, not the full compliment of unicode characters. This is a semi-functional example only. ReganI think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe)
Jun 03 2005
Regan Heath wrote:template toLowerT(Type) { Type[] toLowerT(Type[] input) { Type[] res = input.dup; foreach(inout Type c; res) c = tolower(c); return res; } } alias toLowerT!(char) toLower; alias toLowerT!(wchar) toLower; alias toLowerT!(dchar) toLower;Thinks: so that's how you do it! :) This is the kind of thing I had in mind. Is there any chance that std.string actually *will* be implemented like this? James McComb
Jun 04 2005
On Fri, 03 Jun 2005 21:37:23 -0600, Hasan Aljudy <hasan.aljudy gmail.com> wrote:of course I maybe wrong, but I'm assuming that converting a char to wchar is like converting an int to double .. where any extra space is just filled with zeros (speaking in the bit level)Yes and No. In many cases, yes, especially where ASCII is used. However some UTF-8 'characters'/'glyphs' (not sure what the correct term is exactly) take 2 or more char's (UTF-8 codepoints) to represent, so when converting them you might go from 3 chars to 1 wchar (1 UTF-16 codepoint) which is a decrease in byte space required, and often a change in the value of the codepoint., and you can convert an int to double, process it, and convert it back to int, and assume that no information will be lost because of the conversion to double.Converting to/from char[], wchar[] and dchar[] causes no loss of data, ever. All existing glyphs can be represented in UTF-8(char[]), UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be represented in all types. Of course that representation uses a different number of bytes and may in fact use different bit patterns(codepoints) as well. Regan
Jun 03 2005
Regan Heath wrote: > Converting to/from char[], wchar[] and dchar[] causes no loss of data,ever. All existing glyphs can be represented in UTF-8(char[]), UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be represented in all types. Of course that representation uses a different number of bytes and may in fact use different bit patterns(codepoints) as well. ReganWhat then is the point of having all of these different types? How does UTF-8 work? when you only have 256 possible values?
Jun 03 2005
On Sat, 04 Jun 2005 00:05:46 -0600, Hasan Aljudy <hasan.aljudy gmail.com> wrote:Regan Heath wrote: > Converting to/from char[], wchar[] and dchar[] causes no loss of data,They're each better or worse depending on the data you're operating on. Terminology: (I think this is correct) Codepoint == one char, wchar, or dchar. Character == a symbol, made up of 1 or more codepoints. UTF-8 is perfect if most/all of your data is ASCII, as UTF-8 characters have the same values as they do in ASCII, ASCII is a sub-set of UTF-8 (which can represent characters that do not exist in ASCII). UTF-16 is better than UTF-8 in cases where most/all of your data would take 2 or more UTF-8 codepoints to represent. Essentially UTF-16 can store some characters in less space than UTF-8 can. UTF-32 is better than UTF-16 in cases where most/all of your data would take 2 or more UTF-16 codepoints to represent. Some people choose to use UTF-32 as you can guarantee a codepoint == a character, meaning the dchar's length property is the 'string' length (this is not always the case with wchar, or char, due to some characters taking more than 1 codepoint).ever. All existing glyphs can be represented in UTF-8(char[]), UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be represented in all types. Of course that representation uses a different number of bytes and may in fact use different bit patterns(codepoints) as well. ReganWhat then is the point of having all of these different types?How does UTF-8 work? when you only have 256 possible values?In essence it uses between 1 and 4 codepoints to represent a single character. Someone probably has a better reference than this: http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA I just quickly googled that up. Regan
Jun 03 2005
Hasan Aljudy wrote:I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-) Conversion is a real problem, but at least you can do: char[] str; foreach(dchar c; str) { ... } Plus some ASCII shortcuts, when the high bit isn't set. Much more on http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs (and several other pages on the Wiki4D, like Derek's RFE: "FeatureRequestList/ImplicitConversionBetweenUTF") --anders PS. You probably meant to say "dchar[]", and not dchar ?
Jun 04 2005
Anders F Björklund wrote:Hasan Aljudy wrote:No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles. booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans. But char, wchar, and dchar are all characters, just with different storage space. I don't really think anybody cares about size, most people who care would care most about performance (processing time). imagine if all std functions used short instead of int ;) that could be a serious problem.I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-)Conversion is a real problem, but at least you can do: char[] str; foreach(dchar c; str) { ... } Plus some ASCII shortcuts, when the high bit isn't set.I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.
Jun 04 2005
It would be great to resolve this ongoing concern. However, you might consider trying the ICU project for all your unicode needs ~ it's what Java uses under the covers: http://www-306.ibm.com/software/globalization/icu/index.jsp There's a D interface available over here, along with a well-rounded String class: http://dsource.org/forums/viewtopic.php?t=148 - Kris "Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message news:d7t8tc$b40$1 digitaldaemon.com...Anders F Björklund wrote:Hasan Aljudy wrote:No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles. booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans. But char, wchar, and dchar are all characters, just with different storage space. I don't really think anybody cares about size, most people who care would care most about performance (processing time). imagine if all std functions used short instead of int ;) that could be a serious problem.I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-)Conversion is a real problem, but at least you can do: char[] str; foreach(dchar c; str) { ... } Plus some ASCII shortcuts, when the high bit isn't set.I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.
Jun 04 2005
I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.
Jun 04 2005
Vathix wrote:I suggested that enhancement last year, but it wasn't popular... http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/2154 Or maybe it just got lost in this crippled "bug reporting system" ? --andersI don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.
Jun 05 2005
On Sun, 05 Jun 2005 09:25:09 +0200, Anders F Björklund wrote:Vathix wrote:You mean like this ... //--------------------------- // --- isASCII -- // Returns true if the supplied argument is an ASCII character. // // Paramaters: // (1) -- char -- The character to test. // (return) -- bool -- 'true' if the character is ASCII otherwise false. //--------------------------- bool isASCII(char c) out(result) { assert(result == (UTF8stride[c] == 1)); } body{ return (cast(uint)c <= 127U ? true : false); } unittest { assert(isASCII('a') == true); assert(isASCII('~') == true); assert(isASCII('\xFF') == false); assert(isASCII('\x80') == false); assert(isASCII('\x00') == true); assert(isASCII(cast(char) -1) == false); } //--------------------------- -- Derek Parnell Melbourne, Australia 5/06/2005 7:13:16 PMI suggested that enhancement last year, but it wasn't popular... http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/2154 Or maybe it just got lost in this crippled "bug reporting system" ?I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory. C'mon people, D is a high level language.Maybe there should be isascii(char) somewhere :) Would be inlined and self documenting.
Jun 05 2005
Derek Parnell wrote:You mean like this ... //--------------------------- // --- isASCII -- // Returns true if the supplied argument is an ASCII character. // // Paramaters: // (1) -- char -- The character to test. // (return) -- bool -- 'true' if the character is ASCII otherwise false. //---------------------------Is that the "Natural Docs" format ? I think I prefer Doxygen, myself: /// Is the supplied code unit an ASCII character ? /// param c The UTF-8 code unit to test. /// return 'true' if the character is ASCIIbool isASCII(char c) out(result) { assert(result == (UTF8stride[c] == 1)); } body{ return (cast(uint)c <= 127U ? true : false); }But surely this workaround shouldn't be needed ? If a "bool" function can't return a comparison, then there's something severly broken somewhere... --anders
Jun 05 2005
On Sun, 05 Jun 2005 12:09:47 +0200, Anders F Björklund wrote:Derek Parnell wrote:Dunno. What's that ? I just made this up on the spot.You mean like this ... //--------------------------- // --- isASCII -- // Returns true if the supplied argument is an ASCII character. // // Paramaters: // (1) -- char -- The character to test. // (return) -- bool -- 'true' if the character is ASCII otherwise false. //---------------------------Is that the "Natural Docs" format ?I think I prefer Doxygen, myself: /// Is the supplied code unit an ASCII character ? /// param c The UTF-8 code unit to test. /// return 'true' if the character is ASCIIGood on ya.I make a distinction between the machine code that is generated by a compiler and the source code that is read by a human. Yes, the compiler is able to work out that a bool is returned from a comparison, but by writing it out explicitly, we also get a clear and unambiguous statement of intent by the coder. We get the same machine code generated and now its also human readable too. In other words, it is self-documenting and does not rely on the sophistication of the compiler. -- Derek Parnell Melbourne, Australia 5/06/2005 8:39:19 PMbool isASCII(char c) out(result) { assert(result == (UTF8stride[c] == 1)); } body{ return (cast(uint)c <= 127U ? true : false); }But surely this workaround shouldn't be needed ? If a "bool" function can't return a comparison, then there's something severly broken somewhere...
Jun 05 2005
Derek Parnell wrote:http://www.naturaldocs.org/ Whatever style is used, it should be parsable ?Is that the "Natural Docs" format ?Dunno. What's that ? I just made this up on the spot.Yes, the compiler is able to work out that a bool is returned from a comparison, but by writing it out explicitly, we also get a clear and unambiguous statement of intent by the coder. We get the same machine code generated and now its also human readable too.Ah, OK, then it wasn't a compiler bug <phew>. Just a matter of opinion on readability... :-) Like: "a < b" versus "(a < b) ? true : false" --anders
Jun 05 2005
On Sat, 04 Jun 2005 11:20:47 +1000, James McComb wrote:I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally.Yes please. I've had to write dchar[] versions of a lot of things in std.string and others. I tend to use char[] only when reading to and from files/streams, and use dchar[] for internal routines. The application I'm working on now does a lot of text processing and it is too slow to convert char[] -> dchar[], process it, convert dchar[] -> char[]. The simplicity of dchar[] is that the array index always points to the start of a character, where as with char[] and wchar[] the index can point to somewhere inside a character. (Remembering that each character in a dchar[] string is the same size - a dchar - but characters in wchar[] and char[] have variable sizes.) The current Phobos routines are heavily biased to char[]. Also, the use of templates is not always the best solution because there are some optimizations available, depending on the UTF encoding format used. -- Derek Parnell Melbourne, Australia 4/06/2005 6:08:29 PM
Jun 04 2005
Derek Parnell wrote:The current Phobos routines are heavily biased to char[]. Also, the use of templates is not always the best solution because there are some optimizations available, depending on the UTF encoding format used.Not that anyone cares, but templates also have severe problems on other D platforms such as with the GDC compiler on Mac OS X... It's getting better, but it's like "the early days of C++" or so. --anders
Jun 04 2005