digitalmars.D - Evolution (Hello World)
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/56) Feb 09 2005 Taking a quick look at "Hello World"
- Sebastian Beschke (7/8) Feb 10 2005 This is an obvious question, but what would you propose for wchar[] and
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/31) Feb 10 2005 My proposition was "ustr" for wchar[] (since it rhymes with "uint",
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/19) Feb 10 2005 Another possibility is wstr for wchar[] ("wide string")
- Matthew (5/23) Feb 10 2005 I don't think char[] should have an alias. Strings in D are slices, for
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (13/16) Feb 10 2005 I thought that strings in D were sliceable codepoint arrays,
- Matthew (4/7) Feb 10 2005 Now you've got something of a point there. But, still, I'd prefer to
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (10/14) Feb 10 2005 Ehrm, nooooo ? "The line must be drawn here". :-)
- Matthew (5/14) Feb 10 2005 I know. And I like your sentiment. It's just that I think that the
- Derek (11/52) Feb 10 2005 One cannot easily address individual code points using utf8. For example...
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (15/43) Feb 10 2005 This is not that much of a problem, since you should not address
- Derek Parnell (16/64) Feb 10 2005 I obviously do a lot of different sort of programing to you. I often nee...
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (24/42) Feb 10 2005 No, but it's mine ;-) (the ignorant westerner that I am)
- James McComb (4/9) Feb 10 2005 Does anyone here know if Japanese and Chinese use a lot of ASCII
- Walter (5/9) Feb 11 2005 Take a look at the functions std.utf.stride, std.utf.toUCSindex, and
Taking a quick look at "Hello World" shows a remarkable language evolution... From C:#include <stdio.h> #include <stdlib.h> int main(void) { puts("Hello, World!"); return EXIT_SUCCESS; } int main(int argc, char *argv[]) { int i; for (i = 0; i < argc; i++) printf("%d %s\n", i, argv[i]); return EXIT_SUCCESS; }To "old D":import std.c.stdio; import std.c.stdlib; int main() { puts("Hello, World!"); return EXIT_SUCCESS; } int main(char[][] args) { for (int i = 0; i < args.length; i++) printf("%d %.*s\n", i, args[i]); return EXIT_SUCCESS; }To "new D":import std.stdio; void main() { writeln("Hello, World!"); } void main(str[] args) { foreach (int i, str a; args) writefln("%d %s", i, a); }Where I took the liberty of adding a few of my own RFEs: 1) "void main" should return 0 back to the operating system 2) new std.stdio.writeln, a formatless version of writefln 3) the "str" alias for the char[] type, like "bool" for bit Not too bad for the first five years, if I say so myself ? Then again, I couldn't even use it here until GDC arrived... And it was not even a year ago, that David Friedman did that. --anders
Feb 09 2005
Anders F Björklund schrieb:3) the "str" alias for the char[] type, like "bool" for bitThis is an obvious question, but what would you propose for wchar[] and dchar[]? I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise. -Sebastian
Feb 10 2005
Sebastian Beschke wrote:My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...3) the "str" alias for the char[] type, like "bool" for bitThis is an obvious question, but what would you propose for wchar[] and dchar[]?I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise.There is no "one and only one string type" in D, just as there is no "one and only one boolean type". There are three of each, and char[] is the preferred string type (what does "main" use ?) so it gets to be the str. And bit is the type of "true" and "false" so it gets to be the default bool type. If you want to speed up or optimize your code, you can change to using wchar[] or wbool[]... And when it is needed in a few places, you have dchar[] and dbool The contents are exactly the same, just encoded differently - all strings are in Unicode and all booleans are in Zero-is-False. I just find this shortform to be easier on the eyes: void main(str[] args); str[str] dictionary; UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower) --anders
Feb 10 2005
I wrote:Another possibility is wstr for wchar[] ("wide string") and ustr for dchar[] ("Unicode string"), which might perhaps work better and be a tad more logical too... I like "str" better than "string", because it: 1) rhymes with int, char, bool and the others 2) is shorter to type, easilly 50% saved 3) doesn't confuse anyone with C++ std::string --andersMy proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...3) the "str" alias for the char[] type, like "bool" for bitThis is an obvious question, but what would you propose for wchar[] and dchar[]?
Feb 10 2005
"Anders F Björklund" <afb algonet.se> wrote in message news:cugd6f$2ptf$1 digitaldaemon.com...I wrote:I don't think char[] should have an alias. Strings in D are slices, for very good reason, and it's good for that to be foremost in peoples' minds.Another possibility is wstr for wchar[] ("wide string") and ustr for dchar[] ("Unicode string"), which might perhaps work better and be a tad more logical too... I like "str" better than "string", because it: 1) rhymes with int, char, bool and the others 2) is shorter to type, easilly 50% saved 3) doesn't confuse anyone with C++ std::stringMy proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...3) the "str" alias for the char[] type, like "bool" for bitThis is an obvious question, but what would you propose for wchar[] and dchar[]?
Feb 10 2005
Matthew wrote:I don't think char[] should have an alias. Strings in D are slices, for very good reason, and it's good for that to be foremost in peoples' minds.I thought that strings in D were sliceable codepoint arrays, but not necessarily slices always ? It's just an alias, the type is still char[] ? (and wchar[] and dchar(), but anyway) But I rethought and found "ustr" to be silly altogether... alias char[] str; // ASCII-optimized alias wchar[] wstr; // Unicode-optimized alias dchar[] dstr; // codepoint-optimized More orthogonal that way ? (with "char[]" = "str", always) char[] by itself is actually not that bad, but this is: int main(char[][] args); char[][char[]] dictionary; --anders
Feb 10 2005
char[] by itself is actually not that bad, but this is: int main(char[][] args); char[][char[]] dictionary;Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.
Feb 10 2005
Matthew wrote:Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.Ehrm, nooooo ? "The line must be drawn here". :-) I just wanted some easier basics, for beginners ? For the higher levels, you still need to learn about bit and char[] and other behind-the-scenes. It's just similar to the "alias foo* fooPtr;", that seems to always enter the picture after one has seen one too many stars fly by... I'll (re)post my Grand Scheme of Std Aliases. --anders
Feb 10 2005
"Anders F Björklund" <afb algonet.se> wrote in message news:cugfi7$2sll$2 digitaldaemon.com...Matthew wrote:I know. And I like your sentiment. It's just that I think that the string-is-a-slice concept is so important and fundamental to D that it's more likely to be disservice in the medium/long term.Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.Ehrm, nooooo ? "The line must be drawn here". :-) I just wanted some easier basics, for beginners ? For the higher levels, you still need to learn about bit and char[] and other behind-the-scenes.
Feb 10 2005
On Thu, 10 Feb 2005 20:01:50 +0100, Anders F Björklund wrote:Sebastian Beschke wrote:One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32. So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8. -- Derek Melbourne, AustraliaMy proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized) There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...3) the "str" alias for the char[] type, like "bool" for bitThis is an obvious question, but what would you propose for wchar[] and dchar[]?I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise.There is no "one and only one string type" in D, just as there is no "one and only one boolean type". There are three of each, and char[] is the preferred string type (what does "main" use ?) so it gets to be the str. And bit is the type of "true" and "false" so it gets to be the default bool type. If you want to speed up or optimize your code, you can change to using wchar[] or wbool[]... And when it is needed in a few places, you have dchar[] and dbool The contents are exactly the same, just encoded differently - all strings are in Unicode and all booleans are in Zero-is-False. I just find this shortform to be easier on the eyes: void main(str[] args); str[str] dictionary; UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)
Feb 10 2005
Derek wrote:This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string. See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome. Indexing is slow for both of them, but in practice indexing by different code units is done very rarely, except when communicating with specifications that use UTF-32 code units, such as XSL. This point about indexing is true unless an API for strings allows access only by code point offsets. This is a very inefficient design: strings should always allow indexing with code unit offsets.But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units... Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8.Yes, and this is easily done with a foreach(dchar c; SomeText) loop, as D can transparently handle the transition between char[] and dchar... There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers. If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term. --anders
Feb 10 2005
On Thu, 10 Feb 2005 22:47:18 +0100, Anders F Björklund wrote:Derek wrote:I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.UTF-8 has two major advantages: 1) it's optimized for ASCII and does not require a BOM mark, making it compatible for files too 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)One cannot easily address individual code points using utf8. For example... char[] SomeText; You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:Yes, and a simple index into a char[] doesn't do this for you.Code-point boundaries, iteration, and indexing are very fast with UTF-32. Code-point boundaries, accessing code points at a given offset, and iteration involve a few extra machine instructions for UTF-16; UTF-8 is a bit more cumbersome. Indexing is slow for both of them, but in practice indexing by different code units is done very rarely, except when communicating with specifications that use UTF-32 code units, such as XSL. This point about indexing is true unless an API for strings allows access only by code point offsets. This is a very inefficient design: strings should always allow indexing with code unit offsets.But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units... Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' is not permitted.So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8.Yes, and this is easily done with a foreach(dchar c; SomeText) loop, as D can transparently handle the transition between char[] and dchar...There are also readily available functions in the std.utf module: "encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.Exactly my point. One needs to use these if *manipulating* characters in a utf8 or utf16 string.If you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term.'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-) -- Derek Melbourne, Australia 11/02/2005 9:38:27 AM
Feb 10 2005
Derek Parnell wrote:No, but it's mine ;-) (the ignorant westerner that I am) Seriously, in my own language - Swedish, about 10% of the text is non ASCII, which means that Walters optimized US ASCII parts runs for 90% of the time. I assume this is the same for the rest of the previously ISO-8859-X using Western world languages... Had I been using another alphabet, like Japanese or Chinese, then UTF-16 had been a nice bet. Surrogate characters are not occuring very often, in fact they were just now introduced in Java 1.5 since the original 16 bits of Unicode "overflowed". So I think there's a 90-10 rule here too, with non-Surrogates. So I do think talking about "exceptions" is warranted ?But char[] works fine for ASCII and wchar[] works fine for Unicode, *as long* as you watch out for any surrogates in the code units... Which means you can have a fast standard route, and extra code to handle the exceptional characters if and when they occur ?'exceptional' to whom? To latin-based alphabet users maybe, but not the great majority of the world's population.We are talking Copy-on-Write here, yes ? As in reading from readonly and writing to readwrite ? Otherwise you could use dchar[] instead, and do a simple indexing. (or a foreach(inout dchar c; SomeText on it) And convert from UTF-8/UTF-16 on the way in, do all the processing on the UTF-32 internal array, and convert back to UTF-8/UTF-16 on the way out. (most routines now do include a dchar[] interface too, you can even use dchar[] in switch/case statements - if you like)Yes, and this is easily done with a foreach(dchar c; SomeText) loop, as D can transparently handle the transition between char[] and dchar...Except for "character manipulation" as 'foreach(inout dchar c; SomeText)' is not permitted.Storage. Even with all "silly alphabets" utilized, there are still 11 dead bits in each UTF-32 character. UTF-16 is bound to more efficient. Unless you are doing extinct languages research or something? :-) It's not just me... See http://www.unicode.org/faq/utf_bom.html#UTF32 --andersIf you lot of loops like that, you can use a dchar[] (dstr alias) as a intermediate storage. But char[] and wchar[] are better for long term.'long term' meaning ??? Disk storage? RAM storage? Or until we finally get rid of all those silly 'alphabets' out there ;-)
Feb 10 2005
Anders F Björklund wrote:Had I been using another alphabet, like Japanese or Chinese, then UTF-16 had been a nice bet. Surrogate characters are not occuring very often, in fact they were just now introduced in Java 1.5 since the original 16 bits of Unicode "overflowed". So I think there's a 90-10 rule here too, with non-Surrogates.Does anyone here know if Japanese and Chinese use a lot of ASCII punctuation? If they do, then maybe UTF-8 is reasonable. James McComb
Feb 10 2005
"Derek Parnell" <derek psych.ward> wrote in message news:uk7573l4ag4s.fkp4buj0rl0e.dlg 40tude.net...Take a look at the functions std.utf.stride, std.utf.toUCSindex, and std.utf.toUTFindex. They provide the basic building blocks to manipulate UTF-8 strings as if they were an array of UCS characters.This is not that much of a problem, since you should not address individual code points anyway but treat the code units as a string.I obviously do a lot of different sort of programing to you. I often need to look at individual code points (ie. characters) in a string.
Feb 11 2005