digitalmars.D - First Impressions
- Geoff Carlton (63/63) Sep 28 2006 Hi,
- Jarrett Billingsley (42/105) Sep 28 2006 They're more just syntactic sugar than member functions. You can, in fa...
- Geoff Carlton (32/50) Sep 28 2006 I'm a fan of utf-8 so it would seem natural to have string, wstring, and...
- Lutger (10/39) Sep 28 2006 Yes, I was too. But although it looks not very nice at first sight, D's
- Derek Parnell (25/47) Sep 28 2006 Yes. It isn't very 'nice' for a modern language. Though as you note belo...
- Walter Bright (7/14) Sep 29 2006 On the other hand, the reasons other languages have strings as classes
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/5) Sep 29 2006 A string alias might still be, just as the bool alias was.
- Derek Parnell (11/27) Sep 29 2006 And is it there yet? I mean, given that a string is just a lump of text,...
- Georg Wrede (5/10) Sep 29 2006 The string you're talking about is not just a lump of text.
- David Medlock (4/35) Sep 29 2006 I just quickly want to interject my wish for aliases for the basic
- Walter Bright (4/10) Sep 29 2006 I believe it's there. I don't think std::string or java.lang.String have...
- Matthias Spycher (5/20) Sep 29 2006 Immutability and some guarantees about the validity of the state of an
- Derek Parnell (16/28) Sep 29 2006 I'm pretty sure that the phobos routines for search and replace only wor...
- Georg Wrede (20/25) Sep 29 2006 I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61)
- Walter Bright (25/37) Sep 29 2006 That cannot happen, because multibyte sequences *always* have the high
- Derek Parnell (26/66) Sep 30 2006 Thanks. That has cleared up some misconceptions and pre-concenptions tha...
- Walter Bright (21/52) Sep 30 2006 I certainly hope this thread doesn't degenerate into that like some of
- Derek Parnell (6/9) Oct 01 2006 Oh, I threw trhat away ages ago ;-)
- Lars Ivar Igesund (6/12) Oct 01 2006 Nope, it just looks correct.
- Lionello Lunesu (14/23) Oct 02 2006 I don't think renaming toString to toUTF gets rid of any confusion.
- Georg Wrede (3/5) Oct 01 2006 Let's just say it would be a first step in lessening the confusion _we_
- Kevin Bealer (15/22) Oct 02 2006 I would kind of agree with this, but I think it's a two-edged knife.
- Georg Wrede (24/54) Oct 03 2006 Well, with string, folks would at least be inclined to search for the
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (17/32) Oct 03 2006 Which could be a *good* thing, since it would stop users from hurting
- Bruno Medeiros (9/26) Oct 01 2006 Precisely! And even if such conceptual difference didn't exist, or is
- Geoff Carlton (7/13) Oct 01 2006 There are also many cases where char arrays are not strings:
- Thomas Kuehne (31/44) Sep 30 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Thomas Kuehne (11/30) Sep 30 2006 -----BEGIN PGP SIGNED MESSAGE-----
- Sean Kelly (9/12) Sep 30 2006 The wording could be more explicit, but I think the current
- Geoff Carlton (20/35) Sep 29 2006 Hi,
- David Medlock (12/30) Sep 29 2006 The reason *I* want it is _alias_ does not respect the private:
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (25/40) Sep 29 2006 Problem of "char[]" is both that it hides the fact that "char" is UTF-8
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (7/14) Sep 29 2006 Except the other way around, of course!
- Lionello Lunesu (10/10) Sep 29 2006 I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (3/7) Sep 29 2006 And probably only for ASCII string constants, at that...
- Lionello Lunesu (15/23) Sep 29 2006 Right, that too!
- Georg Wrede (21/32) Sep 29 2006 Using char[] as long as you don't know about UTF seems to work pretty
- Chad J (24/64) Sep 29 2006 haha too true.
- Johan Granberg (8/23) Sep 29 2006 I completely agree, char should hold a character independently of
- BCS (15/23) Sep 29 2006 Why isn't performance a problem?
- Chad J (40/71) Sep 29 2006 I will go ahead and say that the current state of char[] is incorrect.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (9/29) Sep 29 2006 But D already uses Unicode for all strings, encoded as UTF ?
- Chad J (38/78) Sep 29 2006 Probably 7-bit. Anything where the size of one character is ALWAYS one
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (24/38) Sep 29 2006 It's mostly about looking out for the UTF "control" characters, which is...
- Georg Wrede (4/8) Sep 29 2006 Problem is, using 16-bit you sort-of get away with _almost_ all of it.
- Chad J (35/75) Sep 30 2006 So it seems to me the problem is that those 2 bytes are both 2
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (30/57) Oct 01 2006 Code point is the closest thing to a "character", although it might take...
- =?UTF-8?B?QW5kZXJzIEYgQmrDtnJrbHVuZA==?= (29/40) Sep 29 2006 This code probably does not work as you think it does...
- Chad J (15/67) Sep 29 2006 ah. And yep the i++ was a typo (oops).
- Georg Wrede (26/31) Sep 29 2006 Wrong.
- Chad J (13/19) Sep 29 2006 But this is what I'm talking about... you can't slice them or index
- Georg Wrede (92/116) Sep 29 2006 Yes. That's why I talked about you falling down once you realise Daddy's...
- Walter Bright (16/31) Sep 29 2006 Yes, you do have to be aware of it being UTF, just like in C you have to...
- Sean Kelly (9/20) Sep 30 2006 As long as you're aware that you are working in UTF-8 I think
- Walter Bright (10/21) Sep 30 2006 It's so broken that there are proposals to reengineer core C++ to add
- Sean Kelly (12/36) Oct 01 2006 True. And I hinted at this above.
- Walter Bright (6/22) Oct 01 2006 That's why the proposals to fix it are rewriting some of the *core* C++
- Johan Granberg (7/18) Sep 29 2006 But is this not a needless source of confusion, that could be eliminated...
- Georg Wrede (26/48) Sep 29 2006 You might begin with pasting this and compiling it:
- Derek Parnell (10/12) Sep 29 2006 The Build program does lots of 'tampering'. I had to rewrite many standa...
- Georg Wrede (7/17) Sep 29 2006 Yes, case insensitive compares are difficult if you want to cater for
- Geoff Carlton (21/29) Sep 29 2006 I agree, but I disagree that there is a problem, or that utf-8 is a bad
- Georg Wrede (2/38) Sep 29 2006 Yes.
- Johan Granberg (6/14) Sep 29 2006 How should we chop strings on character boundaries?
- Walter Bright (2/3) Sep 30 2006 std.utf.toUTFindex() should do the trick.
- Johan Granberg (4/23) Sep 29 2006 I don't think any performance hit will be so big that it causes problems...
- BCS (14/25) Oct 01 2006 If you will note, I said nothing about the size of the hit. While some
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (8/15) Oct 01 2006 We have that already:
- BCS (19/44) Oct 01 2006 ubyte is an 8 bit unsigned number not a character encoding.
- Georg Wrede (23/27) Oct 01 2006 Then all Americans would use that instead of UTF-8.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (19/23) Oct 01 2006 Right, I actually meant ubyte[] but void[] might have been
- BCS (18/22) Oct 02 2006 The more I think about it the worse this get.
Hi, I'm a C++ user who's just tried D and I wanted to give my first impressions. I can't really justify moving any of my codebase over to D, so I wrote a quick tool to parse a dictionary file and make a histogram - a bit like the wc demo in the dmd package. 1.) I was a bit underwhelmed by the syntax of char[]. I've used lua which also has strings,functions and maps as basic primitives, so going back to array notation seems a bit low level. Also, char[][] is not the best start in the main() declaration. Is it a 2D array, an array of arrays? Then there is the char[][char[]]. What a mouthful for a simple map! Well, now I need to find elements.. I'd use std::string's find() here, but the wc example has all array operations. Even isalpha is done as 'a', 'z' comparisons on an indexed array. Back to low level C stuff. A simple alias of char[] to string would simplify the first glance code. string x; // yep, a string main (string[]) // an array of strings string[string] m; // map of string to string I believe single functions get pulled in as member functions? e.g. find(string) can be used as string.find()? If so, it means that all the string functionality can be added and then used naturally as member functions on this "string" (which is really just the plain old char[] in disguise). This is a small thing, but I think it would help in terms of the mindset of strings being a first class primitive, and clear up simple "hello world" examples at the same time. Put simply, every modern language has a first class string primitive type, except D - at least in terms of nomenclature. 2.) I liked the more powerful for loop. I'm curious is there any ability to use delegates in the same way as lua does? I was blown away the first time I realised how simple it was for custom iteration in lua. In short, you write a function that returns a delegate (a closure?) that itself returns arguments, terminating in nil. e.g. for r in rooms_in_level(lvl) // custom function As lua can handle multiple return arguments, it can also do a key,value sort of thing that D can do. What a wonderful way of allowing any sort of iteration. It beats pages of code in C++ to write an iterator that can go forwards, or one that can go backwards (wow, the power of C++!). C++09 still isn't much of an improvement here, it only sugars the awful iterator syntax. 3.) From the newsgroups, it seems like 'auto' as local raii and 'auto' as automatic type deduction are still linked to the one keyword. Well in lua, 'local' is pretty intuitive for locally scoped variables. Also 'auto' will soon mean automatic type deduction in C++. So those make sense to me personally. Looks like this has been discussed to death, but thats my 2c. 4.) The D version of Scintilla and d-build was nice, very easy to use. Personally I would have preferred the default behaviour of dbuild to put object files in an /obj subdirectory and the final exe in the original directory dbuild is run from. This way, it could be run from a root directory, operate on a /src subdirectory, and not clutter up the source with object files. There is a switch for that, of course, but I can't imagine when you would want object files sitting in the same directory as the source. Well, as first impressions go, I was pleased by D, and am interested to see how well it fares as time goes on. Its just a shame that all the tools/library/IDE is all in C++! Thanks, Geoff
Sep 28 2006
"Geoff Carlton" <gcarlton iinet.net.au> wrote in message news:efhp1r$1r9s$1 digitaldaemon.com...Hi, I'm a C++ user who's just tried D and I wanted to give my first impressions. I can't really justify moving any of my codebase over to D, so I wrote a quick tool to parse a dictionary file and make a histogram - a bit like the wc demo in the dmd package. 1.) I was a bit underwhelmed by the syntax of char[]. I've used lua which also has strings,functions and maps as basic primitives, so going back to array notation seems a bit low level. Also, char[][] is not the best start in the main() declaration. Is it a 2D array, an array of arrays? Then there is the char[][char[]]. What a mouthful for a simple map! Well, now I need to find elements.. I'd use std::string's find() here, but the wc example has all array operations. Even isalpha is done as 'a', 'z' comparisons on an indexed array. Back to low level C stuff. A simple alias of char[] to string would simplify the first glance code. string x; // yep, a string main (string[]) // an array of strings string[string] m; // map of string to string I believe single functions get pulled in as member functions? e.g. find(string) can be used as string.find()? If so, it means that all the string functionality can be added and then used naturally as member functions on this "string" (which is really just the plain old char[] in disguise).They're more just syntactic sugar than member functions. You can, in fact do this with any array type, e.g void foo(int[] arr) { ... } int[] x; x = [4, 5, 6, 7]; // bug in the new array literals ;) x.foo();This is a small thing, but I think it would help in terms of the mindset of strings being a first class primitive, and clear up simple "hello world" examples at the same time. Put simply, every modern language has a first class string primitive type, except D - at least in terms of nomenclature.It does look nicer. I suppose the counterargument would be that having an alias char[] string might not be portable -- what about wchar[] and dchar[]? Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to already use UTF-16 as the default string type)? I've never been too incredibly put off by char[], but of course other people have other opinions.2.) I liked the more powerful for loop. I'm curious is there any ability to use delegates in the same way as lua does? I was blown away the first time I realised how simple it was for custom iteration in lua. In short, you write a function that returns a delegate (a closure?) that itself returns arguments, terminating in nil. e.g. for r in rooms_in_level(lvl) // custom function As lua can handle multiple return arguments, it can also do a key,value sort of thing that D can do. What a wonderful way of allowing any sort of iteration.Unfortunately the way Lua does "foreach" iteration is exactly the inverse of how D does it. Lua gets an iterator and keeps calling it in the loop; D gives the loop (the entire body!) to the iterator function, which runs the loop. So it's something like a "true" iterator as described in the Lua book: level.each(function(r) print("Room: " .. r) end) D does it this way I guess to make it easier to write iterators. Since you're limited to one return value, it's simpler to make the iterator a callback and pass the indices into the foreach body than it is to make the iterator return multiple parameters through "out" parameters. That, and it's easier to keep track of state with a callback iterator. (I'm going through which to use in a Lua-like language that I'm designing too!)It beats pages of code in C++ to write an iterator that can go forwards, or one that can go backwards (wow, the power of C++!). C++09 still isn't much of an improvement here, it only sugars the awful iterator syntax.Weeeeeeeee! C++3.) From the newsgroups, it seems like 'auto' as local raii and 'auto' as automatic type deduction are still linked to the one keyword. Well in lua, 'local' is pretty intuitive for locally scoped variables. Also 'auto' will soon mean automatic type deduction in C++. So those make sense to me personally. Looks like this has been discussed to death, but thats my 2c.I don't even wanna get into it ;) _Technically_ speaking, auto isn't really "used" in type deduction; instead, the syntax is just <storage class> <identifier>, skipping the type. Since the default storage class is auto, it looks like auto is being used to determine the type, but it also works like e.g. static x = 5; I think a better way to do it would be to have a special "stand-in" type, such as var x = 5; static var y = 20; auto var f = new Foo(); // this will be RAII and automatically type-determined4.) The D version of Scintilla and d-build was nice, very easy to use. Personally I would have preferred the default behaviour of dbuild to put object files in an /obj subdirectory and the final exe in the original directory dbuild is run from. This way, it could be run from a root directory, operate on a /src subdirectory, and not clutter up the source with object files. There is a switch for that, of course, but I can't imagine when you would want object files sitting in the same directory as the source. Well, as first impressions go, I was pleased by D, and am interested to see how well it fares as time goes on. Its just a shame that all the tools/library/IDE is all in C++! Thanks, Geoff
Sep 28 2006
Jarrett Billingsley wrote:It does look nicer. I suppose the counterargument would be that having an alias char[] string might not be portable -- what about wchar[] and dchar[]? Would they be wstring and dstring? Or would we choose wchar[] or dchar[] to already use UTF-16 as the default string type)?I'm a fan of utf-8 so it would seem natural to have string, wstring, and dstring. IMO utf-16 is backward thinking, and has the dubious property of being mostly fixed width, except when its not. And even utf-32 isn't one-to-one in terms of glyphs rendered on screen. Anyway, as a low level programmer, I appreciate that its all based on very powerful and flexible arrays. But as a high level programmer, I don't want to be reminded of that fact every time I need a to use a string.Unfortunately the way Lua does "foreach" iteration is exactly the inverse of how D does it. Lua gets an iterator and keeps calling it in the loop; D gives the loop (the entire body!) to the iterator function, which runs the loop. So it's something like a "true" iterator as described in the Lua book:Ok, although the advantage of the first method is that you write the iterator once, and then its easy to use for all clients. Wrapping up the loop in a function is just backward, although it is much more palatable in the inline format than a clunky out of line functor or using _1, _2 hackery magic. As an example, I love the fact that I can do this in lua: for r1 in rooms_in_level(lvl) do for r2 in rooms_in_level(lvl) do for c in connections(r1, r2) do print("got connection " .. c) end end end I wrote Floyd's algorithm in lua in the time it would take me in C++ to not even finish thinking about what structures, classes, vectors I would use. I imagine D would be as easy, although not as nice as the above style.D does it this way I guess to make it easier to write iterators. Since you're limited to one return value, it's simpler to make the iterator a callback and pass the indices into the foreach body than it is to make the iterator return multiple parameters through "out" parameters. That, and it's easier to keep track of state with a callback iterator. (I'm going through which to use in a Lua-like language that I'm designing too!)Multiple returns would be tricky. C++ looks like its getting there with std::tuple and std::tie, but as always the downside is the sheer clunkiness. As hetrogenous arrays aren't in the core language for either C++ or D, its tricky to come up with a clean solution. Designing a language would be great fun, and I think lua has done a great many things right. Not sure about the typeless state though, it gets messy with large projects. Still, no templates (or rather, every function is like a template).
Sep 28 2006
Geoff Carlton wrote:Hi, I'm a C++ user who's just tried D and I wanted to give my first impressions. I can't really justify moving any of my codebase over to D, so I wrote a quick tool to parse a dictionary file and make a histogram - a bit like the wc demo in the dmd package.You'll sure be pleased with D coming from C++.1.) I was a bit underwhelmed by the syntax of char[]...Yes, I was too. But although it looks not very nice at first sight, D's arrays are nothing like C++ arrays. Strings are first class, array notation is consistent and getting used to them together with concatenation and slicing operators, I found they are quite powerful yet simple to use.2.) I liked the more powerful for loop. I'm curious is there any ability to use delegates in the same way as lua does? I was blown away the first time I realised how simple it was for custom iteration in lua. In short, you write a function that returns a delegate (a closure?) that itself returns arguments, terminating in nil.You can enable a class to use the foreach statement. http://www.digitalmars.com/d/statement.html#foreach4.) The D version of Scintilla and d-build was nice, very easy to use. Personally I would have preferred the default behaviour of dbuild to put object files in an /obj subdirectory and the final exe in the original directory dbuild is run from. This way, it could be run from a root directory, operate on a /src subdirectory, and not clutter up the source with object files. There is a switch for that, of course, but I can't imagine when you would want object files sitting in the same directory as the source.Check out build: http://www.dsource.org/projects/buildWell, as first impressions go, I was pleased by D, and am interested to see how well it fares as time goes on. Its just a shame that all the tools/library/IDE is all in C++! Thanks, Geoff
Sep 28 2006
On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:Hi, I'm a C++ user who's just tried DI was a bit underwhelmed by the syntax of char[].Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;I believe single functions get pulled in as member functions? e.g. find(string) can be used as string.find()?This syntax sugar works for all arrays. func(T[] x, a) x.func(a) are equivalent.2.) I liked the more powerful for loop. I'm curious is there any ability to use delegates in the same way as lua does?Yes it can use anonymous delegates. You can also overload it in classes.3.) From the newsgroups, it seems like 'auto' as local raii and 'auto' as automatic type deduction are still linked to the one keyword.There are lots of D users hoping that this wart will be repaired before too long.4.) The D version of Scintilla and d-build was nice, very easy to use. Personally I would have preferred the default behaviour of dbuild to put object files in an /obj subdirectory and the final exe in the original directory dbuild is run from. This way, it could be run from a root directory, operate on a /src subdirectory, and not clutter up the source with object files. There is a switch for that, of course, but I can't imagine when you would want object files sitting in the same directory as the source.Thanks for the Build comments. One unfortunate thing I find is that one person's defaults are another's exceptions. That is why you can tailor Build to your 'default' behaviour requirements. In this case, create a text file in the same directory that Build.exe is installed in, called 'build.cfg' and place in it the line ... CMDLINE=-od./obj Then when you run the tool, the command line switch is applied every time you run it. -- Derek (skype: derek.j.parnell) Melbourne, Australia "Down with mediocrity!" 29/09/2006 4:44:52 PM
Sep 28 2006
Derek Parnell wrote:On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.I was a bit underwhelmed by the syntax of char[].Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;
Sep 29 2006
Walter Bright wrote:An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.A string alias might still be, just as the bool alias was. --anders
Sep 29 2006
On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:Derek Parnell wrote:And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems? -- Derek Parnell Melbourne, Australia "Down with mediocrity!"On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.I was a bit underwhelmed by the syntax of char[].Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;
Sep 29 2006
Derek Parnell wrote:On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:The string you're talking about is not just a lump of text. More specifically it's a lump of text, irregularly interspersed with short non-ascii ubyte sequences. The latter being of course the tails of UTF-8 "characters".An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.And is it there yet? I mean, given that a string is just a lump of text
Sep 29 2006
Derek Parnell wrote:On Fri, 29 Sep 2006 01:24:50 -0700, Walter Bright wrote:I just quickly want to interject my wish for aliases for the basic string array types. -DavidMDerek Parnell wrote:And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can. And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?On Fri, 29 Sep 2006 10:23:32 +1000, Geoff Carlton wrote:On the other hand, the reasons other languages have strings as classes is because they just don't support arrays very well. C++'s std::string combines the worst of core functionality and libraries, and has the advantages of neither. An early design goal for D was to upgrade arrays to the point where string classes weren't necessary.I was a bit underwhelmed by the syntax of char[].Yes. It isn't very 'nice' for a modern language. Though as you note below a simple alias can help a lot. alias char[] string;
Sep 29 2006
Derek Parnell wrote:And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
Immutability and some guarantees about the validity of the state of an immutable string in a concurrent setting are what set Java strings apart. Garbage collection without immutable strings in the standard library is quite out of the ordinary. Walter Bright wrote:Derek Parnell wrote:And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:Derek Parnell wrote:I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
Derek Parnell wrote:I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.I take it that you mean that the bit pattern, or byte, 'a' (as in 0x61) may be found within a Japanese multibyte glyph? Or even a very long Japanese text. That is not correct. The designers of UTF-8 knew that this would be dangerous, and created UTF-8 so that such _will_not_happen_. Ever. Therefore, something like std.string.find() doesn't even have to know about it. Basically, std.string.find() and comparable functions, only have to receive two octet sequences, and see where one of them first occurs in the other. No need to be aware of UTF or ASCII. For all we know, the strings may even be in EBCDIC. Still works. If the strings themselves are valid (in whichever encoding you have chosen to use), then the result will also be valid. ((For the sake of completeness, here I've restricted the discussion to the version of such functions that accept ubyte[] compatible input (obviously including char[]). Those taking 16 or 32 bits, and especially if we deliberately feed input of wrong width to any of these, then of course the results will be more complicated.))
Sep 29 2006
Derek Parnell wrote:I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character.That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.It looks for byte-subsets rather than character sub-sets.I don't think it's broken, but if it is, those are bugs, not fundamental problems with char[], and should be filed in bugzilla.It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool.I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string. (2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2). And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively. If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).
Sep 29 2006
On Fri, 29 Sep 2006 23:11:37 -0700, Walter Bright wrote:Derek Parnell wrote:Thanks. That has cleared up some misconceptions and pre-concenptions that I had with utf encoding. I can reduce some of my home-grown routines now and reduce that number of times that I (think I) need dchar[] ;-)I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character.That cannot happen, because multibyte sequences *always* have the high bit set, and 'a' does not. That's one of the things that sets UTF-8 apart from other multibyte formats. You might be thinking of the older Shift-JIS multibyte encoding, which did suffer from such problems.No argument there.It may very well be pointless for your way of thinking, but your language is also for people who may not necessarily think in the same manner as yourself. I, for example, think there is a point to having my code read like its dealing with strings rather than arrays of characters. I suspect I'm not alone. We could all write the alias in all our code, but you could also be helpful and do it for us - like you did with bit/bool.I'm concerned about just adding more names that don't add real value. As I wrote in a private email discussion about C++ typedefs, they should only be used when: 1) they provide an abstraction against the presumption that the underlying type may change 2) they provide a self-documentation purpose (1) certainly doesn't apply to string.(2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2).This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively.And yet we have "toString" and not "toCharArray" or "toUTF"! And we still have the "printf" in object.d too!If someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 30 2006
Derek Parnell wrote:I certainly hope this thread doesn't degenerate into that like some of the others.(2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2).This is a lttle more debatable, but not worth generating hostility.A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.You do have a point there.True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful. I suppose that since I grew up with char* meaning string, using char[] seems perfectly natural. I tried typedef'ing char* to string now and then, but always wound up going back to just using char*.And lastly, there's the inevitable confusion. People learning the language will see char[] and string, and wonder which should be used when. I can't think of any consistent understandable rule for that. So it just winds up being wishy-washy. Adding more names into the global space (which is what names in object.d are) should be done extremely conservatively.And yet we have "toString" and not "toCharArray" or "toUTF"!And we still have the "printf" in object.d too!I know many feel that printf doesn't belong there. It certainly isn't there for purity or consistency. It's there purely (!) for the convenience of writing short quickie programs. I tend to use it for quick debugging test cases, because it doesn't rely on the rest of D working.No, you certainly don't need to appease me! I do care about maintaining a reasonably consistent style in Phobos, but I don't believe a language should enforce a particular style beyond the standard library. Viva la difference. P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.htmlIf someone wants to use the string alias as their personal or company style, I have no issue with that, as other people *do* think differently than me (which is abundantly clear here!).I'll revert Build to string again as it is a lot easier to read. It started out that way but I converted it to char[] to appease you (why I thought you need appeasing is lost though). :-)
Sep 30 2006
On Sat, 30 Sep 2006 21:18:02 -0700, Walter Bright wrote:P.S. I did say to not 'enforce', but that doesn't mean I am above recommending a particular style, as in http://www.digitalmars.com/d/dstyle.htmlOh, I threw trhat away ages ago ;-) -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Oct 01 2006
Walter Bright wrote:Nope, it just looks correct. -- Lars Ivar Igesund blog at http://larsivi.net DSource & #D: larsiviAnd yet we have "toString" and not "toCharArray" or "toUTF"!True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.
Oct 01 2006
Lars Ivar Igesund wrote:Walter Bright wrote:I don't think renaming toString to toUTF gets rid of any confusion. AFAIK, toString is meant for debugging and char[] should be enough, and yet flexible enough for unicode strings. In fact, "string toString()" would be a good solution too. --- My 4 reasons for the "string" aliases: * readability: less [] pairs; * safety: char[] is not zero-terminated, so lets not pretend there's a relation with C's char*. In fact: lets hide any relation; * clarity: a char[] should not be iterated 1 char at a time, which makes it different from an int[]. * consistency: "string toString()" L.Nope, it just looks correct.And yet we have "toString" and not "toCharArray" or "toUTF"!True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.
Oct 02 2006
Walter Bright wrote:True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
Oct 01 2006
Georg Wrede wrote:Walter Bright wrote:I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist. string : Users will think it's immutable, special; they will ask "how do I get one of the characters out of a string", "how do I convert string to char[]?", and other things that would be obvious without the alias. KevinTrue, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
Oct 02 2006
Kevin Bealer wrote:Georg Wrede wrote:Yes.Walter Bright wrote:I would kind of agree with this, but I think it's a two-edged knife. If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work. For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist.True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.string : Users will think it's immutable, special; they will ask "how do I get one of the characters out of a string", "how do I convert string to char[]?", and other things that would be obvious without the alias.Well, with string, folks would at least be inclined to search for the library function to do it. --- Overall, having string instead of char[] should result in folks learning and doing more with D _before_ they get tangled with UTF issues. (I guess, getting tangled with UTF is unavoidable.) But the more later folks stumble on this, the better they can handle it. If it happens too soon, then they will just run away from D. But substituting string for char[] in D is not enough. More than half the issue is the wording in the docs. --- Another thing intimately connected with this is whether we should have char[] or utf8[] (string or no string, this is an important thing anyway). I understand that "char" is one of the words that a seasoned programmer's fingers know by heart. So it would feel simply disgusting to have to learn (and bother) to write "utf8" which I admit is a lot more work to type. (Seriously.) Now, "string" is easy for the fingers, and then you get to skip "[]", which makes it all a little more palatable. Having string would let us have the underlying type be utf8[], which really emphasizes and calls your attention to the fact that it's not byte-by-byte stuff we have there.
Oct 03 2006
Kevin Bealer wrote:If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...) If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work.Which could be a *good* thing, since it would stop users from hurting themselves by pretending that the D strings are arrays of characters ? And when they have read up that they are "arrays of Unicode code units", they should be OK with interpreting the "string" alias as char[] arrays.For instance, from a Java perspective: char[] : Users don't know that it's "String"; users see it as low-level. Some will try to write things like 'find()' by hand since they will figure arrays are low level and not expect this to exist. string : Users will think it's immutable, special; they will ask "how do I get one of the characters out of a string", "how do I convert string to char[]?", and other things that would be obvious without the alias.I think the best answer would be: "to get a char[] from the string, use the std.utf.toUTF8 function", since this also works even if you redeclare the "string" alias to be something else - like wchar_t[] ? Earlier* I suggested adding the alias utf8_t for "char", just like we have int8_t for "byte", but I wouldn't rename the actual D types. Just a little std.stdutf module with some aliases, if ever needed... string std.string.toString( ) utf8_t[] std.utf.toUTF8( ) utf16_t[] std.utf.toUTF16( ) utf32_t[] std.utf.toUTF32( ) --anders * digitalmars.D/11821, 2004-10-15
Oct 03 2006
Derek Parnell wrote:Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the brain as 'array of strings' than its current counterpart. -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D(2) may, but char[] has no use other than that of being a string, as a char[] is always a string and a string is always a char[]. So I don't think string fits (2).This is a lttle more debatable, but not worth generating hostility. A string of text contains characters whose position in the string is significant - there are semantics to be applied to the entire text. It is quite possible to conceive of an application in which the characters in the char[] array have no importance attached to their relative position within the array *where compared to neighboring characters*. The order of characters in text is significant but not necessarily so in a arbitary character array. Conceptually a string is different from a char[], even though they are implemented using the same technology.
Oct 01 2006
Bruno Medeiros wrote:Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the brain as 'array of strings' than its current counterpart.There are also many cases where char arrays are not strings: Single array of characters, not strings: char GAME_10PT_LETTERS[] = { 'x', 'z' }; Two-dimensional array of characters, not string arrays: char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. }; char m_scrabbleBoard[20][20];
Oct 01 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Derek Parnell schrieb am 2006-09-30:On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be: The same applies to ifind and the like. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHj4fLK5blCcjpWoRAj67AJoDagf5zf7Az7ZqMDfOyZdRJ+aIqQCdGeen ye80pstE4IJC1WoxgTVVgdc= =iwT5 -----END PGP SIGNATURE-----Derek Parnell wrote:I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.
Sep 30 2006
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thomas Kuehne schrieb am 2006-09-30:Derek Parnell schrieb am 2006-09-30:As it seems, the original code depends on the undocumented index behavior with regards to silent transcoding in foreach. Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFFHkOILK5blCcjpWoRAnmjAJ9PKdGDHsghycgxHdr7hkc+IP+XEgCgohH8 LH7OOQgQAZoTMLRQXtWhqbE= =or0x -----END PGP SIGNATURE-----On Fri, 29 Sep 2006 10:04:57 -0700, Walter Bright wrote:~wow~ Have a look at std.string.find's source and try to stop giggling *g* The correct implementation would be:Derek Parnell wrote:I'm pretty sure that the phobos routines for search and replace only work for ASCII text. For example, std.string.find(japanesetext, "a") will nearly always fail to deliver the correct result. It finds the first occurance of the byte value for the letter 'a' which may well be inside a Japanese character. It looks for byte-subsets rather than character sub-sets.And is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.
Sep 30 2006
Thomas Kuehne wrote:As it seems, the original code depends on the undocumented index behavior with regards to silent transcoding in foreach.The wording could be more explicit, but I think the current documentation implies the actual behavior: "The index must be of int or uint type, it cannot be inout, and it is set to be the index of the array element." The docs should probably also be revised to allow for 64-bit indices, where the index would be long or ulong. Something along the lines of: "The index must be an integer type of size equal to size_t.sizeof. . ." Sean
Sep 30 2006
Walter Bright wrote:Derek Parnell wrote:Hi, The main reasons I think are these: It simplifies the initial examples, particularly main(string[]), and maps such as string[string]. More complex examples are a map of words to text lines, string[][string], rather than char[][][char[]]. It clarifies the actual use of the entity. It is a text string, not just a jumbled array of characters. Arrays of char can be used for other things, such as the set of player letters in a scrabble game. A string has the additional usage that we know it as is text string. The alias reflects that intent. Given a user wants to use a string, there is no need to expose the implementation detail of how strings are done in D. Perhaps in perl, strings are a linked list of shorts, but it doesn't mean that you'd have list<short> all over the place. Use of char[] and char[][] looks like low level C. It has also been noted that it encourages char based indexing, which is not a good thing for utf8. Anyway, hope one of those points grabbed you! GeoffAnd is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
Walter Bright wrote:Derek Parnell wrote:The reason *I* want it is _alias_ does not respect the private: visibility modifier. So when I pull out an old piece of code which says alias char[] string and import it in my newer module I get conflicts when I compile. Then I must do this silly hack where I include the newer file from the old or vice versa. If you didn't add this into phobos, at least or adopt a method to discriminate between more than one alias with the same name to resolve the issue. -DavidMAnd is it there yet? I mean, given that a string is just a lump of text, is there any text processing operation that cannot be simply done to a char[] item? I can't think of any but maybe somebody else can.I believe it's there. I don't think std::string or java.lang.String have anything over it.And if a char[] is just as capable as a std::string, then why not have an official alias in Phobos? Will 'alias char[] string' cause anyone any problems?I don't think it'll cause problems, it just seems pointless.
Sep 29 2006
Geoff Carlton wrote:A simple alias of char[] to string would simplify the first glance code. string x; // yep, a string main (string[]) // an array of strings string[string] m; // map of string to string I believe single functions get pulled in as member functions? e.g. find(string) can be used as string.find()? If so, it means that all the string functionality can be added and then used naturally as member functions on this "string" (which is really just the plain old char[] in disguise).Problem of "char[]" is both that it hides the fact that "char" is UTF-8 while at the same time it exposes the fact that it's stored as an array. You can "improve" upon that readability with aliases, like declaring say utf8_t -> char and string -> utf8_t[], but you still need to understand Unicode and Arrays in order to use it outside of the provided methods... I think "hides the implementation" was the biggest argument against it ? http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssuesThis is a small thing, but I think it would help in terms of the mindset of strings being a first class primitive, and clear up simple "hello world" examples at the same time. Put simply, every modern language has a first class string primitive type, except D - at least in terms of nomenclature.I did the big mistake of thinking it would be a good thing to be able to switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like: version(UNICODE) alias char[] string; else // version(ANSI) alias wchar_t[] string; // wchar[] on Windows, dchar[] on Unix Still trying to sort out all the code problems with that idea, as there is a ton of toUTF8 and other conversions to make strings work together. In retrospect it would have been much easier to have stuck with char[], and do the conversion from UTF-8 to the local encoding on the C++ side. (since there were no guarantees that the "char" and "wchar_t" types in C++ used UTF encodings, even if they did so in Unix/GTK+ for instance) Any (minor) performance issues of having to do the UTF-8 <-> UTF-32 conversions were not worth the hassle of doing it on the D side, IMHO. So I agree with the "alias char[] string;" and the string[string] args. It's going to be used as wx.common.string for instance, in wxD library. --anders
Sep 29 2006
I did the big mistake of thinking it would be a good thing to be able to switch between "ANSI" and "UNICODE" builds (of wxD), and so did it like: version(UNICODE) alias char[] string; else // version(ANSI) alias wchar_t[] string; // wchar[] on Windows, dchar[] on UnixExcept the other way around, of course! version(UNICODE) alias wchar_t[] string; else // version(ANSI) alias char[] string; Now, to get me some more coffee... :-P --anders
Sep 29 2006
I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish they would be included by default in Phobos. alias char[] string; alias wchar[] wstring; alias dchar[] dstring; Perhaps, using string instead of char[], it's more obvious that it's not zero-terminated. I've seen D examples online that just cast a char[] to char* for use in MessageBox and the like (which worked since it were string constants.) L.
Sep 29 2006
Lionello Lunesu wrote:Perhaps, using string instead of char[], it's more obvious that it's not zero-terminated. I've seen D examples online that just cast a char[] to char* for use in MessageBox and the like (which worked since it were string constants.)And probably only for ASCII string constants, at that... --anders
Sep 29 2006
Anders F Björklund wrote:Lionello Lunesu wrote:Right, that too! char[] somestring = "...."; func( somestring[0] ); // WRONG: somestring[x] is not 1 character! Using "string" would make it less obvious: string somestring = "....."; func( somestring[0] ); // [0] means what? This goes for iteration as well. DMD will still deduct 'char' as the type type, but at least one's less likely to type foreach(char c;str). If you want to iterate the UNICODE characters in a string, you'll specify "dchar" as the type and you won't worry about "how come I can use dchar when it's a char[]": foreach(dchar c; somestring) func(c); // correct L.Perhaps, using string instead of char[], it's more obvious that it's not zero-terminated. I've seen D examples online that just cast a char[] to char* for use in MessageBox and the like (which worked since it were string constants.)And probably only for ASCII string constants, at that...
Sep 29 2006
Lionello Lunesu wrote:I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish they would be included by default in Phobos. alias char[] string; alias wchar[] wstring; alias dchar[] dstring; Perhaps, using string instead of char[], it's more obvious that it's not zero-terminated. I've seen D examples online that just cast a char[] to char* for use in MessageBox and the like (which worked since it were string constants.)Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)
Sep 29 2006
Georg Wrede wrote:Lionello Lunesu wrote:haha too true. I experienced this too as I read this ng. It hasn't been THAT truamatic for me though, since everything seems to work as long as you stick to english. I don't have the resources to even begin thinking about non-english text (ex: paying people to translate stuff), so I don't lose any sleep about it, at least not yet. Perhaps there should be a string struct/class that has an undefined underlying type (it could be UTF-8, 16, 32, you dunno really), and you could index it to get the *complete* character at any position in the string. Basically, it is like char[], but it /just works/ in all cases. I'd almost rather have the size of a char be undefined, and just have char[] be the said magic string type. If you want something with a .size of 1, then there is byte/ubyte. There would probably have to be some stuff in the phobos internals to handle such a string in a correct manner. Going even further... if you could make char[] be such a magic string type, then wchar[] and dchar[] could probably be deprecated - use ushort and uint instead. Then add the following aliases to phobos: alias ubyte utf8; alias ushort utf16; alias uint utf32; Just a thought. I'm no expert on UTF, but maybe this can start a discussion that will result in the nightmares ending :)I also ALWAYS create aliases for char[], wchar[], dchar[]... I DO wish they would be included by default in Phobos. alias char[] string; alias wchar[] wstring; alias dchar[] dstring; Perhaps, using string instead of char[], it's more obvious that it's not zero-terminated. I've seen D examples online that just cast a char[] to char* for use in MessageBox and the like (which worked since it were string constants.)Using char[] as long as you don't know about UTF seems to work pretty well in D. But the moment you realise that we're having potential multibyte characters in what essentially is a ubyte[], you get scared to death, and start to wonder how on earth you haven't yet blown up your hard disk. You start having nightmares about slicing char arrays at the wrong place, extracting single chars that might not be storable in a char, and all of a sudden you decide to stick with your old language "till things calm down". The only medicine to this is simply to shut your eyes and keep coding on like you never did realise anything. It's a little like when you first realised Daddy isn't holding your bike: you instantly fall hurting yourself, instead of realizing that he's probably let go ages ago, and you still haven't fallen, so simply keep going. --- This doesn't mean I'm happy with this either, but I don't have the energy to conjure up a significantly better solution _and_ fight for it till it gets accepted. (Some things are just too hard to fix, like "bit=bool" was, and now "auto/auto".)
Sep 29 2006
Chad J > wrote:Perhaps there should be a string struct/class that has an undefined underlying type (it could be UTF-8, 16, 32, you dunno really), and you could index it to get the *complete* character at any position in the string. Basically, it is like char[], but it /just works/ in all cases. I'd almost rather have the size of a char be undefined, and just have char[] be the said magic string type. If you want something with a ..size of 1, then there is byte/ubyte. There would probably have to be some stuff in the phobos internals to handle such a string in a correct manner.I have thought about this to.Going even further... if you could make char[] be such a magic string type, then wchar[] and dchar[] could probably be deprecated - use ushort and uint instead. Then add the following aliases to phobos: alias ubyte utf8; alias ushort utf16; alias uint utf32;I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would bee beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.
Sep 29 2006
Johan Granberg wrote:I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would be beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.
Sep 29 2006
BCS wrote:Johan Granberg wrote:I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results. It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user. Technically if you follow UTF and do your char[] manipulations very carefully, it is correct, but realistically few if any people will do such things (I won't). Also, if you do this, your program will probably run as slow as one with the proposed char/string solution, maybe slower (since language/stdlib level support can be heavily optimized). What I'd like then, is a program that is correct and as fast as possible while still being correct. Sure you can get some speed gains by just using ASCII and saying to hell with UTF, but you should probably only do that when profiling has shown that such speed gains are actually useful/needed in your program. Ultimately we have to decide whether we want D to default to UTF code which might run slightly slower but allow better localization and international friendliness, or if we want it to default to ASCII or some such encoding that runs slightly faster but is mostly limited to english. I'd like the default to be UTF. Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII. Also, if we move over to full blown UTF, we won't have to give up ASCII. It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support). So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII. That leaves string literals, which is a slight problem, but I suppose easily fixed: ubyte[] hi = "hello!"a; Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these. Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.I completely agree, char should hold a character independently of encoding and NOT a code unit or something else. I think it would be beneficial to D in the long term if chars where done right (meaning that they can store any character) how it is implemented is not important and i believe performance is not a problem here, so ease of use and correctness would be appreciated.Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.
Sep 29 2006
Chad J > wrote:I'd like the default to be UTF. Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII.But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)Also, if we move over to full blown UTF, we won't have to give up ASCII. It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support). So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII. That leaves string literals, which is a slight problem, but I suppose easily fixed: ubyte[] hi = "hello!"a;I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these. Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time. --anders
Sep 29 2006
Anders F Björklund wrote:Chad J > wrote:Probably 7-bit. Anything where the size of one character is ALWAYS one byte. I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8. However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.I'd like the default to be UTF. Then we can have a base of code to correctly manipulate UTF strings (in phobos and language supported). Writing correct ASCII manipulation routine without good library/language support is a lot easier than writing good UTF manipulation routines without good library/language support, and UTF will probably be used much more than ASCII.But D already uses Unicode for all strings, encoded as UTF ? When you say "ASCII", do you mean 8-bit encodings perhaps ? (since all proper 7-bit ASCII are already valid UTF-8 too)I was talking about IF we made char[] into a datatype that handles all of those odd corner cases correctly (slices into multibyte strings, for instance) then it will no longer be the same fast ASCII-only routines. So for those who want the fast ASCII-only stuff, it would nice to specify a way to make string literals such that each character in the literal takes only one byte, without ugly casting. To get an ASCII monobyte string from a string literal in D I would have to do the following: ubyte[] hi = cast(ubyte[])"hello!"; hmmm, yuck.Also, if we move over to full blown UTF, we won't have to give up ASCII. It seems to me like the phobos std.string functions are pretty much ASCII string manipulating functions (no multibyte string support). So just copy those out to a seperate library, call it "ASCII lib", and there's your library support for ASCII. That leaves string literals, which is a slight problem, but I suppose easily fixed: ubyte[] hi = "hello!"a;I don't understand this, why can't you use UTF-8 for this ? char[] hi = "hello!";I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[]. If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so: class String { char[] data; ... dchar opIndex( int index ) { foreach( int i, dchar c; data ) { if ( i == index ) return c; i++; } } // similar thing for opSlice down here ... } Which is probably slower than could be done. All in all it is a drag that we should have to learn all of this UTF stuff. I want char[] to just work!Just add a postfix 'a' for strings which makes the string an ASCII literal, of type ubyte[]. D arrays don't seem powerful enough to do UTF manipulations without special attention, but they are powerful enough to do ASCII manipulations without special attention, so using ubyte[] as an ASCII string should give full language support for these. Given that and ASCIILIB you pretty much have the current D string manipulation capabilities afaik, and it will be fast.What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time.
Sep 29 2006
Chad J > wrote:Probably 7-bit. Anything where the size of one character is ALWAYS one byte. I am already assuming that ASCII is a subset or at least is mostly a subset of UTF8. However, I talk about it in an exclusive manner because if you handle UTF8 strings properly then the code will probably run at least slightly slower than with ASCII-only strings.It's mostly about looking out for the UTF "control" characters, which is not more than a simple assertion in your ASCII-only functions really... I don't think handling UTF-8 properly is a burden for string functions, when you compare it with the enormous gain that it has over ASCII-only.Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term... As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead... But it's still possible to translate, transform, and translate back ?What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time.I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so: [...]In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)All in all it is a drag that we should have to learn all of this UTF stuff. I want char[] to just work!Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders
Sep 29 2006
Anders F Björklund wrote:If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format.Problem is, using 16-bit you sort-of get away with _almost_ all of it. But as a pay-back, the day your 16 bits don't suffice, you're in deep crap. And that day _will_ come.
Sep 29 2006
Anders F Björklund wrote:So it seems to me the problem is that those 2 bytes are both 2 characters and 1 character at the same time. In this case, I'd prefer being able to index to a safe default (like the ö, instead of the umlauts next to the o), or not being able to index at all.Well, it's also a lot "trickier" than that... For instance, my last name can be written in Unicode as Björklund or Bj¨orklund, both of which are valid - only that in one of them, the 'ö' occupies two full code points! It's still a single character, which is why Unicode avoids that term...What is not powerful enough about the foreach(dchar c; str) ? It will step through that UTF-8 array one codepoint at a time.I'm assuming 'str' is a char[], which would make that very nice. But it doesn't solve correctly slicing or indexing into a char[].As you know, if you need to access your strings by codepoint (something that the Unicode group explicitly recommends against, in their FAQ) then char[] isn't a very nice format - because of the conversion overhead... But it's still possible to translate, transform, and translate back ?I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint. Maybe you mean a different FAQ here, in which case, could I have a link please? I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :( Also I still am not sure exactly what a code point is. And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either. When you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[]. It might be wchar[]. Or dchar[]. Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed). So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing. Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those. Maybe this is a bit too complex, but I can dream, hehe.My impression has gone from being quite scared of UTF to being not so worried, but only for myself. D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters. Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings. This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling. It's a newbie trap. Like I said earlier, I either want to be able to index/slice strings safely, or not at all (or better yet, not by any intuitive means).If nothing was done about this and I absolutely needed UTF support, I'd probably make a class like so: [...]In my own mock String class, I cached the dchar[] codepoints on demand. (viewable at http://www.algonet.se/~afb/d/dcaf/html/class_string.html)All in all it is a drag that we should have to learn all of this UTF stuff. I want char[] to just work!Using Unicode strings and characters does require a little learning... (where http://www.unicode.org/faq/utf_bom.html is a very good page) And D does force you to think about string implementation, no question. This has both pros and cons, but it is a deliberate language decision. If you're willing to handle the "surrogates", then UTF-16 is a rather good trade-off between the default UTF-8 and wasteful UTF-32 formats ? A downside is that it is not "ascii-compatible" (has embedded NUL chars) and that it is endian-dependant unlike the more universal UTF-8 format. --anders
Sep 30 2006
Chad J > wrote:I read that FAQ at the bottom of this post, and didn't see anything about accessing strings by codepoint. Maybe you mean a different FAQ here, in which case, could I have a link please? I've been to the unicode site before and all I remember was being confused and having a hard time finding the info I wanted :(Also I still am not sure exactly what a code point is. And that FAQ at the bottom used the word "surrogate" a lot; I'm not sure about that one either.Code point is the closest thing to a "character", although it might take more than one Unicode code point to represent a single Unicode grapheme. Surrogates are used with UTF-16, to represent "too large" code points... i.e. they always occur in "surrogate pairs", which combine to a singleWhen you say char[] isn't a nice format, I wasn't thinking about having the string class I mentioned earlier store the data ONLY as char[]. It might be wchar[]. Or dchar[]. Then it would be automatically converted between the two either at compile time (when possible) or dynamically at runtime (hopefully only when needed). So if someone throws a Chinese character literal at it, there is a very big clue there to use UTF32 or something that can store all of the characters in a uniform width sort of way, to speed indexing. Algorithms could be used so that a program 'learns' at runtime what kind of strings are dominating the program, and uses algorithms optimized for those. Maybe this is a bit too complex, but I can dream, hehe.Actually I said that dchar[] (i.e. UTF-32) wasn't ideal, but anyway... (UTF-8 or UTF-16 is preferrable, for the reasons in the UTF FAQ above) We already have char[] as the string default in D, but most models for a String class uses wchar[] (i.e. UTF-16), for instance Mango or Java: * http://mango.dsource.org/classUString.html (uses the ICU lib) * http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html All formats do use Unicode, so converting from one UTF to another is mostly a question of memory/performance and not about any data loss. However, it is not converted at compile time (without using templates) so mixing and matching different representations is somewhat of a pain. I think that char[] for string and wchar[] for String are good defaults.My impression has gone from being quite scared of UTF to being not so worried, but only for myself. D seems to be good at handling UTF, but only if someone tells you to never handle strings as arrays of characters. Unfortunately, the first thing you see in a lot of D programs is "int main( char[][] args )" and there are some arrays of characters being used as strings. This also means that some array capabilities like indexing and the braggable slicing are more dangerous than useful for string handling. It's a newbie trap.It is, since it isn't really "arrays of characters" but "arrays of code units". What muddies the waters further is that sometimes they're equal. That is, with ASCII characters each character fits into a a D char unit. Without surrogates, each character (from BMP) fits into one wchar unit. However, all code that handles the shorter formats should be prepared to handle non-ASCII (for UTF-8) and surrogates (for UTF-16), or use UTF-32: bool isAscii(char c) { return (c <= 0x7f); } bool isSurrogate(wchar c) { return (c >= 0xD800 && c <= 0xDFFF); } But a warning that D uses multi-byte strings might be in order, yes... Another warning that it only supports UTF-8 platforms* might also be ? --anders * "main(char[][] args)" does not work for any non-UTF consoles, as you will get invalid UTF sequences for the non-ASCII chars.
Oct 01 2006
Chad J > wrote:char[] data;dchar opIndex( int index ) { foreach( int i, dchar c; data ) { if ( i == index ) return c; i++; } }This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders
Sep 29 2006
Anders F Björklund wrote:Chad J > wrote:ah. And yep the i++ was a typo (oops). So maybe something like: dchar opIndex( int index ) { int i; foreach( dchar c; data ) { if ( i == index ) return c; i++; } } The i is no longer the foreach's index, so the i++ isn't a typo anymore. Thanks for the info. I'll check out that faq a little later, gotta go.char[] data;dchar opIndex( int index ) { foreach( int i, dchar c; data ) { if ( i == index ) return c; i++; } }This code probably does not work as you think it does... If you loop through a char[] using dchars (with a foreach), then the int will get the codeunit index - *not* codepoint. (the ++ in your code above looks more like a typo though, since it needs to *either* foreach i, or do it "manually") import std.stdio; void main() { char[] str = "Björklund"; foreach(int i, dchar c; str) { writefln("%4d \\U%08X '%s'", i, c, ""d ~ c); } } Will print the following sequence: 0 \U00000042 'B' 1 \U0000006A 'j' 2 \U000000F6 'ö' 4 \U00000072 'r' 5 \U0000006B 'k' 6 \U0000006C 'l' 7 \U00000075 'u' 8 \U0000006E 'n' 9 \U00000064 'd' Notice how the non-ASCII character takes *two* code units ? (if you expect indexing to use characters, that'd be wrong) More at http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs --anders
Sep 29 2006
Chad J > wrote:I will go ahead and say that the current state of char[] is incorrect. That is, if you write a program manipulating char[] strings, then run it in china, you will be dissapointed with the results. It won't matter how fast the program runs, because bad stuff will happen like entire strings becoming unreadable to the user.Wrong. And that's precisely what I meant about the Daddy holding bike allegory a few messages back. The current system seems to work "by magic". So, if you do go to China, itll "just work". At this point you _should_ not believe me. :-) But it still works. --- The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays. So things just keep on working. --- Not convinced yet? Well, a lot of folks here are from Europe, and our languages contain "non-ASCII" characters. Our text manipulating programs still work allright. And, actually D is pretty popular in Japan. Every once in a while some Japanese guys pop on-and-off here, and some of them don't even speak English, so they use a machine translator(!) to talk with us. Just guess if they use ASCII in their programs. And you know what, most of these guys even use their own characters for variable names in D! And not one of them has complained about "disappointing results". --- That's why I continued with: keep your eyes shut and keep on coding.
Sep 29 2006
Georg Wrede wrote:The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.
Sep 29 2006
Chad J > wrote:Georg Wrede wrote:Yes. That's why I talked about you falling down once you realise Daddy's not holding the bike. Part of UTF-8's magic lies in that it is amazingly easy to get working smoothly with truly minor tweaks to "formerly ASCII-only" libraries -- so that even the most exotic languages have no problem. Your concerns about the for loop are valid, and expected. Now, IMHO, the standard library should take care of "all" the situations where you would ever need to split, join, examine, or otherwise use strings, "non-ASCII" or not. (And I really have no complaint (Walter!) about this.) Therefore, in no normal circumstances should you have to twiddle them yourself -- unless. And this "unless" is exactly why I'm unhappy with the situation, too. Problem is, _technology_wise_ the existing setup may actually be the best, both considering ease of writing the library, ease of using it, robustness of both the library and users' code, and the headaches saved from programmers who, either haven't heard of the issue (whether they're American or Chinese!), or who simply trust their lives with the machinery. So, where's the actual problem??? At this point I'm inclined to say: the documentation, and the stage props! The latter meaning: exposing the fact that our "strings" are just arrays is psychologically wrong, and even more so is the fact that we're shamelessly storing entities of variable length in arrays which have no notion of such -- even worse, while we brag with slices! If this had been a university course assignment, we'd have been thrown out of class, for both half baked work, and for arrogance towards our client, victimizing the coder. The former meaning: we should not be like "we're bad enough to overtly use plain arrays for variable-length data, now if you have a problem with it, the go home and learn stuff, or then just trust us". Both "documentation" and "stage props" ultimately meaning that the largest problem here is psychology, pedagogy, and education. --- A lot would already be won by: merely aliasing char[] to string, and discouraging other than guru-level folks from screwing with their internals. This alone would save a lot of Fear, Uncertainty and D-phobia. The documentation should take pains in explaining up front that if you _really_ want to do Character-by-Character ops _and_ you live outside of America, then the Right way to do it (ehh, actually the Canonical Way), is to first convert the string to dchar[]. Period. Then, if somebody else knows enough of UTF-8 and knows he can handle bit twiddling more efficiently than using the Canonical Way, with plain char[] and "foreignish", then let him. But let that be undocumented and Un-Discussed in the docs. Precisely like a lot of other things are. (And should be.) And will be. He's on his own, and he ought to know it. --- In other words, the normal programmer should believe he's working with black-box Strings, and he will be happy with it. That way he'll survive whether he's in Urduland or Boise, Idaho -- without neither ever needing to have heard about UTF nor other crap. Not until in Appendix Z of the manual should we ever admit that the Emperor's Clothes are just plain arrays, and we apologize for the breach of manners of storing variable length data in simple naked arrays. And here would be the right place to explain how come this hasn't blown up in our faces already. And, exactly how you'll avoid it too. (This _needs_ to contain an adequate explanation about the actual format of UTF-8.) --- TO RECAP The _single_ biggest strings-related disservice to our pilgrims is to lead them to believe, that D stores strings in something like utf8[] internally. Now that's an oxymoron, if I ever saw one. (If utf8[] was _actually_ implemented, it would probably have to be an alias of char[][]. Right? Right? What we have instead is ubyte[], which is _not_ the same as utf8[].) (Oh, and if it ever becomes obvious that not _everybody_ understood this, then that in itself simply proves my point here.) (*1) And the fault lies in the documentation, not the implementation! This results, in braincell-hours wasted, precisely as much as everybody has to waste them, before they realise that the acronym RAII is a filthy lie. Akin only to the former "German _Democratic_ Republic". Only a politician should be capable of this kind of deception. Ok, nobody is doing it on purpose. Things being too clear to oneself often result in difficulties to find ways to express them to new people. (Happens every day at the Math department! :-( ) And since all in-the-know are unable to see it, and all not-in-the-know are too, then both groups might think it's the thing itself that is "the problem", and not merely the chosen _presentation_ of it. Sorry for sonding Righteous, arrogant and whatever. But this really is a 5 minute thing for one person to fix for good, while it wastes entire days or months _per_person_, from _every_ non-defoiled victim who approaches the issue. Originally I was one of them: hence the aggression. ------------------------------------------- (*1) Even I am not simultaneously both literally and theoretically right here. Those who saw it right away, probably won't mind, since it's the point that is the issue here. Now, having to write this disclaimer, IMHO simply again underlines the very point attempted here.The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays.But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.
Sep 29 2006
Chad J > wrote:But this is what I'm talking about... you can't slice them or index them. I might actually index a character out of an array from time to time. If I don't know about UTF, and I do just keep on coding, and I do something like this: char[] str = "some string in nonenglish text"; for ( int i = 0; i < str.length; i++ ) { str[i] = doSomething( str[i] ); } and this will fail right? If it does fail, then everything is not alright. You do have to worry about UTF. Someone has to tell you to use a foreach there.Yes, you do have to be aware of it being UTF, just like in C you have to be aware that strings are 0 terminated. But once aware of it, there is plenty of support for it in the core language and in std.utf. You can also simply use dchar[], which has a one to one mapping between characters and indices, if you prefer. Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode. You can also wrap char[] inside a class that provides a view of the data as if it were dchar's. But I don't think the performance of such a class would be competitive. Interestingly, it turns out that most string operations do not need to be concerned with the number of char's in a character (like "find this substring"), and forcing them to care just makes for inefficiency.
Sep 29 2006
Walter Bright wrote:Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.You can also wrap char[] inside a class that provides a view of the data as if it were dchar's. But I don't think the performance of such a class would be competitive. Interestingly, it turns out that most string operations do not need to be concerned with the number of char's in a character (like "find this substring"), and forcing them to care just makes for inefficiency.Yup. I realized this while working on array operations and it came as a surprise--when I began I figured I would have to provide overloads for char strings, but in most cases it simply isn't necessary. Sean
Sep 30 2006
Sean Kelly wrote:Walter Bright wrote:It's so broken that there are proposals to reengineer core C++ to add support for UTF types. 1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i] 2) none of the iteration, insertion, appending, etc., operations can handle multibyte 3) no UTF conversion or transliteration 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.
Sep 30 2006
Walter Bright wrote:Sean Kelly wrote:Oops, forgot about this.Walter Bright wrote:It's so broken that there are proposals to reengineer core C++ to add support for UTF types. 1) implementation-defined whether a char is signed or unsigned, so you've got to cast the result of any string[i]Contrast that with C++, which has no usable or portable support for UTF-8, UTF-16, or any Unicode. All your carefully coded use of std::string needs to be totally scrapped and redone with your own custom classes, should you decide your app needs to support unicode.As long as you're aware that you are working in UTF-8 I think std::string could still be used. It just may be strange to use substring searches to find multibyte characters with no built-in support for dchar-type searching.2) none of the iteration, insertion, appending, etc., operations can handle multibyteTrue. And I hinted at this above.3) no UTF conversion or transliteration 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)Personally, I see this as a language deficiency more than a deficiency in std::string. std::string is really just a vector with some search capabilities thrown in. It's not that great for a string class, but it works well enough as a general sequence container. And it will work a tad better once they impose the came data contiguity guarantee that vector has (I believe that's one of the issues set to be resolved for 0x). Overall, I do agree with you. Though I suppose that's obvious as I'm a former C++ advocate who now uses D quite a bit :-) Sean
Oct 01 2006
Sean Kelly wrote:That's why the proposals to fix it are rewriting some of the *core* C++ language.3) no UTF conversion or transliteration 4) C++ source text encoding is implementation-defined, so no using UTF characters in source code (have to use \u or \U notation)Personally, I see this as a language deficiency more than a deficiency in std::string.std::string is really just a vector with some search capabilities thrown in.Another difficulty with it is it doesn't have a connection with std::vector<char>.It's not that great for a string class, but it works well enough as a general sequence container. And it will work a tad better once they impose the came data contiguity guarantee that vector has (I believe that's one of the issues set to be resolved for 0x). Overall, I do agree with you. Though I suppose that's obvious as I'm a former C++ advocate who now uses D quite a bit :-):-)
Oct 01 2006
Georg Wrede wrote:Wrong. And that's precisely what I meant about the Daddy holding bike allegory a few messages back. The current system seems to work "by magic". So, if you do go to China, itll "just work". At this point you _should_ not believe me. :-) But it still works. ---But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens. I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.
Sep 29 2006
Johan Granberg wrote:Georg Wrede wrote:You might begin with pasting this and compiling it: import std.stdio; void main() { int öylätti; int ШеФФ; öylätti = 37; ШеФФ = 19; writefln("Köyhyys 1 on %d ja nöyrä 2 on %d, että näin.", öylätti, ШеФФ); } It will compile, and run just fine. (The source file having been read into DMD as a single big string, and then having gone through comment removal, tokenizing, parsing, lexing, compiling, optimizing, and finally the variable names having found their way into the executable. Even though the front end has been written in D itself, with simply char[] all over the place.) (Then you might see that the Windows "command prompt window" renders the output wrong, but it's only from the fact that Windows itself doesn't handle UTF-8 right in the Command Window.) The next thing you might do is to write a grep program (that takes as input a file and as output writes the lines found). Write the program as if you had never heard this discussion. Then feed it the Kalevala in Finnish, or Mao's Red Book in Chinese. Should still work. As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.Wrong. And that's precisely what I meant about the Daddy holding bike allegory a few messages back. The current system seems to work "by magic". So, if you do go to China, itll "just work". At this point you _should_ not believe me. :-) But it still works. ---But is this not a needless source of confusion, that could be eliminated by defining char as "big enough to hold a unicode code point" or something else that eliminates the possibility to incorrectly divide utf tokens. I will have to try using char[] with non ascii characters thou I have been using dchar fore that up till now.
Sep 29 2006
On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.The Build program does lots of 'tampering'. I had to rewrite many standard routines and create some new ones to deal with unicode characters because the standard ones just don't work. And Build still fails to do somethings correctly (e.g. case insensitive compares) but that's on the TODO list. I have to think about UTF because it doesn't work unless I do that. -- Derek Parnell Melbourne, Australia "Down with mediocrity!"
Sep 29 2006
Derek Parnell wrote:On Sat, 30 Sep 2006 03:03:02 +0300, Georg Wrede wrote:Do you still remember which they were?As long as you don't start tampering with the individual octets in strings, you should be just fine. Don't think about UTF and you'll prosper.The Build program does lots of 'tampering'. I had to rewrite many standard routines and create some new ones to deal with unicode characters because the standard ones just don't work.And Build still fails to do somethings correctly (e.g. case insensitive compares) but that's on the TODO list.Yes, case insensitive compares are difficult if you want to cater for non-ASCII strings. While it may not be unreasonably difficult to get American, European and Russian strings right, there will always be languages and character sets where even the Unicode guys aren't sure what is right. Unfortunately.
Sep 29 2006
Georg Wrede wrote:The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays. So things just keep on working.I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead. As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file. The only thing with utf8 is that a glyphs aren't represented by a single char. But utf16 is no better! And even utf32 codepoints can be combined into a single rendered glyph. So truncating a string at an arbitrary index is not going to slice on a glyph boundary. However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That garbage is a unique series of bytes that represent a codepoint. This is a property not found in any other encoding. As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes". It all just works. The only thing that breaks is if you tried to index or truncate the data by hand. But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".
Sep 29 2006
Geoff Carlton wrote:Georg Wrede wrote:Yes.The secret is, there actually is a delicate balance between UTF-8 and the library string operations. As long as you use library functions to extract substrings, join or manipulate them, everything is OK. And very few of us actually either need to, or see the effort of bit-twiddling individual octets in these "char" arrays. So things just keep on working.I agree, but I disagree that there is a problem, or that utf-8 is a bad choice, or that perhaps char[] or string should be called utf8 instead. As a note here, I actually had a page of text localised into Chinese last week - it came back as a utf8 text file. The only thing with utf8 is that a glyphs aren't represented by a single char. But utf16 is no better! And even utf32 codepoints can be combined into a single rendered glyph. So truncating a string at an arbitrary index is not going to slice on a glyph boundary. However, it doesn't mean utf8 is ASCII mixed with "garbage" bytes. That garbage is a unique series of bytes that represent a codepoint. This is a property not found in any other encoding. As such, everything works, strstr, strchr, strcat, printf, scanf - for ASCII, normal unicode, and the "Astral planes". It all just works. The only thing that breaks is if you tried to index or truncate the data by hand. But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".
Sep 29 2006
Georg Wrede wrote:Geoff Carlton wrote:How should we chop strings on character boundaries? I have a text rendering function that uses freetype and want to restrict the width of the renderd string, (i have to use some sort of search here, binary or linear) by truncating it. Right now I use dchar but if char is sufficient it would save me conversions all over the place.But even that mostly works, you can iterate through, looking for ASCII sequences, chop out ASCII and string together more stuff, it all works because you can just ignore the higher order bytes. Pretty much the only thing that fails is if you said "I don't know whats in the string, but chop it off at index 12".Yes.
Sep 29 2006
Johan Granberg wrote:How should we chop strings on character boundaries?std.utf.toUTFindex() should do the trick.
Sep 30 2006
BCS wrote:Why isn't performance a problem? If you are saying that this won't cause performance hits in run times or memory space, I might be able to buy it, but I'm not yet convinced. If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise. In my opinion, any compiled language should allow fairly direct access to the most efficient practical means of doing something*. If I didn't care about speed and memory I wound use some sort of scripting language. A good set of libs should make most of this moot. Leave the char as is and define a typedef struct or whatever that provides the added functionality that you want. * OTOH a language should not mandate code to be efficient at the expense of ease of coding.I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.
Sep 29 2006
Johan Granberg wrote:BCS wrote:[...]Why isn't performance a problem?If you will note, I said nothing about the size of the hit. While some may disagree, I think that any unneeded hit is a problem. One alternative that I could live with would use 4 character types: char one codeunit in whatever encoding the runtime uses schar one 8 bit code unit (ASCII or utf-8) wchar one 16 bit code unit (same as before) dchar one 32 bit code unit (same as before) (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field) The point being that char, wchar and dchar are not representing numbers and should be there own type. This also preserves direct access to 8, 16 and 32 bit types.If you are saying that causing a performance hit in run times or memory space is not a problem... in that case I think you are dead wrong and you will not convince me otherwise.I don't think any performance hit will be so big that it causes problems (max x4 memory and negligible computation overhead). Hope that made clear what I meant.
Oct 01 2006
BCS wrote:One alternative that I could live with would use 4 character types: char one codeunit in whatever encoding the runtime uses schar one 8 bit code unit (ASCII or utf-8) wchar one 16 bit code unit (same as before) dchar one 32 bit code unit (same as before)We have that already: ubyte one codeunit in whatever encoding the runtime uses char one 8 bit code unit (ASCII or utf-8) There is no support in Phobos for runtime/native encodings, but you can use the "iconv" library to do such conversions ?(using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field)All ASCII characters are valid UTF-8 code units, so it's OK. --anders
Oct 01 2006
Anders F Björklund wrote:BCS wrote:ubyte is an 8 bit unsigned number not a character encoding. [after some more reading] I may be just rambling but... how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 32. Direct assignment would be illegal (as with, say int[] -> Object) or implicitly converted (as with int -> real). Casts would be provided. Indexing would be by codepoint. Non-array variables would be big enough to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of "whatever the system uses" data type (ah la C's int) could be used for actual output, maybe even escaping anything that won't get displayed correctly. This all sort of follows the idea of "call it what it is and don't hide the overhead". 1) Characters are a different type of data than numbers (see the threads on bool) and as such, that should be reflected in the type system. 2) I have no problem with high overhead operations as long as I can avoid using them when I don't want to.One alternative that I could live with would use 4 character types: char one codeunit in whatever encoding the runtime uses schar one 8 bit code unit (ASCII or utf-8) wchar one 16 bit code unit (same as before) dchar one 32 bit code unit (same as before)We have that already: ubyte one codeunit in whatever encoding the runtime uses char one 8 bit code unit (ASCII or utf-8)There is no support in Phobos for runtime/native encodings, but you can use the "iconv" library to do such conversions ?But UTF-8 is not ASCII.(using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field)All ASCII characters are valid UTF-8 code units, so it's OK.--anders
Oct 01 2006
BCS wrote:I may be just rambling but... how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid)Then all Americans would use that instead of UTF-8. This is natural, since first you code for yourself, later maybe for your boss, etc. And, you'd only become aware of any problems when a Latino tries to use his own name José, talk about Motörhead, or Anaïs the fragrance. And the mail and newsreader you wrote in D simply would not work. Guess if anybody would heed the warning "Only use this new ASCII encoding when you are absolutely positive the program never will encounter a single foreign sentence or letter". So, better not. --- D's current setup and documentation engourage this kind of suggestions, and I don't blame you. Things being like they are, a programmer who wants to write a crossword puzzle generator, would of course begin with: char[20][20] theGrid; It's a shame that an otherwise so excellent language ( + the wording it its docs) downright leads you to do this. The guy naturally assumes that D being a "UTF-8" language, this would work even in Chinese. (Hey, char[] foo = "José Motörhead from the band Anaïs is on stage!"; works, so why wouldn't theGrid? Poor guy. I can't blame anyone then wanting to stay within ASCII for the rest of D's life.
Oct 01 2006
BCS wrote:ubyte is an 8 bit unsigned number not a character encoding.Right, I actually meant ubyte[] but void[] might have been more accurate for representing any (even non-UTF) encoding. (I used ubyte[] in my mapping functions, since they only used legacy 8-bit encodings like "cp1252" or "macroman") Re-reading your post, it seems to me that you were more talking about doing an alias to the UTF type most suitable for the OS ? I guess UTF-8 would be a good choice if the operating system doesn't use Unicode, since then it'll have to do lookups anyway. Otherwise the existing "wchar_t" isn't bad for such an UTF type, it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)So you would like a char "type" that would only take ASCII ? I guess that is *one* way of dealing with it, you could also have a wchar type that wouldn't accept surrogates (BMP only) Then it would be OK to index them by code unit / character... (since each allowed character would fit into one code unit) Sounds a little like signed vs. unsigned integers actually ? Then again, 5 character types is even worse than the 3 now. --andersAll ASCII characters are valid UTF-8 code units, so it's OK.But UTF-8 is not ASCII.
Oct 01 2006
Anders F Björklund wrote: [...]Then again, 5 character types is even worse than the 3 now. --andersThe more I think about it the worse this get. What I really would like is a system that allows O(1) operations on strings (slice out char 7 to 27), allows somewhat compact encoding (8bit) and allows safe operations on UTF (if I do something dumb, it complains). All at the same time would be nice, but is not needed. Come to think about it, a lib that will do good FAST convention between buffers: //note: "in" is intentional, it wont allocate anything UTF8to16(in char[], in wchar[]); UTF8to32(in char[], in dchar[]); UTF16to32(in wchar[], in dchar[]); ... would get most of what I want. <sarcasm> And while I'm at it, I'd like a million bucks please. </sarcasm>
Oct 02 2006