digitalmars.D - strings in D
- Andrew Fedoniouk (20/20) Feb 18 2005 Is there any string class for the D?
- Kris (7/27) Feb 18 2005 You're walking upon graves with that one, Andrew! I'm afraid there's bee...
- John Reimer (14/40) Feb 18 2005 This question has been asked many times in the D groups. If there ever
- Charlie Patterson (4/10) Feb 19 2005 The D newsgroup could probably use a FAQ. I also don't know where the l...
- John Reimer (4/17) Feb 19 2005 Yep, Navigating this newsgroup can be quite a chore.
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/47) Feb 20 2005 There are several FAQ.
- Unknown W. Brackets (17/20) Feb 18 2005 Forgive me, but isn't UCS-2 *essentially* 16 bit Unicode without the bom...
- Andrew Fedoniouk (67/67) Feb 18 2005 Ok. Seems like I did not explain this clearly. Let's try again then from...
- Unknown W. Brackets (33/33) Feb 19 2005 Yes, all true. I know. UCS-2 and UTF-16 are not exactly the same, but
- Derek (57/59) Feb 19 2005 I submit this sample code ...
- Andrew Fedoniouk (44/44) Feb 19 2005 "you're ignoring ISO-8859-2, Shift_JIS, and similar encodings."
- =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (15/15) Feb 19 2005 Andrew Fedoniouk wrote:
- Andrew Fedoniouk (31/31) Feb 19 2005 According to
- =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (37/37) Feb 20 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (31/52) Feb 20 2005 Right, if you want to get all technical about it at once. :-)
- =?ISO-8859-1?Q?Thomas_K=FChne?= (19/19) Feb 20 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/23) Feb 20 2005 Yes, that's what I said :-) (not my fault char[] sounds a lot like char)
- =?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (23/23) Feb 19 2005 -----BEGIN PGP SIGNED MESSAGE-----
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (28/63) Feb 20 2005 Length in D counts code units. Always. (but yes, an array insert
- Ben Hinkle (6/11) Feb 19 2005 foreach already iterates over code points. Try something like
- =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (22/32) Feb 20 2005 There is no built-in (Phobos) class, as reasoned in:
- Ben Hinkle (19/27) Feb 20 2005 A while ago I posted some tiny helper functions to do on-the-fly charact...
Is there any string class for the D? Or are there any plans to create string for D? char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings. String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach(). I personally would like to see something similar to Java strings (ucs-2) with methods like fromByteArray(encoding), fromUtf8() , etc. Probably such strings should use copy-on-write implementation. I think that ucs-2 (unsigned word) as a string character whould be enough for all active languages. Any other ideas, gentlemen? Andrew Fedoniouk. http://terrainformatica.com
Feb 18 2005
In article <cv6d5q$19al$1 digitaldaemon.com>, Andrew Fedoniouk says...Is there any string class for the D? Or are there any plans to create string for D? char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings. String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach(). I personally would like to see something similar to Java strings (ucs-2) with methods like fromByteArray(encoding), fromUtf8() , etc. Probably such strings should use copy-on-write implementation. I think that ucs-2 (unsigned word) as a string character whould be enough for all active languages. Any other ideas, gentlemen? Andrew Fedoniouk. http://terrainformatica.comYou're walking upon graves with that one, Andrew! I'm afraid there's been a lot of conflicting opinion around that particular subject. Best bet is to get hold of a 'non-standard' library for such things, and go from there. The mango.icu package is a wrapper around the extensive ICU project, and may suit your needs ~ you can find that over at dsource.org: http://dsource.org/forums/viewtopic.php?t=420
Feb 18 2005
On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:Is there any string class for the D? Or are there any plans to create string for D? char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings. String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach(). I personally would like to see something similar to Java strings (ucs-2) with methods like fromByteArray(encoding), fromUtf8() , etc. Probably such strings should use copy-on-write implementation. I think that ucs-2 (unsigned word) as a string character whould be enough for all active languages. Any other ideas, gentlemen? Andrew Fedoniouk. http://terrainformatica.comThis question has been asked many times in the D groups. If there ever were a "big three" in the D debates department, I think this one would rank as one of them. From what I gather, the opinions have settled into three groups: 1) Those that want a String class in D and think it is a critical addition to the language. 2) Those that consider a String class contrary to the D methodology; they thing char[] wchar[] and dchar[] are sufficient. 3) Those that think a String class could be a useful addition; but it should be added to D for optional use. If you do a search of this newsgroup and the old D newsgroup, I think you'll find how big the discussion has been! - John R.
Feb 18 2005
"John Reimer" <brk_6502 yahoo.com> wrote in message news:pan.2005.02.19.05.02.08.170345 yahoo.com...On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:Is there any string class for the D? ...This question has been asked many times in the D groups. If there ever were a "big three" in the D debates department, I think this one would rank as one of them.The D newsgroup could probably use a FAQ. I also don't know where the land mines are buried!
Feb 19 2005
On Sat, 19 Feb 2005 10:54:44 -0500, Charlie Patterson wrote:"John Reimer" <brk_6502 yahoo.com> wrote in message news:pan.2005.02.19.05.02.08.170345 yahoo.com...Yep, Navigating this newsgroup can be quite a chore. I'm not sure, but I thought the D wiki site has some references to these topics. Justin Calvarese would probably know.On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:Is there any string class for the D? ...This question has been asked many times in the D groups. If there ever were a "big three" in the D debates department, I think this one would rank as one of them.The D newsgroup could probably use a FAQ. I also don't know where the land mines are buried!
Feb 19 2005
John Reimer wrote:There are several FAQ. http://www.digitalmars.com/d/faq.html (Offical FAQ) http://int19h.tamb.ru/faq.html (Inoffical FAQ) http://www.prowiki.org/wiki4d/wiki.cgi?FaqRoadmap But you might be looking for simple things like: http://www.prowiki.org/wiki4d/wiki.cgi?ShortFrequentAnswersThe D newsgroup could probably use a FAQ. I also don't know where the land mines are buried!Yep, Navigating this newsgroup can be quite a chore. I'm not sure, but I thought the D wiki site has some references to these topics. Justin Calvarese would probably know.Strings are not null-terminated but hold explicit length information. Therefore you need to use %.*s not %s in printf, or just use writef!Comparing an object reference like: "if (object == null)" will crash. You must use "if (object is null)"Checking for a key in an AA like: "if(array[key])" will create it if it's missing. You must use "if(key in array)"Or just a quick summary, like I posted earlier: http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/12609Q: What's the default boolean type in D ? A: bit. (bool is an "alias") Q: Is that really type-safe ? A: No. (just as in C99/C++) Q: What's the default string type in D ? A: char[]. (since main() uses it) Q: Is that a single class ? A: No. (it's a primitive type) Q: Was this done by accident or by choice ? A: choice. (by Walter Bright) Q: Will this change before D version 1.0 ? A: No. (at least unlikely)At least the String Wars and the Boolean Wars are *over*... And it was char[]/wchar[]/dchar[] and bit/wbit/dbit that won. --anders
Feb 20 2005
Forgive me, but isn't UCS-2 *essentially* 16 bit Unicode without the bom and maybe a few other things? I may be wrong, but I would think that, if you want that, you can just use dchar[] or even wchar[]... I'm not saying that strings are or aren't necessary, but if I do this: (let's see if I can post unicode on this newsgroup...) wchar[] test = "ウェブ全体から検索"; foreach (wchar c; test) writef("%s ", c); You'll get one iteration for each character (there are nine.) Yes, this uses twice the memory, but it gives you the "character in full" you're asking for. No replacement for a string class, and I'm not arguing either way on that, but foreach and [] (called opIndex, I believe, in D) work fine. As for byte conversions, you can at least do that with unicode (simple casting between byte[] and char[], etc.) and I'm sure iconv could be useful if you need charset conversion. -[Unknown]I think that ucs-2 (unsigned word) as a string character whould be enough for all active languages.
Feb 18 2005
Ok. Seems like I did not explain this clearly. Let's try again then from different point of view (this time more technical). UTF16 sequence cannot be treated as UCS-2 sequence (especially in D with its built-in conversion). This is just technically wrong. See: word utf16string[] = { 0x0041, // 'a' - Latin-1 0x0020, // ' ' - Latin-1 0xD800, // high-half zone part 0xDC00, // low-half zone part - value 0xD800, // high-half zone part 0xDC01 // value }; This example text contains 4 coded characters. The first two are BMP (basic multiplane) characters coded with a single UCS-2 (BMP)code value; the last two are non-BMP characters coded with two wordseach, a high-half code and a low-half code. Translating this to UCS-4code values would produce the following: dword ucs4string[] = { 0x00000041, // 'a' // Latin-1 0x00000020, // ' ' // Latin-1 0x00010000, // hieroglyph foo 0x00010001 // hieroglyph bar }; What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that utf16string is sequence of wchars. I wouldn't say so.These are not characters in common sense but just parts of the sequence of16bit units. You cannot treat them as characters e.g. you cannotinsert new wchar at position 3 of utf16string. Only dchar could be considered as a real UNICODE character (UCS-4). But modern computers are not ready yet for UCS-4. Too much memory needed. Practical solution is to use ucs-2 - two-byte ucs-2 characters. (Again ucs-2 is BMP http://www.unicode.org/roadmaps/bmp/ and includes all active languages civilazation using now for writings) typedef wchar char2; // new type, ucs-2 codes typedef char2[] string2; // brand new type, strict ucs-2 string conversion from utf16 wchar[] -> char2[] *must* interpret utf16 pairs (0xD800,0xDC00) and produce *one* char2 codewith value '?' (or any other with meaning not supported character) Thus codes in the range D800 - DBFF *must* not appear in char2[] string. As soon as D has built-in conversion routines then list of character types should look like as: char - element of utf8 sequence. char[] - utf8 encoded unicode sequence. wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence. dchar - ucs-4 character. full unicode character. dchar[] - ucs-4 string. char2 - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of UTF16 sequence - do not expand into ucs-4 by system. char2[] - ucs-2 string - sequence of characters. Could be manipulated arbitrarye.g. characters (char2) could be inserted or deleted at any given position. Let me highlight again: ///// ///// elements of utf sequence *are not* characters. ///// So such functions as strchr(string,char) must be declared either as int strchr(char1[], char1 c) // latin-1 string --or-- int strchr(char2[], char2 c) // ucs-2 string and char --or-- int strchr(char4[], char4 c) // ucs-4 string or 'dchar' This message has one sole reason: to make D close to perfect. Andrew Fedoniouk. http://terrainformatica.com
Feb 18 2005
Yes, all true. I know. UCS-2 and UTF-16 are not exactly the same, but they are quite similar for many intents and purposes. Again, you can get the conversion you want (Latin1 -> UCS-4, etc.) using iconv or similar. Even if this was built in, it would have to be done using such a tool or a custom written one - it's not like it's an interrupt call or something :P. And, to make a strlen that counted unqiue characters in a UTF-8/UTF-16/etc. string would be expensive performance wise. Instead of just giving the array's length, which is lightning quick and very possibly imho why D performs better with string usage, you'll end up traversing the entire string again looking for characters. Yes, this length could be (I would hope!) cached by the class to improve speed of sequential strpos's, substr's, etc. But, if it had to traverse like that it would be so much better to use wchar, at least just for textual strings that might contain such characters, because then you could use the speedy method, instead of searching the whole string like C did. Several other languages have these same problems: C, PHP, Perl, SQL, etc. I'm quite sure most people who understand UTF-8 are aware that the number of bits divided by eight may or may not have anything to do with the actual length in characters of the string, though. It's essential - and sometimes, you just have to know it. Not everything can be abstracted to the point where you just type "do my homework" and hit compile... Still, I don't think, personally, using a whole bunch of char types wouldn't solve this. That's several times uglier than a string class, and since a char array is just an array, there isn't really any clean way to override the .length of it... it'd have to be a class. And anyway, you're ignoring ISO-8859-2, Shift_JIS, and similar encodings. Why should ISO-8859-1 (Latin1) be special? Anyway, I can just see a "i18n_length(char[] x)" function.... because sometimes, you really just want the number of bytes, not characters. -[Unknown]
Feb 19 2005
On Sat, 19 Feb 2005 01:14:46 -0800, Unknown W. Brackets wrote: [snip]Anyway, I can just see a "i18n_length(char[] x)" function.... because sometimes, you really just want the number of bytes, not characters.I submit this sample code ... <code> module i18n; private import std.utf; debug(1) private import std.stdio; uint i18n_length( char[] x) { return toUTF32(x).length; } uint i18n_length( wchar[] x) { return toUTF32(x).length; } uint i18n_length( dchar[] x) { return x.length; } unittest { char[] tchar; wchar[] twchar; dchar[] tdchar; tdchar ~= 0x00000041; // 'a' // Latin-1 tdchar ~= 0x00000020; // ' ' // Latin-1 tdchar ~= 0x00010000; // hieroglyph foo tdchar ~= 0x00010001; // hieroglyph bar twchar = toUTF16(tdchar); tchar = toUTF8(tdchar); debug(1) {writefln("dchar.length = %d (%d)", i18n_length(tdchar), tdchar.length); } assert( i18n_length(tdchar) == 4); debug(1) {writefln("wchar.length = %d (%d)", i18n_length(twchar), twchar.length); } assert( i18n_length(twchar) == 4); debug(1) {writefln(" char.length = %d (%d)", i18n_length(tchar), tchar.length); } assert( i18n_length(tchar) == 4); } debug(2) { void main() { } } </code> This can be compiled using "build i18n -debug=2" to generate the unittests and then run i18n to run the unittests. Of course, it you want to you can create a doctored version of toUTFxx to just count codepoints rather than do an actual conversion. -- Derek Melbourne, Australia
Feb 19 2005
"you're ignoring ISO-8859-2, Shift_JIS, and similar encodings." Where I am ignoring them? "Still, I don't think, personally, using a whole bunch of char types ...." In fact I am not proposing new top level character types. My point is simple: 'string' as an entity (or class) is different from wchar[] - sequence of UTF16 characters in the terms of following: class string // string which supports only ucs-2 code points { typedef wchar char2; // ucs-2 code points only. char2[] chars; this( wchar[] utf16 ) { // thanks to Ben Hinkle foreach(dchar cp; utf16) { if( dchar > 0xFFFF ) chars ~= cast(char2) '?'; // ignorabimus et ignorabus else chars ~= cast(char2) cp; } } int length() { return chars.length; } // as chars ALWAYS contains code points. void set(int pos, wchar wc) { if( wc >= oxD800 && wc <= 0xDFFF) throw "invalid ucs-2 code point"; else chars[pos] = cast(char2)wc; } } AFAIK this approach used in java.lang.String . I think that existing names of entities in D are misleading. 'char' in fact is not a character but element of UTF-8 sequence - ubyte. 'wchar' in fact is not a "wide" character but element of UTF-16 sequence - ushort. and only 'dchar' has meaning of character. Keeping this in mind declaration like wchar a; is a technical nonsense. The way it is implemented now and treated by D wchar (and char) can be used *ONLY* as members of arrays (in sequence).
Feb 19 2005
Andrew Fedoniouk wrote: | I think that existing names of entities in D are misleading. | | 'char' in fact is not a character but element of UTF-8 sequence - | ubyte. | 'wchar' in fact is not a "wide" character but element of | UTF-16 sequence - ushort. and only 'dchar' has meaning of character. 'dchar' is no _character_, it represents a _codepoint_. While codepoints are interesting for some cases you are much more likely to a) treat strings as void[]/byte[]/ubyte[] (most cases) b) or are interested in graphemes (display/text editing) http://www.unicode.org/faq/char_combmark.html Hint: search the digitalmars.D newsgroup archive bevore posting any more about strings/*chars. Thomas
Feb 19 2005
According to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues which is based on "digitalmars.D newsgroup archive" I believe, D's 'char' and 'wchar' are not 'characters' as their names state but rather "code units". Right? And about "code point": in terms of UNICODE code point is a number between 0 and 0x10FFFF. To represent this codes (or unicode character indexes) from this range you may use either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane codes) or uint21 (full UNICODE range). Some code values from 0 and 0x10FFFF range are illeagal. E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world. If D wants to treat its strings in UTF8 or UTF16 form it should provide methods recommended by W3C: http://www.w3.org/TR/DOM-Level-2-Core/i18n.html I think that ideally D.char, D.wchar and D.dchar should be treated as code point value storage types and not as code units. This will give some meaning to these type names at least. String literals should have type of 'utf8' like this: typedef ubyte[] utf8; Intrinsic conversion routines like: wchar[] str = "?????? ???"; // utf8 ("Hello World" in Russian) should create str as sequence of codepoints with substitution of unsupported values for wchar with lets say 0xFFFF. The same rule should apply to char[] str = "?????? ???"; // utf8 (in this case str will contain ten 0xFF as these are not Latin-1 codes) Andrew Fedoniouk. http://terrainformatica.com
Feb 19 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrew Fedoniouk wrote: | According to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues | which is based on "digitalmars.D newsgroup archive" I believe, D's | 'char' and 'wchar' are not 'characters' as their names state but | rather "code units". Right? | | And about "code point": in terms of UNICODE code point is a number | between 0 and 0x10FFFF. To represent this codes (or unicode character | indexes) from this range you may use either uint8 (for Latin-1 code | points) UTF-8 supports all code points depending on the value of the codepoint value 1 - 4 chars are required | or uint16 (Basic Multilang Plane codes) or uint21 (full UNICODE | range). UTF-16 supports all code points depending on the value of the codepoint value 1 - 2 wchars are required | Some code values from 0 and 0x10FFFF range are illeagal. E.g. | cast(dchar)0xD800 should rise an error in ideal 'D' world. The codepoint 0xD800 isn't illegal, it's unassigned and is very likely to remain unassigned in all future Unicode version. The uint16 0xD800 on it's own is illegal as it is part of a UTF-16 surrogate pair. | If D wants to treat its strings in UTF8 or UTF16 form it should | provide methods recommended by W3C: | http://www.w3.org/TR/DOM-Level-2-Core/i18n.html findOffset8/16/32 are very simple functions. I'm sure that there is at least one project at dsource.org providing this functionality. Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCGFm53w+/yD4P9tIRAig+AJ4///q2bK65Adnunco68Ej9U18hiACfeBnT qd0/azp0KlO1T9p3bf87+8k= =UqHn -----END PGP SIGNATURE-----
Feb 20 2005
Andrew Fedoniouk wrote:According to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues which is based on "digitalmars.D newsgroup archive" I believe, D's 'char' and 'wchar' are not 'characters' as their names state but rather "code units". Right?Right, if you want to get all technical about it at once. :-) However, "char" is still a perfectly good *ASCII* character. It's just that the-high-bit-set is now defined, unlike in C... And "wchar" is also *usually* a character (BMP), just like "char" was in Java for a number of years... (they're now using int instead: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html, which means that D wchar = Java char, D dchar = Java int nowadays) So they are still "characters" ? Just that there are "exceptions" (being the surrogate code units, referring to next unit in array) And as long as you watch out for these, it's perfectly OK to use them as good-old-fashioned characters (and it could be faster, too)And about "code point": in terms of UNICODE code point is a number between 0 and 0x10FFFF. To represent this codes (or unicode character indexes) from this range you may use either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane codes) or uint21 (full UNICODE range). Some code values from 0 and 0x10FFFF range are illeagal. E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world.For reasons of efficiency, D does not check all values upon assignment. You must instead call the Phobos helper function: std.utf.isValidDchar Note that "char" only holds ASCII in D, wchar must be used for Latin-1. I suggested adding the new functions isAscii and isSurrogate too, but it was ignored. (They're all copied and pasted at the moment) http://www.digitalmars.com/d/archives/digitalmars/D/bugs/2154.htmlIntrinsic conversion routines like: wchar[] str = "?????? ???"; // utf8 ("Hello World" in Russian) should create str as sequence of codepoints with substitution of unsupported values for wchar with lets say 0xFFFF.Substituting all surrogates with invalid characters will *lose data*. That is clearly not good, and using UTF-8 sounds like a better idea ? If you want single-codeunit strings, you can search/replace yourself. In the example above, the string literal will be converted to UTF-16. (as in: the actual literal data, it will also be '\0'-escaped for C)The same rule should apply to char[] str = "?????? ???"; // utf8 (in this case str will contain ten 0xFF as these are not Latin-1 codes)You can use ubyte[] for storing 8-bit encodings (such as Latin-1, etc.) Using char[] will give "invalid UTF sequence", when encountering high bytes, although the first 0x100 characters are the same in both "sets", that is ISO-8559-1 and UTF-8. But only 0x80 will fit in a single "char". Note that (char*) is still used for NUL-terminated 8-bit strings too! This is mostly for making it much simpler to use external C functions, which is the same reason why all D string literals are NUL-terminated. --anders
Feb 20 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anders F Bjrklund wrote: | For reasons of efficiency, D does not check | all values upon assignment. You must instead call the Phobos helper | function: std.utf.isValidDchar Note that "char" only holds ASCII in | D, wchar must be used for Latin-1. clarification char: can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment char[]/char* can hold any Unicode codepoint/codepoint sequence Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCGIcn3w+/yD4P9tIRAnwmAJ4nsTMXVVdUQfwVxoxHoHeZyhvcGgCgjmL8 9klhna13B1PZSzl4hhN8CuI= =4rkw -----END PGP SIGNATURE-----
Feb 20 2005
Thomas Khne wrote:| Note that "char" only holds ASCII in | D, wchar must be used for Latin-1. clarification char: can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragmentYes, that's what I said :-) (not my fault char[] sounds a lot like char) And that should probably be 0x00-0x7F, or 0x00..0x80 in exclusive style? We mean the same thing, the 7-bit ASCII subset of ISO-8859-1 and UTF-8. (as in the table: http://www.algonet.se/~afb/d/latin1/iso-8859-1.html)TYPE ALIAS // RANGE char utf8_t // \x00-\x7F (ASCII) wchar utf16_t // \u0000-\uD7FF, \uE000-\uFFFF dchar utf32_t // \U00000000-\U0010FFFF (Unicode)66 codepoints are invalid "noncharacters", but that's beside the point. The code unit arrays, char[]/wchar[]/dchar[] can all hold any UTF string But only "dchar" is fully standalone for all different codepoint values. This does not stop "char" and "wchar" from being useful for loops and other special uses, just as the limitations are being accounted for ? --anders
Feb 20 2005
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Andrew Fedoniouk wrote: | I think that existing names of entities in D are misleading. | | 'char' in fact is not a character but element of UTF-8 sequence - | ubyte. | 'wchar' in fact is not a "wide" character but element of | UTF-16 sequence - ushort. and only 'dchar' has meaning of character. 'dchar' is no _character_, it represents a _codepoint_. While codepoints are interesting for some cases you are much more likely to a) treat strings as void[]/byte[]/ubyte[] (most cases) b) or are interested in graphemes (display/text editing) http://www.unicode.org/faq/char_combmark.html Hint: search the digitalmars.D newsgroup archive bevore posting any more about strings/*chars. Thomas -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCF65E3w+/yD4P9tIRAnFwAKCmDBgFbOLf0aOSfrnfdI9Xn6nPuwCgiBd/ 47zTxYo7sPndn3XKbfCFrZ0= =6CdT -----END PGP SIGNATURE-----
Feb 19 2005
Andrew Fedoniouk wrote:Ok. Seems like I did not explain this clearly. Let's try again then from different point of view (this time more technical).[...]What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that utf16string is sequence of wchars. I wouldn't say so.These are not characters in common sense but just parts of the sequence of16bit units. You cannot treat them as characters e.g. you cannotinsert new wchar at position 3 of utf16string.Length in D counts code units. Always. (but yes, an array insert operation only gives useful results when there's no surrogates) As been said, counting codeunits is a lot faster than codepoints.Only dchar could be considered as a real UNICODE character (UCS-4). But modern computers are not ready yet for UCS-4. Too much memory needed.dchar is quite alright for use in parameters and such, since the registers are 32-bit wide anyway. For string storage, I agree... UTF-32 wastes too much space, and UTF-16 or even UTF-8 is better.As soon as D has built-in conversion routines then list of character types should look like as: char - element of utf8 sequence. char[] - utf8 encoded unicode sequence. wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence. dchar - ucs-4 character. full unicode character. dchar[] - ucs-4 string. char2 - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of UTF16 sequence - do not expand into ucs-4 by system. char2[] - ucs-2 string - sequence of characters. Could be manipulated arbitrarye.g. characters (char2) could be inserted or deleted at any given position.As you've discovered, D "only" concerns itself with UTF code units... (dchar is of the UTF-32 subset instead of the full ucs-4, but anyway) This means that if you want to handle arrays of Latin-1 characters or arrays of BMP characters, you can not use the "character" types. However, you are free to use the ubyte and ushort types to represent those types of strings (that are still Unicode, encoded differently) But there is really not much use of introducing two new types just to represent those two special cases of the more general UTF ones ? For ASCII (only), char[] and ubyte[] with ISO-8859-1 would be the same. Just as for non-surrogates (only), wchar[] and ushort[] are identical. But the latter two types would be unable to handle higher code points. Converting between the two is trivial, but there could be a loss of data when going from char[] -> ubyte[], or from wchar[] -> ushort[] (e.g. if replacing any surrogates with something like \xFF or \uFFFF) And I think it's better to go with the lossless format, than to support the rare operation of indexing individual codepoints... (and in case you need to to this often, there's still dchar[])Let me highlight again: ///// ///// elements of utf sequence *are not* characters. ///// So such functions as strchr(string,char) must be declared either as int strchr(char1[], char1 c) // latin-1 string --or-- int strchr(char2[], char2 c) // ucs-2 string and char --or-- int strchr(char4[], char4 c) // ucs-4 string or 'dchar' This message has one sole reason: to make D close to perfect.int strchr(char[], dchar c) would also work... (would return the *start* of 1-4 code units) --anders
Feb 20 2005
String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach().foreach already iterates over code points. Try something like char[] str = ...some non-ascii string... foreach(int n, dchar cp; str) { .. cp is the nth codepoint of str ... } -Ben
Feb 19 2005
Andrew Fedoniouk wrote:Is there any string class for the D?There is no built-in (Phobos) class, as reasoned in: http://www.digitalmars.com/d/cppstrings.html However, there are at least two 3rd-party ones: http://dool.sourceforge.net/dool_String_String.html http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classUString.html I'm not sure having a default *class* in a hybrid language is such a great idea in the first place ? (then again, Exceptions are classes and default...)Or are there any plans to create string for D?As a built-in value type ? No, that will not happen. Although, there are three good alternatives already... (the famous: str, wstr, dstr as I prefer to call them)char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings.This is not true. All of UTF-8, UTF-16 and UTF32 can be used for storing an array of Unicode code points... Just that some code points require more than just one code unit, just as one "grapheme" might require more than just one "code point" anyway when using Unicode.String as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach().You can "foreach dchar", over all three string types. If you want to index by code point, you will need to convert the two smaller code units to UTF-32 first... --anders
Feb 20 2005
A while ago I posted some tiny helper functions to do on-the-fly character indexing, but I can't find them so I'll just post them again in case the OP finds them useful: -BenString as an entity is a sequence of "code points" - ascii, ucs-2(basic multilang plane) and ucs-4 so operator[] always returns character in full (for the given supported plane). The same should apply to foreach().You can "foreach dchar", over all three string types. If you want to index by code point, you will need to convert the two smaller code units to UTF-32 first...
Feb 20 2005