D - String type.
- Jakob Kemi (7/7) Mar 11 2002 As I understand from the docs, D is supposed to use wchars
- Pavel Minayev (10/15) Mar 11 2002 ...while UNICODE is already a standard on at least Windows and BeOS
- Jakob Kemi (7/24) Mar 11 2002 Linux supports UTF-8 very good. You can use all your standard programs
- Jakob Kemi (22/39) Mar 11 2002 I forgot to add. UNICODE is a very loose notion as it includes
- Walter (3/42) Mar 11 2002 Supporting utf8 would just be by using char[] arrays!
- Walter (5/11) Mar 11 2002 Actually, string literals are uncommitted by default. They then get
- Pavel Minayev (11/13) Mar 12 2002 So how is the context determined?
- Walter (6/20) Mar 12 2002 That's a bug, it should give an ambiguity error.
- Juan Carlos Arevalo Baeza (16/32) Mar 15 2002 Hmmm... I'm thinking that flagging an ambiguity here would still be b...
- Walter (7/16) Mar 26 2002 or
- J. Daniel Smith (9/16) Mar 11 2002 UTF-8 is fine for strings that are mostly ASCII with some UNICODE (sourc...
- Jakob Kemi (15/23) Mar 11 2002 True.
- Serge K (4/8) Mar 11 2002 Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes...
- Walter (8/13) Mar 11 2002 At one time I had written a lexer that handled utf-8 source. It turned o...
- Jakob Kemi (8/23) Mar 11 2002 You already have this problem in windows with linebreaks being two
- Pavel Minayev (3/5) Mar 12 2002 There are no iterators in D, nor there is a string class.
- Jakob Kemi (10/17) Mar 12 2002 I'm not talking about some STL iterators here, what I mean
- Walter (6/28) Mar 12 2002 That's true, but I was never comfortable using such things, and hiding t...
As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0) Just a thought. Jakob Kemi
Mar 11 2002
"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
Mar 11 2002
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...Linux supports UTF-8 very good. You can use all your standard programs with UTF-8 encoding (cat, less, etc...) The best part is that if all string functions is written to handle UTF-8, they'll also work for ordinary (legacy) ASCII strings. There's no need to change the char type or anything to deal with UTF-8. wchar would still be useful to interact with older C libs.As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
Mar 11 2002
On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...I forgot to add. UNICODE is a very loose notion as it includes UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among others. My first reaction was that variable width characters is kinda gross and inelegant. However, one would waste memory (yeah, I know, it's cheap) with 4 byte characters in order to be UNICODE compliant. Also the fact that UTF-8 works so well with ASCII inheritance and it's fast acceptance in the UNIX world make me feel all warm and fuzzy about it, despite its variable character size. By sticking with UCS-2 and UCS-4 we'll still be in present day's situation, all internationalization will be clumpsy addons to the standard ASCII strings and every program will have to decide which routines to use etc. (a hell when you're developing big applications and/or exchanging data from different countries and different systems.) The best thing would offcourse be if the whole world including all legacy databases, every file ever written and every old and unsupported application just magically and instantly were rewritten or converted to UCS-4. But there is _no_ way that it's ever going to happen. UTF-8 gives the best compromise IMO. Jakob KemiAs I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
Mar 11 2002
Supporting utf8 would just be by using char[] arrays! "Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j8hj$1m6n$1 digitaldaemon.com...On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...I forgot to add. UNICODE is a very loose notion as it includes UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among others. My first reaction was that variable width characters is kinda gross and inelegant. However, one would waste memory (yeah, I know, it's cheap) with 4 byte characters in order to be UNICODE compliant. Also the fact that UTF-8 works so well with ASCII inheritance and it's fast acceptance in the UNIX world make me feel all warm and fuzzy about it, despite its variable character size. By sticking with UCS-2 and UCS-4 we'll still be in present day's situation, all internationalization will be clumpsy addons to the standard ASCII strings and every program will have to decide which routines to use etc. (a hell when you're developing big applications and/or exchanging data from different countries and different systems.) The best thing would offcourse be if the whole world including all legacy databases, every file ever written and every old and unsupported application just magically and instantly were rewritten or converted to UCS-4. But there is _no_ way that it's ever going to happen. UTF-8 gives the best compromise IMO. Jakob KemiAs I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)...while UNICODE is already a standard on at least Windows and BeOS (these, I know for sure; Linux?). I'd prefer to have both char and wchar flavor for each and every string-manipulation function, probably overloaded (so you don't really see the difference). BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?
Mar 11 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a6j584$1kuq$1 digitaldaemon.com..."Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com... BTW, Walter, a question. String literals seem to be char[] by default, I guess they are wchar[] if the program is written in UNICODE, though? Also, are UNICODE literals allowed in ASCII programs?Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. You can insert unicode literals into strings with the \uUUUU syntax.
Mar 11 2002
"Walter" <walter digitalmars.com> wrote in message news:a6jfpg$2vd$1 digitaldaemon.com...Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context.So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. If the whole program text would be in UNICODE, would the string be UNICODE as well? And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?
Mar 12 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a6l1sf$l7f$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a6jfpg$2vd$1 digitaldaemon.com...That's a bug, it should give an ambiguity error.Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context.So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument.If the whole program text would be in UNICODE, would the string be UNICODE as well?Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii.And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?It won't implicitly convert it to char[], then.
Mar 12 2002
"Walter" <walter digitalmars.com> wrote in message news:a6lccr$psj$2 digitaldaemon.com...Hmmm... I'm thinking that flagging an ambiguity here would still be bad. How about using attributes to add the ability to resolve ambiguities in a user-defined manner? For example: priority(9) void foo(char[] s) { ... } priority(5) void foo(wchar[] s) { ... } foo("Hello, world!"); // Calls using char[], as it's higher priority. This way, ambiguities will only be flagged if multiple possibilities exist that have the same priority. The default priority could be 5 for all functions, and the range could be 0 to 9, so you can always define higher or lower ones as needed. I admit that this might open a whole new can of worms, but I'd definitely be willing to explore this if the language supported it. Salutaciones, JCABSo how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument.That's a bug, it should give an ambiguity error.If the whole program text would be in UNICODE, would the string be UNICODE as well?Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii.And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?It won't implicitly convert it to char[], then.
Mar 15 2002
"Juan Carlos Arevalo Baeza" <jcab roningames.com> wrote in message news:a6u823$bip$1 digitaldaemon.com...priority(9) void foo(char[] s) { ... } priority(5) void foo(wchar[] s) { ... } foo("Hello, world!"); // Calls using char[], as it's higher priority. This way, ambiguities will only be flagged if multiple possibilities exist that have the same priority. The default priority could be 5 for all functions, and the range could be 0 to 9, so you can always define higherorlower ones as needed. I admit that this might open a whole new can of worms, but I'ddefinitelybe willing to explore this if the language supported it.D was trying to migrate to a simpler overloading scheme <g>. It can be less convenient at times, but I think it's more than made up for by having simple and obvious rules.
Mar 26 2002
UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages). But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long. UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option. Dan "Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0) Just a thought. Jakob Kemi
Mar 11 2002
On Mon, 11 Mar 2002 22:52:46 +0100, J. Daniel Smith wrote:UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages). But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.True. Globally however UTF-8 will save memory compared to UCS-4 (no UCS-2 isn't enough) since 6 byte wide characters are rare. But I think that the memory issue doesn't really matter, it will also matter even less as prices fall. Also if someone is storing _huge_ amounts of text they will just compress it and remove most of the redundancy in the codeset.UTF-8 solves a lot of problems, but I'm not sure you want to wire it into the language as the only option.It sure does solve lots of problems and the best part is that you don't have to opt out anything else. Just design all string functions to handle UTF-8 and you'll have the best of both worlds (ordinary ASCII char strings and UTF-8 that is). If there's need you can still have special UCS-4 functions and ucs4_char (or whatever).DanJakob
Mar 11 2002
"J. Daniel Smith" <j_daniel_smith HoTMaiL.com> wrote in message news:a6j90i$1me0$1 digitaldaemon.com...UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages). But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long.Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes. Which means - it simply cannot consume more memory than UTF-32. (ISO/IEC 10646 may require up to 6 bytes in UTF-8, but it is the superset for Unicode.)
Mar 11 2002
"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.
Mar 11 2002
On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers. Jakob KemiAs I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.
Mar 11 2002
"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6jitg$118$1 digitaldaemon.com...You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string classThere are no iterators in D, nor there is a string class.
Mar 12 2002
On Tue, 12 Mar 2002 20:06:34 +0100, Pavel Minayev wrote:"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6jitg$118$1 digitaldaemon.com...I'm not talking about some STL iterators here, what I mean is that you just desing your loops like this: for (char* s = string; get_char(s) != '\0'; s = next_char(s) ) { ... } Loops operating on strings is rare anyway, most string functions should be optimized library functions. get_char() & next_char() should be inlined and can use whatever syntax sugar applicatable. JakobYou already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string classThere are no iterators in D, nor there is a string class.
Mar 12 2002
"Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6jitg$118$1 digitaldaemon.com...On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:That's true, but I was never comfortable using such things, and hiding the performance hit of it behind syntactic sugar doesn't make the hit go away. When you're trying to compile 100,000 lines of code, every cycle in the lexer matters."Jakob Kemi" <jakob.kemi telia.com> wrote in message news:a6j4v5$1koq$1 digitaldaemon.com...You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers.As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0)At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. It turned out to be a lot of trouble :-( and I finally converted it to wchar's.
Mar 12 2002