D - String type.

Jakob Kemi (7/7) Mar 11 2002 As I understand from the docs, D is supposed to use wchars

Pavel Minayev (10/15) Mar 11 2002 ...while UNICODE is already a standard on at least Windows and BeOS

Jakob Kemi (7/24) Mar 11 2002 Linux supports UTF-8 very good. You can use all your standard programs
Jakob Kemi (22/39) Mar 11 2002 I forgot to add. UNICODE is a very loose notion as it includes

Walter (3/42) Mar 11 2002 Supporting utf8 would just be by using char[] arrays!

Walter (5/11) Mar 11 2002 Actually, string literals are uncommitted by default. They then get

Pavel Minayev (11/13) Mar 12 2002 So how is the context determined?

Walter (6/20) Mar 12 2002 That's a bug, it should give an ambiguity error.

Juan Carlos Arevalo Baeza (16/32) Mar 15 2002 Hmmm... I'm thinking that flagging an ambiguity here would still be b...

Walter (7/16) Mar 26 2002 or

J. Daniel Smith (9/16) Mar 11 2002 UTF-8 is fine for strings that are mostly ASCII with some UNICODE (sourc...

Jakob Kemi (15/23) Mar 11 2002 True.
Serge K (4/8) Mar 11 2002 Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes...

Walter (8/13) Mar 11 2002 At one time I had written a lexer that handled utf-8 source. It turned o...

Jakob Kemi (8/23) Mar 11 2002 You already have this problem in windows with linebreaks being two

Pavel Minayev (3/5) Mar 12 2002 There are no iterators in D, nor there is a string class.

Jakob Kemi (10/17) Mar 12 2002 I'm not talking about some STL iterators here, what I mean

Walter (6/28) Mar 12 2002 That's true, but I was never comfortable using such things, and hiding t...

Jakob Kemi <jakob.kemi telia.com> writes:

As I understand from the docs, D is supposed to use wchars 
(2 to 4 bytes) for representing non-ASCII strings. I think it would be
better to let all string functions only handle UTF-8 (which is fully
backwards compatible with ASCII). UTF-8 is slowly becoming standard in
UNIX. (just look at X and Gtk+ 2.0)

Just a thought.

	Jakob Kemi

Mar 11 2002

"Pavel Minayev" <evilone omen.ru> writes:

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...

 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

...while UNICODE is already a standard on at least Windows and BeOS
(these, I know for sure; Linux?). I'd prefer to have both char
and wchar flavor for each and every string-manipulation function,
probably overloaded (so you don't really see the difference).

BTW, Walter, a question. String literals seem to be char[] by
default, I guess they are wchar[] if the program is written
in UNICODE, though? Also, are UNICODE literals allowed in ASCII
programs?

Mar 11 2002

Jakob Kemi <jakob.kemi telia.com> writes:

On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

 
 ...while UNICODE is already a standard on at least Windows and BeOS
 (these, I know for sure; Linux?). I'd prefer to have both char and wchar
 flavor for each and every string-manipulation function, probably
 overloaded (so you don't really see the difference).
 
 BTW, Walter, a question. String literals seem to be char[] by default, I
 guess they are wchar[] if the program is written in UNICODE, though?
 Also, are UNICODE literals allowed in ASCII programs?

Linux supports UTF-8 very good. You can use all your standard programs
with UTF-8 encoding (cat, less, etc...) The best part is that if all
string functions is written to handle UTF-8, they'll also work for
ordinary (legacy) ASCII strings. There's no need to change the char
type or anything to deal with UTF-8. wchar would still be useful
to interact with older C libs.

Mar 11 2002

Jakob Kemi <jakob.kemi telia.com> writes:

On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

 
 ...while UNICODE is already a standard on at least Windows and BeOS
 (these, I know for sure; Linux?). I'd prefer to have both char and wchar
 flavor for each and every string-manipulation function, probably
 overloaded (so you don't really see the difference).
 
 BTW, Walter, a question. String literals seem to be char[] by default, I
 guess they are wchar[] if the program is written in UNICODE, though?
 Also, are UNICODE literals allowed in ASCII programs?

I forgot to add. UNICODE is a very loose notion as it includes
UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
others.

My first reaction was that variable width characters is kinda gross
and inelegant.
However, one would waste memory (yeah, I know, it's cheap) with 4
byte characters in order to be UNICODE compliant. Also the fact
that UTF-8 works so well with ASCII inheritance and it's fast
acceptance in the UNIX world make me feel all warm and fuzzy about
it, despite its variable character size.
By sticking with UCS-2 and UCS-4 we'll still be in present day's
situation, all internationalization will be clumpsy addons to
the standard ASCII strings and every program will have to decide
which routines to use etc. (a hell when you're developing big
applications and/or exchanging data from different countries and
different systems.) The best thing would offcourse be if the whole
world including all legacy databases, every file ever written and
every old and unsupported application just magically and instantly
were rewritten or converted to UCS-4. But there is _no_ way that
it's ever going to happen. UTF-8 gives the best compromise IMO.

	Jakob Kemi

Mar 11 2002

"Walter" <walter digitalmars.com> writes:

Supporting utf8 would just be by using char[] arrays!

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j8hj$1m6n$1 digitaldaemon.com...
 On Mon, 11 Mar 2002 21:48:45 +0100, Pavel Minayev wrote:

 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...

 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

 ...while UNICODE is already a standard on at least Windows and BeOS
 (these, I know for sure; Linux?). I'd prefer to have both char and wchar
 flavor for each and every string-manipulation function, probably
 overloaded (so you don't really see the difference).

 BTW, Walter, a question. String literals seem to be char[] by default, I
 guess they are wchar[] if the program is written in UNICODE, though?
 Also, are UNICODE literals allowed in ASCII programs?

 I forgot to add. UNICODE is a very loose notion as it includes
 UCS-2 (2 byte), UCS-4 (4 byte) and UTF-8 (variable width) among
 others.

 My first reaction was that variable width characters is kinda gross
 and inelegant.
 However, one would waste memory (yeah, I know, it's cheap) with 4
 byte characters in order to be UNICODE compliant. Also the fact
 that UTF-8 works so well with ASCII inheritance and it's fast
 acceptance in the UNIX world make me feel all warm and fuzzy about
 it, despite its variable character size.
 By sticking with UCS-2 and UCS-4 we'll still be in present day's
 situation, all internationalization will be clumpsy addons to
 the standard ASCII strings and every program will have to decide
 which routines to use etc. (a hell when you're developing big
 applications and/or exchanging data from different countries and
 different systems.) The best thing would offcourse be if the whole
 world including all legacy databases, every file ever written and
 every old and unsupported application just magically and instantly
 were rewritten or converted to UCS-4. But there is _no_ way that
 it's ever going to happen. UTF-8 gives the best compromise IMO.

 Jakob Kemi

Mar 11 2002

"Walter" <walter digitalmars.com> writes:

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a6j584$1kuq$1 digitaldaemon.com...
 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 BTW, Walter, a question. String literals seem to be char[] by
 default, I guess they are wchar[] if the program is written
 in UNICODE, though? Also, are UNICODE literals allowed in ASCII
 programs?

Actually, string literals are uncommitted by default. They then get
converted to char[], wchar[], char, or wchar depending on the context.

You can insert unicode literals into strings with the \uUUUU syntax.

Mar 11 2002

"Pavel Minayev" <evilone omen.ru> writes:

"Walter" <walter digitalmars.com> wrote in message
news:a6jfpg$2vd$1 digitaldaemon.com...

 Actually, string literals are uncommitted by default. They then get
 converted to char[], wchar[], char, or wchar depending on the context.

So how is the context determined?

    void foo(char[] s)  { ... }
    void foo(wchar[] s) { ... }
    foo("Hello, world!");

My tests show that in the above snippet "Hello, world!" is passed to the
function that takes char[] argument. If the whole program text would
be in UNICODE, would the string be UNICODE as well?

And what if I insert some UNICODE chars into the literal? Will the
compiler complain about "invalid characters"?

Mar 12 2002

"Walter" <walter digitalmars.com> writes:

"Pavel Minayev" <evilone omen.ru> wrote in message
news:a6l1sf$l7f$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a6jfpg$2vd$1 digitaldaemon.com...

 Actually, string literals are uncommitted by default. They then get
 converted to char[], wchar[], char, or wchar depending on the context.

 So how is the context determined?

     void foo(char[] s)  { ... }
     void foo(wchar[] s) { ... }
     foo("Hello, world!");

 My tests show that in the above snippet "Hello, world!" is passed to the
 function that takes char[] argument.

That's a bug, it should give an ambiguity error.

 If the whole program text would
 be in UNICODE, would the string be UNICODE as well?

Yes, but if the string doesn't contain any characters with the high bits
set, it can be implicitly converted to ascii.

 And what if I insert some UNICODE chars into the literal? Will the
 compiler complain about "invalid characters"?

It won't implicitly convert it to char[], then.

Mar 12 2002

"Juan Carlos Arevalo Baeza" <jcab roningames.com> writes:

"Walter" <walter digitalmars.com> wrote in message
news:a6lccr$psj$2 digitaldaemon.com...

 So how is the context determined?

     void foo(char[] s)  { ... }
     void foo(wchar[] s) { ... }
     foo("Hello, world!");

 My tests show that in the above snippet "Hello, world!" is passed to the
 function that takes char[] argument.

 That's a bug, it should give an ambiguity error.

   Hmmm... I'm thinking that flagging an ambiguity here would still be bad.
How about using attributes to add the ability to resolve ambiguities in a
user-defined manner? For example:

priority(9) void foo(char[] s)  { ... }
priority(5) void foo(wchar[] s) { ... }
foo("Hello, world!"); // Calls using char[], as it's higher priority.

   This way, ambiguities will only be flagged if multiple possibilities
exist that have the same priority. The default priority could be 5 for all
functions, and the range could be 0 to 9, so you can always define higher or
lower ones as needed.

   I admit that this might open a whole new can of worms, but I'd definitely
be willing to explore this if the language supported it.

Salutaciones,
                         JCAB


 If the whole program text would
 be in UNICODE, would the string be UNICODE as well?

 Yes, but if the string doesn't contain any characters with the high bits
 set, it can be implicitly converted to ascii.

 And what if I insert some UNICODE chars into the literal? Will the
 compiler complain about "invalid characters"?

 It won't implicitly convert it to char[], then.

Mar 15 2002

"Walter" <walter digitalmars.com> writes:

"Juan Carlos Arevalo Baeza" <jcab roningames.com> wrote in message
news:a6u823$bip$1 digitaldaemon.com...
 priority(9) void foo(char[] s)  { ... }
 priority(5) void foo(wchar[] s) { ... }
 foo("Hello, world!"); // Calls using char[], as it's higher priority.

    This way, ambiguities will only be flagged if multiple possibilities
 exist that have the same priority. The default priority could be 5 for all
 functions, and the range could be 0 to 9, so you can always define higher

or
 lower ones as needed.

    I admit that this might open a whole new can of worms, but I'd

definitely
 be willing to explore this if the language supported it.

D was trying to migrate to a simpler overloading scheme <g>. It can be less
convenient at times, but I think it's more than made up for by having simple
and obvious rules.

Mar 26 2002

"J. Daniel Smith" <j_daniel_smith HoTMaiL.com> writes:

UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source
code, Western European languages).  But if the string is entirely UNICODE
(something in Chinese for example), the UTF-8 encoding can consume MORE
memory since the UTF-8 tranformation can be as many as six bytes long.

UTF-8 solves a lot of problems, but I'm not sure you want to wire it into
the language as the only option.

   Dan

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

 Just a thought.

 Jakob Kemi

Mar 11 2002

Jakob Kemi <jakob.kemi telia.com> writes:

On Mon, 11 Mar 2002 22:52:46 +0100, J. Daniel Smith wrote:

 UTF-8 is fine for strings that are mostly ASCII with some UNICODE
 (source code, Western European languages).  But if the string is
 entirely UNICODE (something in Chinese for example), the UTF-8 encoding
 can consume MORE memory since the UTF-8 tranformation can be as many as
 six bytes long.

True.
Globally however UTF-8 will save memory compared to UCS-4
(no UCS-2 isn't enough) since 6 byte wide characters are rare.
But I think that the memory issue doesn't really matter, it will
also matter even less as prices fall. Also if someone is storing
_huge_ amounts of text they will just compress it and remove most
of the redundancy in the codeset.


 UTF-8 solves a lot of problems, but I'm not sure you want to wire it
 into the language as the only option.

It sure does solve lots of problems and the best part is that you
don't have to opt out anything else. Just design all string functions
to handle UTF-8 and you'll have the best of both worlds (ordinary
ASCII char strings and UTF-8 that is).

If there's need you can still have special UCS-4 functions and
ucs4_char (or whatever).

    Dan

	Jakob

Mar 11 2002

"Serge K" <skarebo programmer.net> writes:

"J. Daniel Smith" <j_daniel_smith HoTMaiL.com> wrote in message
news:a6j90i$1me0$1 digitaldaemon.com...
 UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source
 code, Western European languages).  But if the string is entirely UNICODE
 (something in Chinese for example), the UTF-8 encoding can consume MORE
 memory since the UTF-8 tranformation can be as many as six bytes long.

Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes.
Which means - it simply cannot consume more memory than UTF-32.

(ISO/IEC 10646 may require up to 6 bytes in UTF-8, but it is the superset for
Unicode.)

Mar 11 2002

"Walter" <walter digitalmars.com> writes:

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars
 (2 to 4 bytes) for representing non-ASCII strings. I think it would be
 better to let all string functions only handle UTF-8 (which is fully
 backwards compatible with ASCII). UTF-8 is slowly becoming standard in
 UNIX. (just look at X and Gtk+ 2.0)

At one time I had written a lexer that handled utf-8 source. It turned out
to cause a lot of problems because strings could no longer be simply indexed
by character position, nor could pointers be arbitrarilly incremented and
decremented.

It turned out to be a lot of trouble :-( and I finally converted it to
wchar's.

Mar 11 2002

Jakob Kemi <jakob.kemi telia.com> writes:

On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

 
 At one time I had written a lexer that handled utf-8 source. It turned
 out to cause a lot of problems because strings could no longer be simply
 indexed by character position, nor could pointers be arbitrarilly
 incremented and decremented.
 
 It turned out to be a lot of trouble :-( and I finally converted it to
 wchar's.

You already have this problem in windows with linebreaks being two
bytes. Just use custom iterators for your string class
implementation and if you need to set/get positions in streams you
use tell and seek (you're not supposed to assume that 1 character
== 1 byte anyway according to standards.) There should be no real
_need_ to index characters in strings with pointers.

	Jakob Kemi

Mar 11 2002

"Pavel Minayev" <evilone omen.ru> writes:

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6jitg$118$1 digitaldaemon.com...

 You already have this problem in windows with linebreaks being two
 bytes. Just use custom iterators for your string class

There are no iterators in D, nor there is a string class.

Mar 12 2002

Jakob Kemi <jakob.kemi telia.com> writes:

On Tue, 12 Mar 2002 20:06:34 +0100, Pavel Minayev wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6jitg$118$1 digitaldaemon.com...
 
 You already have this problem in windows with linebreaks being two
 bytes. Just use custom iterators for your string class

 
 There are no iterators in D, nor there is a string class.

I'm not talking about some STL iterators here, what I mean
is that you just desing your loops like this:

for (char* s = string; get_char(s) != '\0'; s = next_char(s) ) {
   ...
}

Loops operating on strings is rare anyway, most string functions
should be optimized library functions. get_char() & next_char()
should be inlined and can use whatever syntax sugar applicatable.

	Jakob

Mar 12 2002

"Walter" <walter digitalmars.com> writes:

"Jakob Kemi" <jakob.kemi telia.com> wrote in message
news:a6jitg$118$1 digitaldaemon.com...
 On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote:


 "Jakob Kemi" <jakob.kemi telia.com> wrote in message
 news:a6j4v5$1koq$1 digitaldaemon.com...
 As I understand from the docs, D is supposed to use wchars (2 to 4
 bytes) for representing non-ASCII strings. I think it would be better
 to let all string functions only handle UTF-8 (which is fully backwards
 compatible with ASCII). UTF-8 is slowly becoming standard in UNIX.
 (just look at X and Gtk+ 2.0)

 At one time I had written a lexer that handled utf-8 source. It turned
 out to cause a lot of problems because strings could no longer be simply
 indexed by character position, nor could pointers be arbitrarilly
 incremented and decremented.

 It turned out to be a lot of trouble :-( and I finally converted it to
 wchar's.

 You already have this problem in windows with linebreaks being two
 bytes. Just use custom iterators for your string class
 implementation and if you need to set/get positions in streams you
 use tell and seek (you're not supposed to assume that 1 character
 == 1 byte anyway according to standards.) There should be no real
 _need_ to index characters in strings with pointers.

That's true, but I was never comfortable using such things, and hiding the
performance hit of it behind syntactic sugar doesn't make the hit go away.
When you're trying to compile 100,000 lines of code, every cycle in the
lexer matters.

Mar 12 2002

D Programming

C/C++ Programming

Other

D - String type.