digitalmars.D - strings in D

Andrew Fedoniouk (20/20) Feb 18 2005 Is there any string class for the D?

Kris (7/27) Feb 18 2005 You're walking upon graves with that one, Andrew! I'm afraid there's bee...
John Reimer (14/40) Feb 18 2005 This question has been asked many times in the D groups. If there ever

Charlie Patterson (4/10) Feb 19 2005 The D newsgroup could probably use a FAQ. I also don't know where the l...

John Reimer (4/17) Feb 19 2005 Yep, Navigating this newsgroup can be quite a chore.

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/47) Feb 20 2005 There are several FAQ.

Unknown W. Brackets (17/20) Feb 18 2005 Forgive me, but isn't UCS-2 *essentially* 16 bit Unicode without the bom...
Andrew Fedoniouk (67/67) Feb 18 2005 Ok. Seems like I did not explain this clearly. Let's try again then from...

Unknown W. Brackets (33/33) Feb 19 2005 Yes, all true. I know. UCS-2 and UTF-16 are not exactly the same, but

Derek (57/59) Feb 19 2005 I submit this sample code ...
Andrew Fedoniouk (44/44) Feb 19 2005 "you're ignoring ISO-8859-2, Shift_JIS, and similar encodings."

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (15/15) Feb 19 2005 Andrew Fedoniouk wrote:

Andrew Fedoniouk (31/31) Feb 19 2005 According to

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (37/37) Feb 20 2005 -----BEGIN PGP SIGNED MESSAGE-----
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (31/52) Feb 20 2005 Right, if you want to get all technical about it at once. :-)

=?ISO-8859-1?Q?Thomas_K=FChne?= (19/19) Feb 20 2005 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (12/23) Feb 20 2005 Yes, that's what I said :-) (not my fault char[] sounds a lot like char)

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= (23/23) Feb 19 2005 -----BEGIN PGP SIGNED MESSAGE-----

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (28/63) Feb 20 2005 Length in D counts code units. Always. (but yes, an array insert

Ben Hinkle (6/11) Feb 19 2005 foreach already iterates over code points. Try something like
=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= (22/32) Feb 20 2005 There is no built-in (Phobos) class, as reasoned in:

Ben Hinkle (19/27) Feb 20 2005 A while ago I posted some tiny helper functions to do on-the-fly charact...

"Andrew Fedoniouk" <news terrainformatica.com> writes:

Is there any string class for the D?
Or are there any plans to create string for D?

char[], dchar[] and qchar[] cannot serve string purposes as they
use utf encodings which are "transport" encodings and cannot be used
in most cases as strings.

String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
multilang plane)
and ucs-4 so operator[] always returns character in full (for the given 
supported plane).
The same should apply to foreach().

I personally would like to see something similar to Java strings (ucs-2) 
with
methods like fromByteArray(encoding),  fromUtf8() , etc.

Probably such strings should use copy-on-write implementation.
I think that ucs-2 (unsigned word) as a string character whould be enough 
for all active
languages.

Any other ideas, gentlemen?

Andrew Fedoniouk.
http://terrainformatica.com

Feb 18 2005

Kris <Kris_member pathlink.com> writes:

In article <cv6d5q$19al$1 digitaldaemon.com>, Andrew Fedoniouk says...
Is there any string class for the D?
Or are there any plans to create string for D?

char[], dchar[] and qchar[] cannot serve string purposes as they
use utf encodings which are "transport" encodings and cannot be used
in most cases as strings.

String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
multilang plane)
and ucs-4 so operator[] always returns character in full (for the given 
supported plane).
The same should apply to foreach().

I personally would like to see something similar to Java strings (ucs-2) 
with
methods like fromByteArray(encoding),  fromUtf8() , etc.

Probably such strings should use copy-on-write implementation.
I think that ucs-2 (unsigned word) as a string character whould be enough 
for all active
languages.

Any other ideas, gentlemen?

Andrew Fedoniouk.
http://terrainformatica.com

You're walking upon graves with that one, Andrew! I'm afraid there's been a lot
of conflicting opinion around that particular subject.

Best bet is to get hold of a 'non-standard' library for such things, and go from
there. The mango.icu package is a wrapper around the extensive ICU project, and
may suit your needs ~ you can find that over at dsource.org:
http://dsource.org/forums/viewtopic.php?t=420

Feb 18 2005

John Reimer <brk_6502 yahoo.com> writes:

On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:

 Is there any string class for the D?
 Or are there any plans to create string for D?
 
 char[], dchar[] and qchar[] cannot serve string purposes as they
 use utf encodings which are "transport" encodings and cannot be used
 in most cases as strings.
 
 String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
 multilang plane)
 and ucs-4 so operator[] always returns character in full (for the given 
 supported plane).
 The same should apply to foreach().
 
 I personally would like to see something similar to Java strings (ucs-2) 
 with
 methods like fromByteArray(encoding),  fromUtf8() , etc.
 
 Probably such strings should use copy-on-write implementation.
 I think that ucs-2 (unsigned word) as a string character whould be enough 
 for all active
 languages.
 
 Any other ideas, gentlemen?
 
 Andrew Fedoniouk.
 http://terrainformatica.com

This question has been asked many times in the D groups.  If there ever
were a "big three" in the D debates department, I think this one would
rank as one of them. 

From what I gather, the opinions have settled into three groups:

1) Those that want a String class in D and think it is a critical addition
to the language.
2) Those that consider a String class contrary to the D methodology; they
thing char[] wchar[] and dchar[] are sufficient.
3) Those that think a String class could be a useful addition; but it
should be added to D for optional use.

If you do a search of this newsgroup and the old D newsgroup, I think
you'll find how big the discussion has been!

- John R.

Feb 18 2005

"Charlie Patterson" <charliep1 SPAMIDDYSPAMexcite.com> writes:

"John Reimer" <brk_6502 yahoo.com> wrote in message
news:pan.2005.02.19.05.02.08.170345 yahoo.com...
 On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:

 Is there any string class for the D?
 ...


 This question has been asked many times in the D groups.  If there ever
 were a "big three" in the D debates department, I think this one would
 rank as one of them.

The D newsgroup could probably use a FAQ.  I also don't know where the land
mines are
buried!

Feb 19 2005

John Reimer <brk_6502 yahoo.com> writes:

On Sat, 19 Feb 2005 10:54:44 -0500, Charlie Patterson wrote:

 "John Reimer" <brk_6502 yahoo.com> wrote in message
 news:pan.2005.02.19.05.02.08.170345 yahoo.com...
 On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:

 Is there any string class for the D?
 ...


 
 This question has been asked many times in the D groups.  If there ever
 were a "big three" in the D debates department, I think this one would
 rank as one of them.

 
 The D newsgroup could probably use a FAQ.  I also don't know where the land
mines are
 buried!

Yep, Navigating this newsgroup can be quite a chore.

I'm not sure, but I thought the D wiki site has some references to these
topics.  Justin Calvarese would probably know.

Feb 19 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

John Reimer wrote:

The D newsgroup could probably use a FAQ.  I also don't know where the land
mines are
buried!

 
 Yep, Navigating this newsgroup can be quite a chore.
 
 I'm not sure, but I thought the D wiki site has some references to these
 topics.  Justin Calvarese would probably know.

There are several FAQ.
http://www.digitalmars.com/d/faq.html (Offical FAQ)
http://int19h.tamb.ru/faq.html (Inoffical FAQ)
http://www.prowiki.org/wiki4d/wiki.cgi?FaqRoadmap


But you might be looking for simple things like:
http://www.prowiki.org/wiki4d/wiki.cgi?ShortFrequentAnswers

 Strings are not null-terminated but hold explicit length information.
 Therefore you need to use %.*s not %s in printf, or just use writef!

 Comparing an object reference like: "if (object == null)" will crash.
 You must use "if (object is null)"

 Checking for a key in an AA like: "if(array[key])" will create it
 if it's missing. You must use "if(key in array)"


Or just a quick summary, like I posted earlier:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/12609

 Q: What's the default boolean type in D ?
 A: bit.
 (bool is an "alias")
 
 Q: Is that really type-safe ?
 A: No.
 (just as in C99/C++)
 
 Q: What's the default string type in D ?
 A: char[].
 (since main() uses it)
 
 Q: Is that a single class ?
 A: No.
 (it's a primitive type)
 
 Q: Was this done by accident or by choice ?
 A: choice.
 (by Walter Bright)

 Q: Will this change before D version 1.0 ?
 A: No.
 (at least unlikely)


At least the String Wars and the Boolean Wars are *over*...
And it was char[]/wchar[]/dchar[] and bit/wbit/dbit that won.

--anders

Feb 20 2005

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Forgive me, but isn't UCS-2 *essentially* 16 bit Unicode without the bom 
and maybe a few other things?  I may be wrong, but I would think that, 
if you want that, you can just use dchar[] or even wchar[]...

I'm not saying that strings are or aren't necessary, but if I do this: 
(let's see if I can post unicode on this newsgroup...)

wchar[] test = "ウェブ全体から検索";

foreach (wchar c; test)
    writef("%s ", c);

You'll get one iteration for each character (there are nine.)  Yes, this 
uses twice the memory, but it gives you the "character in full" you're 
asking for.  No replacement for a string class, and I'm not arguing 
either way on that, but foreach and [] (called opIndex, I believe, in D) 
work fine.

As for byte conversions, you can at least do that with unicode (simple 
casting between byte[] and char[], etc.) and I'm sure iconv could be 
useful if you need charset conversion.

-[Unknown]


 I think that ucs-2 (unsigned word) as a string character whould be enough 
 for all active
 languages.

Feb 18 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

Ok. Seems like I did not explain this clearly. Let's try again then from 
different
point of view (this time more technical).

UTF16 sequence cannot be treated as UCS-2 sequence (especially in D
with its built-in conversion). This is just technically wrong.

See:
word    utf16string[] =
{
  0x0041,       // 'a' - Latin-1
  0x0020,       // ' ' - Latin-1
  0xD800,       // high-half zone part
  0xDC00,       // low-half zone part - value
  0xD800,       // high-half zone part
  0xDC01        // value
};

This example text contains 4 coded characters. The first two are BMP (basic 
multiplane) characters coded with a single UCS-2 (BMP)code value; the last 
two are non-BMP characters coded with two wordseach, a high-half code and a 
low-half code. Translating this to UCS-4code values would produce the 
following:
dword  ucs4string[] =
{
  0x00000041,   // 'a' // Latin-1
  0x00000020,   // ' ' // Latin-1
  0x00010000,   // hieroglyph foo
  0x00010001    // hieroglyph bar
};

What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that 
utf16string is sequence of wchars. I wouldn't say so.These are not 
characters in common sense but just parts of the sequence of16bit units. You 
cannot treat them as characters e.g. you cannotinsert new wchar at position 
3 of utf16string.

Only dchar could be considered as a real UNICODE character (UCS-4).
But modern computers are not ready yet for UCS-4. Too much memory needed.

Practical solution is to use ucs-2 - two-byte ucs-2 characters.
(Again ucs-2 is BMP http://www.unicode.org/roadmaps/bmp/ and includes
all active languages civilazation using now for writings)

typedef wchar char2; // new type, ucs-2 codes
typedef char2[] string2; // brand new type, strict ucs-2 string

conversion from utf16 wchar[] -> char2[] *must* interpret utf16 pairs 
(0xD800,0xDC00) and
produce *one* char2 codewith value '?' (or any other with meaning
not supported character) Thus codes in the range D800 - DBFF *must* not
appear in char2[] string.

As soon as D has built-in conversion routines then list of character types
should look like as:

char    - element of utf8 sequence. char[] - utf8 encoded unicode sequence.
wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence.
dchar  - ucs-4 character. full unicode character. dchar[] - ucs-4 string.
char2  - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of
             UTF16 sequence - do not expand into ucs-4 by system.
char2[] - ucs-2 string - sequence of characters.
        Could be manipulated arbitrarye.g. characters (char2) could
        be inserted or deleted at any given position.

Let me highlight again:

/////
/////   elements of utf sequence *are not* characters.
/////

So such functions as strchr(string,char) must be declared either as

int strchr(char1[], char1 c) // latin-1 string
--or--
int strchr(char2[], char2 c) // ucs-2 string and char
--or--
int strchr(char4[], char4 c) // ucs-4 string or 'dchar'

This message has one sole reason: to make D close to perfect.

Andrew Fedoniouk.
http://terrainformatica.com

Feb 18 2005

"Unknown W. Brackets" <unknown simplemachines.org> writes:

Yes, all true.  I know.  UCS-2 and UTF-16 are not exactly the same, but 
they are quite similar for many intents and purposes.

Again, you can get the conversion you want (Latin1 -> UCS-4, etc.) using 
iconv or similar.  Even if this was built in, it would have to be done 
using such a tool or a custom written one - it's not like it's an 
interrupt call or something :P.

And, to make a strlen that counted unqiue characters in a 
UTF-8/UTF-16/etc. string would be expensive performance wise.  Instead 
of just giving the array's length, which is lightning quick and very 
possibly imho why D performs better with string usage, you'll end up 
traversing the entire string again looking for characters.  Yes, this 
length could be (I would hope!) cached by the class to improve speed of 
sequential strpos's, substr's, etc.

But, if it had to traverse like that it would be so much better to use 
wchar, at least just for textual strings that might contain such 
characters, because then you could use the speedy method, instead of 
searching the whole string like C did.

Several other languages have these same problems: C, PHP, Perl, SQL, 
etc.  I'm quite sure most people who understand UTF-8 are aware that the 
number of bits divided by eight may or may not have anything to do with 
the actual length in characters of the string, though.  It's essential - 
and sometimes, you just have to know it.  Not everything can be 
abstracted to the point where you just type "do my homework" and hit 
compile...

Still, I don't think, personally, using a whole bunch of char types 
wouldn't solve this.  That's several times uglier than a string class, 
and since a char array is just an array, there isn't really any clean 
way to override the .length of it... it'd have to be a class.  And 
anyway, you're ignoring ISO-8859-2, Shift_JIS, and similar encodings. 
Why should ISO-8859-1 (Latin1) be special?

Anyway, I can just see a "i18n_length(char[] x)" function.... because 
sometimes, you really just want the number of bytes, not characters.

-[Unknown]

Feb 19 2005

Derek <derek psych.ward> writes:

On Sat, 19 Feb 2005 01:14:46 -0800, Unknown W. Brackets wrote:


[snip]

 Anyway, I can just see a "i18n_length(char[] x)" function.... because 
 sometimes, you really just want the number of bytes, not characters.

I submit this sample code ...
<code>
module i18n;
private import std.utf;
debug(1) private import std.stdio;

uint i18n_length( char[] x)
{
    return toUTF32(x).length;
}

uint i18n_length( wchar[] x)
{
    return toUTF32(x).length;
}

uint i18n_length( dchar[] x)
{
    return x.length;
}

unittest
{
    char[] tchar;
    wchar[] twchar;
    dchar[] tdchar;
    
    
    tdchar ~= 0x00000041;   // 'a' // Latin-1
    tdchar ~= 0x00000020;   // ' ' // Latin-1
    tdchar ~= 0x00010000;   // hieroglyph foo
    tdchar ~= 0x00010001;   // hieroglyph bar
    
    twchar = toUTF16(tdchar);
    tchar  = toUTF8(tdchar);

    debug(1) {writefln("dchar.length = %d (%d)", i18n_length(tdchar),
tdchar.length); }    
    assert( i18n_length(tdchar) == 4);
    debug(1) {writefln("wchar.length = %d (%d)", i18n_length(twchar),
twchar.length); }    
    assert( i18n_length(twchar) == 4);
    debug(1) {writefln(" char.length = %d (%d)", i18n_length(tchar),
tchar.length); }    
    assert( i18n_length(tchar)  == 4);
}   

debug(2)
{
    void main()
    {
    }
} 

</code>

This can be compiled using "build i18n -debug=2" to generate the unittests
and then run i18n to run the unittests.

Of course, it you want to you can create a doctored version of toUTFxx to
just count codepoints rather than do an actual conversion.

-- 
Derek
Melbourne, Australia

Feb 19 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

"you're ignoring ISO-8859-2, Shift_JIS, and similar encodings."

Where I am ignoring them?

"Still, I don't think, personally, using a whole bunch of char types ...."

In fact I am not proposing new top level character types.

My point is simple:

'string' as an entity (or class) is different from wchar[] - sequence of 
UTF16 characters
in the terms of following:

class string  // string which supports only ucs-2 code points
{
    typedef wchar char2; // ucs-2 code points only.
    char2[] chars;

    this( wchar[] utf16 )
    {
        // thanks to Ben Hinkle
        foreach(dchar cp; utf16)
       {
           if( dchar > 0xFFFF )
               chars ~= cast(char2) '?'; // ignorabimus et ignorabus
           else
               chars ~= cast(char2) cp;
        }
    }

    int length() {  return chars.length;  }
            // as chars ALWAYS contains code points.

    void set(int pos, wchar wc)
    {
         if( wc >= oxD800 && wc <= 0xDFFF)
             throw "invalid ucs-2 code point";
         else
             chars[pos] = cast(char2)wc;
    }

}

AFAIK this approach used in java.lang.String .

I think that existing names of entities in D are
misleading.

'char' in fact is not a character but element of UTF-8 sequence - ubyte.
'wchar' in fact is not a "wide" character but element of UTF-16 sequence - 
ushort.
and only 'dchar' has meaning of character.

Keeping this in mind declaration like

wchar a;

is a technical nonsense. The way it is implemented now and treated by D
wchar (and char) can be used *ONLY* as members of arrays (in sequence).

Feb 19 2005

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= writes:

Andrew Fedoniouk wrote:
| I think that existing names of entities in D are misleading.
|
| 'char' in fact is not a character but element of UTF-8 sequence -
| ubyte.
| 'wchar' in fact is not a "wide" character but element of
| UTF-16 sequence - ushort. and only 'dchar' has meaning of character.

'dchar' is no _character_, it represents a _codepoint_.

While codepoints are interesting for some cases you are much more likely to
a) treat strings as void[]/byte[]/ubyte[] (most cases)
b) or are interested in graphemes (display/text editing)

http://www.unicode.org/faq/char_combmark.html

Hint: search the digitalmars.D newsgroup archive bevore posting any more
about strings/*chars.

Thomas

Feb 19 2005

"Andrew Fedoniouk" <news terrainformatica.com> writes:

According to
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
which is based on "digitalmars.D newsgroup archive" I believe,
D's 'char' and 'wchar' are not 'characters' as their names state but
rather "code units". Right?

And about "code point": in terms of UNICODE code point is a number between 0 
and 0x10FFFF.
To represent this codes (or unicode character indexes) from this range you 
may use
either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane 
codes)  or uint21 (full UNICODE range).

Some code values from 0 and 0x10FFFF range are illeagal.
E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world.

If D wants to treat its strings in UTF8 or UTF16 form it should
provide methods recommended by W3C:
http://www.w3.org/TR/DOM-Level-2-Core/i18n.html

I think that ideally
D.char, D.wchar and D.dchar should be treated as code point value storage 
types and not as code units.
This will give some meaning to these type names at least.

String literals should have type of 'utf8' like this:
typedef ubyte[] utf8;

Intrinsic conversion routines like:
wchar[] str = "?????? ???"; // utf8  ("Hello World" in Russian)
should create str as sequence of codepoints with substitution of unsupported 
values for wchar with lets say 0xFFFF.

The same rule should apply to
char[] str = "?????? ???"; // utf8
(in this case str will contain ten 0xFF as these are not Latin-1 codes)

Andrew Fedoniouk.
http://terrainformatica.com

Feb 19 2005

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Fedoniouk wrote:

| According to http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
| which is based on "digitalmars.D newsgroup archive" I believe, D's
| 'char' and 'wchar' are not 'characters' as their names state but
| rather "code units". Right?
|
| And about "code point": in terms of UNICODE code point is a number
| between 0 and 0x10FFFF. To represent this codes (or unicode character
| indexes) from this range you may use either uint8 (for Latin-1 code
| points)
UTF-8 supports all code points
depending on the value of the codepoint value 1 - 4 chars are required

| or uint16 (Basic Multilang Plane  codes)  or uint21 (full UNICODE
| range).
UTF-16 supports all code points
depending on the value of the codepoint value 1 - 2 wchars are required

| Some code values from 0 and 0x10FFFF range are illeagal. E.g.
| cast(dchar)0xD800 should rise an error in ideal 'D' world.
The codepoint 0xD800 isn't illegal, it's unassigned and is very likely
to remain unassigned in all future Unicode version.
The uint16 0xD800 on it's own is illegal as it is part of a UTF-16
surrogate pair.

| If D wants to treat its strings in UTF8 or UTF16 form it should
| provide methods recommended by W3C:
| http://www.w3.org/TR/DOM-Level-2-Core/i18n.html
findOffset8/16/32 are very simple functions.

I'm sure that there is at least one project at dsource.org providing
this functionality.

Thomas

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFCGFm53w+/yD4P9tIRAig+AJ4///q2bK65Adnunco68Ej9U18hiACfeBnT
qd0/azp0KlO1T9p3bf87+8k=
=UqHn
-----END PGP SIGNATURE-----

Feb 20 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 According to
 http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues
 which is based on "digitalmars.D newsgroup archive" I believe,
 D's 'char' and 'wchar' are not 'characters' as their names state but
 rather "code units". Right?

Right, if you want to get all technical about it at once. :-)

However, "char" is still a perfectly good *ASCII* character.
It's just that the-high-bit-set is now defined, unlike in C...

And "wchar" is also *usually* a character (BMP), just like "char"
was in Java for a number of years... (they're now using int instead:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html,
which means that D wchar = Java char, D dchar = Java int nowadays)

So they are still "characters" ? Just that there are "exceptions"
(being the surrogate code units, referring to next unit in array)
And as long as you watch out for these, it's perfectly OK to use
them as good-old-fashioned characters (and it could be faster, too)

 And about "code point": in terms of UNICODE code point is a number between 0 
 and 0x10FFFF.
 To represent this codes (or unicode character indexes) from this range you 
 may use
 either uint8 (for Latin-1 code points) or uint16 (Basic Multilang Plane 
 codes)  or uint21 (full UNICODE range).
 
 Some code values from 0 and 0x10FFFF range are illeagal.
 E.g. cast(dchar)0xD800 should rise an error in ideal 'D' world.

For reasons of efficiency, D does not check all values upon assignment.
You must instead call the Phobos helper function: std.utf.isValidDchar
Note that "char" only holds ASCII in D, wchar must be used for Latin-1.

I suggested adding the new functions isAscii and isSurrogate too,
but it was ignored. (They're all copied and pasted at the moment)
http://www.digitalmars.com/d/archives/digitalmars/D/bugs/2154.html

 Intrinsic conversion routines like:
 wchar[] str = "?????? ???"; // utf8  ("Hello World" in Russian)
 should create str as sequence of codepoints with substitution of unsupported 
 values for wchar with lets say 0xFFFF.

Substituting all surrogates with invalid characters will *lose data*.
That is clearly not good, and using UTF-8 sounds like a better idea ?
If you want single-codeunit strings, you can search/replace yourself.

In the example above, the string literal will be converted to UTF-16.
(as in: the actual literal data, it will also be '\0'-escaped for C)

 The same rule should apply to
 char[] str = "?????? ???"; // utf8
 (in this case str will contain ten 0xFF as these are not Latin-1 codes)

You can use ubyte[] for storing 8-bit encodings (such as Latin-1, etc.)

Using char[] will give "invalid UTF sequence", when encountering high 
bytes, although the first 0x100 characters are the same in both "sets",
that is ISO-8559-1 and UTF-8. But only 0x80 will fit in a single "char".

Note that (char*) is still used for NUL-terminated 8-bit strings too!
This is mostly for making it much simpler to use external C functions,
which is the same reason why all D string literals are NUL-terminated.

--anders

Feb 20 2005

=?ISO-8859-1?Q?Thomas_K=FChne?= <thomas-dloop kuehne.THISISSPAM.cn> writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Anders F Bj�rklund wrote:

| For reasons of efficiency, D does not check
| all values upon assignment. You must instead call the Phobos helper
| function: std.utf.isValidDchar Note that "char" only holds ASCII in
| D, wchar must be used for Latin-1.

clarification

char:
can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment

char[]/char*
can hold any Unicode codepoint/codepoint sequence

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFCGIcn3w+/yD4P9tIRAnwmAJ4nsTMXVVdUQfwVxoxHoHeZyhvcGgCgjmL8
9klhna13B1PZSzl4hhN8CuI=
=4rkw
-----END PGP SIGNATURE-----

Feb 20 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Thomas K�hne wrote:

 | Note that "char" only holds ASCII in
 | D, wchar must be used for Latin-1.
 
 clarification
 
 char:
 can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment

Yes, that's what I said :-) (not my fault char[] sounds a lot like char)
And that should probably be 0x00-0x7F, or 0x00..0x80 in exclusive style?

We mean the same thing, the 7-bit ASCII subset of ISO-8859-1 and UTF-8.
(as in the table: http://www.algonet.se/~afb/d/latin1/iso-8859-1.html)

    TYPE        ALIAS     // RANGE
    char        utf8_t    // \x00-\x7F (ASCII)
   wchar       utf16_t    // \u0000-\uD7FF, \uE000-\uFFFF
   dchar       utf32_t    // \U00000000-\U0010FFFF (Unicode)

66 codepoints are invalid "noncharacters", but that's beside the point.


The code unit arrays, char[]/wchar[]/dchar[] can all hold any UTF string
But only "dchar" is fully standalone for all different codepoint values.
This does not stop "char" and "wchar" from being useful for loops and 
other special uses, just as the limitations are being accounted for ?

--anders

Feb 20 2005

=?UTF-8?B?VGhvbWFzIEvDvGhuZQ==?= writes:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Fedoniouk wrote:
| I think that existing names of entities in D are misleading.
|
| 'char' in fact is not a character but element of UTF-8 sequence -
| ubyte.
| 'wchar' in fact is not a "wide" character but element of
| UTF-16 sequence - ushort. and only 'dchar' has meaning of character.

'dchar' is no _character_, it represents a _codepoint_.

While codepoints are interesting for some cases you are much more likely to
a) treat strings as void[]/byte[]/ubyte[] (most cases)
b) or are interested in graphemes (display/text editing)

http://www.unicode.org/faq/char_combmark.html

Hint: search the digitalmars.D newsgroup archive bevore posting any more
about strings/*chars.

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (MingW32)

iD8DBQFCF65E3w+/yD4P9tIRAnFwAKCmDBgFbOLf0aOSfrnfdI9Xn6nPuwCgiBd/
47zTxYo7sPndn3XKbfCFrZ0=
=6CdT
-----END PGP SIGNATURE-----

Feb 19 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 Ok. Seems like I did not explain this clearly. Let's try again then from 
 different point of view (this time more technical).

[...]
 What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that 
 utf16string is sequence of wchars. I wouldn't say so.These are not 
 characters in common sense but just parts of the sequence of16bit units. You 
 cannot treat them as characters e.g. you cannotinsert new wchar at position 
 3 of utf16string.

Length in D counts code units. Always. (but yes, an array insert
operation only gives useful results when there's no surrogates)

As been said, counting codeunits is a lot faster than codepoints.

 Only dchar could be considered as a real UNICODE character (UCS-4).
 But modern computers are not ready yet for UCS-4. Too much memory needed.

dchar is quite alright for use in parameters and such, since the
registers are 32-bit wide anyway. For string storage, I agree...

UTF-32 wastes too much space, and UTF-16 or even UTF-8 is better.

 As soon as D has built-in conversion routines then list of character types
 should look like as:
 
 char    - element of utf8 sequence. char[] - utf8 encoded unicode sequence.
 wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence.
 dchar  - ucs-4 character. full unicode character. dchar[] - ucs-4 string.
 char2  - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of
              UTF16 sequence - do not expand into ucs-4 by system.
 char2[] - ucs-2 string - sequence of characters.
         Could be manipulated arbitrarye.g. characters (char2) could
         be inserted or deleted at any given position.

As you've discovered, D "only" concerns itself with UTF code units...
(dchar is of the UTF-32 subset instead of the full ucs-4, but anyway)
This means that if you want to handle arrays of Latin-1 characters
or arrays of BMP characters, you can not use the "character" types.

However, you are free to use the ubyte and ushort types to represent
those types of strings (that are still Unicode, encoded differently)
But there is really not much use of introducing two new types just
to represent those two special cases of the more general UTF ones ?

For ASCII (only), char[] and ubyte[] with ISO-8859-1 would be the same.
Just as for non-surrogates (only), wchar[] and ushort[] are identical.
But the latter two types would be unable to handle higher code points.

Converting between the two is trivial, but there could be a loss of
data when going from char[] -> ubyte[], or from wchar[] -> ushort[]
(e.g. if replacing any surrogates with something like \xFF or \uFFFF)

And I think it's better to go with the lossless format, than to
support the rare operation of indexing individual codepoints...
(and in case you need to to this often, there's still dchar[])

 Let me highlight again:
 
 /////
 /////   elements of utf sequence *are not* characters.
 /////
 
 So such functions as strchr(string,char) must be declared either as
 
 int strchr(char1[], char1 c) // latin-1 string
 --or--
 int strchr(char2[], char2 c) // ucs-2 string and char
 --or--
 int strchr(char4[], char4 c) // ucs-4 string or 'dchar'
 
 This message has one sole reason: to make D close to perfect.

int strchr(char[], dchar c) would also work...
(would return the *start* of 1-4 code units)

--anders

Feb 20 2005

Ben Hinkle <Ben_member pathlink.com> writes:

String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
multilang plane)
and ucs-4 so operator[] always returns character in full (for the given 
supported plane).
The same should apply to foreach().

foreach already iterates over code points. Try something like
char[] str = ...some non-ascii string...
foreach(int n, dchar cp; str) {
.. cp is the nth codepoint of str ...
}

-Ben

Feb 19 2005

=?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= <afb algonet.se> writes:

Andrew Fedoniouk wrote:

 Is there any string class for the D?

There is no built-in (Phobos) class, as reasoned in:
http://www.digitalmars.com/d/cppstrings.html

However, there are at least two 3rd-party ones:
http://dool.sourceforge.net/dool_String_String.html
http://svn.dsource.org/svn/projects/mango/trunk/doc/html/classUString.html

I'm not sure having a default *class* in a hybrid
language is such a great idea in the first place ?
(then again, Exceptions are classes and default...)

 Or are there any plans to create string for D?

As a built-in value type ? No, that will not happen.

Although, there are three good alternatives already...
(the famous: str, wstr, dstr as I prefer to call them)

 char[], dchar[] and qchar[] cannot serve string purposes as they
 use utf encodings which are "transport" encodings and cannot be used
 in most cases as strings.

This is not true. All of UTF-8, UTF-16 and UTF32 can
be used for storing an array of Unicode code points...

Just that some code points require more than just one
code unit, just as one "grapheme" might require more
than just one "code point" anyway when using Unicode.


 String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
 multilang plane)
 and ucs-4 so operator[] always returns character in full (for the given 
 supported plane).
 The same should apply to foreach().

You can "foreach dchar", over all three string types.
If you want to index by code point, you will need to
convert the two smaller code units to UTF-32 first...

--anders

Feb 20 2005

Ben Hinkle <Ben_member pathlink.com> writes:

 String as an entity is a sequence of "code points" - ascii, ucs-2(basic 
 multilang plane)
 and ucs-4 so operator[] always returns character in full (for the given 
 supported plane).
 The same should apply to foreach().

You can "foreach dchar", over all three string types.
If you want to index by code point, you will need to
convert the two smaller code units to UTF-32 first...

A while ago I posted some tiny helper functions to do on-the-fly character
indexing, but I can't find them so I'll just post them again in case the OP
finds them useful:

















-Ben

Feb 20 2005

D Programming

C/C++ Programming

Other

digitalmars.D - strings in D