www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Unified String Theory..

reply "Regan Heath" <regan netwin.co.nz> writes:
With the recent Physics slant on some posts here I couldn't resist that  
subject, in actual fact this is an idea for string handling in D which I  
have cooked up recently.

I am going to paste the text here and attach my original document, the  
document may be easier to read than the NG.

I like this idea, it may however be too much of a change for D, I'm hoping  
the advantages outweigh this fact but I'm not going to hold my breath.

It is possible I have missed something obvious and/or am talking out of a  
hole in my head, if that is the case I would appreciate being told so,  
politely ;)

Enough rambling, here is it, be nice!

-----

Proposal: A single unified string type.
Author  : Regan Heath
Version : 1.0a
Date    : 24 Nov 2005 +1300 (New Zealand DST)

[Preamble/Introduction]
After the recent discussion on Unicode, UTF encodings and the current D  
situation it occured to me that many of the issues D has with strings  
could be side-stepped if there was a single string type.

In the past we have assumed that to obtain this we have to choose one of  
the 3 available types and encodings. This wasn't an attractive option  
because each type has different pros/cons and each application may prefer  
one type over another. Another suggested solution was a string class which  
hides the details, this solution suffers from being a class and the  
limitations imposed by that and not being tied directly into the language.

My proposal is a single "string" type built into the language, which can  
represent it's string data in any given UTF encoding. Which will allow  
slicing of "characters" as opposed to what is essentially bytes, shorts,  
and ints. Whose default encoding can be selected at compile time, or  
specified at runtime. Which will implicitly or explicitly transcode where  
required.

There are some requirements for this to be possible, namely knowledge of  
the UTF encodings being built into D, these requirements may prohibit the  
proposal being favourable as it increases the knowledge required to write  
a D compiler. However it occurs to me that DMD and thus D? already  
requires a fair bit of UTF knowledge.


[Key]
First, lets start with some terminology, these are the terms I am going to  
be using and what they mean, if these are incorrect please correct me, but  
take them to have the stated meanings for this document.

code point      := the unicode value for a single and complete character.
code unit       := part of, or a complete character in one of the 3 UTF  
encodings UTF-8,16,32.
code value      := AKA code unit.
transcoding     := the process of converting from one encoding to another.
source          := a file, the keyboard, a tcp socket, a com port, an OS/C  
function call, a 3rd party library.
sink            := a file, the screen, a tcp socket, a com port, an OS/C  
function call, a 3rd party library.
native encoding := application specific "preferred" encoding (more on this  
later)
string          := a sequence of code points.

Anything I am unsure about will be suffixed with (x) where x is a letter  
of the alphabet, and my thoughts will be detailed in the [Questions]  
section.


[Assumptions]
These are what I base my argument/suggestion on, if you disagree with any  
of these you will likely disagree with the proposal. If that is the case  
please post your concerns with any given assumption in it's own post (I  
would like to discuss each issue in it's own thread and avoid mixing  
several issues)


transcoded to/from any UTF encoding with no loss of data/meaning.


mention the possible runtime penalty wherever appropriate.


ouput. Input is the process of obtaining data from a source. Output is the  
process of sending data to a sink. In either case the source or sink will  
have a fixed encoding and is that encoding does not match the native  
encoding the application will need to transcode. (see definitions above  
for what classifies as a source or sink)





[Details]
Many of the details are flexible, i.e. the names of the types etc, the  
important/inflexible details are how it all fits together and achieves  
it's results. I've chosen a bullet point format and tried to make each  
change/point as succint and clear as possible. Feel free to ask for  
clarification on any point or points. Or to ask general questions. Or to  
pose general problems. I will do my best to answer all questions.

* remove char[], wchar[] and dchar[].

* add a new type "string". "string" will store code points in the  
application specific native encoding and be implicitly or explicitly  
transcoded as required (more below).

* the application specific native encoding will default to UTF-8. An  
application can choose another with a compile option or pragma, this  
choice will have no effect on the behaviour of the program (as we only  
have 1 type and all transcoding is handled where required) it will only  
affect performance.

The performance cost cannot be avoided, presuming it is only being done at  
input and output (which is part of what this proposal aims to achieve).  
This cost is application specific and will depend on the tasks and data  
the application is designed to perform and use.

Given that, letting the programmer choose a native encoding will allow  
them to test different encodings for speed and/or provide different builds  
based on the target language, eg an application destined to be used with  
the Japanese language would likely benefit from using UTF-32  
internally/natively.

* keep char, wchar, and dchar but rename them utf8, utf16, utf32. These  
types represent code points (always, not a code units/values) in each  
encoding. Only code points that fit in utf8 will ever be represented by  
utf8, and so on. Thus some code points will always be utf32 values and  
never utf8 or 16. (much like byte/short/int)

* add promotion/comparrison rules for utf8, 16 and 32:

- any given code point represented as utf8 will compare equal to the same  
code point represented as a utf16 or utf32 and vice versa(a)

- any given code point represented as utf8 will be implicitly  
converted/promoted to the same code point represented as utf16 or utf32 as  
required and vice versa(a). If promotion from utf32 to utf16 or 8 causes  
loss in data it should be handled just like int to short or byte.

* add a new type/alias "utf", this would alias utf8, 16 or 32. It  
represents the application specific native encoding. This allows efficient  
code, like:

string s = "test";
foreach(utf c; s) {
}

regardless of the applications selected native encoding.

* slicing string gives another string

* indexing a string gives a utf8, 16, or 32 code point.

* string literals would be of type "string" encoded in the native  
encoding, or if another encoding can be determined at compile time, in  
that encoding (see ASCII example below).

* character literals would default to the native encoding, failing that  
the smallest possible type, and promoted/converted as required.

* there are occasions where you may want to use a specific encoding for a  
part of your application, perhaps you're loading a UTF-16 file and parsing  
it. If all the work is done in a small section of code and it doesn't  
interact with the bulk of your application data which is all in UTF-8 your  
native encoding it likely to be UTF-8.

In this case, for performance reasons, you want to be able to specify the  
encoding to use for your "string" types at runtime, they are exceptions to  
the native encoding. To do this we specify the encoding at  
construction/declaration time, eg.

string s(UTF16);
s.utf16 = ..data read from UTF-16 source..

(or similar, the exact syntax is not important at this stage)

thus...

* the type of encoding used by "string" should be selectable at runtime,  
some sort of encoding type flag must exist for each string at runtime,  
this is starting to head into "implementation details" which I want to  
avoid at this point, however it is important to note the requirement.


[Output]
* the type "char" will still exist, it will now _only_ represent a C  
string, thus when a string is passed as a char it can be implicitly  
transcoded into ASCII(b) with a null terminator, eg.

int strcmp(const char *src, const char *dst);

string test = "this is a test";
if (strcmp(test,"this is a test")==0) { }

the above will implicitly transcode 'test' into ASCII and ensure there is  
a null terminator. The literal "this is a test" will likely be stored in  
the binary as ASCII with a null terminator.

* Native OS functions requiring "char" will use the rule above. eg.

CreateFileA(char *filename...

* Native OS functions requiring unicode will be defined as:

CreateFileW(utf16 *filename...

and "string" will be implicitly transcoded to utf16, with a null  
terminator added..

* When the required encoding is not apparent, eg.

void CreateFile(char *data) { }
void CreateFile(utf16 *data) { }

string test = "this is a test";
CreateFile(test);

an explicit property should be used, eg.

CreateFile(test.char);
CreateFile(test.utf16);

NOTE: this problem still exists! It should however now be relegated to  
interaction with C API's as opposed to happening for native D methods.


[Input]
* Old encodings, Latin-1 etc would be loaded into ubyte[] or byte[] and  
could be cast (painted) to char*, utf8*, 16 or 32 or converted to "string"  
using a routine i.e. string toStringFromXXX(ubyte[] raw).

* A stream class would have a selectable encoding and hide these details  
 from us handling the data and giving a natively encoded "string" instead.  
Meaning, transcoding will naturally occur on input or output where  
required.


[Example application types and the effect of this change]

* the quick and dirty console app which handles ASCII only. It's native  
encoding will be UTF-8, and no transcoding will ever need to occur  
(assuming none of it's input or output is in another encoding)

* an app which loads files in different encodings and needs to process  
them efficiently. In this case the code can select the encoding of  
"string" at runtime and avoid transcoding the data until such time as it  
needs to interface with another part of the application in another  
encoding or it needs to output to a sink, also in another encoding.

* an international app which will handle many languages. this app can be  
custom built with the native string type selected to match each language.


[Advantages]
As I see it, this change would have the following advantages:

* "string" requires no knowledge of UTF encodings (and the associated  
problems) to use making it easy for begginners and for a quick and dirty  
program.

* "string" can be sliced/indexed by character regardless of the encoding  
used for the data.

* overload resolution has only 1 type, not 3 to choose from.

* code written in D would all use the same type "string". no more this  
library uses char[] this one wchar and my app dchar[] problems.


[Disadvantages]
* requirements listed below

* libraries built for a different native type will likely cause  
transcoding. This problem already exists, at least with this suggestion  
the library can be built 3 times, once for each native encoding and the  
correct one linked to your app.

* possibility of implicit and silent transcoding. This can occur between  
libraries built with different native encodings and between "string" and  
char*, utf8*, utf16* and utf32*, the compiler _could_ identify all such  
locations if desired.


[Requirements]
In order to implement all this "string" requires knowledge of all code  
points, how they are encoded in the 3 encodings and how to compare and  
convert between them. So, D and thus any D compiler eg DMD, requires this  
knowledge. I am not entirely sure just how big an "ask" this is. I believe  
DMD and thus D already has much of this capability built in.


[Questions]
(a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have  
the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other  
words is it the same numerical value in all encodings? If so then  
comparing utf8, 16 and 32 is no different to comparing byte, short and int  
and all the same promotion and comparrison rules can apply.

(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.
Nov 23 2005
next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 24 Nov 2005 16:09:13 +1300, Regan Heath wrote:

 With the recent Physics slant on some posts here I couldn't resist that  
 subject
LOL
 Enough rambling, here is it, be nice!
Just some quick thoughts are recorded here. More will come later I suspect. [snip]
 [Key]
 First, lets start with some terminology, these are the terms I am going to  
 be using and what they mean, if these are incorrect please correct me, but  
 take them to have the stated meanings for this document.
 
 code point      := the unicode value for a single and complete character.
 code unit       := part of, or a complete character in one of the 3 UTF  
 encodings UTF-8,16,32.
 code value      := AKA code unit.
The Unicode Consortium defines code value as the smallest (in terms of bits) value that will hold a character in the various encoding formats. Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes. [snip]
 * remove char[], wchar[] and dchar[].
Do we still have to cater for strings that were formatted in specific encodings outside of our D applications? For example, a C library routine might insist that a pointer to a UTF16 string be supplied, thus we would have to force a specific encoding somehow.
 * add a new type "string". "string" will store code points in the  
 application specific native encoding and be implicitly or explicitly  
 transcoded as required (more below).
 
 * the application specific native encoding will default to UTF-8. An  
 application can choose another with a compile option or pragma, this  
 choice will have no effect on the behaviour of the program (as we only  
 have 1 type and all transcoding is handled where required) it will only  
 affect performance.
 
 The performance cost cannot be avoided, presuming it is only being done at  
 input and output (which is part of what this proposal aims to achieve).  
 This cost is application specific and will depend on the tasks and data  
 the application is designed to perform and use.
 
 Given that, letting the programmer choose a native encoding will allow  
 them to test different encodings for speed and/or provide different builds  
 based on the target language, eg an application destined to be used with  
 the Japanese language would likely benefit from using UTF-32  
 internally/natively.
 
 * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These  
 types represent code points (always, not a code units/values) in each  
 encoding. Only code points that fit in utf8 will ever be represented by  
 utf8, and so on. Thus some code points will always be utf32 values and  
 never utf8 or 16. (much like byte/short/int)
I think you've lost track of your 'code point' definition. A 'code point' is a character. All encodings can hold all characters, every character will fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are still all code points. There are no exclusive code points in utf32. Every UTF32 code point can also be expressed in UTF8.
 * add promotion/comparrison rules for utf8, 16 and 32:
 
 - any given code point represented as utf8 will compare equal to the same  
 code point represented as a utf16 or utf32 and vice versa(a)
 
 - any given code point represented as utf8 will be implicitly  
 converted/promoted to the same code point represented as utf16 or utf32 as  
 required and vice versa(a). If promotion from utf32 to utf16 or 8 causes  
 loss in data it should be handled just like int to short or byte.
I assume by 'promotion' you really mean 'transcoding'. There is never any assumption.
 * add a new type/alias "utf", this would alias utf8, 16 or 32. It  
 represents the application specific native encoding. This allows efficient  
 code, like:
 
 string s = "test";
 foreach(utf c; s) {
 }
But utf8, utf16, and utf32 are *strings* not characters, so 'utf' could not be an *alias* for these in your example. I guess you mean it to be a term for a character (code point) in a utf string.
 regardless of the applications selected native encoding.
 
 * slicing string gives another string
 
 * indexing a string gives a utf8, 16, or 32 code point.
 
 * string literals would be of type "string" encoded in the native  
 encoding, or if another encoding can be determined at compile time, in  
 that encoding (see ASCII example below).
 
 * character literals would default to the native encoding, failing that  
 the smallest possible type, and promoted/converted as required.
By 'smallest possible type' do you mean the smallest memory usage?
 * there are occasions where you may want to use a specific encoding for a  
 part of your application, perhaps you're loading a UTF-16 file and parsing  
 it. If all the work is done in a small section of code and it doesn't  
 interact with the bulk of your application data which is all in UTF-8 your  
 native encoding it likely to be UTF-8.
 
 In this case, for performance reasons, you want to be able to specify the  
 encoding to use for your "string" types at runtime, they are exceptions to  
 the native encoding. To do this we specify the encoding at  
 construction/declaration time, eg.
 
 string s(UTF16);
 s.utf16 = ..data read from UTF-16 source..
 
 (or similar, the exact syntax is not important at this stage)
But the idea is that a string has the property of 'utf8', and 'utf16' and 'utf32' encoding at runtime? -- Derek (skype: derek.j.parnell) Melbourne, Australia 24/11/2005 2:34:13 PM
Nov 23 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 15:04:08 +1100, Derek Parnell <derek psych.ward> wrote:
 [Key]
 First, lets start with some terminology, these are the terms I am going  
 to
 be using and what they mean, if these are incorrect please correct me,  
 but
 take them to have the stated meanings for this document.

 code point      := the unicode value for a single and complete  
 character.
 code unit       := part of, or a complete character in one of the 3 UTF
 encodings UTF-8,16,32.
 code value      := AKA code unit.
The Unicode Consortium defines code value as the smallest (in terms of bits) value that will hold a character in the various encoding formats. Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.
Thanks for the detailed description. That is what I meant above.
 * remove char[], wchar[] and dchar[].
Do we still have to cater for strings that were formatted in specific encodings outside of our D applications? For example, a C library routine might insist that a pointer to a UTF16 string be supplied, thus we would have to force a specific encoding somehow.
Yes, that is the purpose of char*, utf16*, etc. eg. int strlen(const char *string) {} int CreateFileW(utf16 *filename, ...
 * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These
 types represent code points (always, not a code units/values) in each
 encoding. Only code points that fit in utf8 will ever be represented by
 utf8, and so on. Thus some code points will always be utf32 values and
 never utf8 or 16. (much like byte/short/int)
I think you've lost track of your 'code point' definition.
Not so. I've just failed to explain what I mean here, let me try some more...
 A 'code point' is a character.
Correct.
 All encodings can hold all characters, every character will
 fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there  
 are still all code points. There are no exclusive code points in utf32.  
 Every
 UTF32 code point can also be expressed in UTF8.
I realise all this. It is not what I mean't above. Think of the type "utf8" as being identical to "byte", except that the values it stores are always complete code points, never fragments or code units/values. The type "utf8" will never have part of a complete character in it, it'll either have the whole character or it will be an error. "utf8" can represent the range of code points which are between 0 and 255 (or perhaps it's 127, not sure). perhaps the name "utf8" is missleading, it's not in fact a UTF-8 code unit/value, it is a codepoint, that fits in a byte. The reason it's not called "byte" is because the seperate type is used to trigger transcoding, see my original utf16* example.
 * add promotion/comparrison rules for utf8, 16 and 32:

 - any given code point represented as utf8 will compare equal to the  
 same
 code point represented as a utf16 or utf32 and vice versa(a)

 - any given code point represented as utf8 will be implicitly
 converted/promoted to the same code point represented as utf16 or utf32  
 as
 required and vice versa(a). If promotion from utf32 to utf16 or 8 causes
 loss in data it should be handled just like int to short or byte.
I assume by 'promotion' you really mean 'transcoding'.
No, I think I mean promotion. This is one of the things I am not 100% sure of, bear with me. The character 'A' has ASCII value 65 (decimal). Assuming it's code point is 65 (decimal), then this code point will fit in my "utf8" type. Thus "utf8" can represent the code point 'A'. If you assign that "utf8" to a "utf16", eg. utf8 a = 'A'; utf16 b = a; The utf8 value will be promoted to a utf16 value. The value itself doesn't change (it's not transcoded). It happens in exactly the same way a byte is promoted to a short. Is promoted the right word? That is, provided the value doesn't _need_ to change when going from utf8 to utf16, I am not 100% sure of this. I don't think it does. I believe all the code points that fit in the 1 byte type, have the same numerical value in the 2 byte type (UTF-16), and also the 4 byte type (UTF-32).
 * add a new type/alias "utf", this would alias utf8, 16 or 32. It
 represents the application specific native encoding. This allows  
 efficient
 code, like:

 string s = "test";
 foreach(utf c; s) {
 }
But utf8, utf16, and utf32 are *strings* not characters
No, they're not, not in my proposal. I think I picked bad names.
 , so 'utf' could not be an *alias* for these in your example. I guess  
 you mean it to be a term
 for a character (code point) in a utf string.
utf, utf8, utf16, and utf32 are all types that store complete code points, never code units/values/fragments. Think of them as being identical to byte, short, and int.
 regardless of the applications selected native encoding.

 * slicing string gives another string

 * indexing a string gives a utf8, 16, or 32 code point.

 * string literals would be of type "string" encoded in the native
 encoding, or if another encoding can be determined at compile time, in
 that encoding (see ASCII example below).

 * character literals would default to the native encoding, failing that
 the smallest possible type, and promoted/converted as required.
By 'smallest possible type' do you mean the smallest memory usage?
Yes. utf8 is smaller than utf16 is smaller than utf32.
 * there are occasions where you may want to use a specific encoding for  
 a
 part of your application, perhaps you're loading a UTF-16 file and  
 parsing
 it. If all the work is done in a small section of code and it doesn't
 interact with the bulk of your application data which is all in UTF-8  
 your
 native encoding it likely to be UTF-8.

 In this case, for performance reasons, you want to be able to specify  
 the
 encoding to use for your "string" types at runtime, they are exceptions  
 to
 the native encoding. To do this we specify the encoding at
 construction/declaration time, eg.

 string s(UTF16);
 s.utf16 = ..data read from UTF-16 source..

 (or similar, the exact syntax is not important at this stage)
But the idea is that a string has the property of 'utf8', and 'utf16' and 'utf32' encoding at runtime?
Yes. But you will only need to use these properties when performing input or output (see my definitions of source and sink) and only when the type cannot be inferred by the context, i.e. it's not required here: int CreateFile(utf16* filename) {} string test = "test"; CreateFile(test); Regan
Nov 23 2005
prev sibling next sibling parent reply "Lionello Lunesu" <lio remove.lunesu.com> writes:
Hi Regan,

Two small remarks:

* "wchar" might still be useful for those applications / libraries that 
support 16-bit unicode without aggregates like in Windows NT if I'm correct. 
It's not utf16 since it can't contain a big, >2-byte code point, ie. it's 
ushort.

* I don't see the point of the utf8, utf16 and utf32 types. They can all 
contain any code point, so they should all be just as big? Or do you mean 
that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint? 
Actually pieces from the respective strings.

L. 
Nov 23 2005
next sibling parent "Lionello Lunesu" <lio remove.lunesu.com> writes:
By the way, I like the proposal! I prefer different compiled libraries to 
many runtime checks or version blocks. It's like the #define UNICODE in 
Windows.

L. 
Nov 23 2005
prev sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 09:56:51 +0200, Lionello Lunesu  
<lio remove.lunesu.com> wrote:
 Two small remarks:

 * "wchar" might still be useful for those applications / libraries that
 support 16-bit unicode without aggregates like in Windows NT if I'm  
 correct.
 It's not utf16 since it can't contain a big, >2-byte code point, ie. it's
 ushort.

 * I don't see the point of the utf8, utf16 and utf32 types. They can all
 contain any code point, so they should all be just as big? Or do you mean
 that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint?
 Actually pieces from the respective strings.
No. I seem to have done a bad job of explaining it _and_ picked terrible names. The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?) We don't need wchar because utf16 replaces it. Perhaps if I had kept the original names... doh! Regan
Nov 24 2005
parent reply "Lionello Lunesu" <lio remove.lunesu.com> writes:
 The "utf8", "utf16" and "utf32" types I refer to are essentially byte, 
 short and int. They cannot contain any code point, only those that fit (I 
 thought I said that?)
In that case I don't like your idea : ) It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter: string s="Whatever"; //imagine it with a small circle on the a, comma under the t foreach(uchar u; s) {} Read "uchar" as "unicode char", essentially dchar, could in fact still be named dchar, I just didn't want to mix old/new terminology. The underlying type of "string" would be determined at compile time, but still convertable using properties (that part I liked very much). D's "char" should go back to C's char, signed even. Many decissions in D where made to ease the porting of C code, so why this "char" got overriden beats me. char[] should then behave no differently from byte[] (except maybe the element being signed). L.
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu  
<lio remove.lunesu.com> wrote:
 The "utf8", "utf16" and "utf32" types I refer to are essentially byte,
 short and int. They cannot contain any code point, only those that fit  
 (I
 thought I said that?)
In that case I don't like your idea : ) It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:
Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example. Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen.
 string s="Whatever";    //imagine it with a small circle on the a, comma
 under the t
 foreach(uchar u; s) {}

 Read "uchar" as "unicode char", essentially dchar, could in fact still be
 named dchar, I just didn't want to mix old/new terminology. The  
 underlying
 type of "string" would be determined at compile time, but still  
 convertable
 using properties (that part I liked very much).

 D's "char" should go back to C's char, signed even. Many decissions in D
 where made to ease the porting of C code, so why this "char" got  
 overriden
 beats me. char[] should then behave no differently from byte[] (except  
 maybe
 the element being signed).
I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*. Regan
Nov 24 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu 
 
 It makes far more sense to have only 1 _character_ type, that holds
 any UNICODE character. Whether it comes from an utf8, utf16 or
 utf32 string shouldn't matter:
True!
 Yeah, I'm starting to think that is the only way it works. The 3
 types were an attempt to avoid that for programs which do not need an
 int-sized  type for a char i.e. the quick and dirty ASCII program
 for example.
 
 Interestingly it seems std.stdio and std.format are already involved
 in a  conspiracy to convert all our char[] output to dchar and back
 again one  character at a time before it eventually makes it to the
 screen.
Must've been the specters in the night again. :-)
 I like "uchar". I agree "char" should go back to being C's char type.
 I don't think we need a char[] all the C functions expect a null 
 terminated  char*.
That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 24 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:

 Regan Heath wrote:
 On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu 
 
 It makes far more sense to have only 1 _character_ type, that holds
 any UNICODE character. Whether it comes from an utf8, utf16 or
 utf32 string shouldn't matter:
True!
 Yeah, I'm starting to think that is the only way it works. The 3
 types were an attempt to avoid that for programs which do not need an
 int-sized  type for a char i.e. the quick and dirty ASCII program
 for example.
 
 Interestingly it seems std.stdio and std.format are already involved
 in a  conspiracy to convert all our char[] output to dchar and back
 again one  character at a time before it eventually makes it to the
 screen.
Must've been the specters in the night again. :-)
 I like "uchar". I agree "char" should go back to being C's char type.
 I don't think we need a char[] all the C functions expect a null 
 terminated  char*.
That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 11:37:08 AM
Nov 24 2005
parent reply Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
 I like "uchar". I agree "char" should go back to being C's char
 type. I don't think we need a char[] all the C functions expect a
 null terminated char*.
That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.
Slicing C's char[] implies byte-wide, and non-UTF.
Nov 24 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:

 Derek Parnell wrote:
 On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
 I like "uchar". I agree "char" should go back to being C's char
 type. I don't think we need a char[] all the C functions expect a
 null terminated char*.
That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.
Slicing C's char[] implies byte-wide, and non-UTF.
Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 3:12:51 PM
Nov 24 2005
parent Georg Wrede <georg.wrede nospam.org> writes:
Derek Parnell wrote:
 On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:
Derek Parnell wrote:
On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
I like "uchar". I agree "char" should go back to being C's char
type. I don't think we need a char[] all the C functions expect a
null terminated char*.
That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.
Slicing C's char[] implies byte-wide, and non-UTF.
Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated.
With what we're doing with the utf, it would be a small additional job to have the "C char" arrays take care of the null byte at the end. So the programmer would not have to think about it. (I admit this takes some further thinking first! So you are right in your concerns!)
Nov 25 2005
prev sibling next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Regan Heath wrote:
[snip]
 [Questions]
 (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the
numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it
the same numerical value in all encodings? If so then comparing utf8, 16 and 32
is no different to comparing byte, short and int and all the same promotion and
comparrison rules can apply.
I think you are making this more complicated than it is by using the name UTF when you actually mean something like: ascii_char (not utf8) (code point < 128) ucs2_char (not utf16) (code point < 65536) unicode_char (not utf32) And yes: ascii is a subset of ucs2 is a subset of unicode.
 (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar.
Is it ASCII values 127 or less perhaps? To be honest I'm not sure.
ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode. Regards, /Oskar
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde  
<oskar.lindeREM OVEgmail.com> wrote:
 Regan Heath wrote:
 [snip]
 [Questions]
 (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A'  
 have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in  
 other words is it the same numerical value in all encodings? If so then  
 comparing utf8, 16 and 32 is no different to comparing byte, short and  
 int and all the same promotion and comparrison rules can apply.
I think you are making this more complicated than it is by using the name UTF when you actually mean something like: ascii_char (not utf8) (code point < 128) ucs2_char (not utf16) (code point < 65536) unicode_char (not utf32)
I agree, it appears my choice of type names was really confusing. I have posted a change, but perhaps I should repost all over again, perhaps I should have bounced this off one person before posting.
 And yes: ascii is a subset of ucs2 is a subset of unicode.
Excellent. Thanks.
 (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
 similar. Is it ASCII values 127 or less perhaps? To be honest I'm not  
 sure.
ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode.
And which does a C function expect? Or is that defined by the C function? Does strcmp care? Does strlen, strchr, ...? Regan
Nov 24 2005
parent Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde  
 <oskar.lindeREM OVEgmail.com> wrote:
 
 Regan Heath wrote:
 (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
 similar. Is it ASCII values 127 or less perhaps? To be honest I'm 
 not  sure.
ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode.
And which does a C function expect? Or is that defined by the C function? Does strcmp care? Does strlen, strchr, ...?
This is not defined. strcmp doesn't care. strlen etc only counts bytes until '\0'. You can use latin-1, utf-8 or any 8-bit encoding. This is why UTF-8 is so popular. You can just plug it in and almost everything that used to assume latin-1 or any 8-bit encoding will just work without any changes. Not even the OS cares very much. To the OS, things like a file name, file contents, usernames, etc are just a bunch of bytes. Different file systems may then define different encodings the file names should be interpreted in. This is just how the file name is presented to the user. (Transcoding to/from the terminal) /Oskar
Nov 24 2005
prev sibling next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
Ok, it appears I picked some really bad type names in my proposal and it  
is causing some confusion.

The types "utf8" "utf16" and "utf32" do not in fact have anything to do  
with UTF. (Bad Regan).

They are in fact essentially byte, short and int with different names.  
Having different names is important because it triggers the transcoding of  
"string" to the required C, OS, or UTF type.

I could have left them called "char", "wchar" and "dchar", except that I  
wanted a 4th type to represent C's char as well. That type was called  
"char" in the proposal.

So, for the sake of our sanity can we all please assume I have used these  
type names instead:

"utf8"  == "cp1"
"utf16" == "cp2"
"utf32" == "cp4"
"utf"   == "cpn"

(the actual type names are unimportant at this stage, we can pick the best  
possible names later)

The idea behind these types is that they represent code points/characters  
_never_ code units/values/fragments. Which means cp1 can only represent a  
small subset of unicode code points, cp2 slightly more and cp4 all of them  
(IIRC).

It means assigning anything outside their range to them is an error.

It means that you can assign a cp1 to a cp2 and it simply promotes it  
(like it would from byte to short).

"cpn" is simply and alias for the type that is best suited for the chosen  
native encoding. If the native encoding is UTF-8, cpn is an alias for cp1,  
if the native encoding is UTF-16, cpn is an alias for cp2, and so on.

Sorry for all the confusion.

Regan
Nov 24 2005
next sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
Replying to myself now, in addition to bolloxing the initial proposal up  
with bad type names, I'm on a roll!

Here is version 1.1 of the proposal, with different type names and some  
changes to the other content. Hopefully this one will make more sense,  
fingers crossed.

Regan
Nov 24 2005
next sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Congrats, Regan! Great job!

And the thread subject is simply a Killer!



If I understand you correctly, then the following would work:

string st = "aaa\U41bbb\UC4ccc\U0107ddd";   // aaaAbbbÄcccćddd
cp1 s3 = st[3];   // A
cp1 s7 = st[7];   // Ä
cp1 s11 = st[11]; // error, too narrow
cp2 s11 = st[11]; // ć

assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 );

So, s3 would contain "A", which the old system would store as utf8 with 
no problem. s3 is 8 bits.

s7 would contain "Ä", which the old system shouldn't have stored in 
8-bit (char) because it is too big, but with your proposal it would be 
ok, since the _code_point_ (i.e. the "value" of the character in 
Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the 
UTF character here, right?

s11 would error, since even the Unicode value is too big for 8 bits.

The second s11 assignment would be ok, since the Unicode value of ć fits 
in 16 bits.

And, st itself would be "regular" UTF-8 on a Linux, and (probably) 
UTF-16 on Windows.

Yes?
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Congrats, Regan! Great job!
 And the thread subject is simply a Killer!

 If I understand you correctly, then the following would work:

 string st = "aaa\U41bbb\UC4ccc\U0107ddd";   // aaaAbbbÄcccćddd
 cp1 s3 = st[3];   // A
 cp1 s7 = st[7];   // Ä
 cp1 s11 = st[11]; // error, too narrow
 cp2 s11 = st[11]; // ć

 assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 );

 So, s3 would contain "A", which the old system would store as utf8 with  
 no problem. s3 is 8 bits.

 s7 would contain "Ä", which the old system shouldn't have stored in  
 8-bit (char) because it is too big, but with your proposal it would be  
 ok, since the _code_point_ (i.e. the "value" of the character in  
 Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the  
 UTF character here, right?
Yes. That's exactly what I was thinking. However it appears that the idea does hold together to well when it comes to "cpn" the alias, eg: string s = "smörgåsbord"; foreach(cpn c; s) { } "cpn" would need to change size for each character. It would be more than a simple alias. If it cannot change size, then it would need to be the largest size required. If that was also too weird/difficult then it would need to be 32 bits in size all the time. I was trying to avoid this but it seems it may be required?
 s11 would error, since even the Unicode value is too big for 8 bits.

 The second s11 assignment would be ok, since the Unicode value of ć fits  
 in 16 bits.

 And, st itself would be "regular" UTF-8 on a Linux, and (probably)  
 UTF-16 on Windows.

 Yes?
My proposal didn't suggest different encodings based on the system. It was UTF-8 by default (all systems) and application specific otherwise. There is nothing stopping us making the windows default to UTF-16 if that makes sense. Which it seems to. Regan
Nov 24 2005
parent Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede 
 <georg.wrede nospam.org>  wrote:
...
 "cpn" would need to change size for each character. It would be more 
 than  a simple alias.
 
 If it cannot change size, then it would need to be the largest size  
 required.
 
 If that was also too weird/difficult then it would need to be 32 bits 
 in  size all the time. I was trying to avoid this but it seems it may 
 be  required?
Yes. I see no way to avoid "cpn" being 32 bit only.
 My proposal didn't suggest different encodings based on the system. It 
 was  UTF-8 by default (all systems) and application specific otherwise. 
 There  is nothing stopping us making the windows default to UTF-16 if 
 that makes  sense. Which it seems to.
Windows, <sigh>. Looks like it. They seem to have a habit of choosing what seems easiest at the outset, without ever learning to dig into issues first. Had they done it, they'd chosen UTF-8, like everybody else. :-(
Nov 24 2005
prev sibling parent reply Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 
 * add a new type/alias "cpn", this alias will be cp1, cp2 or cp4
 depending on the native encoding chosen. This allows efficient
 code, like:
 
 string s = "test";
 foreach(cpn c; s) {
 }
 
 * slicing string gives another string
 
 * indexing a string gives a cp1, cp2 or cp4
I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right? (Ah, upon proofreading before posting, I only now noticed the cpn sentence at the top. I'll remark on it at the very end.) --- While we are now intricately submerged in UTF and char width issues, one day, when D is a household word, programmers wouldn't have to even know about UTF and stuff. Just like last summer, when none of us European D folk knew anything about UTF, and just wrote stuff like
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede <georg.wrede nospam.org>  
wrote:
 Regan Heath wrote:
  * add a new type/alias "cpn", this alias will be cp1, cp2 or cp4
 depending on the native encoding chosen. This allows efficient
 code, like:
  string s = "test";
 foreach(cpn c; s) {
 }
  * slicing string gives another string
  * indexing a string gives a cp1, cp2 or cp4
I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right?
I didn't think this part thru enough and Oskar gave me an example which broke my original idea. It seems for this to work cpn would need to be a type which changed size for each character, or always 32 bits large (as you suggest below). I was trying to avoid it being 32 bits large all the time, but it seems to be the only way it works.
 If this is true, then we might consider blatantly skipping cp1 and cp2,  
 and only having cp4 (possibly also renaming it utfchar).

 Then it would be a lot simpler for the programmer, right? He'd have even  
 less need to start researching in this UTF swamp. And everything would  
 "just work".

 This would make it possible for us to fully automate the extraction and  
 insertion of single "characters" into our new strings.

      string foo = "gagaga";
      utfchar bar = '\UFE9D'; // you don't want to know the name :-)
      utfchar baf = 'a';
      foo ~= bar ~ baf;

 (I admit the last line doesn't probably work currently, but it should,  
 IMHO.) Anyhow, the point being that if the utfchar type is 32 bits, then  
 it doesn't hurt anybody, and also doesn't lead to gratuituous  
 incompatibility with foreign characters -- which is the D aim all along.
It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing!
 For completeness, we could have the painting casts (as opposed to  
 converting casts). They'd be for the (seldom) situations where the  
 programmer _does_ want to do serious tinkering on our strings.

      ubyte[] myarr1 = cast(ubyte[])foo;
      ushort[] myarr2 = cast(ushort[]) foo;
      uint[] myarr3 = cast(uint[]) foo;

 These give raw arrays, like exact images of the string. The burden of  
 COW would lie on the programmer.
I was thinking of using properties (Sean's idea) to access the data as a certain type, eg. ubyte[] b = foo.utf8; ushort[] s = foo.utf16; uint[] i = foo.utf32; these properties would return the string in the specified encoding using those array types.
 ---

 The cpn remark: I think D programs should be (as much as possible) UTF  
 clean, even if the programmer didn't come to think about it. This has  
 the advantage that his programs won't break embarrassingly when a guy in  
 China suddenly uses them.

 It would also be quite nice if the programmer didn't have to think about  
 such issues at all. Just code his stuff.

 Having cpn as something else than 32 bits, will prevent this dream.
 (Heh, and only having single chars as 32 bits would make writing the  
 libraries so much easier, too, I think.)
Sad but probably true. I was hoping to avoid using 32bits everywhere :( Regan
Nov 24 2005
next sibling parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:

 
 Sad but probably true. I was hoping to avoid using 32bits everywhere :(
I also use the Euphoria programming language and this uses 32-bit characters exclusively. You do not notice any performance hit because of that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95). -- Derek Parnell Melbourne, Australia 25/11/2005 7:39:28 AM
Nov 24 2005
parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 25 Nov 2005 07:41:15 +1100, Derek Parnell <derek psych.ward> wrote:
 On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:


 Sad but probably true. I was hoping to avoid using 32bits everywhere :(
I also use the Euphoria programming language and this uses 32-bit characters exclusively. You do not notice any performance hit because of that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95).
Interesting. In that case I think my "string" type has an advantage. The data could actually be stored in either UTF-8, UTF-16 or UTF-32 internally and only convertedd to/from the 32 bit char when required. Regan.
Nov 24 2005
prev sibling parent Georg Wrede <georg.wrede nospam.org> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede wrote:
 
 I was trying to avoid it being 32 bits large 
 all the  time, but it seems to be the only way it works.
I agree. And I share the feeling. :-)
 If this is true, then we might consider blatantly skipping cp1 and 
 cp2,  and only having cp4 (possibly also renaming it utfchar).

 This would make it possible for us to fully automate the extraction 
 and  insertion of single "characters" into our new strings.

      string foo = "gagaga";
      utfchar bar = '\UFE9D'; // you don't want to know the name :-)
      utfchar baf = 'a';
      foo ~= bar ~ baf;
It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing!
Not too many have dissected writef. Or else we'd have heard some complaints already. ;-) I actually thought about "uchar" for a while, but then I remembered that a lot of this utf disaster originates from unfortunate names. And C has a uchar type. So, I'd suggest "utfchar" or "unicode" or something to-the-point and unambiguous that's not in C.
 For completeness, we could have the painting casts (as opposed to  
 converting casts). They'd be for the (seldom) situations where the  
 programmer _does_ want to do serious tinkering on our strings.

      ubyte[] myarr1 = cast(ubyte[])foo;
      ushort[] myarr2 = cast(ushort[]) foo;
      uint[] myarr3 = cast(uint[]) foo;

 These give raw arrays, like exact images of the string.  
 The burden of COW would lie on the programmer.
I was thinking of using properties (Sean's idea) to access the data as a certain type, eg. ubyte[] b = foo.utf8; ushort[] s = foo.utf16; uint[] i = foo.utf32; these properties would return the string in the specified encoding using those array types.
So it'd be the same thing, except your code looks a lot nicer!
Nov 24 2005
prev sibling parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:

 Ok, it appears I picked some really bad type names in my proposal and it  
 is causing some confusion.
Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters. We really only need to deal with characters except for some obscure functionality (maybe interfacing with an external system?). So we don't need to deal with the individual bytes that make up the characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless of encoding format), we might need to know how many bytes the third character uses. The answer will depend on the UTF encoding *and* the code point value. Mostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve performance. But generally, we shouldn't care. So how about we just have a string datatype called 'string'. The default encoding format in RAM is compiler dependant but we can on a declaration basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a property of the string. The compiler would generate the call to transcode if required to. The string could also have array properties such that each element addressed an entire character. If one ever really needed to get down to the byte level of a character they could assign it to a new datatype called a 'unicode' (for example) and that would have properties such as the encoding format and byte size, and the bytes in a unicode could be accessed using array syntax too. string Foo = "Some string"; unicode C; C = Foo[4]; if (C.encoding = unicode.utf8) { foreach (ubyte b; C) { . . . } } We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics. If some function absolutely insisted on a utf16 string for example, ... SomeFunc(Foo.utf16); would pass the utf16 version of the string to the function. As for declarations ... utf16 { // force RAM encoding to be utf16 string Foo; string Bar; } string Qwerty; // RAM encoding is compiler choice. -- Derek Parnell Melbourne, Australia 24/11/2005 7:54:01 PM
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
Derek, I must have done a terrible job explaining this, because you've  
completely missunderstood me, in fact your counter proposal is essentially  
what my proposal was intended to be.

More inline...

On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek psych.ward> wrote:
 On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:

 Ok, it appears I picked some really bad type names in my proposal and it
 is causing some confusion.
Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters.
I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.
 So we don't need to deal with the individual bytes that make up the
 characters in the various UTF encodings. Sure, we will need to know how  
 big a character is from time to time. For example, given a string  
 (regardless
 of encoding format), we might need to know how many bytes the third
 character uses. The answer will depend on the UTF encoding *and* the code
 point value.
Exactly my point, and the reason for the "cpn" alias.
 Mostly we won't even need to know the encoding format. We might, if that  
 is an interfacing requirement, and we might in some circumstances to  
 improve
 performance. But generally, we shouldn't care.
Yes, exactly.
 So how about we just have a string datatype called 'string'. The default
 encoding format in RAM is compiler dependant but we can on a declaration
 basis, define specific internal encoding format for a string.  
 Furthermore, we can access any of the three UTF encoding formats for a  
 string as a
 property of the string. The compiler would generate the call to transcode
 if required to. The string could also have array properties such that  
 each element addressed an entire character.
That, is exactly what I proposed.
 We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make  
 char and char[] array have the C/C++ semantics.
I proposed exactly that, except char[] should not exist either. char and char* are all that are required. Regan
Nov 24 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:

 Derek, I must have done a terrible job explaining this, because you've  
 completely missunderstood me, in fact your counter proposal is essentially  
 what my proposal was intended to be.
You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that. If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding. In other words, if 'uchar' is the data type that holds a character then alias uchar[] string; could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.
 On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek psych.ward> wrote:
 On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:

 Ok, it appears I picked some really bad type names in my proposal and it
 is causing some confusion.
Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters.
I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.
Apparently, or I'm a bit thicker than suspected ;-)
 So we don't need to deal with the individual bytes that make up the
 characters in the various UTF encodings. Sure, we will need to know how  
 big a character is from time to time. For example, given a string  
 (regardless
 of encoding format), we might need to know how many bytes the third
 character uses. The answer will depend on the UTF encoding *and* the code
 point value.
Exactly my point, and the reason for the "cpn" alias.
But why the need for cp1, cp2, cp4?
 Mostly we won't even need to know the encoding format. We might, if that  
 is an interfacing requirement, and we might in some circumstances to  
 improve
 performance. But generally, we shouldn't care.
Yes, exactly.
 So how about we just have a string datatype called 'string'. The default
 encoding format in RAM is compiler dependant but we can on a declaration
 basis, define specific internal encoding format for a string.  
 Furthermore, we can access any of the three UTF encoding formats for a  
 string as a
 property of the string. The compiler would generate the call to transcode
 if required to. The string could also have array properties such that  
 each element addressed an entire character.
That, is exactly what I proposed.
 We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make  
 char and char[] array have the C/C++ semantics.
I proposed exactly that, except char[] should not exist either. char and char* are all that are required.
Are you saying that we can have arrays of everything except char? I don't think that'll fly. And char* is a pointer to a single char. -- Derek Parnell Melbourne, Australia 25/11/2005 7:27:14 AM
Nov 24 2005
next sibling parent "Regan Heath" <regan netwin.co.nz> writes:
On Fri, 25 Nov 2005 07:37:18 +1100, Derek Parnell <derek psych.ward> wrote:
 On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:

 Derek, I must have done a terrible job explaining this, because you've
 completely missunderstood me, in fact your counter proposal is  
 essentially
 what my proposal was intended to be.
You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that.
No, never fragments, always complete code points. I tried to stress this point. The 8 bit type would hold all the code points with values that fit in 8 bits and never anything else, it's value would always be a code point, not a fragment.
 If strings must be arrays, then let there be an atomic data type that
 represents a character and then strings can be arrays of characters. The
 UTF encoding of the string is just an implementation detail then. All
 indexing would be done on a character basis regardless of the underlying
 encoding.

 In other words, if 'uchar' is the data type that holds a character then

   alias uchar[] string;

 could be used. The main 'new thinking' is that uchar could actually have
 variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
 the character it holds and the encoding used at the time.

 However I still prefer my earlier suggestion.
I suspect now that all individual characters will have to be represented by a 32 bit type, uchar is a good name for it. If you take my proposal, throw away all the garbage about cp1, cp2, cp4, and cpn then replace them with a new type "uchar" which is 32 bits large, and always use this to represent individual characters then it starts to work, I believe.
 Apparently, or I'm a bit thicker than suspected ;-)
I've just used confusing terms and done a bad job explaining I think.
 But why the need for cp1, cp2, cp4?
This was intended to avoid ASCII programs having to use a 32 bit type for all their characters, and so on.
 I proposed exactly that, except char[] should not exist either.
 char and char* are all that are required.
Are you saying that we can have arrays of everything except char?
Yes. Because we don't need an array of char[]. It's simply there for interfacing to C.
 I don't think that'll fly.  And char* is a pointer to a single char.
Technically true, but when you're talking about a C function it's a pointer to the start of a string which is null terminated. That's all we need it for in D. Regan
Nov 24 2005
prev sibling parent reply Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Derek Parnell wrote:
 On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:
 
 
Derek, I must have done a terrible job explaining this, because you've  
completely missunderstood me, in fact your counter proposal is essentially  
what my proposal was intended to be.
You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that. If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding. In other words, if 'uchar' is the data type that holds a character then alias uchar[] string; could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.
Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then... -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 25 2005
parent reply Derek Parnell <derek psych.ward> writes:
On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote:

[snip]

 could be used. The main 'new thinking' is that uchar could actually have
 variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
 the character it holds and the encoding used at the time.
 
 However I still prefer my earlier suggestion.
  
 
Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then...
Well not *actually* impossible but certainly something you'd only do if you didn't care about performance. However, I was really talking on a conceptual level rather than an implementation level. As you and others have said, it would most likely be implemented as a 32-bit unsigned integer however certain bits are redundant and are thus (conceptually) not significant. And as I have said earlier, I already work in such a world. The Euphoria programming language only has 32-bit characters. -- Derek Parnell Melbourne, Australia 26/11/2005 8:40:39 AM
Nov 25 2005
parent reply Bruno Medeiros <daiphoenixNO SPAMlycos.com> writes:
Derek Parnell wrote:
 On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote:
 
 [snip]
 
 
could be used. The main 'new thinking' is that uchar could actually have
variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
the character it holds and the encoding used at the time.

However I still prefer my earlier suggestion.
 
Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then...
Well not *actually* impossible but certainly something you'd only do if you didn't care about performance.
Another alternative would be to use a reference type, but that would use even more space. I honestly don't see how it would be possible, while using less than 4 bytes and maintaining all other D features/properties (performance not considerated).
 However, I was really talking on a
 conceptual level rather than an implementation level. As you and others
 have said, it would most likely be implemented as a 32-bit unsigned integer
 however certain bits are redundant and are thus (conceptually) not
 significant.
 
Thus one would have the dchar type, and this new "Unified String" would be simply be a dchar[] . I've only skimmed trough this discussion but people (Regan & others?) wanted a string type that was space-eficient, allowing itself to be enconded in UTF-8, UTF-16, etc, thus dchar[]/uchar[] would not be acceptable. Unless you wanted this uchar[] to be a basic type by itself, and not an array of basic uchars (which would work, but would be a horrible design) -- In fact, and I'm gonna go a bit on rant mode here (not directed at you in particular, Derek), but I've skimmed through this whole series of threads about Unicode and strings, and I'm getting a bit pissed with all of those meaningless posts based on wrong assumptions, wrong terminology, crazy or unfeasable language changes, and all of this for a problem I've yet failed to grasp why it cannot be fully solved with a dchar array or with a custom made String class (custom-made, that is, *user-coded*, not part of the language). I admit I have no Unicode coding experience, so indeed *I may be* missing something, but on every new thread made all I see is progressively more crazy, ridiculous ideas about a problem I do not see. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
Nov 26 2005
parent Derek Parnell <derek psych.ward> writes:
On Sat, 26 Nov 2005 13:30:28 +0000, Bruno Medeiros wrote:


[snip]
 Thus one would have the dchar type, and this new "Unified String" would
 be simply be a dchar[] .
[snip]
 I've skimmed through this whole series of
 threads about Unicode and strings, and 
[snip]
 I've yet failed to grasp why it cannot be fully solved with a
 dchar array or with a custom made String class (custom-made, that is,
 *user-coded*, not part of the language). I admit I have no Unicode
 coding experience, so indeed *I may be* missing something, but on every
 new thread made all I see is progressively more crazy, ridiculous ideas
 about a problem I do not see.
I agree that it is much better to identify the problem *before* one tries to fix it. I'm sure Water has been having a nice little chuckle at our meandering ways. I see 'the problem' as ... ** We have choice about the representation of strings in D, and thus at times we introduce a degree of ambiguity in our code that the compiler has trouble resolving. ** The 'char' data type is performing multiple roles. In one aspect, it is a fragment of a character in an utf-8 encoded string, and in other aspects it is a byte-sized character for ASCII and C/C++ compatibility purposes. This can be confusing to coders not used to thinking internationally. ** Indexing string that are based on 'char' and 'wchar' can cause bugs because it is possible to access character fragments rather than complete characters. There are some other issues which are not language related, and have to deal with string manipulation that assume ASCII strings only - such as the 'strip()' function which doesn't recognize all the Unicode white-space characters just the ASCII ones. -- Derek Parnell Melbourne, Australia 27/11/2005 7:40:57 AM
Nov 26 2005
prev sibling next sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Regan Heath wrote:

 * add a new type/alias "utf", this would alias utf8, 16 or 32. It  
 represents the application specific native encoding. This allows 
 efficient  code, like:
 
 string s = "test";
 foreach(utf c; s) {
 }
 
 regardless of the applications selected native encoding.
I will rewrite this with your changed names (cp*):
 * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
 represents the application specific native encoding. This allows
 efficient  code, like:

 string s = "test";
 foreach(cpn c; s) {
 }

 regardless of the applications selected native encoding.
Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8). You have introduced platform dependence where there previously was none. What do you gain by this? As I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters By your suggestion, there would be a third view: c) The unicode characters that are encoded by a single code unit. Why is this useful? Should the "smörgåsbord"-example above throw an error? Isn't what you want instead: assert_only_contains_single_code_unit_characters_in_native_encoding(string) /Oskar
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  
<oskar.lindeREM OVEgmail.com> wrote:
 Regan Heath wrote:

 * add a new type/alias "utf", this would alias utf8, 16 or 32. It   
 represents the application specific native encoding. This allows  
 efficient  code, like:
  string s = "test";
 foreach(utf c; s) {
 }
  regardless of the applications selected native encoding.
I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).
No. "string" would be UTF-8 encoded internally on both platforms. My proposal stated that "cpn" would thus be an alias for "cp1" but clearly that idea isn't going to work in this case as (I'm assuming) it's impossible to represent some of those characters using a single byte. Java uses an int, maybe we should just do the same?
 You have introduced platform dependence where there previously was none.
 What do you gain by this?
No, there is no platform dependance. The choice of encoding is entirely up to the programmer, they choose a default encoding for each program they write, it defaults to UTF-8.
 As I see it, there are only two views you need on a unicode string:
 a) The code units
 b) The unicode characters
(a) is seldom required. (b) is the common and thus goal view IMO.
 By your suggestion, there would be a third view:
 c) The unicode characters that are encoded by a single code unit.
(c) was intended to be equal to (b). It was intended that we have 3 types so that ASCII programs would not be forced to use an int sized variable for single character values. It seems we're stuck doing that.
 Why is this useful?
It's not, it's not what I intended.
 Should the "smörgåsbord"-example above throw an error?
No, certainly not.
 Isn't what you want instead:
 assert_only_contains_single_code_unit_characters_in_native_encoding(string)
I have no idea what you mean here. Regan
Nov 24 2005
parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  
 <oskar.lindeREM OVEgmail.com> wrote:
 
 Regan Heath wrote:

 * add a new type/alias "utf", this would alias utf8, 16 or 32. It   
 represents the application specific native encoding. This allows  
 efficient  code, like:
  string s = "test";
 foreach(utf c; s) {
 }
  regardless of the applications selected native encoding.
I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).
No. "string" would be UTF-8 encoded internally on both platforms.
 My proposal stated that "cpn" would thus be an alias for "cp1" but 
Ok. I assumed cpn would be the platform native (preferred) encoding.
 clearly  that idea isn't going to work in this case as (I'm assuming) 
 it's  impossible to represent some of those characters using a single 
 byte. Java  uses an int, maybe we should just do the same?
D uses dchar. Better would maybe be to rename it to char (or maybe character), giving: utf8 (todays char) utf16 (todays wchar) char (todays dchar)
 As I see it, there are only two views you need on a unicode string:
 a) The code units
 b) The unicode characters
(a) is seldom required. (b) is the common and thus goal view IMO.
Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding. This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters. When would you actually need character based indexing? I believe the answer is less often than you think. /Oskar
Nov 24 2005
parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde  
<oskar.lindeREM OVEgmail.com> wrote:
 Regan Heath wrote:
 On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde   
 <oskar.lindeREM OVEgmail.com> wrote:

 Regan Heath wrote:

 * add a new type/alias "utf", this would alias utf8, 16 or 32. It    
 represents the application specific native encoding. This allows   
 efficient  code, like:
  string s = "test";
 foreach(utf c; s) {
 }
  regardless of the applications selected native encoding.
I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).
No. "string" would be UTF-8 encoded internally on both platforms.
 My proposal stated that "cpn" would thus be an alias for "cp1" but
Ok. I assumed cpn would be the platform native (preferred) encoding.
Not platform native, application native. But it's not going to work anyway. It seems an int sized char type is required, I was trying to avoid that.
 clearly  that idea isn't going to work in this case as (I'm assuming)  
 it's  impossible to represent some of those characters using a single  
 byte. Java  uses an int, maybe we should just do the same?
D uses dchar. Better would maybe be to rename it to char (or maybe character), giving: utf8 (todays char) utf16 (todays wchar) char (todays dchar)
 As I see it, there are only two views you need on a unicode string:
 a) The code units
 b) The unicode characters
(a) is seldom required. (b) is the common and thus goal view IMO.
Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.
If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.
 This is the beauty of UTF-8 and the reason D strings all work on code  
 units rather than characters.
But people don't care about code units, they care about characters. When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.
 When would you actually need character based indexing?
 I believe the answer is less often than you think.
Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO. Regan
Nov 24 2005
next sibling parent reply =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:
Regan Heath wrote:
 But people don't care about code units, they care about characters. 
 When  do you want to inspect or modify a single code unit? I would say, 
 just  about never. On the other hand you might want to change the 4th 
 character  of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index 
 in a char[]  array. Ick.
True. BTW, is there a bug in std.string.insert? I tried to do: char[] a = "blaahblaah"; std.string.insert(a, 5, "öö"); std.stdio.writefln(a); Outputs: blaahblaah
 When would you actually need character based indexing?
 I believe the answer is less often than you think.
Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.
I agree. You don't need it very often, but when you do, there's currently no possibility to do that. I think char[]-slicing and indexing should be a bit better (work in the Unicode character level) since you _never_ want to change code units. (And in case you do, just cast it to void[]) Jari-Matti
Nov 24 2005
parent Derek Parnell <derek psych.ward> writes:
On Fri, 25 Nov 2005 00:31:32 +0200, Jari-Matti Mäkelä wrote:


 True. BTW, is there a bug in std.string.insert? I tried to do:
 
 char[] a = "blaahblaah";
 std.string.insert(a, 5, "öö");
 std.stdio.writefln(a);
 
 Outputs:
 
 blaahblaah
No bug. The function is not designed to update the same string passed to the function. It returns an updated string. char[] a = "blaahblaah"; a = std.string.insert(a, 5, "öö"); std.stdio.writefln(a); -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 10:04:41 AM
Nov 24 2005
prev sibling parent reply Oskar Linde <oskar.lindeREM OVEgmail.com> writes:
In article <ops0rhghyl23k2f5 nrage.netwin.co.nz>, Regan Heath says...
On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde  
<oskar.lindeREM OVEgmail.com> wrote:
 As I see it, there are only two views you need on a unicode string:
 a) The code units
 b) The unicode characters
(a) is seldom required. (b) is the common and thus goal view IMO.
Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.
If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.
By split, I meant this: char[][] words = "abc def ghi åäö jkl".split(" ");
 This is the beauty of UTF-8 and the reason D strings all work on code  
 units rather than characters.
But people don't care about code units, they care about characters. [...]
Most of the time people care about string contents, neither code units nor characters. Naturally, I'm biased by my own experience. I have written a few applications in D dealing with UTF-8 data, including parsing grammar definition files and communicating with web servers, but not once have I needed character based indexing. One reason may be that all delimeters used are ASCII, and therefore only occupy a single code unit, but I would assume that this is typical for most data. When dealing with UTF-8 streams, you want searching and parsing to work on indices (positions) within this stream, not on the character count up to this position. A code unit index gives you the direct byte position of the stream, whereas a character index would require iterating the entire stream up to the indexed position. The performance difference is hardly negligible.
[...] When  
do you want to inspect or modify a single code unit? I would say, just  
about never. On the other hand you might want to change the 4th character  
of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[]  
array. Ick.
How often do you need to change the 4th character of a string? I think that scenario is just as unlikely. Of course, there are cases where you need character based access, and that is what dchar[] is ideal for*. If you instead want to sacrifice performance for better memory footprint, use a wrapper class. What I don't agree with is making this sacrifice in performance the default, when its gains are so seldom needed. *) In many cases, such as word processors and similar, you need more efficient data structures than flat arrays. A basic character based string would not be of much help.
 When would you actually need character based indexing?
 I believe the answer is less often than you think.
Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.
I very seldom, if ever, care what UTF-8 framents are used to represent the data as long as I know that ASCII characters (those whose character literals are assignable to a char) are represented by a single code unit. You say that users of char[] need to know some things about UTF-8, and I can't argue with that. Maybe the docs should recommend dchar[] for users that want to remain UTF ignorant. :) Regards, Oskar
Nov 25 2005
parent "Kris" <fu bar.com> writes:
"Oskar Linde" <oskar.lindeREM OVEgmail.com> wrote
 Most of the time people care about string contents, neither code units nor
 characters.
 Naturally, I'm biased by my own experience. I have written a few 
 applications in
 D dealing with UTF-8 data, including parsing grammar definition files and
 communicating with web servers, but not once have I needed character based
 indexing. One reason may be that all delimeters used are ASCII, and 
 therefore
 only occupy a single code unit, but I would assume that this is typical 
 for most
 data.
Absolutely right. This is why, for example, URI classes will remain char[] based. IRI extensions are applied simply by assuming the content is utf8.
[...] When
do you want to inspect or modify a single code unit? I would say, just
about never. On the other hand you might want to change the 4th character
of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[]
array. Ick.
How often do you need to change the 4th character of a string? I think that scenario is just as unlikely. Of course, there are cases where you need character based access, and that is what dchar[] is ideal for*. If you instead want to sacrifice performance for better memory footprint, use a wrapper class. What I don't agree with is making this sacrifice in performance the default, when its gains are so seldom needed. *) In many cases, such as word processors and similar, you need more efficient data structures than flat arrays. A basic character based string would not be of much help.
Right on. Such things are very much application specific. That, IMO, is where much of the general confusion stems from.
Nov 25 2005
prev sibling next sibling parent reply =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:
Regan, your proposal is absolutely too complex. I don't get it and I 
really don't like it. D is supposed to be a _simple_ language. Here's an 
alternative proposal:

-Allowed text string types:

char, char[] (we don't need silly aliases nor wchar/dchar)

-Text string implementation:

char - Unicode code unit (UTF-8, it's up to the compiler vendor to 
decide between 1-4xbytes and an int)

char[] - array of char-types, thus a valid Unicode string encoded in 
UTF-8, no BOM is needed because all char[]s are UTF-8.

-Text string operations:

char a = 'ä', b = 'å';
char[] s = "åäö", t;

t ~= a;		// s == [ 'ä' ]
t ~= b;		// s == [ 'ä', 'å' ] == "äå"

s[1..2] == "äö"

foreach(char c; s) writefln(c);	// outputs: å \n ä \n ö \n

-I/O:

writef/writefln - does implicit conversion (utf-8 -> terminal encoding)
puts/gets -

File I/O - through UnicodeStream() (handles encoding issues)

-Conversion:

std.utf - two functions needed:

byte[] encode(char[] string, EncodingType et)
char[] decode(byte[] stream, EncodingType et)

-Compatibility:

This new char[] is fully compatible with C-language char*, when 0-127 
ASCII-values and a trailing zero-value are used.

Access to Windows/Unix-API available (std.utf.[en/de]code)
Access to Unicode files available (std.stream.UnicodeStream)

-Advantages:

OS/compiler vendor independent
Easy to use

-Disadvantages:

Hard to implement (or is it, Walter seems to have problems with UTF-8 - 
OTOH this proposal doesn't imply you to implement strings using UTF-8, 
you can also use "fixed-width" UTF16/32)
It's not super high performance (need to convert a lot on Windows&legacy 
systems)
Indexing problem (as UTF-8 streams are variable length, it's hard to 
tell the exact position of a single character. This affects all string 
operations except concatenating.)

---

Please stop whining about the slowness of utf-conversions. If it's 
really so slow, I would certainly want to see some real world benchmarks.
Nov 24 2005
next sibling parent Dawid =?UTF-8?B?Q2nEmcW8YXJraWV3aWN6?= <dawid.ciezarkiewicz gmail.com> writes:
Jari-Matti Mäkelä wrote:

 Regan, your proposal is absolutely too complex. I don't get it and I
 really don't like it. D is supposed to be a _simple_ language. Here's an
 alternative proposal:
 [CUT]
+1
Nov 24 2005
prev sibling parent reply "Regan Heath" <regan netwin.co.nz> writes:
On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä  
<jmjmak invalid_utu.fi> wrote:
 Regan, your proposal is absolutely too complex. I don't get it and I  
 really don't like it. D is supposed to be a _simple_ language. Here's an  
 alternative proposal:
<snip> Thanks for your opinion. It appears some parts of my idea were badly thought out. I was trying to end up with something simple, it seems a few of my choices were bad ones and they simply complicated the idea. I was trying to avoid picking any 1 type over the others (as you have suggested here). It appears now that I should replace all my talk about cp1, cp2, cp4 and cpn with "all characters are stored in a 32 bit type called uchar". If anyone has a problem with that, I'd direct them to take a look at std.format.doFormat and std.stdio.writef which convert all char[] data into individual dchar's before converting it back to UTF-8 for output to the screen.
 Please stop whining about the slowness of utf-conversions. If it's  
 really so slow, I would certainly want to see some real world benchmarks.
I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained... Regan.
Nov 24 2005
parent =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= <jmjmak invalid_utu.fi> writes:
Regan Heath wrote:
 On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä  
 <jmjmak invalid_utu.fi> wrote:
 
 Regan, your proposal is absolutely too complex. I don't get it and I  
 really don't like it. D is supposed to be a _simple_ language. Here's 
 an  alternative proposal:
Sorry for being a bit impolite. I just wanted to show that it's completely possible to write Unicode-compliant programs without the need for several string-keywords. I believe a careful design&implementation removes most of the performance drawbacks.
 Thanks for your opinion. It appears some parts of my idea were badly  
 thought out. I was trying to end up with something simple, it seems a 
 few  of my choices were bad ones and they simply complicated the idea.
 
Thanks for bringing up some conversation. As you can see, neither of us is perfect => designing a modern programming language isn't as easy as it might have seen.
 I was trying to avoid picking any 1 type over the others (as you have  
 suggested here).
Actually I have to change my opinion. I think it would be good, if the compiler were allowed to choose the correct encoding. I don't think there will be any serious problems since nowadays most Win32-things use UTF-16 and *nix-systems UTF-8.
 It appears now that I should replace all my talk about cp1, cp2, cp4 
 and  cpn with "all characters are stored in a 32 bit type called uchar". 
 If  anyone has a problem with that, I'd direct them to take a look at  
 std.format.doFormat and std.stdio.writef which convert all char[] data  
 into individual dchar's before converting it back to UTF-8 for output 
 to  the screen.
That is one solution. Although I might let the compiler decide the encoding.
 Please stop whining about the slowness of utf-conversions. If it's  
 really so slow, I would certainly want to see some real world benchmarks.
I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained...
I can't say anything about the overall complexity class for programs that do Unicode, but at least my simple experiments [1] show that unoptimized use of writefln is 'only' 50% slower than optimized use of printf in C (both using the same gcc-backend). Though I'm not 100% sure this program of mine actually did any transcoding. In addition, I think most 'static' conversions can be precalculated. [1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/1983 Jari-Matti
Nov 24 2005
prev sibling parent "Regan Heath" <regan netwin.co.nz> writes:
I want to thank everyone for reading and posting opinions on my proposal.

It appears I have done a bad job explaining some of it, and some of it  
simply doesn't work. I have a modified idea in mind which I think might  
work a whole bunch better and should also be much simpler too.

Thanks everyone.

Regan
Nov 24 2005