digitalmars.D - Unified String Theory..
- Regan Heath (221/221) Nov 23 2005 With the recent Physics slant on some posts here I couldn't resist that ...
- Derek Parnell (34/116) Nov 23 2005 LOL
- Regan Heath (46/137) Nov 23 2005 Yes, that is the purpose of char*, utf16*, etc. eg.
- Lionello Lunesu (11/11) Nov 23 2005 Hi Regan,
- Lionello Lunesu (4/4) Nov 23 2005 By the way, I like the proposal! I prefer different compiled libraries t...
- Regan Heath (10/20) Nov 24 2005 No. I seem to have done a bad job of explaining it _and_ picked terrible...
- Lionello Lunesu (16/19) Nov 24 2005 In that case I don't like your idea : )
- Regan Heath (12/35) Nov 24 2005 Yeah, I'm starting to think that is the only way it works. The 3 types
- Georg Wrede (6/23) Nov 24 2005 Must've been the specters in the night again. :-)
- Derek Parnell (10/38) Nov 24 2005 I think that would interfere with the slice concept.
- Georg Wrede (2/15) Nov 24 2005 Slicing C's char[] implies byte-wide, and non-UTF.
- Derek Parnell (9/26) Nov 24 2005 Exactly, and that why I'm worried by the suggestion that char[] be
- Georg Wrede (6/29) Nov 25 2005 With what we're doing with the utf, it would be a small additional job
- Oskar Linde (12/15) Nov 24 2005 I think you are making this more complicated than it is by using the
- Regan Heath (9/28) Nov 24 2005 I agree, it appears my choice of type names was really confusing. I have...
- Oskar Linde (12/28) Nov 24 2005 This is not defined. strcmp doesn't care. strlen etc only counts bytes
- Regan Heath (30/30) Nov 24 2005 Ok, it appears I picked some really bad type names in my proposal and it...
- Regan Heath (6/6) Nov 24 2005 Replying to myself now, in addition to bolloxing the initial proposal up...
- Georg Wrede (22/22) Nov 24 2005 Congrats, Regan! Great job!
- Regan Heath (19/41) Nov 24 2005 Yes. That's exactly what I was thinking. However it appears that the ide...
- Georg Wrede (7/22) Nov 24 2005 Yes. I see no way to avoid "cpn" being 32 bit only.
- Georg Wrede (59/71) Nov 24 2005 I hope you are not implying that indexing would choose between cp1..4
- Regan Heath (20/66) Nov 24 2005 I didn't think this part thru enough and Oskar gave me an example which ...
- Derek Parnell (10/11) Nov 24 2005 I also use the Euphoria programming language and this uses 32-bit
- Regan Heath (5/11) Nov 24 2005 Interesting. In that case I think my "string" type has an advantage. The...
- Georg Wrede (9/48) Nov 24 2005 Not too many have dissected writef. Or else we'd have heard some
- Derek Parnell (53/55) Nov 24 2005 Regan,
- Regan Heath (14/44) Nov 24 2005 Derek, I must have done a terrible job explaining this, because you've
- Derek Parnell (23/76) Nov 24 2005 You seemed to be wanting to have data types that could only hold
- Regan Heath (19/43) Nov 24 2005 No, never fragments, always complete code points. I tried to stress this...
- Bruno Medeiros (12/40) Nov 25 2005 Whoa, did you ever stop to think on the implications of having a
- Derek Parnell (14/28) Nov 25 2005 Well not *actually* impossible but certainly something you'd only do if ...
- Bruno Medeiros (28/58) Nov 26 2005 Another alternative would be to use a reference type, but that would use
- Derek Parnell (26/36) Nov 26 2005 [snip]
- Oskar Linde (20/36) Nov 24 2005 Say you instead have:
- Regan Heath (18/53) Nov 24 2005 No. "string" would be UTF-8 encoded internally on both platforms.
- Oskar Linde (14/60) Nov 24 2005 D uses dchar. Better would maybe be to rename it to char (or maybe
- Regan Heath (17/74) Nov 24 2005 Not platform native, application native. But it's not going to work
- =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= (13/25) Nov 24 2005 True. BTW, is there a bug in std.string.insert? I tried to do:
- Derek Parnell (12/21) Nov 24 2005 No bug. The function is not designed to update the same string passed to
- Oskar Linde (33/59) Nov 25 2005 By split, I meant this:
- Kris (5/37) Nov 25 2005 Absolutely right. This is why, for example, URI classes will remain char...
- =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= (45/45) Nov 24 2005 Regan, your proposal is absolutely too complex. I don't get it and I
- Dawid =?UTF-8?B?Q2nEmcW8YXJraWV3aWN6?= (2/6) Nov 24 2005 +1
-
Regan Heath
(19/24)
Nov 24 2005
- =?ISO-8859-15?Q?Jari-Matti_M=E4kel=E4?= (21/46) Nov 24 2005 Sorry for being a bit impolite. I just wanted to show that it's
- Regan Heath (6/6) Nov 24 2005 I want to thank everyone for reading and posting opinions on my proposal...
With the recent Physics slant on some posts here I couldn't resist that subject, in actual fact this is an idea for string handling in D which I have cooked up recently. I am going to paste the text here and attach my original document, the document may be easier to read than the NG. I like this idea, it may however be too much of a change for D, I'm hoping the advantages outweigh this fact but I'm not going to hold my breath. It is possible I have missed something obvious and/or am talking out of a hole in my head, if that is the case I would appreciate being told so, politely ;) Enough rambling, here is it, be nice! ----- Proposal: A single unified string type. Author : Regan Heath Version : 1.0a Date : 24 Nov 2005 +1300 (New Zealand DST) [Preamble/Introduction] After the recent discussion on Unicode, UTF encodings and the current D situation it occured to me that many of the issues D has with strings could be side-stepped if there was a single string type. In the past we have assumed that to obtain this we have to choose one of the 3 available types and encodings. This wasn't an attractive option because each type has different pros/cons and each application may prefer one type over another. Another suggested solution was a string class which hides the details, this solution suffers from being a class and the limitations imposed by that and not being tied directly into the language. My proposal is a single "string" type built into the language, which can represent it's string data in any given UTF encoding. Which will allow slicing of "characters" as opposed to what is essentially bytes, shorts, and ints. Whose default encoding can be selected at compile time, or specified at runtime. Which will implicitly or explicitly transcode where required. There are some requirements for this to be possible, namely knowledge of the UTF encodings being built into D, these requirements may prohibit the proposal being favourable as it increases the knowledge required to write a D compiler. However it occurs to me that DMD and thus D? already requires a fair bit of UTF knowledge. [Key] First, lets start with some terminology, these are the terms I am going to be using and what they mean, if these are incorrect please correct me, but take them to have the stated meanings for this document. code point := the unicode value for a single and complete character. code unit := part of, or a complete character in one of the 3 UTF encodings UTF-8,16,32. code value := AKA code unit. transcoding := the process of converting from one encoding to another. source := a file, the keyboard, a tcp socket, a com port, an OS/C function call, a 3rd party library. sink := a file, the screen, a tcp socket, a com port, an OS/C function call, a 3rd party library. native encoding := application specific "preferred" encoding (more on this later) string := a sequence of code points. Anything I am unsure about will be suffixed with (x) where x is a letter of the alphabet, and my thoughts will be detailed in the [Questions] section. [Assumptions] These are what I base my argument/suggestion on, if you disagree with any of these you will likely disagree with the proposal. If that is the case please post your concerns with any given assumption in it's own post (I would like to discuss each issue in it's own thread and avoid mixing several issues) transcoded to/from any UTF encoding with no loss of data/meaning. mention the possible runtime penalty wherever appropriate. ouput. Input is the process of obtaining data from a source. Output is the process of sending data to a sink. In either case the source or sink will have a fixed encoding and is that encoding does not match the native encoding the application will need to transcode. (see definitions above for what classifies as a source or sink) [Details] Many of the details are flexible, i.e. the names of the types etc, the important/inflexible details are how it all fits together and achieves it's results. I've chosen a bullet point format and tried to make each change/point as succint and clear as possible. Feel free to ask for clarification on any point or points. Or to ask general questions. Or to pose general problems. I will do my best to answer all questions. * remove char[], wchar[] and dchar[]. * add a new type "string". "string" will store code points in the application specific native encoding and be implicitly or explicitly transcoded as required (more below). * the application specific native encoding will default to UTF-8. An application can choose another with a compile option or pragma, this choice will have no effect on the behaviour of the program (as we only have 1 type and all transcoding is handled where required) it will only affect performance. The performance cost cannot be avoided, presuming it is only being done at input and output (which is part of what this proposal aims to achieve). This cost is application specific and will depend on the tasks and data the application is designed to perform and use. Given that, letting the programmer choose a native encoding will allow them to test different encodings for speed and/or provide different builds based on the target language, eg an application destined to be used with the Japanese language would likely benefit from using UTF-32 internally/natively. * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These types represent code points (always, not a code units/values) in each encoding. Only code points that fit in utf8 will ever be represented by utf8, and so on. Thus some code points will always be utf32 values and never utf8 or 16. (much like byte/short/int) * add promotion/comparrison rules for utf8, 16 and 32: - any given code point represented as utf8 will compare equal to the same code point represented as a utf16 or utf32 and vice versa(a) - any given code point represented as utf8 will be implicitly converted/promoted to the same code point represented as utf16 or utf32 as required and vice versa(a). If promotion from utf32 to utf16 or 8 causes loss in data it should be handled just like int to short or byte. * add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { } regardless of the applications selected native encoding. * slicing string gives another string * indexing a string gives a utf8, 16, or 32 code point. * string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below). * character literals would default to the native encoding, failing that the smallest possible type, and promoted/converted as required. * there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 your native encoding it likely to be UTF-8. In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg. string s(UTF16); s.utf16 = ..data read from UTF-16 source.. (or similar, the exact syntax is not important at this stage) thus... * the type of encoding used by "string" should be selectable at runtime, some sort of encoding type flag must exist for each string at runtime, this is starting to head into "implementation details" which I want to avoid at this point, however it is important to note the requirement. [Output] * the type "char" will still exist, it will now _only_ represent a C string, thus when a string is passed as a char it can be implicitly transcoded into ASCII(b) with a null terminator, eg. int strcmp(const char *src, const char *dst); string test = "this is a test"; if (strcmp(test,"this is a test")==0) { } the above will implicitly transcode 'test' into ASCII and ensure there is a null terminator. The literal "this is a test" will likely be stored in the binary as ASCII with a null terminator. * Native OS functions requiring "char" will use the rule above. eg. CreateFileA(char *filename... * Native OS functions requiring unicode will be defined as: CreateFileW(utf16 *filename... and "string" will be implicitly transcoded to utf16, with a null terminator added.. * When the required encoding is not apparent, eg. void CreateFile(char *data) { } void CreateFile(utf16 *data) { } string test = "this is a test"; CreateFile(test); an explicit property should be used, eg. CreateFile(test.char); CreateFile(test.utf16); NOTE: this problem still exists! It should however now be relegated to interaction with C API's as opposed to happening for native D methods. [Input] * Old encodings, Latin-1 etc would be loaded into ubyte[] or byte[] and could be cast (painted) to char*, utf8*, 16 or 32 or converted to "string" using a routine i.e. string toStringFromXXX(ubyte[] raw). * A stream class would have a selectable encoding and hide these details from us handling the data and giving a natively encoded "string" instead. Meaning, transcoding will naturally occur on input or output where required. [Example application types and the effect of this change] * the quick and dirty console app which handles ASCII only. It's native encoding will be UTF-8, and no transcoding will ever need to occur (assuming none of it's input or output is in another encoding) * an app which loads files in different encodings and needs to process them efficiently. In this case the code can select the encoding of "string" at runtime and avoid transcoding the data until such time as it needs to interface with another part of the application in another encoding or it needs to output to a sink, also in another encoding. * an international app which will handle many languages. this app can be custom built with the native string type selected to match each language. [Advantages] As I see it, this change would have the following advantages: * "string" requires no knowledge of UTF encodings (and the associated problems) to use making it easy for begginners and for a quick and dirty program. * "string" can be sliced/indexed by character regardless of the encoding used for the data. * overload resolution has only 1 type, not 3 to choose from. * code written in D would all use the same type "string". no more this library uses char[] this one wchar and my app dchar[] problems. [Disadvantages] * requirements listed below * libraries built for a different native type will likely cause transcoding. This problem already exists, at least with this suggestion the library can be built 3 times, once for each native encoding and the correct one linked to your app. * possibility of implicit and silent transcoding. This can occur between libraries built with different native encodings and between "string" and char*, utf8*, utf16* and utf32*, the compiler _could_ identify all such locations if desired. [Requirements] In order to implement all this "string" requires knowledge of all code points, how they are encoded in the 3 encodings and how to compare and convert between them. So, D and thus any D compiler eg DMD, requires this knowledge. I am not entirely sure just how big an "ask" this is. I believe DMD and thus D already has much of this capability built in. [Questions] (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply. (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.
Nov 23 2005
On Thu, 24 Nov 2005 16:09:13 +1300, Regan Heath wrote:With the recent Physics slant on some posts here I couldn't resist that subjectLOLEnough rambling, here is it, be nice!Just some quick thoughts are recorded here. More will come later I suspect. [snip][Key] First, lets start with some terminology, these are the terms I am going to be using and what they mean, if these are incorrect please correct me, but take them to have the stated meanings for this document. code point := the unicode value for a single and complete character. code unit := part of, or a complete character in one of the 3 UTF encodings UTF-8,16,32. code value := AKA code unit.The Unicode Consortium defines code value as the smallest (in terms of bits) value that will hold a character in the various encoding formats. Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes. [snip]* remove char[], wchar[] and dchar[].Do we still have to cater for strings that were formatted in specific encodings outside of our D applications? For example, a C library routine might insist that a pointer to a UTF16 string be supplied, thus we would have to force a specific encoding somehow.* add a new type "string". "string" will store code points in the application specific native encoding and be implicitly or explicitly transcoded as required (more below). * the application specific native encoding will default to UTF-8. An application can choose another with a compile option or pragma, this choice will have no effect on the behaviour of the program (as we only have 1 type and all transcoding is handled where required) it will only affect performance. The performance cost cannot be avoided, presuming it is only being done at input and output (which is part of what this proposal aims to achieve). This cost is application specific and will depend on the tasks and data the application is designed to perform and use. Given that, letting the programmer choose a native encoding will allow them to test different encodings for speed and/or provide different builds based on the target language, eg an application destined to be used with the Japanese language would likely benefit from using UTF-32 internally/natively. * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These types represent code points (always, not a code units/values) in each encoding. Only code points that fit in utf8 will ever be represented by utf8, and so on. Thus some code points will always be utf32 values and never utf8 or 16. (much like byte/short/int)I think you've lost track of your 'code point' definition. A 'code point' is a character. All encodings can hold all characters, every character will fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are still all code points. There are no exclusive code points in utf32. Every UTF32 code point can also be expressed in UTF8.* add promotion/comparrison rules for utf8, 16 and 32: - any given code point represented as utf8 will compare equal to the same code point represented as a utf16 or utf32 and vice versa(a) - any given code point represented as utf8 will be implicitly converted/promoted to the same code point represented as utf16 or utf32 as required and vice versa(a). If promotion from utf32 to utf16 or 8 causes loss in data it should be handled just like int to short or byte.I assume by 'promotion' you really mean 'transcoding'. There is never any assumption.* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { }But utf8, utf16, and utf32 are *strings* not characters, so 'utf' could not be an *alias* for these in your example. I guess you mean it to be a term for a character (code point) in a utf string.regardless of the applications selected native encoding. * slicing string gives another string * indexing a string gives a utf8, 16, or 32 code point. * string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below). * character literals would default to the native encoding, failing that the smallest possible type, and promoted/converted as required.By 'smallest possible type' do you mean the smallest memory usage?* there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 your native encoding it likely to be UTF-8. In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg. string s(UTF16); s.utf16 = ..data read from UTF-16 source.. (or similar, the exact syntax is not important at this stage)But the idea is that a string has the property of 'utf8', and 'utf16' and 'utf32' encoding at runtime? -- Derek (skype: derek.j.parnell) Melbourne, Australia 24/11/2005 2:34:13 PM
Nov 23 2005
On Thu, 24 Nov 2005 15:04:08 +1100, Derek Parnell <derek psych.ward> wrote:Thanks for the detailed description. That is what I meant above.[Key] First, lets start with some terminology, these are the terms I am going to be using and what they mean, if these are incorrect please correct me, but take them to have the stated meanings for this document. code point := the unicode value for a single and complete character. code unit := part of, or a complete character in one of the 3 UTF encodings UTF-8,16,32. code value := AKA code unit.The Unicode Consortium defines code value as the smallest (in terms of bits) value that will hold a character in the various encoding formats. Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.Yes, that is the purpose of char*, utf16*, etc. eg. int strlen(const char *string) {} int CreateFileW(utf16 *filename, ...* remove char[], wchar[] and dchar[].Do we still have to cater for strings that were formatted in specific encodings outside of our D applications? For example, a C library routine might insist that a pointer to a UTF16 string be supplied, thus we would have to force a specific encoding somehow.Not so. I've just failed to explain what I mean here, let me try some more...* keep char, wchar, and dchar but rename them utf8, utf16, utf32. These types represent code points (always, not a code units/values) in each encoding. Only code points that fit in utf8 will ever be represented by utf8, and so on. Thus some code points will always be utf32 values and never utf8 or 16. (much like byte/short/int)I think you've lost track of your 'code point' definition.A 'code point' is a character.Correct.All encodings can hold all characters, every character will fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are still all code points. There are no exclusive code points in utf32. Every UTF32 code point can also be expressed in UTF8.I realise all this. It is not what I mean't above. Think of the type "utf8" as being identical to "byte", except that the values it stores are always complete code points, never fragments or code units/values. The type "utf8" will never have part of a complete character in it, it'll either have the whole character or it will be an error. "utf8" can represent the range of code points which are between 0 and 255 (or perhaps it's 127, not sure). perhaps the name "utf8" is missleading, it's not in fact a UTF-8 code unit/value, it is a codepoint, that fits in a byte. The reason it's not called "byte" is because the seperate type is used to trigger transcoding, see my original utf16* example.No, I think I mean promotion. This is one of the things I am not 100% sure of, bear with me. The character 'A' has ASCII value 65 (decimal). Assuming it's code point is 65 (decimal), then this code point will fit in my "utf8" type. Thus "utf8" can represent the code point 'A'. If you assign that "utf8" to a "utf16", eg. utf8 a = 'A'; utf16 b = a; The utf8 value will be promoted to a utf16 value. The value itself doesn't change (it's not transcoded). It happens in exactly the same way a byte is promoted to a short. Is promoted the right word? That is, provided the value doesn't _need_ to change when going from utf8 to utf16, I am not 100% sure of this. I don't think it does. I believe all the code points that fit in the 1 byte type, have the same numerical value in the 2 byte type (UTF-16), and also the 4 byte type (UTF-32).* add promotion/comparrison rules for utf8, 16 and 32: - any given code point represented as utf8 will compare equal to the same code point represented as a utf16 or utf32 and vice versa(a) - any given code point represented as utf8 will be implicitly converted/promoted to the same code point represented as utf16 or utf32 as required and vice versa(a). If promotion from utf32 to utf16 or 8 causes loss in data it should be handled just like int to short or byte.I assume by 'promotion' you really mean 'transcoding'.No, they're not, not in my proposal. I think I picked bad names.* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { }But utf8, utf16, and utf32 are *strings* not characters, so 'utf' could not be an *alias* for these in your example. I guess you mean it to be a term for a character (code point) in a utf string.utf, utf8, utf16, and utf32 are all types that store complete code points, never code units/values/fragments. Think of them as being identical to byte, short, and int.Yes. utf8 is smaller than utf16 is smaller than utf32.regardless of the applications selected native encoding. * slicing string gives another string * indexing a string gives a utf8, 16, or 32 code point. * string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below). * character literals would default to the native encoding, failing that the smallest possible type, and promoted/converted as required.By 'smallest possible type' do you mean the smallest memory usage?Yes. But you will only need to use these properties when performing input or output (see my definitions of source and sink) and only when the type cannot be inferred by the context, i.e. it's not required here: int CreateFile(utf16* filename) {} string test = "test"; CreateFile(test); Regan* there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 your native encoding it likely to be UTF-8. In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg. string s(UTF16); s.utf16 = ..data read from UTF-16 source.. (or similar, the exact syntax is not important at this stage)But the idea is that a string has the property of 'utf8', and 'utf16' and 'utf32' encoding at runtime?
Nov 23 2005
Hi Regan, Two small remarks: * "wchar" might still be useful for those applications / libraries that support 16-bit unicode without aggregates like in Windows NT if I'm correct. It's not utf16 since it can't contain a big, >2-byte code point, ie. it's ushort. * I don't see the point of the utf8, utf16 and utf32 types. They can all contain any code point, so they should all be just as big? Or do you mean that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint? Actually pieces from the respective strings. L.
Nov 23 2005
By the way, I like the proposal! I prefer different compiled libraries to many runtime checks or version blocks. It's like the #define UNICODE in Windows. L.
Nov 23 2005
On Thu, 24 Nov 2005 09:56:51 +0200, Lionello Lunesu <lio remove.lunesu.com> wrote:Two small remarks: * "wchar" might still be useful for those applications / libraries that support 16-bit unicode without aggregates like in Windows NT if I'm correct. It's not utf16 since it can't contain a big, >2-byte code point, ie. it's ushort. * I don't see the point of the utf8, utf16 and utf32 types. They can all contain any code point, so they should all be just as big? Or do you mean that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint? Actually pieces from the respective strings.No. I seem to have done a bad job of explaining it _and_ picked terrible names. The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?) We don't need wchar because utf16 replaces it. Perhaps if I had kept the original names... doh! Regan
Nov 24 2005
The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?)In that case I don't like your idea : ) It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter: string s="Whatever"; //imagine it with a small circle on the a, comma under the t foreach(uchar u; s) {} Read "uchar" as "unicode char", essentially dchar, could in fact still be named dchar, I just didn't want to mix old/new terminology. The underlying type of "string" would be determined at compile time, but still convertable using properties (that part I liked very much). D's "char" should go back to C's char, signed even. Many decissions in D where made to ease the porting of C code, so why this "char" got overriden beats me. char[] should then behave no differently from byte[] (except maybe the element being signed). L.
Nov 24 2005
On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu <lio remove.lunesu.com> wrote:Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example. Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen.The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?)In that case I don't like your idea : ) It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:string s="Whatever"; //imagine it with a small circle on the a, comma under the t foreach(uchar u; s) {} Read "uchar" as "unicode char", essentially dchar, could in fact still be named dchar, I just didn't want to mix old/new terminology. The underlying type of "string" would be determined at compile time, but still convertable using properties (that part I liked very much). D's "char" should go back to C's char, signed even. Many decissions in D where made to ease the porting of C code, so why this "char" got overriden beats me. char[] should then behave no differently from byte[] (except maybe the element being signed).I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*. Regan
Nov 24 2005
Regan Heath wrote:On Thu, 24 Nov 2005 17:35:24 +0200, Lionello LunesuTrue!It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example. Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen.Must've been the specters in the night again. :-)I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 24 2005
On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:Regan Heath wrote:I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 11:37:08 AMOn Thu, 24 Nov 2005 17:35:24 +0200, Lionello LunesuTrue!It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter:Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example. Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen.Must've been the specters in the night again. :-)I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 24 2005
Derek Parnell wrote:On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:Slicing C's char[] implies byte-wide, and non-UTF.I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 24 2005
On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:Derek Parnell wrote:Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 3:12:51 PMOn Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:Slicing C's char[] implies byte-wide, and non-UTF.I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 24 2005
Derek Parnell wrote:On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:With what we're doing with the utf, it would be a small additional job to have the "C char" arrays take care of the null byte at the end. So the programmer would not have to think about it. (I admit this takes some further thinking first! So you are right in your concerns!)Derek Parnell wrote:Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated.On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:Slicing C's char[] implies byte-wide, and non-UTF.I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'.I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)!
Nov 25 2005
Regan Heath wrote: [snip][Questions] (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply.I think you are making this more complicated than it is by using the name UTF when you actually mean something like: ascii_char (not utf8) (code point < 128) ucs2_char (not utf16) (code point < 65536) unicode_char (not utf32) And yes: ascii is a subset of ucs2 is a subset of unicode.(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode. Regards, /Oskar
Nov 24 2005
On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:Regan Heath wrote: [snip]I agree, it appears my choice of type names was really confusing. I have posted a change, but perhaps I should repost all over again, perhaps I should have bounced this off one person before posting.[Questions] (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply.I think you are making this more complicated than it is by using the name UTF when you actually mean something like: ascii_char (not utf8) (code point < 128) ucs2_char (not utf16) (code point < 65536) unicode_char (not utf32)And yes: ascii is a subset of ucs2 is a subset of unicode.Excellent. Thanks.And which does a C function expect? Or is that defined by the C function? Does strcmp care? Does strlen, strchr, ...? Regan(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode.
Nov 24 2005
Regan Heath wrote:On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:This is not defined. strcmp doesn't care. strlen etc only counts bytes until '\0'. You can use latin-1, utf-8 or any 8-bit encoding. This is why UTF-8 is so popular. You can just plug it in and almost everything that used to assume latin-1 or any 8-bit encoding will just work without any changes. Not even the OS cares very much. To the OS, things like a file name, file contents, usernames, etc are just a bunch of bytes. Different file systems may then define different encodings the file names should be interpreted in. This is just how the file name is presented to the user. (Transcoding to/from the terminal) /OskarRegan Heath wrote:And which does a C function expect? Or is that defined by the C function? Does strcmp care? Does strlen, strchr, ...?(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.ASCII is equal to the first 128 code points in Unicode. Latin-1 is equal to the first 256 code points in Unicode.
Nov 24 2005
Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion. The types "utf8" "utf16" and "utf32" do not in fact have anything to do with UTF. (Bad Regan). They are in fact essentially byte, short and int with different names. Having different names is important because it triggers the transcoding of "string" to the required C, OS, or UTF type. I could have left them called "char", "wchar" and "dchar", except that I wanted a 4th type to represent C's char as well. That type was called "char" in the proposal. So, for the sake of our sanity can we all please assume I have used these type names instead: "utf8" == "cp1" "utf16" == "cp2" "utf32" == "cp4" "utf" == "cpn" (the actual type names are unimportant at this stage, we can pick the best possible names later) The idea behind these types is that they represent code points/characters _never_ code units/values/fragments. Which means cp1 can only represent a small subset of unicode code points, cp2 slightly more and cp4 all of them (IIRC). It means assigning anything outside their range to them is an error. It means that you can assign a cp1 to a cp2 and it simply promotes it (like it would from byte to short). "cpn" is simply and alias for the type that is best suited for the chosen native encoding. If the native encoding is UTF-8, cpn is an alias for cp1, if the native encoding is UTF-16, cpn is an alias for cp2, and so on. Sorry for all the confusion. Regan
Nov 24 2005
Replying to myself now, in addition to bolloxing the initial proposal up with bad type names, I'm on a roll! Here is version 1.1 of the proposal, with different type names and some changes to the other content. Hopefully this one will make more sense, fingers crossed. Regan
Nov 24 2005
Congrats, Regan! Great job! And the thread subject is simply a Killer! If I understand you correctly, then the following would work: string st = "aaa\U41bbb\UC4ccc\U0107ddd"; // aaaAbbbÄcccćddd cp1 s3 = st[3]; // A cp1 s7 = st[7]; // Ä cp1 s11 = st[11]; // error, too narrow cp2 s11 = st[11]; // ć assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 ); So, s3 would contain "A", which the old system would store as utf8 with no problem. s3 is 8 bits. s7 would contain "Ä", which the old system shouldn't have stored in 8-bit (char) because it is too big, but with your proposal it would be ok, since the _code_point_ (i.e. the "value" of the character in Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the UTF character here, right? s11 would error, since even the Unicode value is too big for 8 bits. The second s11 assignment would be ok, since the Unicode value of ć fits in 16 bits. And, st itself would be "regular" UTF-8 on a Linux, and (probably) UTF-16 on Windows. Yes?
Nov 24 2005
On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Congrats, Regan! Great job! And the thread subject is simply a Killer! If I understand you correctly, then the following would work: string st = "aaa\U41bbb\UC4ccc\U0107ddd"; // aaaAbbbÄcccćddd cp1 s3 = st[3]; // A cp1 s7 = st[7]; // Ä cp1 s11 = st[11]; // error, too narrow cp2 s11 = st[11]; // ć assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 ); So, s3 would contain "A", which the old system would store as utf8 with no problem. s3 is 8 bits. s7 would contain "Ä", which the old system shouldn't have stored in 8-bit (char) because it is too big, but with your proposal it would be ok, since the _code_point_ (i.e. the "value" of the character in Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the UTF character here, right?Yes. That's exactly what I was thinking. However it appears that the idea does hold together to well when it comes to "cpn" the alias, eg: string s = "smörgåsbord"; foreach(cpn c; s) { } "cpn" would need to change size for each character. It would be more than a simple alias. If it cannot change size, then it would need to be the largest size required. If that was also too weird/difficult then it would need to be 32 bits in size all the time. I was trying to avoid this but it seems it may be required?s11 would error, since even the Unicode value is too big for 8 bits. The second s11 assignment would be ok, since the Unicode value of ć fits in 16 bits. And, st itself would be "regular" UTF-8 on a Linux, and (probably) UTF-16 on Windows. Yes?My proposal didn't suggest different encodings based on the system. It was UTF-8 by default (all systems) and application specific otherwise. There is nothing stopping us making the windows default to UTF-16 if that makes sense. Which it seems to. Regan
Nov 24 2005
Regan Heath wrote:On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede nospam.org> wrote:..."cpn" would need to change size for each character. It would be more than a simple alias. If it cannot change size, then it would need to be the largest size required. If that was also too weird/difficult then it would need to be 32 bits in size all the time. I was trying to avoid this but it seems it may be required?Yes. I see no way to avoid "cpn" being 32 bit only.My proposal didn't suggest different encodings based on the system. It was UTF-8 by default (all systems) and application specific otherwise. There is nothing stopping us making the windows default to UTF-16 if that makes sense. Which it seems to.Windows, <sigh>. Looks like it. They seem to have a habit of choosing what seems easiest at the outset, without ever learning to dig into issues first. Had they done it, they'd chosen UTF-8, like everybody else. :-(
Nov 24 2005
Regan Heath wrote:* add a new type/alias "cpn", this alias will be cp1, cp2 or cp4 depending on the native encoding chosen. This allows efficient code, like: string s = "test"; foreach(cpn c; s) { } * slicing string gives another string * indexing a string gives a cp1, cp2 or cp4I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right? (Ah, upon proofreading before posting, I only now noticed the cpn sentence at the top. I'll remark on it at the very end.) --- While we are now intricately submerged in UTF and char width issues, one day, when D is a household word, programmers wouldn't have to even know about UTF and stuff. Just like last summer, when none of us European D folk knew anything about UTF, and just wrote stuff likeOn Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede <georg.wrede nospam.org> wrote:Regan Heath wrote:I didn't think this part thru enough and Oskar gave me an example which broke my original idea. It seems for this to work cpn would need to be a type which changed size for each character, or always 32 bits large (as you suggest below). I was trying to avoid it being 32 bits large all the time, but it seems to be the only way it works.* add a new type/alias "cpn", this alias will be cp1, cp2 or cp4 depending on the native encoding chosen. This allows efficient code, like: string s = "test"; foreach(cpn c; s) { } * slicing string gives another string * indexing a string gives a cp1, cp2 or cp4I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right?If this is true, then we might consider blatantly skipping cp1 and cp2, and only having cp4 (possibly also renaming it utfchar). Then it would be a lot simpler for the programmer, right? He'd have even less need to start researching in this UTF swamp. And everything would "just work". This would make it possible for us to fully automate the extraction and insertion of single "characters" into our new strings. string foo = "gagaga"; utfchar bar = '\UFE9D'; // you don't want to know the name :-) utfchar baf = 'a'; foo ~= bar ~ baf; (I admit the last line doesn't probably work currently, but it should, IMHO.) Anyhow, the point being that if the utfchar type is 32 bits, then it doesn't hurt anybody, and also doesn't lead to gratuituous incompatibility with foreign characters -- which is the D aim all along.It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing!For completeness, we could have the painting casts (as opposed to converting casts). They'd be for the (seldom) situations where the programmer _does_ want to do serious tinkering on our strings. ubyte[] myarr1 = cast(ubyte[])foo; ushort[] myarr2 = cast(ushort[]) foo; uint[] myarr3 = cast(uint[]) foo; These give raw arrays, like exact images of the string. The burden of COW would lie on the programmer.I was thinking of using properties (Sean's idea) to access the data as a certain type, eg. ubyte[] b = foo.utf8; ushort[] s = foo.utf16; uint[] i = foo.utf32; these properties would return the string in the specified encoding using those array types.--- The cpn remark: I think D programs should be (as much as possible) UTF clean, even if the programmer didn't come to think about it. This has the advantage that his programs won't break embarrassingly when a guy in China suddenly uses them. It would also be quite nice if the programmer didn't have to think about such issues at all. Just code his stuff. Having cpn as something else than 32 bits, will prevent this dream. (Heh, and only having single chars as 32 bits would make writing the libraries so much easier, too, I think.)Sad but probably true. I was hoping to avoid using 32bits everywhere :( ReganNov 24 2005On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:Sad but probably true. I was hoping to avoid using 32bits everywhere :(I also use the Euphoria programming language and this uses 32-bit characters exclusively. You do not notice any performance hit because of that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95). -- Derek Parnell Melbourne, Australia 25/11/2005 7:39:28 AMNov 24 2005On Fri, 25 Nov 2005 07:41:15 +1100, Derek Parnell <derek psych.ward> wrote:On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:Interesting. In that case I think my "string" type has an advantage. The data could actually be stored in either UTF-8, UTF-16 or UTF-32 internally and only convertedd to/from the 32 bit char when required. Regan.Sad but probably true. I was hoping to avoid using 32bits everywhere :(I also use the Euphoria programming language and this uses 32-bit characters exclusively. You do not notice any performance hit because of that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95).Nov 24 2005Regan Heath wrote:On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede wrote: I was trying to avoid it being 32 bits large all the time, but it seems to be the only way it works.I agree. And I share the feeling. :-)Not too many have dissected writef. Or else we'd have heard some complaints already. ;-) I actually thought about "uchar" for a while, but then I remembered that a lot of this utf disaster originates from unfortunate names. And C has a uchar type. So, I'd suggest "utfchar" or "unicode" or something to-the-point and unambiguous that's not in C.If this is true, then we might consider blatantly skipping cp1 and cp2, and only having cp4 (possibly also renaming it utfchar). This would make it possible for us to fully automate the extraction and insertion of single "characters" into our new strings. string foo = "gagaga"; utfchar bar = '\UFE9D'; // you don't want to know the name :-) utfchar baf = 'a'; foo ~= bar ~ baf;It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing!So it'd be the same thing, except your code looks a lot nicer!For completeness, we could have the painting casts (as opposed to converting casts). They'd be for the (seldom) situations where the programmer _does_ want to do serious tinkering on our strings. ubyte[] myarr1 = cast(ubyte[])foo; ushort[] myarr2 = cast(ushort[]) foo; uint[] myarr3 = cast(uint[]) foo; These give raw arrays, like exact images of the string. The burden of COW would lie on the programmer.I was thinking of using properties (Sean's idea) to access the data as a certain type, eg. ubyte[] b = foo.utf8; ushort[] s = foo.utf16; uint[] i = foo.utf32; these properties would return the string in the specified encoding using those array types.Nov 24 2005On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters. We really only need to deal with characters except for some obscure functionality (maybe interfacing with an external system?). So we don't need to deal with the individual bytes that make up the characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless of encoding format), we might need to know how many bytes the third character uses. The answer will depend on the UTF encoding *and* the code point value. Mostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve performance. But generally, we shouldn't care. So how about we just have a string datatype called 'string'. The default encoding format in RAM is compiler dependant but we can on a declaration basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a property of the string. The compiler would generate the call to transcode if required to. The string could also have array properties such that each element addressed an entire character. If one ever really needed to get down to the byte level of a character they could assign it to a new datatype called a 'unicode' (for example) and that would have properties such as the encoding format and byte size, and the bytes in a unicode could be accessed using array syntax too. string Foo = "Some string"; unicode C; C = Foo[4]; if (C.encoding = unicode.utf8) { foreach (ubyte b; C) { . . . } } We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics. If some function absolutely insisted on a utf16 string for example, ... SomeFunc(Foo.utf16); would pass the utf16 version of the string to the function. As for declarations ... utf16 { // force RAM encoding to be utf16 string Foo; string Bar; } string Qwerty; // RAM encoding is compiler choice. -- Derek Parnell Melbourne, Australia 24/11/2005 7:54:01 PMNov 24 2005Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be. More inline... On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek psych.ward> wrote:On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters.So we don't need to deal with the individual bytes that make up the characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless of encoding format), we might need to know how many bytes the third character uses. The answer will depend on the UTF encoding *and* the code point value.Exactly my point, and the reason for the "cpn" alias.Mostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve performance. But generally, we shouldn't care.Yes, exactly.So how about we just have a string datatype called 'string'. The default encoding format in RAM is compiler dependant but we can on a declaration basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a property of the string. The compiler would generate the call to transcode if required to. The string could also have array properties such that each element addressed an entire character.That, is exactly what I proposed.We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics.I proposed exactly that, except char[] should not exist either. char and char* are all that are required. ReganNov 24 2005On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be.You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that. If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding. In other words, if 'uchar' is the data type that holds a character then alias uchar[] string; could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek psych.ward> wrote:Apparently, or I'm a bit thicker than suspected ;-)On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.Regan, the idea stinks. Sorry, but that *is* the nice response. It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so. When dealing with strings, almost nobody needs to deal with partial-characters.But why the need for cp1, cp2, cp4?So we don't need to deal with the individual bytes that make up the characters in the various UTF encodings. Sure, we will need to know how big a character is from time to time. For example, given a string (regardless of encoding format), we might need to know how many bytes the third character uses. The answer will depend on the UTF encoding *and* the code point value.Exactly my point, and the reason for the "cpn" alias.Are you saying that we can have arrays of everything except char? I don't think that'll fly. And char* is a pointer to a single char. -- Derek Parnell Melbourne, Australia 25/11/2005 7:27:14 AMMostly we won't even need to know the encoding format. We might, if that is an interfacing requirement, and we might in some circumstances to improve performance. But generally, we shouldn't care.Yes, exactly.So how about we just have a string datatype called 'string'. The default encoding format in RAM is compiler dependant but we can on a declaration basis, define specific internal encoding format for a string. Furthermore, we can access any of the three UTF encoding formats for a string as a property of the string. The compiler would generate the call to transcode if required to. The string could also have array properties such that each element addressed an entire character.That, is exactly what I proposed.We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics.I proposed exactly that, except char[] should not exist either. char and char* are all that are required.Nov 24 2005On Fri, 25 Nov 2005 07:37:18 +1100, Derek Parnell <derek psych.ward> wrote:On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:No, never fragments, always complete code points. I tried to stress this point. The 8 bit type would hold all the code points with values that fit in 8 bits and never anything else, it's value would always be a code point, not a fragment.Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be.You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that.If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding. In other words, if 'uchar' is the data type that holds a character then alias uchar[] string; could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.I suspect now that all individual characters will have to be represented by a 32 bit type, uchar is a good name for it. If you take my proposal, throw away all the garbage about cp1, cp2, cp4, and cpn then replace them with a new type "uchar" which is 32 bits large, and always use this to represent individual characters then it starts to work, I believe.Apparently, or I'm a bit thicker than suspected ;-)I've just used confusing terms and done a bad job explaining I think.But why the need for cp1, cp2, cp4?This was intended to avoid ASCII programs having to use a 32 bit type for all their characters, and so on.Yes. Because we don't need an array of char[]. It's simply there for interfacing to C.I proposed exactly that, except char[] should not exist either. char and char* are all that are required.Are you saying that we can have arrays of everything except char?I don't think that'll fly. And char* is a pointer to a single char.Technically true, but when you're talking about a C function it's a pointer to the start of a string which is null terminated. That's all we need it for in D. ReganNov 24 2005Derek Parnell wrote:On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then... -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be.You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that. If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding. In other words, if 'uchar' is the data type that holds a character then alias uchar[] string; could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.Nov 25 2005On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote: [snip]Well not *actually* impossible but certainly something you'd only do if you didn't care about performance. However, I was really talking on a conceptual level rather than an implementation level. As you and others have said, it would most likely be implemented as a 32-bit unsigned integer however certain bits are redundant and are thus (conceptually) not significant. And as I have said earlier, I already work in such a world. The Euphoria programming language only has 32-bit characters. -- Derek Parnell Melbourne, Australia 26/11/2005 8:40:39 AMcould be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then...Nov 25 2005Derek Parnell wrote:On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote: [snip]Another alternative would be to use a reference type, but that would use even more space. I honestly don't see how it would be possible, while using less than 4 bytes and maintaining all other D features/properties (performance not considerated).Well not *actually* impossible but certainly something you'd only do if you didn't care about performance.could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time. However I still prefer my earlier suggestion.Whoa, did you ever stop to think on the implications of having a primitive type with *variable size* ? It's plain nuts to implement, no wait, it's actually downright impossible. If you have a uchar variable (not an array), how much space do you allocate for it, if it has variable-size? The only way to implement this would be with a fixed-size equal to the max possible size (4 bytes). That would be a dchar then...However, I was really talking on a conceptual level rather than an implementation level. As you and others have said, it would most likely be implemented as a 32-bit unsigned integer however certain bits are redundant and are thus (conceptually) not significant.Thus one would have the dchar type, and this new "Unified String" would be simply be a dchar[] . I've only skimmed trough this discussion but people (Regan & others?) wanted a string type that was space-eficient, allowing itself to be enconded in UTF-8, UTF-16, etc, thus dchar[]/uchar[] would not be acceptable. Unless you wanted this uchar[] to be a basic type by itself, and not an array of basic uchars (which would work, but would be a horrible design) -- In fact, and I'm gonna go a bit on rant mode here (not directed at you in particular, Derek), but I've skimmed through this whole series of threads about Unicode and strings, and I'm getting a bit pissed with all of those meaningless posts based on wrong assumptions, wrong terminology, crazy or unfeasable language changes, and all of this for a problem I've yet failed to grasp why it cannot be fully solved with a dchar array or with a custom made String class (custom-made, that is, *user-coded*, not part of the language). I admit I have no Unicode coding experience, so indeed *I may be* missing something, but on every new thread made all I see is progressively more crazy, ridiculous ideas about a problem I do not see. -- Bruno Medeiros - CS/E student "Certain aspects of D are a pathway to many abilities some consider to be... unnatural."Nov 26 2005On Sat, 26 Nov 2005 13:30:28 +0000, Bruno Medeiros wrote: [snip]Thus one would have the dchar type, and this new "Unified String" would be simply be a dchar[] .[snip]I've skimmed through this whole series of threads about Unicode and strings, and[snip]I've yet failed to grasp why it cannot be fully solved with a dchar array or with a custom made String class (custom-made, that is, *user-coded*, not part of the language). I admit I have no Unicode coding experience, so indeed *I may be* missing something, but on every new thread made all I see is progressively more crazy, ridiculous ideas about a problem I do not see.I agree that it is much better to identify the problem *before* one tries to fix it. I'm sure Water has been having a nice little chuckle at our meandering ways. I see 'the problem' as ... ** We have choice about the representation of strings in D, and thus at times we introduce a degree of ambiguity in our code that the compiler has trouble resolving. ** The 'char' data type is performing multiple roles. In one aspect, it is a fragment of a character in an utf-8 encoded string, and in other aspects it is a byte-sized character for ASCII and C/C++ compatibility purposes. This can be confusing to coders not used to thinking internationally. ** Indexing string that are based on 'char' and 'wchar' can cause bugs because it is possible to access character fragments rather than complete characters. There are some other issues which are not language related, and have to deal with string manipulation that assume ASCII strings only - such as the 'strip()' function which doesn't recognize all the Unicode white-space characters just the ASCII ones. -- Derek Parnell Melbourne, Australia 27/11/2005 7:40:57 AMNov 26 2005Regan Heath wrote:* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { } regardless of the applications selected native encoding.I will rewrite this with your changed names (cp*):* add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(cpn c; s) { } regardless of the applications selected native encoding.Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8). You have introduced platform dependence where there previously was none. What do you gain by this? As I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters By your suggestion, there would be a third view: c) The unicode characters that are encoded by a single code unit. Why is this useful? Should the "smörgåsbord"-example above throw an error? Isn't what you want instead: assert_only_contains_single_code_unit_characters_in_native_encoding(string) /OskarNov 24 2005On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:Regan Heath wrote:No. "string" would be UTF-8 encoded internally on both platforms. My proposal stated that "cpn" would thus be an alias for "cp1" but clearly that idea isn't going to work in this case as (I'm assuming) it's impossible to represent some of those characters using a single byte. Java uses an int, maybe we should just do the same?* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { } regardless of the applications selected native encoding.I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).You have introduced platform dependence where there previously was none. What do you gain by this?No, there is no platform dependance. The choice of encoding is entirely up to the programmer, they choose a default encoding for each program they write, it defaults to UTF-8.As I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters(a) is seldom required. (b) is the common and thus goal view IMO.By your suggestion, there would be a third view: c) The unicode characters that are encoded by a single code unit.(c) was intended to be equal to (b). It was intended that we have 3 types so that ASCII programs would not be forced to use an int sized variable for single character values. It seems we're stuck doing that.Why is this useful?It's not, it's not what I intended.Should the "smörgåsbord"-example above throw an error?No, certainly not.Isn't what you want instead: assert_only_contains_single_code_unit_characters_in_native_encoding(string)I have no idea what you mean here. ReganNov 24 2005Regan Heath wrote:On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:Regan Heath wrote:No. "string" would be UTF-8 encoded internally on both platforms.* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { } regardless of the applications selected native encoding.I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).My proposal stated that "cpn" would thus be an alias for "cp1" butOk. I assumed cpn would be the platform native (preferred) encoding.clearly that idea isn't going to work in this case as (I'm assuming) it's impossible to represent some of those characters using a single byte. Java uses an int, maybe we should just do the same?D uses dchar. Better would maybe be to rename it to char (or maybe character), giving: utf8 (todays char) utf16 (todays wchar) char (todays dchar)Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding. This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters. When would you actually need character based indexing? I believe the answer is less often than you think. /OskarAs I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters(a) is seldom required. (b) is the common and thus goal view IMO.Nov 24 2005On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:Regan Heath wrote:Not platform native, application native. But it's not going to work anyway. It seems an int sized char type is required, I was trying to avoid that.On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:Regan Heath wrote:No. "string" would be UTF-8 encoded internally on both platforms.* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like: string s = "test"; foreach(utf c; s) { } regardless of the applications selected native encoding.I will rewrite this with your changed names (cp*): > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It > represents the application specific native encoding. This allows > efficient code, like: > > string s = "test"; > foreach(cpn c; s) { > } > > regardless of the applications selected native encoding. Say you instead have: string s = "smörgåsbord"; foreach(cpn c; s) { } This code would then work on Win32 (with UTF-16 being the native encoding), but not on Linux (with UTF-8).My proposal stated that "cpn" would thus be an alias for "cp1" butOk. I assumed cpn would be the platform native (preferred) encoding.If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.clearly that idea isn't going to work in this case as (I'm assuming) it's impossible to represent some of those characters using a single byte. Java uses an int, maybe we should just do the same?D uses dchar. Better would maybe be to rename it to char (or maybe character), giving: utf8 (todays char) utf16 (todays wchar) char (todays dchar)Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.As I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters(a) is seldom required. (b) is the common and thus goal view IMO.This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters.But people don't care about code units, they care about characters. When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.When would you actually need character based indexing? I believe the answer is less often than you think.Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO. ReganNov 24 2005Regan Heath wrote:But people don't care about code units, they care about characters. When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.True. BTW, is there a bug in std.string.insert? I tried to do: char[] a = "blaahblaah"; std.string.insert(a, 5, "öö"); std.stdio.writefln(a); Outputs: blaahblaahI agree. You don't need it very often, but when you do, there's currently no possibility to do that. I think char[]-slicing and indexing should be a bit better (work in the Unicode character level) since you _never_ want to change code units. (And in case you do, just cast it to void[]) Jari-MattiWhen would you actually need character based indexing? I believe the answer is less often than you think.Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.Nov 24 2005On Fri, 25 Nov 2005 00:31:32 +0200, Jari-Matti Mäkelä wrote:True. BTW, is there a bug in std.string.insert? I tried to do: char[] a = "blaahblaah"; std.string.insert(a, 5, "öö"); std.stdio.writefln(a); Outputs: blaahblaahNo bug. The function is not designed to update the same string passed to the function. It returns an updated string. char[] a = "blaahblaah"; a = std.string.insert(a, 5, "öö"); std.stdio.writefln(a); -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 10:04:41 AMNov 24 2005In article <ops0rhghyl23k2f5 nrage.netwin.co.nz>, Regan Heath says...On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde <oskar.lindeREM OVEgmail.com> wrote:By split, I meant this: char[][] words = "abc def ghi åäö jkl".split(" ");If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.As I see it, there are only two views you need on a unicode string: a) The code units b) The unicode characters(a) is seldom required. (b) is the common and thus goal view IMO.Most of the time people care about string contents, neither code units nor characters. Naturally, I'm biased by my own experience. I have written a few applications in D dealing with UTF-8 data, including parsing grammar definition files and communicating with web servers, but not once have I needed character based indexing. One reason may be that all delimeters used are ASCII, and therefore only occupy a single code unit, but I would assume that this is typical for most data. When dealing with UTF-8 streams, you want searching and parsing to work on indices (positions) within this stream, not on the character count up to this position. A code unit index gives you the direct byte position of the stream, whereas a character index would require iterating the entire stream up to the indexed position. The performance difference is hardly negligible.This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters.But people don't care about code units, they care about characters. [...][...] When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.How often do you need to change the 4th character of a string? I think that scenario is just as unlikely. Of course, there are cases where you need character based access, and that is what dchar[] is ideal for*. If you instead want to sacrifice performance for better memory footprint, use a wrapper class. What I don't agree with is making this sacrifice in performance the default, when its gains are so seldom needed. *) In many cases, such as word processors and similar, you need more efficient data structures than flat arrays. A basic character based string would not be of much help.I very seldom, if ever, care what UTF-8 framents are used to represent the data as long as I know that ASCII characters (those whose character literals are assignable to a char) are represented by a single code unit. You say that users of char[] need to know some things about UTF-8, and I can't argue with that. Maybe the docs should recommend dchar[] for users that want to remain UTF ignorant. :) Regards, OskarWhen would you actually need character based indexing? I believe the answer is less often than you think.Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.Nov 25 2005"Oskar Linde" <oskar.lindeREM OVEgmail.com> wroteMost of the time people care about string contents, neither code units nor characters. Naturally, I'm biased by my own experience. I have written a few applications in D dealing with UTF-8 data, including parsing grammar definition files and communicating with web servers, but not once have I needed character based indexing. One reason may be that all delimeters used are ASCII, and therefore only occupy a single code unit, but I would assume that this is typical for most data.Absolutely right. This is why, for example, URI classes will remain char[] based. IRI extensions are applied simply by assuming the content is utf8.Right on. Such things are very much application specific. That, IMO, is where much of the general confusion stems from.[...] When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.How often do you need to change the 4th character of a string? I think that scenario is just as unlikely. Of course, there are cases where you need character based access, and that is what dchar[] is ideal for*. If you instead want to sacrifice performance for better memory footprint, use a wrapper class. What I don't agree with is making this sacrifice in performance the default, when its gains are so seldom needed. *) In many cases, such as word processors and similar, you need more efficient data structures than flat arrays. A basic character based string would not be of much help.Nov 25 2005Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal: -Allowed text string types: char, char[] (we don't need silly aliases nor wchar/dchar) -Text string implementation: char - Unicode code unit (UTF-8, it's up to the compiler vendor to decide between 1-4xbytes and an int) char[] - array of char-types, thus a valid Unicode string encoded in UTF-8, no BOM is needed because all char[]s are UTF-8. -Text string operations: char a = 'ä', b = 'å'; char[] s = "åäö", t; t ~= a; // s == [ 'ä' ] t ~= b; // s == [ 'ä', 'å' ] == "äå" s[1..2] == "äö" foreach(char c; s) writefln(c); // outputs: å \n ä \n ö \n -I/O: writef/writefln - does implicit conversion (utf-8 -> terminal encoding) puts/gets - File I/O - through UnicodeStream() (handles encoding issues) -Conversion: std.utf - two functions needed: byte[] encode(char[] string, EncodingType et) char[] decode(byte[] stream, EncodingType et) -Compatibility: This new char[] is fully compatible with C-language char*, when 0-127 ASCII-values and a trailing zero-value are used. Access to Windows/Unix-API available (std.utf.[en/de]code) Access to Unicode files available (std.stream.UnicodeStream) -Advantages: OS/compiler vendor independent Easy to use -Disadvantages: Hard to implement (or is it, Walter seems to have problems with UTF-8 - OTOH this proposal doesn't imply you to implement strings using UTF-8, you can also use "fixed-width" UTF16/32) It's not super high performance (need to convert a lot on Windows&legacy systems) Indexing problem (as UTF-8 streams are variable length, it's hard to tell the exact position of a single character. This affects all string operations except concatenating.) --- Please stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks.Nov 24 2005Jari-Matti Mäkelä wrote:Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal: [CUT]+1Nov 24 2005On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä <jmjmak invalid_utu.fi> wrote:Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal:<snip> Thanks for your opinion. It appears some parts of my idea were badly thought out. I was trying to end up with something simple, it seems a few of my choices were bad ones and they simply complicated the idea. I was trying to avoid picking any 1 type over the others (as you have suggested here). It appears now that I should replace all my talk about cp1, cp2, cp4 and cpn with "all characters are stored in a 32 bit type called uchar". If anyone has a problem with that, I'd direct them to take a look at std.format.doFormat and std.stdio.writef which convert all char[] data into individual dchar's before converting it back to UTF-8 for output to the screen.Please stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks.I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained... Regan.Nov 24 2005Regan Heath wrote:On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä <jmjmak invalid_utu.fi> wrote:Sorry for being a bit impolite. I just wanted to show that it's completely possible to write Unicode-compliant programs without the need for several string-keywords. I believe a careful design&implementation removes most of the performance drawbacks.Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal:Thanks for your opinion. It appears some parts of my idea were badly thought out. I was trying to end up with something simple, it seems a few of my choices were bad ones and they simply complicated the idea.Thanks for bringing up some conversation. As you can see, neither of us is perfect => designing a modern programming language isn't as easy as it might have seen.I was trying to avoid picking any 1 type over the others (as you have suggested here).Actually I have to change my opinion. I think it would be good, if the compiler were allowed to choose the correct encoding. I don't think there will be any serious problems since nowadays most Win32-things use UTF-16 and *nix-systems UTF-8.It appears now that I should replace all my talk about cp1, cp2, cp4 and cpn with "all characters are stored in a 32 bit type called uchar". If anyone has a problem with that, I'd direct them to take a look at std.format.doFormat and std.stdio.writef which convert all char[] data into individual dchar's before converting it back to UTF-8 for output to the screen.That is one solution. Although I might let the compiler decide the encoding.I can't say anything about the overall complexity class for programs that do Unicode, but at least my simple experiments [1] show that unoptimized use of writefln is 'only' 50% slower than optimized use of printf in C (both using the same gcc-backend). Though I'm not 100% sure this program of mine actually did any transcoding. In addition, I think most 'static' conversions can be precalculated. [1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/1983 Jari-MattiPlease stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks.I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained...Nov 24 2005I want to thank everyone for reading and posting opinions on my proposal. It appears I have done a bad job explaining some of it, and some of it simply doesn't work. I have a modified idea in mind which I think might work a whole bunch better and should also be much simpler too. Thanks everyone. ReganNov 24 2005