www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Wide characters support in D

reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Note: I posted this already on runtime D list, but I think that list was a
wrong one for this question. Sorry for duplication :-)

Hi. I am new to D. It looks like D supports 3 types of characters: char, wchar,
dchar. This is cool, however, I have some questions about it:

1. When we have 2 methods (one with wchar[] and another with char[]), how D
will determine which one to use if I pass a string "hello world"?
2. Many libraries (e.g. tango or phobos) don't provide functions/methods (or
have incomplete support) for wchar/dchar
e.g. writefln probably assumes char[] for strings like "Number %d..."
3. Even if they do support, it is kind of annoying to provide methods for all 3
types of chars. Especially, if we want to use native mode (e.g. for Windows
wchar is better, for Linux char is better). E.g. Windows has _wopen, _wdirent,
_wreaddir, _wopenddir, _wmain(int argc, wchar_t[] argv) and so on, and they
should be native (in a sense that no conversion is necessary when we do, for
instance, _wopen). Linux doesn't have them as UTF-8 is used widely there.

Since D language is targeted on system programming, why not to try to use
whatever works better on a particular system (e.g. char will be 2 bytes on
Windows and 1 byte on Linux; it can be a compiler switch, and all libraries can
be compiled properly on a particular system). It's still necessary to have all
3 types of char for cooperation with C. But in those cases byte, short and int
will do their work. For this kind of situation, it would be nice to have some
built-in functions for transparent conversion from char to byte/short/int and
vice versa (especially, if conversion only happens if needed on a particular
platform).

In my opinion, to separate notion of character from byte would be nice, and it
makes sense as a particular platform uses either UTF-8 or UTF-16 natively.
Programmers may write universal code (like TCHAR on Windows). Unfortunately, C
uses 'char' and 'byte' interchangeably but why D has to make this mistake again?

Sorry if my suggestion sounds odd. Anyway, it would be great to hear something
from D gurus :-)

Ruslan.


      
Jun 07 2010
next sibling parent "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?
String literals in D(2) are of type immutable(char)[] (char[] in D1) by default, and thus will be handled by the char[]-version of the function. Should you want a string literal of a different type, append a c, w, or d to specify char[], wchar[] or dchar[]. Or use a cast.
 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system).
Because this leads to unportable code, that fails in unexpected ways when moved from one system to another, thus increasing rather than decreasing the cognitive load on the hapless programmer.
 It's still necessary to have all 3 types of char for cooperation with C.  
 But in those cases byte, short and int will do their work.
Absolutely not. One of the things D tries, is doing strings right. For that purpose, all 3 types are needed.
 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?
D has not. A char is a character, a possibly incomplete UTF-8 codepoint, while a byte is a byte, a humble number in the order of -128 to +127. Yes, it is possible to abuse char in D, and byte likewise. D aims to allow programmers to program close to the metal if the programmer so wishes, and thus does not pretend char is an opaque type about which nothing can be known. -- Simen
Jun 07 2010
prev sibling next sibling parent Robert Clipsham <robert octarineparrot.com> writes:
On 07/06/10 22:48, Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool, however, I have some questions
 about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),
 how D will determine which one to use if I pass a string "hello
 world"?
If you pass "Hello World", this is always a string (char[] in D1, immutable(char)[] in D2). If you want to specify a type with a string literal, you can use "Hello World"w or "Hello World"d for wstring anddstringrespectively.
 2. Many libraries (e.g. tango or phobos) don't provide
 functions/methods (or have incomplete support) for wchar/dchar e.g.
 writefln probably assumes char[] for strings like "Number %d..."
In tango most, if not all string functions are templated, so work with all string types, char[], wchar[] and dchar[]. I don't know how well phobos supports other string types, I know phobos 1 is extremely limited for types other than char[], I don't know about Phobos 2
 3.
 Even if they do support, it is kind of annoying to provide methods
 for all 3 types of chars. Especially, if we want to use native mode
 (e.g. for Windows wchar is better, for Linux char is better). E.g.
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,
 wchar_t[] argv) and so on, and they should be native (in a sense that
 no conversion is necessary when we do, for instance, _wopen). Linux
 doesn't have them as UTF-8 is used widely there.
Enter templates! You can write the function once and have it work with all three string types with little effort involved. All the lower level functions that interact with the operating system are abstracted away nicely for you in both Tango and Phobos, so you'll never have to deal with this for basic functions. For your own it's a simple matter of templating them in most cases.
 Since D language is targeted on system programming, why not to try to
 use whatever works better on a particular system (e.g. char will be 2
 bytes on Windows and 1 byte on Linux; it can be a compiler switch,
 and all libraries can be compiled properly on a particular system).
 It's still necessary to have all 3 types of char for cooperation with
 C. But in those cases byte, short and int will do their work. For
 this kind of situation, it would be nice to have some built-in
 functions for transparent conversion from char to byte/short/int and
 vice versa (especially, if conversion only happens if needed on a
 particular platform).
This is something C did wrong. If compilers are free to choose their own width for the string type you end up with the mess C has where every library introduces their own custom types to make sure they're the expected length, eg uint32_t etc. Having things the other way around makes life far easier - int is always 32bits signed for example, the same applies to strings. You can use version blocks if you want to specify a type which changes based on platform, I wouldn't recommend it though, it just makes life harder in the long run.
 In my opinion, to separate notion of character from byte would be
 nice, and it makes sense as a particular platform uses either UTF-8
 or UTF-16 natively. Programmers may write universal code (like TCHAR
 on Windows). Unfortunately, C uses 'char' and 'byte' interchangeably
 but why D has to make this mistake again?
They are different types in D, so I'm not sure what you mean. byte/ubyte have no encoding associated with them, char is always UTF-8, wchar UTF-16 etc. Robert
Jun 07 2010
prev sibling next sibling parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Ruslan Nikolaev wrote:

 1. When we have 2 methods (one with wchar[] and another with char[]), 
how D will determine which one to use if I pass a string "hello world"? I asked the same question on the D.learn group recently. Literals like that don't have a particular encoding. The programmer must specify explicitly to resolve ambiguities: "hello world"c or "hello world"w.
 3. Even if they do support, it is kind of annoying to provide methods 
for all 3 types of chars. Especially, if we want to use native mode I think the solution is to take advantage of templates and use template constraints if the template parameter is too flexible. Another approach might be to use dchar within the application and use other encodings on the intefraces. Ali
Jun 07 2010
prev sibling next sibling parent justin <justin economicmodeling.com> writes:
This doesn't answer all your questions and suggestions, but here goes.

want
to use UTF-16 or 32, use "Hello world"w and "Hello world"d respectively.

function to support string, wstring, and dstring by using templating and the
fact
that D can do automatic conversions for you. For instance:

string blah = "hello world";
foreach (dchar c; blah)   // guaranteed to get a full character
  // do something
Jun 07 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.

Yes, templates may help. However, that unnecessary make code bigger (since we
have to compile it for every char type). The other problem is that it allows
programmer to choose which one to use. He or she may just prefer char[] as
UTF-8 (or wchar[] as UTF-16). That will be fine on platform that supports this
encoding natively (e.g. for file system operations, screen output, etc.),
whereas it will cause conversion overhead on the other. Not to say that it's a
big overhead, but unnecessary one. Having said this, I do agree that there must
be some flexibility (e.g. in Java char[] is always 2 bytes), however, I don't
believe that this flexibility should be available for application programmer.

I don't think there is any problem with having different size of char. In fact,
that would make programs better (since application programmers will have to
think in terms of characters as opposed to bytes). System programmers (i.e. OS
programmers) may choose to think as they expect it to be (since char width
option can be added to compiler). TCHAR in Windows is a good example of it.
Whenever you need to determine size of element (e.g. for allocation), you can
use 'sizeof'. Again, it does not mean that you're deprived of char/wchar/dchar
capability. It still can be supported (e.g. via ubyte/ushort/uint) for the sake
of interoperability or some special cases. Special string constants (e.g. ""b,
""w, ""d) can be supported, too. My only point is that it would be good to have
universal char type that depends on platform. That, in turns, allows to have
unified char for all libraries on this platform.

In addition, commonly used constants '\n', '\r', '\t' will be the same
regardless of char width.

Anyway, that was just a suggestion. You may disagree with this if you wish.

Ruslan.


      
Jun 07 2010
next sibling parent torhu <no spam.invalid> writes:
On 08.06.2010 01:16, Ruslan Nikolaev wrote:
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello world"
representation. Was postfix "w" and "d" added initially or just recently? I did
not know about it. I thought D does automatic conversion for string literals.
There is automatic conversion, try this example: --- //void f(char[] s) { writefln("char"); } void f(wchar[] s) { writefln("wchar"); } void main() { f("hello"); } --- As long as there's just one possible match, a string literal with no postfix will be interpreted as char[], wchar[], or dchar[] depending on context. But if you uncomment the first f(), the compiler will complain about there being two matching overloads. Then you'll have to add the 'c' or 'w' postfixes to the string literal to disambiguate. For templates and type inference, string literals default to char[]. This example prints 'char': --- void f(T)(T[] s) { writefln(T.stringof); } void main() { f("hello"); } ---
Jun 07 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.122.1275952601.24349.digitalmars-d puremagic.com...
 Ok, ok... that was just a suggestion... Thanks, for reply about "Hello 
 world" representation. Was postfix "w" and "d" added initially or just 
 recently? I did not know about it. I thought D does automatic conversion 
 for string literals.
The postfix 'c', 'w' and 'd' have been in there a long time. But D does have a little bit of automatic conversion. Let me try to clarify: "hello"c // string, UTF-8 "hello"w // wstring, UTF-16 "hello"d // dstring, UTF-32 "hello" // Depends how you use it Suppose I have a function that takes a UTF-8 string, and I call it: void cfoo(string a) {} cfoo("hello"c); // Works cfoo("hello"w); // Error, wrong type cfoo("hello"d); // Error, wrong type cfoo("hello"); // Works, assumed to be UTF-8 string If I make a different function that takes a UTF-16 wstring instead: void wfoo(wstring a) {} wfoo("hello"c); // Error, wrong type wfoo("hello"w); // Works wfoo("hello"d); // Error, wrong type wfoo("hello"); // Works, assumed to be UTF-16 wstring And then, a UTF-32 dstring version would be similar: void dfoo(dstring a) {} dfoo("hello"c); // Error, wrong type dfoo("hello"w); // Error, wrong type dfoo("hello"d); // Works dfoo("hello"); // Works, assumed to be UTF-32 dstring As you can see, the literals with postfixes are always the exact type you specify. If you have no postfix, then you get whatever the compiler expects it to be. But, then the question is, what happens if any of those types can be used? Which does the compiler choose? void Tfoo(T)(T a) { // When compiling, display the type used. pragma(msg, T.stringof); } Tfoo("hello"); (Normally you'd want to add in a constraint that T must be one of the string types, so that no one tries to pass in an int or float or something. I skipped that in there.) In that, Tfoo isn't expecting any particular type of string, it can take any type. And "hello" doesn't have a postfix, so the compiler uses the default: UTF-8 string.
 Yes, templates may help. However, that unnecessary make code bigger (since 
 we have to compile it for every char type).<
It only generates code for the types that are actually needed. If, for instance, your progam never uses anything except UTF-8, then only one version of the function will be made - the UTF-8 version. If you don't use every char type, then it doesn't generate it for every char type - just the ones you choose to use.
The other problem is that it allows programmer to choose which one to use. 
He or she may just prefer char[] as UTF-8 (or wchar[] as UTF-16). That will 
be fine on platform that supports this encoding natively (e.g. for file 
system operations, screen output, etc.), whereas it will cause conversion 
overhead on the other. I don't think there is any problem with having 
different size of char. In fact, that would make programs better (since 
application programmers will have to think in terms of characters as 
opposed to bytes). Not to say that it's a big overhead, but unnecessary 
one. Having said this, I do agree that there must be some flexibility (e.g. 
in Java char[] is always 2 bytes), however, I don't believe that this 
flexibility should be available for application programmer.
< That's not good. First of all, UTF-16 is a lousy encoding, it combines the worst of both UTF-8 and UTF-32: It's multibyte and non-word-aligned like UTF-8, but it still wastes a lot of space like UTF-32. So even if your OS uses it natively, it's still best to do most internal processing in either UTF-8 or UTF-32. (And with templated string functions, if the programmer actually does want to use the native type in the *rare* cases where he's making enough OS calls that it would actually matter, he can still do so.) Secondly, the programmer *should* be able to use whatever type he decides is appropriate. If he wants to stick with native, he can do so, but he shouldn't be forced into choosing between "use the native encoding" and "abuse the type system by pretending that an int is a character". For instance, complex low-level text processing *relies* on knowing exactly what encoding is being used and coding specifically to that encoding. As an example, I'm currently working on a generalized parser library ( http://www.dsource.org/projects/goldie ). Something like that is complex enough already that implementing the internal lexer natively for each possible native text encoding is just not worthwhile, expecially since the text hardly every gets passed to or from any OS calls that expect any particular encoding. Or maybe you're on a fancy OS that can handle any encoding natively. Or maybe the programmer is in a low-memory (or very-large-data) situation and needs the space savings of UTF-8 regardless of OS and doesn't care about speed. Or maybe they're actually *writing* an OS (Most moderns languages are completely useless for writing an OS. D isn't). A language or a library should *never* assume it knows the programmer's needs better than the programmer does. Also, C already tried the approach of multi-sized types (ex, C's "int"), and it ended up being a big PITA disaster that everyone ended up having to make up hacks to work around.

 System programmers (i.e. OS programmers) may choose to think as they 
 expect it to be (since char width option can be added to compiler).<
See that's the thing, D is intended as a systems language, so a D programmer must be able to easily handle it that way whenever they need to.
TCHAR in Windows is a good example of it. Whenever you need to determine 
size of element (e.g. for allocation), you can use 'sizeof'. Again, it does 
not mean that you're deprived of char/wchar/dchar capability. It still can 
be supported (e.g. via ubyte/ushort/uint) for the sake of interoperability 
or some special cases. Special string constants (e.g. ""b, ""w, ""d) can be 
supported, too. My only point is that it would be good to have universal 
char type that depends on platform.
You can have that easily: version(Windows) alias wstring tstring; else alias string tstring; Besides, just because you *can* get a job done a certain way doesn't mean languages should never try to allow a better way for those who want a better way.
 That, in turns, allows to have unified char for all libraries on this 
 platform.
With templated text functions, there is very little benefit to be gained from having a unified char. Just wouldn't serve any real purpose. All it would do is cause problems for anyone who needs to work at the low-level. ------------------------------- Not sent from an iPhone.
Jun 07 2010
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list,
Although D is designed to be fairly agnostic about character types, in practice I recommend the following: 1. Use the string type for strings, it's char[] on D1 and immutable(char)[] on D2. 2. Use dchar's to hold individual characters. The problem with wchar's is that everyone forgets about surrogate pairs. Most UTF-16 programs in the wild, including nearly all Java programs, are broken with regard to surrogate pairs. The problem with dchar's is strings of them consume memory at a prodigious rate.
Jun 07 2010
next sibling parent Kagamin <spam here.lot> writes:
Walter Bright Wrote:

 The problem with wchar's is that everyone forgets about surrogate pairs. Most 
 UTF-16 programs in the wild, including nearly all Java programs, are broken
with 
 regard to surrogate pairs.
I'm affraid, it will pretty hard to show the bug. I don't know whether java is particularly nasty here, but for C code it will be hard.
Jun 07 2010
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 The problem with dchar's is strings of them consume 
 memory at a prodigious rate.
Warning: lazy musings ahead. I hope we'll soon have computers with 200+ GB of RAM where using strings that use less than 32-bit chars is in most cases a premature optimization (like today is often a silly optimization to use arrays of 16-bit ints instead of 32-bit or 64-bit ints. Only special situations found with the profiler can justify the use of arrays of shorts in a low level language). Even in PCs with 200 GB of RAM the first levels of CPU caches can be very small (like 32 KB), and cache misses are costly, so even if huge amounts of RAMs are present, to increase performance it can be useful to reduce the size of strings. A possible solution to this problem can be some kind of real-time hardware compression/decompression between the CPU and the RAM. UTF-8 can be a good enough way to compress 32-bit strings. So we are back to writing low-level programs that have to deal with UTF-8. To avoid this, CPUs and RAM can compress/decompress the text transparently to the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe it can't be done transparently enough. So a smarter and better compression algorithm can be used to keep all this transparent enough (not fully transparent, some low-level situations can require code that deals with the compression). Bye, bearophile
Jun 08 2010
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 The problem with dchar's is strings of them consume memory at a prodigious
 rate.
Warning: lazy musings ahead. I hope we'll soon have computers with 200+ GB of RAM where using strings that use less than 32-bit chars is in most cases a premature optimization (like today is often a silly optimization to use arrays of 16-bit ints instead of 32-bit or 64-bit ints. Only special situations found with the profiler can justify the use of arrays of shorts in a low level language). Even in PCs with 200 GB of RAM the first levels of CPU caches can be very small (like 32 KB), and cache misses are costly, so even if huge amounts of RAMs are present, to increase performance it can be useful to reduce the size of strings. A possible solution to this problem can be some kind of real-time hardware compression/decompression between the CPU and the RAM. UTF-8 can be a good enough way to compress 32-bit strings. So we are back to writing low-level programs that have to deal with UTF-8. To avoid this, CPUs and RAM can compress/decompress the text transparently to the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe it can't be done transparently enough. So a smarter and better compression algorithm can be used to keep all this transparent enough (not fully transparent, some low-level situations can require code that deals with the compression).
I strongly suspect that the encode/decode time for UTF-8 is more than compensated for by the 4x reduction in memory usage. I did a large app 10 years ago using dchars throughout, and the effects of the memory consumption were murderous. (As the recent article on memory consumption shows, large data structures can have huge negative speed consequences due to virtual and cache memory, and multiple cores trying to access the same memory.) https://lwn.net/Articles/250967/ Keep in mind that the overwhelming bulk of UTF-8 text is ascii, and requires only one cycle to "decode".
Jun 08 2010
prev sibling parent reply Rainer Deyke <rainerd eldwood.com> writes:
On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).
Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can. -- Rainer Deyke - rainerd eldwood.com
Jun 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Rainer Deyke" <rainerd eldwood.com> wrote in message 
news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).
Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can.
I think he was just musing that it would be nice to be able to ignore multiple encodings and multiple-code-units, and get back to something much closer to the blissful simplicity of ASCII. On that particular point, I concur ;)
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:humfrk$2gk$1 digitalmars.com...
 "Rainer Deyke" <rainerd eldwood.com> wrote in message 
 news:humes8$s8$1 digitalmars.com...
 On 6/8/2010 13:57, bearophile wrote:
 I hope we'll soon have computers with 200+ GB of RAM where using
 strings that use less than 32-bit chars is in most cases a premature
 optimization (like today is often a silly optimization to use arrays
 of 16-bit ints instead of 32-bit or 64-bit ints. Only special
 situations found with the profiler can justify the use of arrays of
 shorts in a low level language).
Off-topic, but I don't need a profiler to tell me that my 1024x1024x1024 arrays should use shorts instead of ints. And even when 200GB becomes common, I'd still rather not waste that memory by using twice as much space as I have to just because I can.
I think he was just musing that it would be nice to be able to ignore multiple encodings and multiple-code-units, and get back to something much closer to the blissful simplicity of ASCII. On that particular point, I concur ;)
Keep in mind too, that for an English-language app (and there are plenty), even using ASCII still wastes space, since you usually only need the 26 letters, 10 digits, a few whitespace characters, and a handful of punctuation. You could probably fit that in 6 bits per character, less if you're ballsy enough to use huffman encoding internally. Yea, there's twice as many letters if you count uppercase/lowercase, but random-casing is rare so there's tricks you can use to just stick with 26 plus maybe a few special control characters. But, of course, nobody actually does any of that because with the amount of memory we have, and the amount of memory already used by other parts of a program, the savings wouldn't be worth the bother. But I agree with your point too. Just saying.
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Just one more addition: it is possible to have built-in function that conve=
rts multibyte (or multiword) char sequence (even though in my proposal it c=
an be of different size) to dchar (UTF-32) character. Again, my only point =
is that it would be nice to have something similar to TCHAR so that all lib=
raries can use it if they choose not to provide functions for all 3 types.=
=0A=0A2Walter:=0AYes, programmers do often ignore surrogate pairs in case o=
f UTF-16. But in case of undetermined char size (1 or 2 bytes) they will ha=
ve to use special builtin conversion functions to dchar unless they want th=
eir code to be completely broken.=0A=0AThanks,=0ARuslan. =0A=0A--- On Tue, =
6/8/10, Ruslan Nikolaev <nruslan_devel yahoo.com> wrote:=0A=0A> From: Rusla=
n Nikolaev <nruslan_devel yahoo.com>=0A> Subject: Re: Wide characters suppo=
rt in D=0A> To: "digitalmars.D" <digitalmars-d puremagic.com>=0A> Date: Tue=
sday, June 8, 2010, 3:16 AM=0A> Ok, ok... that was just a=0A> suggestion...=
 Thanks, for reply about "Hello world"=0A> representation. Was postfix "w" =
and "d" added initially or=0A> just recently? I did not know about it. I th=
ought D does=0A> automatic conversion for string literals.=0A> =0A> Yes, te=
mplates may help. However, that unnecessary make=0A> code bigger (since we =
have to compile it for every char=0A> type). The other problem is that it a=
llows programmer to=0A> choose which one to use. He or she may just prefer =
char[] as=0A> UTF-8 (or wchar[] as UTF-16). That will be fine on platform=
=0A> that supports this encoding natively (e.g. for file system=0A> operati=
ons, screen output, etc.), whereas it will cause=0A> conversion overhead on=
 the other. Not to say that it's a big=0A> overhead, but unnecessary one. H=
aving said this, I do agree=0A> that there must be some flexibility (e.g. i=
n Java char[] is=0A> always 2 bytes), however, I don't believe that this=0A=
 flexibility should be available for application programmer.=0A> =0A> I do=
n't think there is any problem with having different=0A> size of char. In f= act, that would make programs better=0A> (since application programmers wil= l have to think in terms=0A> of characters as opposed to bytes). System pro= grammers (i.e.=0A> OS programmers) may choose to think as they expect it to= be=0A> (since char width option can be added to compiler). TCHAR in=0A> Wi= ndows is a good example of it. Whenever you need to=0A> determine size of e= lement (e.g. for allocation), you can use=0A> 'sizeof'. Again, it does not = mean that you're deprived of=0A> char/wchar/dchar capability. It still can = be supported (e.g.=0A> via ubyte/ushort/uint) for the sake of interoperabil= ity or=0A> some special cases. Special string constants (e.g. ""b, ""w,=0A>= ""d) can be supported, too. My only point is that it would=0A> be good to = have universal char type that depends on=0A> platform. That, in turns, allo= ws to have unified char for=0A> all libraries on this platform.=0A> =0A> In= addition, commonly used constants '\n', '\r', '\t' will=0A> be the same re= gardless of char width.=0A> =0A> Anyway, that was just a suggestion. You ma= y disagree with=0A> this if you wish.=0A> =0A> Ruslan.=0A> =0A> =0A> =A0 = =A0 =A0 =0A> =0A=0A=0A
Jun 07 2010
parent Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:
 Just one more addition: it is possible to have built-in function that
 converts multibyte (or multiword) char sequence (even though in my proposal
 it can be of different size) to dchar (UTF-32) character. Again, my only
 point is that it would be nice to have something similar to TCHAR so that all
 libraries can use it if they choose not to provide functions for all 3 types.
 
 
 2Walter: Yes, programmers do often ignore surrogate pairs in case of UTF-16.
 But in case of undetermined char size (1 or 2 bytes) they will have to use
 special builtin conversion functions to dchar unless they want their code to
 be completely broken.
The nice thing about char[] is that you'll find out real fast if your multibyte code is broken. With surrogate pairs in wchar[], the bug may lurk undetected for a decade.
Jun 07 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 07 Jun 2010 17:48:09 -0400, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Note: I posted this already on runtime D list, but I think that list was  
 a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters: char,  
 wchar, dchar. This is cool, however, I have some questions about it:

 1. When we have 2 methods (one with wchar[] and another with char[]),  
 how D will determine which one to use if I pass a string "hello world"?
 2. Many libraries (e.g. tango or phobos) don't provide functions/methods  
 (or have incomplete support) for wchar/dchar
 e.g. writefln probably assumes char[] for strings like "Number %d..."
 3. Even if they do support, it is kind of annoying to provide methods  
 for all 3 types of chars. Especially, if we want to use native mode  
 (e.g. for Windows wchar is better, for Linux char is better). E.g.  
 Windows has _wopen, _wdirent, _wreaddir, _wopenddir, _wmain(int argc,  
 wchar_t[] argv) and so on, and they should be native (in a sense that no  
 conversion is necessary when we do, for instance, _wopen). Linux doesn't  
 have them as UTF-8 is used widely there.

 Since D language is targeted on system programming, why not to try to  
 use whatever works better on a particular system (e.g. char will be 2  
 bytes on Windows and 1 byte on Linux; it can be a compiler switch, and  
 all libraries can be compiled properly on a particular system). It's  
 still necessary to have all 3 types of char for cooperation with C. But  
 in those cases byte, short and int will do their work. For this kind of  
 situation, it would be nice to have some built-in functions for  
 transparent conversion from char to byte/short/int and vice versa  
 (especially, if conversion only happens if needed on a particular  
 platform).

 In my opinion, to separate notion of character from byte would be nice,  
 and it makes sense as a particular platform uses either UTF-8 or UTF-16  
 natively. Programmers may write universal code (like TCHAR on Windows).  
 Unfortunately, C uses 'char' and 'byte' interchangeably but why D has to  
 make this mistake again?
One thing that may not be clear from your interpretation of D's docs, all strings representable by one character type are also representable by all the other character types. This means that a function that takes a char[] can also take a dchar[] if it is sent through a converter (i.e. toUtf8 on Tango I think). So D's char is decidedly not like byte or ubyte, or C's char. In general, I use char (utf8) because I am used to C and ASCII (which is exactly represented in utf-8). But because char is utf-8, it could potentially accept any unicode string. -Steve
Jun 07 2010
parent =?UTF-8?B?QWxpIMOHZWhyZWxp?= <acehreli yahoo.com> writes:
Steven Schveighoffer wrote:
 a function that takes 
 a char[] can also take a dchar[] if it is sent through a converter (i.e. 
 toUtf8 on Tango I think).
In Phobos, there are text, wtext, and dtext in std.conv: /** Convenience functions for converting any number and types of arguments into _text (the three character widths). Example: ---- assert(text(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"); assert(wtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"w); assert(dtext(42, ' ', 1.5, ": xyz") == "42 1.5: xyz"d); ---- */ Ali
Jun 07 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
=0A> It only generates code for the types that are actually=0A> needed. If,=
 for =0A> instance, your progam never uses anything except UTF-8,=0A> then =
only one =0A> version of the function will be made - the UTF-8=0A> version.=
=A0 If you don't use =0A> every char type, then it doesn't generate it for =
every char=0A> type - just the =0A> ones you choose to use.=0A=0ANot quite =
right. If we create system dynamic libraries or dynamic libraries commonly =
used, we will have to compile every instance unless we want to burden user =
with this. Otherwise, the same code will be duplicated in users program ove=
r and over again.=0A=0A> That's not good. First of all, UTF-16 is a lousy e=
ncoding,=0A> it combines the =0A> worst of both UTF-8 and UTF-32: It's mult=
ibyte and=0A> non-word-aligned like =0A> UTF-8, but it still wastes a lot o=
f space like UTF-32. So=0A> even if your OS =0A> uses it natively, it's sti=
ll best to do most internal=0A> processing in either =0A> UTF-8 or UTF-32. =
(And with templated string functions, if=0A> the programmer =0A> actually d=
oes want to use the native type in the *rare*=0A> cases where he's =0A> mak=
ing enough OS calls that it would actually matter, he=0A> can still do so.)=
=0A>=0A=0AFirst of all, UTF-16 is not a lousy encoding. It requires for mos=
t characters 2 bytes (not so big wastage especially if you consider other l=
anguages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 wi=
ll require from 1 to 3 bytes for the same common characters. And also 4 cha=
rs for REALLY rare ones. In UTF-16 surrogate is an exception whereas in UTF=
-8 it is a rule (when something is an exception, it won't affect performanc=
e in most cases; when something is a rule - it will affect).=0A=0AFinally, =

y others. Developers of these systems chose to use UTF-16 even though some =

condly, the programmer *should* be able to use whatever=0A> type he decides=
 is =0A> appropriate. If he wants to stick with native, he can do=0A=0AWhy?=
 He/She can just use conversion to UTF-32 (dchar) whenever better understan=
ding of character is needed. At least, that's what should be done anyway.=
=0A=0A> =0A> You can have that easily:=0A> =0A> version(Windows)=0A> =A0 =
=A0 alias wstring tstring;=0A> else=0A> =A0 =A0 alias string tstring;=0A> =
=0A=0ASee that's my point. Nobody is going to do this unless the above is s=
tandardized by the language. Everybody will stick to something particular (=
either char or wchar). =0A=0A=0A> =0A> With templated text functions, there=
 is very little benefit=0A> to be gained =0A> from having a unified char. J=
ust wouldn't serve any real=0A=0Asee my comment above about templates and d=
ynamic libraries =0A=0ARuslan=0A=0A=0A      
Jun 07 2010
next sibling parent Jesse Phillips <jessekphillips+D gmail.com> writes:
On Mon, 07 Jun 2010 19:26:02 -0700, Ruslan Nikolaev wrote:

 It only generates code for the types that are actually needed. If, for
 instance, your progam never uses anything except UTF-8, then only one
 version of the function will be made - the UTF-8 version.  If you don't
 use
 every char type, then it doesn't generate it for every char type - just
 the
 ones you choose to use.
Not quite right. If we create system dynamic libraries or dynamic libraries commonly used, we will have to compile every instance unless we want to burden user with this. Otherwise, the same code will be duplicated in users program over and over again.
I think you really need to look more into what templates are and do. There is also going to be very little performance gain by using the "system type" for strings. Considering that most of the work is not likely going be to the system commands you mentioned, but within D itself.
Jun 07 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.124.1275963971.24349.digitalmars-d puremagic.com...

Nick wrote:
 It only generates code for the types that are actually
 needed. If, for
 instance, your progam never uses anything except UTF-8,
 then only one
 version of the function will be made - the UTF-8
 version. If you don't use
 every char type, then it doesn't generate it for every char
 type - just the
 ones you choose to use.
Not quite right. If we create system dynamic libraries or dynamic libraries 
commonly used, we will have to compile every instance unless we want to 
burden user with this. Otherwise, the same code will be duplicated in users 
program over and over again.<
That's a rather minor issue. I think you're overestimating the amount of bloat that occurs from having one string type versus three string types. Absolute worst case scenario would be a library that contains nothing but text-processing functions. That would triple in size, but what's the biggest such lib you've ever seen anyway? And for most libs, only a fraction is going to be taken up by text processing, so the difference won't be particularly large. In fact, the difference would likely be dwarfed anyway by the bloat incurred from all the other templated code (ie which would be largely unaffected by number of string types), and yes, *that* can get to be a problem, but it's an entirely separate one.
 That's not good. First of all, UTF-16 is a lousy encoding,
 it combines the
 worst of both UTF-8 and UTF-32: It's multibyte and
 non-word-aligned like
 UTF-8, but it still wastes a lot of space like UTF-32. So
 even if your OS
 uses it natively, it's still best to do most internal
 processing in either
 UTF-8 or UTF-32. (And with templated string functions, if
 the programmer
 actually does want to use the native type in the *rare*
 cases where he's
 making enough OS calls that it would actually matter, he
 can still do so.)

First of all, UTF-16 is not a lousy encoding. It requires for most 
characters 2 bytes (not so big wastage especially if you consider other 
languages). Only for REALLY rare chars do you need 4 bytes. Whereas UTF-8 
will require from 1 to 3 bytes for the same common characters. And also 4 
chars for REALLY rare ones. In UTF-16 surrogate is an exception whereas in 
UTF-8 it is a rule (when something is an exception, it won't affect 
performance in most cases; when something is a rule - it will affect).<
Maybe "lousy" is too strong a word, but aside from compatibility with other libs/software that use it (which I'll address separately), UTF-16 is not particularly useful compared to UTF-8 and UTF-32: Non-latin-alphabet language: UTF-8 vs UTF-16: The real-word difference in sizes is minimal. But UTF-8 has some advantages: The nature of the the encoding makes backwards-scanning cheaper and easier. Also, as Walter said, bugs in the handling of multi-code-unit characters become fairly obvious. Advantages of UTF-16: None. Latin-alphabet language: UTF-8 vs UTF-16: All the same UTF-8 advantages for non-latin-alphabet languages still apply, plus there's a space savings: Under UTF-8, *most* characters are going to be 1 byte. Yes, there will be the occasional 2+ byte character, but they're so much less common that the overhead compared to ASCII (I'm only using ASCII as a baseline here, for the sake of comparisons) would only be around 0% to 15% depending on the language. UTF-16, however, has a consistent 100% overhead (slightly more when you count surrogate pairs, but I'll just leave it at 100%). So, depending on language, UTF-16 would be around 70%-100% larger than UTF-8. That's not insignificant. Any language: UTF-32 vs UTF-16: Using UTF-32 takes up extra space, but when that matters, UTF-8 already has the advantage over UTF-16 anyway regardless of whether or not UTF-8 is providing a space savings (see above), so the question of UTF-32 vs UTF-16 becomes useless. The rest of the time, UTF-32 has these advantages: Guaranteed one code-unit per character. And, the code-unit size is faster on typical CPUs (typical CPUs generally handle 32-bits faster than they handle 8- or 16-bits). Advantages of UTF-16: None. So compatibility with certain tools/libs is really the only reason ever to choose UTF-16.

Qt and many others. Developers of these systems chose to use UTF-16 even 

First of all, it's not exactly unheard of for big projects to make a sub-optimal decision. Secondly, Java and Windows adapted 16-bit encodings back when many people were still under the mistaken impression that would allow them to hold any character in one code-unit. If that had been true, then it would indeed have had at least certain advantages over UTF-8. But by the time the programming world at large knew better, it was too late for Java or Windows to .NET use UTF-16 because Windows does. I don't know about Qt, but judging by how long Wikipedia says it's been around, I'd say it's probably the same story. As for choosing to use UTF-16 because of interfacing with other tools and libs that use it: That's certainly a good reason to use UTF-16. But it's about the only reason. And it's a big mistake to just assume that the overhead of converting to/from UTF-16 when crossing those API borders is always going to outweigh all other concerns: For instance, if you're writing an app that does a large amount of text-processing on relatively small amounts of text and only deals a little bit with a UTF-16 API, then the overhead of operating on 16-bits at a time can easily outweigh the overhead from the UTF-16 <-> UTF-32 conversions. Or, maybe the app you're writing is more memory-limited than speed-limited. There are perfectly legitimate reasons to want to use an encoding other than the OS-native. Why force those people to circumvent the type system to do it? Especially in a language that's intended to be usable as a systems language. Just to potentially save a couple megs on some .dll or .so?
 Secondly, the programmer *should* be able to use whatever
 type he decides is
 appropriate. If he wants to stick with native, he can do
Why? He/She can just use conversion to UTF-32 (dchar) whenever better 
understanding of character is needed. At least, that's what should be done 
anyway.<
Weren't you saying that the main point of just having one string type (the OS-native string) was to avoid unnecessary conversions? But now you're arguing that's it's fine to do unnecessary conversions and to have the multiple string types?
 You can have that easily:

 version(Windows)
 alias wstring tstring;
 else
 alias string tstring;

See that's my point. Nobody is going to do this unless the above is 
standardized by the language. Everybody will stick to something particular 
(either char or wchar).<
True enough. I don't have anything against having something like that in the std library as long as the others are still available too. Could be useful in a few cases. I do think having it *instead* of the three types is far too presumptuous, though.
Jun 07 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
--- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D gmail.com> wrote:

 I think you really need to look more into what templates
 are and do.
 
Excuse me? Unless templates are something different in D (I can't be 100% sure since I am new D), it should be the case. At least in C++, that would be the case. As I said, for libraries you need to compile every commonly used instance, so that user will not be burdened with this overhead. http://www.digitalmars.com/d/2.0/template.html
 There is also going to be very little performance gain by
 using the 
 "system type" for strings. Considering that most of the
 work is not 
 likely going be to the system commands you mentioned, but
 within D itself.
 
It depends. For instance, if you work with files, write on the console output, use system functions, use Win32 api, DFL, there can be overhead.
Jun 07 2010
parent BCS <none anon.com> writes:
Hello Ruslan,

 --- On Tue, 6/8/10, Jesse Phillips <jessekphillips+D gmail.com> wrote:
 
 I think you really need to look more into what templates are and do.
 
As I said, for libraries you need to compile every commonly used instance, so that user will not be burdened with this overhead.
You only need to do that where you are shipping closed source and for that, it should be trivial to get the compiler to generate all three versions.
 There is also going to be very little performance gain by
 using the
 "system type" for strings. Considering that most of the
 work is not
 likely going be to the system commands you mentioned, but
 within D itself.
It depends. For instance, if you work with files, write on the console output, use system functions, use Win32 api, DFL, there can be overhead.
Your, right: it depends. In the few cases I can think of where more of the D code will be interacting with non D code than just processing the text, you could almost use void[] as your type. Where would you care about the encoding but not do much worth it? Also unless you have large amounts of text, you are going to have to work hard to get perf problems. If you do have large amounts of text, you are going to be I/O bound (cache misses etc.) and at that point, the cost of any operation, is it's I/O. From that, Reading in some date, doing a single pass of processing on it and writing it back out would only take 2/3 long with translations on both side. -- ... <IXOYE><
Jun 07 2010
prev sibling next sibling parent Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Yes, to clarify what I suggest, I can put it as follows (2 possibilities):

1. Have a special standardized type "tchar" and "tstring". Then, system
libraries as well as users can use this type unless they want to do something
special. There can be a compiler switch to change tchar width (essentially, to
assign tchar to char, wchar or dchar), so that for each platform it can be used
accordingly. In addition, tmain(tstring[] args) can be used as entry point;
_topen, _treaddir, _tfopen, etc. can be added to binding. 
Adv: doesn't break existent code.
Disadv: tchar and tstring may look weird for users. 

2. Rename current char to bchar or schar, or something similar. Then 'char' can
be used as type described above.
Adv: users are likely to use this type
Disadv: may break existent code; especially in part of bindings

I think to have something (at least (1)) would be nice feature and addition to
D. Although, I do admit that there can be different opinions about it. However,
TCHAR in Windows DOES work fine. In the case described above it's even better
since we always work with Unicode (UTF8/16/32) unlike Windows (which use ANSI
for 1 byte char), thus everything should be more or less transparent. It would
be cool to hear something from D, phobos and tango developers.

P.S. For commonly used characters (e.g. '\n') the size of char will never make
any difference. The problems should not occur in good code, or should occur
really rare (which can be adjusted by programmer).

Thanks,
Ruslan Nikolaev


      
Jun 07 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
 You only need to do that where you are shipping closed
 source and for that, it should be trivial to get the
 compiler to generate all three versions. 
You will also need to do it in open source projects if you want to include generated template code into dynamic library as opposed to user's program (read as unnecessary space "burden" where code is repeated over and over again across user programs). But, yes, closed source programs is a good particular example. True, you can compile all 3 versions. But the whole argument was about additional generated code which someone claimed will not happen.
 
 Your, right: it depends. In the few cases I can think of
 where more of the D code will be interacting with non D code
 than just processing the text, you could almost use void[]
 as your type. Where would you care about the encoding but
 not do much worth it?
 
 Also unless you have large amounts of text, you are going
 to have to work hard to get perf problems. If you do have
 large amounts of text, you are going to be I/O bound (cache
 misses etc.) and at that point, the cost of any operation,
 is it's I/O. From that, Reading in some date, doing a single
 pass of processing on it and writing it back out would only
 take 2/3 long with translations on both side.
 
True. But even simple string handling is faster for UTF-16. The time required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. Generally, we have to read one code point after another (not more than this) since data guaranteed to be aligned by 2 byte boundary for wchar and 1 byte for char. Not to mention that converting 2 code points takes less time in UTF-16. And why not use this opportunity if system already natively support this? In addition, I want to mention that reading/writing file in text mode is very transparent. For instance, in Windows, the conversion will happen automatically from multibyte to unicode for open, fopen, etc. when text mode is specified. In general, it is a good practice since 1 byte char text is not necessary UTF-8 anyway and can be ANSI as well. Also, some other OS use 2 bytes UTF-16 natively, so it's not just for Windows. If I am not wrong, Symbian should be one such example.
Jun 07 2010
parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.127.1275974825.24349.digitalmars-d puremagic.com...
 True. But even simple string handling is faster for UTF-16. The time 
 required to read 2 bytes from UTF-16 string is the same 1 byte from UTF-8. 
 Generally, we have to read one code point after another (not more than 
 this) since data guaranteed to be aligned by 2 byte boundary for wchar and 
 1 byte for char. Not to mention that converting 2 code points takes less 
 time in UTF-16. And why not use this opportunity if system already 
 natively support this?
Why do you say that UTF-16 is faster than UTF-8?
In general, it is a good practice since 1 byte char text is not necessary 
UTF-8 anyway and can be ANSI as well.
That's what the BOM is for.
Jun 07 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
 
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other 
 libs/software that use it (which I'll address separately),
 UTF-16 is not 
 particularly useful compared to UTF-8 and UTF-32:
...
 
I tried to avoid commenting this because I am afraid we'll stray away from the main point (which is not discussion about which Unicode is better). But in short I would say: "Not quite right". UTF-16 as already mentioned is generally faster for non-Latin letters (as reading 2 bytes of aligned data takes the same time as reading 1 byte). Although, I am not familiar with Asian languages, I believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols. That is one of the reason they don't like UTF-8. UTF-32 doesn't have any advantage except for being fixed length. It has a lot of unnecessary memory, cache, etc. overhead (the worst case scenario for both UTF8/16) which is not justified for any language.
 
 First of all, it's not exactly unheard of for big projects
 to make a 
 sub-optimal decision.
I would say, the decision was quite optimal for many reasons, including that "lousy programming" will not cause too many problems as in case of UTF-8.
 
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people 
 were still under the mistaken impression that would allow
 them to hold any 
 character in one code-unit. If that had been true, then it
I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16. Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere): I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.
Jun 07 2010
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
please use the "Reply" Button

On 08.06.2010 08:50, Ruslan Nikolaev wrote:
 Maybe "lousy" is too strong a word, but aside from
 compatibility with other
 libs/software that use it (which I'll address separately),
 UTF-16 is not
 particularly useful compared to UTF-8 and UTF-32:
...

 I tried to avoid commenting this because I am afraid we'll stray away from the
main point (which is not discussion about which Unicode is better). But in
short I would say: "Not quite right". UTF-16 as already mentioned is generally
faster for non-Latin letters (as reading 2 bytes of aligned data takes the same
time as reading 1 byte). Although, I am not familiar with Asian languages, I
believe that UTF-16 requires just 2 bytes instead of 3 for most of symbols.
That is one of the reason they don't like UTF-8. UTF-32 doesn't have any
advantage except for being fixed length. It has a lot of unnecessary memory,
cache, etc. overhead (the worst case scenario for both UTF8/16) which is not
justified for any language.

 First of all, it's not exactly unheard of for big projects
 to make a
 sub-optimal decision.
I would say, the decision was quite optimal for many reasons, including that "lousy programming" will not cause too many problems as in case of UTF-8.
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it
I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen. In doubt that conversion overhead (which is small compared to VM) was the main reason to preserve UTF-16. Concerning why I say that it's good to have conversion to UTF-32 (you asked somewhere): I think you did not understand correctly what I meant. This a very common practice, and in fact - required, to convert from both UTF-8 and UTF-16 to UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In fact, it is the only place where UTF-32 is commonly used and useful.
Jun 08 2010
prev sibling parent reply "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...
 Secondly, Java and Windows adapted 16-bit encodings back
 when many people
 were still under the mistaken impression that would allow
 them to hold any
 character in one code-unit. If that had been true, then it
I doubt that it was the only reason. UTF-8 was already available before Windows NT was released. It would be much easier to use UTF-8 instead of ANSI as opposed to creating parallel API. Nonetheless, UTF-16 has been chosen.
I didn't say that was the only reason. Also, you've misunderstood my point: Their reasoning at the time: 8-bit: Multiple code-units for some characters 16-bit: One code-unit per character Therefore, use 16-bit. Reality: 8-bit: Multiple code-units for some characters 16-bit: Multiple code-units for some characters Therefore, old reasoning not necessarily still applicable.

 length.
standardized on.
I doubt that conversion overhead (which is small compared to VM) was the 
main reason to preserve UTF-16.
I never said anything about conversion overhead being a reason to preserve UTF-16.
 Concerning why I say that it's good to have conversion to UTF-32 (you 
 asked somewhere):

 I think you did not understand correctly what I meant. This a very common 
 practice, and in fact - required, to convert from both UTF-8 and UTF-16 to 
 UTF-32 when you need to do character analysis (e.g. mbtowc() in C). In 
 fact, it is the only place where UTF-32 is commonly used and useful.
I'm well aware why UTF-32 is useful. Earlier, you had started out saying that there should only be one string type, the OS-native type. Now you're changing your tune and saying that we do need multiple types.
Jun 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.
standardized on.
s/UTF-16/16-bit/ It's getting late and I'm starting to mix terminology...
Jun 08 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.
standardized on.
s/UTF-16/16-bit/ It's getting late and I'm starting to mix terminology...
s/16-bit/UCS-2/ The story is that Windows standardized on UCS-2, which is the uniform 16-bit-per-character encoding that predates UTF-16. When UCS-2 turned out to be insufficient, it was extended to the variable-length UTF-16. As has been discussed, that has been quite unpleasant because a lot of code out there handles strings as if they were UCS-2. Andrei
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:hul65q$o98$1 digitalmars.com...
 On 06/08/2010 03:12 AM, Nick Sabalausky wrote:
 "Nick Sabalausky"<a a.a>  wrote in message
 news:huktq1$8tr$1 digitalmars.com...
 "Ruslan Nikolaev"<nruslan_devel yahoo.com>  wrote in message
 news:mailman.128.1275979841.24349.digitalmars-d puremagic.com...

 length.
already standardized on.
s/UTF-16/16-bit/ It's getting late and I'm starting to mix terminology...
s/16-bit/UCS-2/ The story is that Windows standardized on UCS-2, which is the uniform 16-bit-per-character encoding that predates UTF-16. When UCS-2 turned out to be insufficient, it was extended to the variable-length UTF-16. As has been discussed, that has been quite unpleasant because a lot of code out there handles strings as if they were UCS-2.
Ok, that's what I had thought, but then I started second-guessing, so I figured "s/UTF-16/16-bit/" was a safer claim than "s/UTF-16/UCS-2/".
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
 
 I'm well aware why UTF-32 is useful. Earlier, you had
 started out saying 
 that there should only be one string type, the OS-native
 type. Now you're 
 changing your tune and saying that we do need multiple
 types.
 
No. From the very beginning I said "it would also be nice to have some builtin function for conversion to dchar". That means it would be nice to have function that converts from tchar (regardless of its width) to UTF-32. The reason was always clear - you normally don't need UTF-32 chars/strings but for some character analysis you might need them.
Jun 08 2010
next sibling parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-08 04:15:50 -0400, Ruslan Nikolaev <nruslan_devel yahoo.com> said:

 No. From the very beginning I said "it would also be nice to have some 
 builtin function for conversion to dchar". That means it would be nice 
 to have function that converts from tchar (regardless of its width) to 
 UTF-32. The reason was always clear - you normally don't need UTF-32 
 chars/strings but for some character analysis you might need them.
Is this what you want? version (utf16) alias wchar tchar; else alias char tchar; alias immutable(tchar)[] tstring; import std.utf; unittest { tstring tstr = "hello"; dstring dstr = toUTF32(tstr); } -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 08 2010
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Ruslan Nikolaev wrote:
 No. From the very beginning I said "it would also be nice to have some
 builtin function for conversion to dchar". That means it would be nice to
 have function that converts from tchar (regardless of its width) to UTF-32.
 The reason was always clear - you normally don't need UTF-32 chars/strings
 but for some character analysis you might need them.
http://www.digitalmars.com/d/2.0/phobos/std_utf.html Function overloading takes care of selecting the right version.
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
=0A> =0A> Is this what you want?=0A> =0A> =A0=A0=A0 version (utf16)=0A> =A0=
=A0=A0 =A0=A0=A0 alias wchar tchar;=0A> =A0=A0=A0 else=0A> =A0=A0=A0 =A0=A0=
=A0 alias char tchar;=0A> =0A> =A0=A0=A0 alias immutable(tchar)[] tstring;=
=0A> =0A> =A0=A0=A0 import std.utf;=0A> =0A> =A0=A0=A0 unittest {=0A> =A0=
=A0=A0 =A0=A0=A0 tstring tstr =3D=0A> "hello";=0A> =A0=A0=A0 =A0=A0=A0 dstr=
ing dstr =3D=0A> toUTF32(tstr);=0A> =A0=A0=A0 }=0A> =0A=0AYes, I think some=
thing like this but standardized by the language. Also would be nice to hav=
e for interoperability (like I also mentioned in the beginning) toUTF16, to=
UTF8, fromUTF16, fromUTF8, fromUTF32, as tchar can be anything. If it's UTF=
-16, and you do toUTF16 - it won't do actual conversion, rather use input s=
tring instead. Something like this.=0A=0AThe other point of argument - whet=
her to use this kind of type as the main character type. My point was that =
having this kind of type used in dynamic libraries would be nice since you =
don't need to provide instances for every other character type, and at the =
same time - use native character encoding available on system. Of course it=
 does not mean, that you should be deprived of other types. If you need spe=
cific type to do something specific, you can always use it.=0A=0A=0A      
Jun 08 2010
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2010-06-08 09:22:02 -0400, Ruslan Nikolaev <nruslan_devel yahoo.com> said:

 you don't need to provide instances for every other character type, and 
 at the same time - use native character encoding available on system.
My opinion is thinking this will work is a fallacy. Here's why... Generally Linux systems use UTF-8 so I guess the "system encoding" there will be UTF-8. But then if you start to use QT you have to use UTF-16, but you might have to intermix UTF-8 to work with other libraries in the backend (libraries which are not necessarily D libraries, nor system libraries). So you may have a UTF-8 backend (such as the MySQL library), UTF-8 "system encoding" glue code, and UTF-16 GUI code (QT). That might be a good or a bad choice, depending on various factors, such as whether the glue code send more strings to the backend or the GUI. Now try to port the thing to Windows where you define the "system encoding" as UTF-16. Now you still have the same UTF-8 backend, and the same UTF-16 GUI code, but for some reason you're changing the glue code in the middle to UTF-16? Sure, it can be made to work, but all the string conversions will start to happen elsewhere, which may change the performance characteristics and add some potential for bugs, and all this for no real reason. The problem is that what you call "system encoding" is only the encoding used by the system frameworks. It is relevant when working with the system frameworks, but when you're working with any other API, you'll probably want to use the same character type as that API does, not necessarily the "system encoding". Not all programs are based on extensive use of the system frameworks. In some situations you'll want to use UTF-16 on Linux, or UTF-8 on Windows, because you're dealing with libraries that expect that (QT, MySQL). A compiler switch is a poor choice there, because you can't mix libraries compiled with a different compiler switches when that switch changes the default character type. In most cases, it's much better in my opinion if the programmer just uses the same character type as one of the libraries it uses, stick to that, and is aware of what he's doing. If someone really want to deal with the complexity of supporting both character types depending on the environment it runs on, it's easy to create a "tchar" and "tstring" alias that depends on whether it's Windows or Linux, or on a custom version flag from a compiler switch, but that'll be his choice and his responsibility to make everything work. But I think in this case a better option might be to abstract all those 'strings' under a single type that work with all UTF encodings (something like [mtext]). [mtext]: http://www.dprogramming.com/mtext.php -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
 
 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.
 
 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.
 
 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).
 
Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same time, UTF-16 is more common for Windows (consider win32api, DFL, system calls, etc.). Some programs written in C even tend to have their own 'tchar' so that they can be compiled differently depending on platform.
 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.
Compiler switch is only necessary for system programmer. For instance, gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's generally not for application programmer.
 
 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of
Programmer should not know generally what encoding he works with. For both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte (word) sequence by just looking at first code point. This can also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily be determined by sizeof. Conversion to UTF-32 and back can be done very transparently. The only problem it might cause - bindings with other libraries (but in this case you can just use fromUTFxx and toUTFxx; you do this conversion anyway). Also, transferring data over the network - again you can just stick to a particular encoding (for network and files, UTF-8 is better since it's byte order free).
 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.
If it's a choice of programmer, then almost all advantages of tchar are lost. It's like garbage collector - if used by everybody, you can expect advantages of using it. However, if it's optional - everybody will write libraries assuming no GC is available, thus - almost all performance advantages are lost. And after all, one of the goals of D (if I am not wrong) to be flexible, so that performance gains will be available for particular configurations if they can be achieved (it's fully compiled language). It does not stick to something particular and say 'you must use UTF-8' or 'you must use UTF-16'.
 michel.fortin michelf.com
 http://michelf.com/
 
 
Jun 08 2010
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
please stop top-posting - just click on the post you want to reply and 
click then reply - your flooding the newsgroup root with replies ...

Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use
  QT you have to use UTF-16, but you might have to intermix
  UTF-8 to work with other libraries in the backend (libraries
  which are not necessarily D libraries, nor system
  libraries). So you may have a UTF-8 backend (such as the
  MySQL library), UTF-8 "system encoding" glue code, and
  UTF-16 GUI code (QT). That might be a good or a bad choice,
  depending on various factors, such as whether the glue code
  send more strings to the backend or the GUI.

  Now try to port the thing to Windows where you define the
  "system encoding" as UTF-16. Now you still have the same
  UTF-8 backend, and the same UTF-16 GUI code, but for some
  reason you're changing the glue code in the middle to
  UTF-16? Sure, it can be made to work, but all the string
  conversions will start to happen elsewhere, which may change
  the performance characteristics and add some potential for
  bugs, and all this for no real reason.

  The problem is that what you call "system encoding" is only
  the encoding used by the system frameworks. It is relevant
  when working with the system frameworks, but when you're
  working with any other API, you'll probably want to use the
  same character type as that API does, not necessarily the
  "system encoding". Not all programs are based on extensive
  use of the system frameworks. In some situations you'll want
  to use UTF-16 on Linux, or UTF-8 on Windows, because you're
  dealing with libraries that expect that (QT, MySQL).
Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same time, UTF-16 is more common for Windows (consider win32api, DFL, system calls, etc.). Some programs written in C even tend to have their own 'tchar' so that they can be compiled differently depending on platform.
  A compiler switch is a poor choice there, because you can't
  mix libraries compiled with a different compiler switches
  when that switch changes the default character type.
Compiler switch is only necessary for system programmer. For instance, gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's generally not for application programmer.
  In most cases, it's much better in my opinion if the
  programmer just uses the same character type as one of the
  libraries it uses, stick to that, and is aware of what he's
  doing. If someone really want to deal with the complexity of
Programmer should not know generally what encoding he works with. For both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte (word) sequence by just looking at first code point. This can also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily be determined by sizeof. Conversion to UTF-32 and back can be done very transparently. The only problem it might cause - bindings with other libraries (but in this case you can just use fromUTFxx and toUTFxx; you do this conversion anyway). Also, transferring data over the network - again you can just stick to a particular encoding (for network and files, UTF-8 is better since it's byte order free).
  supporting both character types depending on the environment
  it runs on, it's easy to create a "tchar" and "tstring"
  alias that depends on whether it's Windows or Linux, or on a
  custom version flag from a compiler switch, but that'll be
  his choice and his responsibility to make everything work.
If it's a choice of programmer, then almost all advantages of tchar are lost. It's like garbage collector - if used by everybody, you can expect advantages of using it. However, if it's optional - everybody will write libraries assuming no GC is available, thus - almost all performance advantages are lost. And after all, one of the goals of D (if I am not wrong) to be flexible, so that performance gains will be available for particular configurations if they can be achieved (it's fully compiled language). It does not stick to something particular and say 'you must use UTF-8' or 'you must use UTF-16'.
  michel.fortin michelf.com
  http://michelf.com/
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:hulqni$1ssj$1 digitalmars.com...
 please stop top-posting - just click on the post you want to reply and 
 click then reply - your flooding the newsgroup root with replies ...

 Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  Generally Linux systems use UTF-8 so I guess the "system
  encoding" there will be UTF-8. But then if you start to use
Speaking of top-posting... ;)
Jun 08 2010
prev sibling parent "Yao G." <nospamyao gmail.com> writes:
Every time you reply to somebody, a new message is created. Is kinda  
difficult to follow this discussion when you need to look more than 15  
separated messages about the same issue. Please check your news client or  
something.

Yao G.

On Tue, 08 Jun 2010 10:11:34 -0500, Ruslan Nikolaev  
<nruslan_devel yahoo.com> wrote:

 Generally Linux systems use UTF-8 so I guess the "system
 encoding" there will be UTF-8. But then if you start to use
 QT you have to use UTF-16, but you might have to intermix
 UTF-8 to work with other libraries in the backend (libraries
 which are not necessarily D libraries, nor system
 libraries). So you may have a UTF-8 backend (such as the
 MySQL library), UTF-8 "system encoding" glue code, and
 UTF-16 GUI code (QT). That might be a good or a bad choice,
 depending on various factors, such as whether the glue code
 send more strings to the backend or the GUI.

 Now try to port the thing to Windows where you define the
 "system encoding" as UTF-16. Now you still have the same
 UTF-8 backend, and the same UTF-16 GUI code, but for some
 reason you're changing the glue code in the middle to
 UTF-16? Sure, it can be made to work, but all the string
 conversions will start to happen elsewhere, which may change
 the performance characteristics and add some potential for
 bugs, and all this for no real reason.

 The problem is that what you call "system encoding" is only
 the encoding used by the system frameworks. It is relevant
 when working with the system frameworks, but when you're
 working with any other API, you'll probably want to use the
 same character type as that API does, not necessarily the
 "system encoding". Not all programs are based on extensive
 use of the system frameworks. In some situations you'll want
 to use UTF-16 on Linux, or UTF-8 on Windows, because you're
 dealing with libraries that expect that (QT, MySQL).
Agreed. True, system encoding is not always that clear. Yet, usually UTF-8 is common for Linux (consider also Gtk, wxWidgets, system calls, etc.) At the same time, UTF-16 is more common for Windows (consider win32api, DFL, system calls, etc.). Some programs written in C even tend to have their own 'tchar' so that they can be compiled differently depending on platform.
 A compiler switch is a poor choice there, because you can't
 mix libraries compiled with a different compiler switches
 when that switch changes the default character type.
Compiler switch is only necessary for system programmer. For instance, gcc also has 'fshort-wchar' that changes width of wchar_t to 16 bit. It also DOES break the code casue libraries normally compiled for wchar_t to 32 bit. Again, it's generally not for application programmer.
 In most cases, it's much better in my opinion if the
 programmer just uses the same character type as one of the
 libraries it uses, stick to that, and is aware of what he's
 doing. If someone really want to deal with the complexity of
Programmer should not know generally what encoding he works with. For both UTF-8 and UTF-16, it's easy to determine number of bytes (words) in multibyte (word) sequence by just looking at first code point. This can also be builtin function (e.g. numberOfChars(tchar firstChar)). Size of each element can easily be determined by sizeof. Conversion to UTF-32 and back can be done very transparently. The only problem it might cause - bindings with other libraries (but in this case you can just use fromUTFxx and toUTFxx; you do this conversion anyway). Also, transferring data over the network - again you can just stick to a particular encoding (for network and files, UTF-8 is better since it's byte order free).
 supporting both character types depending on the environment
 it runs on, it's easy to create a "tchar" and "tstring"
 alias that depends on whether it's Windows or Linux, or on a
 custom version flag from a compiler switch, but that'll be
 his choice and his responsibility to make everything work.
If it's a choice of programmer, then almost all advantages of tchar are lost. It's like garbage collector - if used by everybody, you can expect advantages of using it. However, if it's optional - everybody will write libraries assuming no GC is available, thus - almost all performance advantages are lost. And after all, one of the goals of D (if I am not wrong) to be flexible, so that performance gains will be available for particular configurations if they can be achieved (it's fully compiled language). It does not stick to something particular and say 'you must use UTF-8' or 'you must use UTF-16'.
 michel.fortin michelf.com
 http://michelf.com/
-- Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
Jun 08 2010
prev sibling next sibling parent Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
 Every time you reply to somebody, a
 new message is created. Is kinda difficult to follow this
 discussion when you need to look more than 15 separated
 messages about the same issue. Please check your news client
 or something.
 
 Yao G.
 
Sorry for that, I did not know there was some problem there. It looks there is some problem with web-based mail I am using and I do click "Reply". I need to check & fix it. Just a last note regarding the topic: Anyway, I already explained all my points. Others have good points, too. There can be good and bad reasons for tchar. It primarily depends on the: 1. You view D as a language with single framework that behaves absolutely the same way on all platforms. 2. You allow some diversion from common view for the sake of better interoperability with system libraries. In addition, tchar can be added to 3 already existent types. I doubt that it will hurt. If library developers prefer to work with native encoding, they can use it. Otherwise, they can provide templates that can be used for any of those 4 types. Finally, if someone wants to use something particular, s/he can use it. It would be nice to hear something from Walter. If he says "no, in no way we need this", I am fine with this. The final decision, as you know, is made by the developer of the language. Thanks.
Jun 08 2010
prev sibling next sibling parent Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Yes, I know function overloading takes care of it. But my whole point was
totally different. 'tchar' has nothing to do with overloading, and the
rationale is totally different - provide a type depending on the target
platform preferences.

Ruslan.

--- On Tue, 6/8/10, Walter Bright <newshound1 digitalmars.com> wrote:

 From: Walter Bright <newshound1 digitalmars.com>
 Subject: Re: Wide characters support in D
 To: digitalmars-d puremagic.com
 Date: Tuesday, June 8, 2010, 8:36 PM
 Ruslan Nikolaev wrote:
 No. From the very beginning I said "it would also be
nice to have some
 builtin function for conversion to dchar". That means
it would be nice to
 have function that converts from tchar (regardless of
its width) to UTF-32.
 The reason was always clear - you normally don't need
UTF-32 chars/strings
 but for some character analysis you might need them.
http://www.digitalmars.com/d/2.0/phobos/std_utf.html Function overloading takes care of selecting the right version.
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
Yeah... Exactly. I just verified our posts via web interface. Why did he bl=
ame me for top posting (at least it can be inferred from that my message ha=
s been addressed to)? I am simply replying to already existing messages.=0A=
=0ARuslan.=0A=0A--- On Tue, 6/8/10, Nick Sabalausky <a a.a> wrote:=0A=0A> F=
rom: Nick Sabalausky <a a.a>=0A> Subject: Re: Wide characters support in D=
=0A> To: digitalmars-d puremagic.com=0A> Date: Tuesday, June 8, 2010, 9:50 =
PM=0A> "dennis luehring" <dl.soluz gmx.net>=0A> wrote in message =0A> news:=
hulqni$1ssj$1 digitalmars.com...=0A> > please stop top-posting - just click=
 on the post you=0A> want to reply and =0A> > click then reply - your flood=
ing the newsgroup root=0A> with replies ...=0A> >=0A> > Am 08.06.2010 17:11=
, schrieb Ruslan Nikolaev:=0A> >>>=0A> >>>=A0 Generally Linux systems use U=
TF-8 so I=0A> guess the "system=0A> >>>=A0 encoding" there will be UTF-8. B=
ut then=0A> if you start to use=0A> =0A> Speaking of top-posting... ;)=0A> =
=0A> =0A> =0A=0A=0A      
Jun 08 2010
next sibling parent dennis luehring <dl.soluz gmx.net> writes:
Am 08.06.2010 19:55, schrieb Ruslan Nikolaev:
 Yeah... Exactly. I just verified our posts via web interface. Why did he blame
me for top posting (at least it can be inferred from that my message has been
addressed to)? I am simply replying to already existing messages.
sorry but - there are serveral others using the web-interface and you the only power-top-poster around - maybe you should switch over to thunderbird or something
 --- On Tue, 6/8/10, Nick Sabalausky<a a.a>  wrote:

  From: Nick Sabalausky<a a.a>
  Subject: Re: Wide characters support in D
  To: digitalmars-d puremagic.com
  Date: Tuesday, June 8, 2010, 9:50 PM
  "dennis luehring"<dl.soluz gmx.net>
  wrote in message
  news:hulqni$1ssj$1 digitalmars.com...
  >  please stop top-posting - just click on the post you
  want to reply and
  >  click then reply - your flooding the newsgroup root
  with replies ...
  >
  >  Am 08.06.2010 17:11, schrieb Ruslan Nikolaev:
  >>>
  >>>   Generally Linux systems use UTF-8 so I
  guess the "system
  >>>   encoding" there will be UTF-8. But then
  if you start to use

  Speaking of top-posting... ;)
Jun 08 2010
prev sibling parent "Nick Sabalausky" <a a.a> writes:
"Ruslan Nikolaev" <nruslan_devel yahoo.com> wrote in message 
news:mailman.134.1276019725.24349.digitalmars-d puremagic.com...
Yeah... Exactly. I just verified our posts via web interface. Why did he 
blame me for top posting (at least it can be inferred from that my message 
has been addressed to)? I am simply replying to already existing messages.<
Sorry, I think I created some confusion: What I think dennis was talking about (or am I mistaken?) was how all of your replies are being shown in tree-view as replying directly to the original post instead of being shown as a reply to the message that it *really* replies to. That makes the discussion hard to follow. Then I came in and made a smart-ass comment about how he wrote his message above the quoted text instead of below the quoted text (usually we follow the convention here of writing below the quoted text). So, two totally different things.
Jun 08 2010
prev sibling next sibling parent reply Ruslan Nikolaev <nruslan_devel yahoo.com> writes:
No. New messages are definitely not created by me. You can verify it here:=
=0Ahttp://blog.gmane.org/gmane.comp.lang.d.general=0A=0AYou can easily see =
that in none of the top posts (except for the first one) my name appears fi=
rst. In fact, you have just created another top post. I am only replying to=
 other's comments.=0A=0ARuslan.=0A=0A--- On Tue, 6/8/10, dennis luehring <d=
l.soluz gmx.net> wrote:=0A=0A> From: dennis luehring <dl.soluz gmx.net>=0A>=
 Subject: Re: Wide characters support in D=0A> To: digitalmars-d puremagic.=
com=0A> Date: Tuesday, June 8, 2010, 10:11 PM=0A> Am 08.06.2010 19:55, schr=
ieb Ruslan=0A> Nikolaev:=0A> > Yeah... Exactly. I just verified our posts v=
ia web=0A> interface. Why did he blame me for top posting (at least it=0A> =
can be inferred from that my message has been addressed to)?=0A> I am simpl=
y replying to already existing messages.=0A> =0A> sorry but - there are ser=
veral others using the=0A> web-interface and you =0A> the only power-top-po=
ster around - maybe you should switch=0A> over to =0A> thunderbird or somet=
hing=0A> =0A> > --- On Tue, 6/8/10, Nick Sabalausky<a a.a>=A0 wrote:=0A> >=
=0A> >>=A0 From: Nick Sabalausky<a a.a>=0A> >>=A0 Subject: Re: Wide charact=
ers support in D=0A> >>=A0 To: digitalmars-d puremagic.com=0A> >>=A0 Date: =
Tuesday, June 8, 2010, 9:50 PM=0A> >>=A0 "dennis luehring"<dl.soluz gmx.net=
=0A> >>=A0 wrote in message=0A> >>=A0 news:hulqni$1ssj$1 digitalmars.com..=
.=0A> >>=A0 >=A0 please stop top-posting - just=0A> click on the post you= =0A> >>=A0 want to reply and=0A> >>=A0 >=A0 click then reply - your floodin= g=0A> the newsgroup root=0A> >>=A0 with replies ...=0A> >>=A0 >=0A> >>=A0 >= =A0 Am 08.06.2010 17:11, schrieb=0A> Ruslan Nikolaev:=0A> >>=A0 >>>=0A> >>= =A0 >>>=A0=A0=A0Generally=0A> Linux systems use UTF-8 so I=0A> >>=A0 guess = the "system=0A> >>=A0 >>>=A0=A0=A0encoding"=0A> there will be UTF-8. But th= en=0A> >>=A0 if you start to use=0A> >>=0A> >>=A0 Speaking of top-posting..= . ;)=0A> >>=0A> >>=0A> >>=0A> >=0A> >=0A> >=0A> =0A> =0A=0A=0A
Jun 08 2010
next sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first one) my
name appears first. In fact, you have just created another top post. I am only
replying to other's comments.
but my newsreader (thunderbird and mail live) telling an different story http://de.tinypic.com/r/mbjndc/6 (click on the image) as you can see - no top-poster can beat you i should give thunderbird a try - very good and nice with newsgroups
Jun 08 2010
parent reply "Nick Sabalausky" <a a.a> writes:
"dennis luehring" <dl.soluz gmx.net> wrote in message 
news:hum3fc$2dp6$1 digitalmars.com...
 Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it 
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first 
 one) my name appears first. In fact, you have just created another top 
 post. I am only replying to other's comments.
but my newsreader (thunderbird and mail live) telling an different story http://de.tinypic.com/r/mbjndc/6 (click on the image) as you can see - no top-poster can beat you i should give thunderbird a try - very good and nice with newsgroups
That link didn't show the image for me, but this one does: http://i50.tinypic.com/mbjndc.jpg I get the same results as dennis in Outlook Express. Also, that link from Ruslan seems to display in a blog-style, which is a really bizarre way to view a newsgroup.
Jun 08 2010
parent "Nick Sabalausky" <a a.a> writes:
"Nick Sabalausky" <a a.a> wrote in message 
news:hum6c8$2j0s$1 digitalmars.com...
 "dennis luehring" <dl.soluz gmx.net> wrote in message 
 news:hum3fc$2dp6$1 digitalmars.com...
 Am 08.06.2010 20:20, schrieb Ruslan Nikolaev:
 No. New messages are definitely not created by me. You can verify it 
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first 
 one) my name appears first. In fact, you have just created another top 
 post. I am only replying to other's comments.
but my newsreader (thunderbird and mail live) telling an different story http://de.tinypic.com/r/mbjndc/6 (click on the image) as you can see - no top-poster can beat you i should give thunderbird a try - very good and nice with newsgroups
That link didn't show the image for me, but this one does: http://i50.tinypic.com/mbjndc.jpg I get the same results as dennis in Outlook Express.
Well, more-or-less. This is what I'm getting: http://www.semitwist.com/download/wideCharNG.png A standard newsgroup tree-view, except all of Ruslan's posts are immediate children of the original post. Everyone else's posts show up as proper replies. Yea, I am using Outlook Express, but I've never seen anyone else on this NG for which every one of their posts is always either first or second level.
 Also, that link from Ruslan seems to display in a blog-style, which is a 
 really bizarre way to view a newsgroup.
 
Jun 08 2010
prev sibling parent reply Pelle <pelle.mansson gmail.com> writes:
On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first one) my
name appears first. In fact, you have just created another top post. I am only
replying to other's comments.

 Ruslan.
Speaking as someone with a tiny bit of knowledge about nntp, you are sending your messages without references; that would be top posting. Please, fix. Your thread is all over the place.
Jun 08 2010
parent reply "Simen kjaeraas" <simen.kjaras gmail.com> writes:
Pelle <pelle.mansson gmail.com> wrote:

 On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it  
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first  
 one) my name appears first. In fact, you have just created another top  
 post. I am only replying to other's comments.

 Ruslan.
Speaking as someone with a tiny bit of knowledge about nntp, you are sending your messages without references; that would be top posting. Please, fix. Your thread is all over the place.
Weird. I'm getting all his messages in their right places. Using Opera's built-in newsreader. -- Simen
Jun 09 2010
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 09 Jun 2010 07:22:17 -0400, Simen kjaeraas  
<simen.kjaras gmail.com> wrote:

 Pelle <pelle.mansson gmail.com> wrote:

 On 06/08/2010 08:20 PM, Ruslan Nikolaev wrote:
 No. New messages are definitely not created by me. You can verify it  
 here:
 http://blog.gmane.org/gmane.comp.lang.d.general

 You can easily see that in none of the top posts (except for the first  
 one) my name appears first. In fact, you have just created another top  
 post. I am only replying to other's comments.

 Ruslan.
Speaking as someone with a tiny bit of knowledge about nntp, you are sending your messages without references; that would be top posting. Please, fix. Your thread is all over the place.
Weird. I'm getting all his messages in their right places. Using Opera's built-in newsreader.
OK, you know what's really *really* weird? I was getting all his replies as root-level messages, with Opera's news reader. Today, I started up opera, and all his posts are now magically threaded properly. WTF!??? Maybe it was a server thing? An Opera thing? No clue, but all seems well now... -Steve
Jun 09 2010
prev sibling parent "Jer" <jersey chicago.com> writes:
Ruslan Nikolaev wrote:
 Note: I posted this already on runtime D list, but I think that list
 was a wrong one for this question. Sorry for duplication :-)

 Hi. I am new to D. It looks like D supports 3 types of characters:
 char, wchar, dchar. This is cool,
It's wrong, actually.
Jun 10 2010