www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - UTF-8 Everywhere

reply Walter Bright <newshound2 digitalmars.com> writes:
http://utf8everywhere.org/

It has a good explanation of the issues and problems, and how these things came 
to be.

This is pretty much in line with my current (!) opinion on Unicode. What it 
means for us is I don't think it is that important anymore for algorithms to 
support strings of UTF-16 or UCS-4.
Jun 19 2016
next sibling parent "H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:
On Sun, Jun 19, 2016 at 05:49:40PM -0700, Walter Bright via Digitalmars-d wrote:
 http://utf8everywhere.org/
 
 It has a good explanation of the issues and problems, and how these
 things came to be.
 
 This is pretty much in line with my current (!) opinion on Unicode.
 What it means for us is I don't think it is that important anymore for
 algorithms to support strings of UTF-16 or UCS-4.
And it also confirms that autodecoding to dchar was a wrong design decision (cf. last paragraph under section 1). We should follow Walter's proposal to make as much of Phobos as possible independent of autodecoding, and hopefully at some point in the future deprecate and remove autodecoding altogether. T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
Jun 19 2016
prev sibling parent reply Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:
To me it seems that a lot of the time processing is more efficient with 
UCS-4 (what I call utf-32).  Storage is clearly more efficient with 
utf-8, but access is more direct with UCS-4.  I agree that utf-8 is 
generally to be preferred where it can be efficiently used, but that's 
not everywhere.  The problem is efficient bi-directional 
conversion...which D appears to handle fairly well already with text() 
and dtext().  (I don't see any utility for utf-16.  To me that seems 
like a first attempt that should have been deprecated.)

On 06/19/2016 05:49 PM, Walter Bright via Digitalmars-d wrote:
 http://utf8everywhere.org/

 It has a good explanation of the issues and problems, and how these 
 things came to be.

 This is pretty much in line with my current (!) opinion on Unicode. 
 What it means for us is I don't think it is that important anymore for 
 algorithms to support strings of UTF-16 or UCS-4.
Jun 19 2016
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
 To me it seems that a lot of the time processing is more efficient with UCS-4
 (what I call utf-32).  Storage is clearly more efficient with utf-8, but access
 is more direct with UCS-4.  I agree that utf-8 is generally to be preferred
 where it can be efficiently used, but that's not everywhere.  The problem is
 efficient bi-directional conversion...which D appears to handle fairly well
 already with text() and dtext().  (I don't see any utility for utf-16.  To me
 that seems like a first attempt that should have been deprecated.)
That seemed to me to be true, too, until I wrote a text processing program using UCS-4. It was rather slow. Turns out, 4x memory consumption has a huge performance cost.
Jun 19 2016
parent Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:
On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:
 On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
 To me it seems that a lot of the time processing is more efficient 
 with UCS-4
 (what I call utf-32).  Storage is clearly more efficient with utf-8, 
 but access
 is more direct with UCS-4.  I agree that utf-8 is generally to be 
 preferred
 where it can be efficiently used, but that's not everywhere. The 
 problem is
 efficient bi-directional conversion...which D appears to handle 
 fairly well
 already with text() and dtext().  (I don't see any utility for 
 utf-16.  To me
 that seems like a first attempt that should have been deprecated.)
That seemed to me to be true, too, until I wrote a text processing program using UCS-4. It was rather slow. Turns out, 4x memory consumption has a huge performance cost.
The approach I took (which worked well for my purposes) was to process the text a line at a time, and for that the overhead of memory was trivial. ... If I'd needed to go back and forth this wouldn't have been desirable, but there was one dtext conversion, processing, and then several text conversions (of small portions), and it was quite efficient. Clearly this can't be the approach taken in all circumstances, but for this purpose it was significantly more efficient than any other approach I've tried. It's also true that most of the text I handled was actually ASCII, which would have made the most common conversion processes simpler. To me it appears that both cases need to be handled. The problem is documenting the tradeoffs in efficiency. D seems to already work quite well with arrays of dchars, so there may well not be any need for development in that area. Direct indexing of utf-8 arrays, however, is a much more complicated thing, which I doubt can ever be as efficient. Memory allocation, however, is a separate, though not independent, complexity. If you can work in small chunks then it becomes less important.
Jun 20 2016