digitalmars.D - UTF-8 Everywhere

Walter Bright (6/6) Jun 19 2016 http://utf8everywhere.org/

H. S. Teoh via Digitalmars-d (9/17) Jun 19 2016 And it also confirms that autodecoding to dchar was a wrong design
Charles Hixson via Digitalmars-d (9/15) Jun 19 2016 To me it seems that a lot of the time processing is more efficient with

Walter Bright (4/11) Jun 19 2016 That seemed to me to be true, too, until I wrote a text processing progr...

Charles Hixson via Digitalmars-d (18/35) Jun 20 2016 The approach I took (which worked well for my purposes) was to process

Walter Bright <newshound2 digitalmars.com> writes:

http://utf8everywhere.org/

It has a good explanation of the issues and problems, and how these things came 
to be.

This is pretty much in line with my current (!) opinion on Unicode. What it 
means for us is I don't think it is that important anymore for algorithms to 
support strings of UTF-16 or UCS-4.

Jun 19 2016

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Sun, Jun 19, 2016 at 05:49:40PM -0700, Walter Bright via Digitalmars-d wrote:
 http://utf8everywhere.org/
 
 It has a good explanation of the issues and problems, and how these
 things came to be.
 
 This is pretty much in line with my current (!) opinion on Unicode.
 What it means for us is I don't think it is that important anymore for
 algorithms to support strings of UTF-16 or UCS-4.

And it also confirms that autodecoding to dchar was a wrong design
decision (cf. last paragraph under section 1).

We should follow Walter's proposal to make as much of Phobos as possible
independent of autodecoding, and hopefully at some point in the future
deprecate and remove autodecoding altogether.


T

-- 
Life is unfair. Ask too much from it, and it may decide you don't deserve what
you have now either.

Jun 19 2016

Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:

To me it seems that a lot of the time processing is more efficient with 
UCS-4 (what I call utf-32).  Storage is clearly more efficient with 
utf-8, but access is more direct with UCS-4.  I agree that utf-8 is 
generally to be preferred where it can be efficiently used, but that's 
not everywhere.  The problem is efficient bi-directional 
conversion...which D appears to handle fairly well already with text() 
and dtext().  (I don't see any utility for utf-16.  To me that seems 
like a first attempt that should have been deprecated.)

On 06/19/2016 05:49 PM, Walter Bright via Digitalmars-d wrote:
 http://utf8everywhere.org/

 It has a good explanation of the issues and problems, and how these 
 things came to be.

 This is pretty much in line with my current (!) opinion on Unicode. 
 What it means for us is I don't think it is that important anymore for 
 algorithms to support strings of UTF-16 or UCS-4.

Jun 19 2016

Walter Bright <newshound2 digitalmars.com> writes:

On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
 To me it seems that a lot of the time processing is more efficient with UCS-4
 (what I call utf-32).  Storage is clearly more efficient with utf-8, but access
 is more direct with UCS-4.  I agree that utf-8 is generally to be preferred
 where it can be efficiently used, but that's not everywhere.  The problem is
 efficient bi-directional conversion...which D appears to handle fairly well
 already with text() and dtext().  (I don't see any utility for utf-16.  To me
 that seems like a first attempt that should have been deprecated.)

That seemed to me to be true, too, until I wrote a text processing program
using 
UCS-4. It was rather slow. Turns out, 4x memory consumption has a huge 
performance cost.

Jun 19 2016

Charles Hixson via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 06/19/2016 11:44 PM, Walter Bright via Digitalmars-d wrote:
 On 6/19/2016 11:36 PM, Charles Hixson via Digitalmars-d wrote:
 To me it seems that a lot of the time processing is more efficient 
 with UCS-4
 (what I call utf-32).  Storage is clearly more efficient with utf-8, 
 but access
 is more direct with UCS-4.  I agree that utf-8 is generally to be 
 preferred
 where it can be efficiently used, but that's not everywhere. The 
 problem is
 efficient bi-directional conversion...which D appears to handle 
 fairly well
 already with text() and dtext().  (I don't see any utility for 
 utf-16.  To me
 that seems like a first attempt that should have been deprecated.)

 That seemed to me to be true, too, until I wrote a text processing 
 program using UCS-4. It was rather slow. Turns out, 4x memory 
 consumption has a huge performance cost.

The approach I took (which worked well for my purposes) was to process 
the text a line at a time, and for that the overhead of memory was 
trivial. ... If I'd needed to go back and forth this wouldn't have been 
desirable, but there was one dtext conversion, processing, and then 
several text conversions (of small portions), and it was quite 
efficient.  Clearly this can't be the approach taken in all 
circumstances, but for this purpose it was significantly more efficient 
than any other approach I've tried. It's also true that most of the text 
I handled was actually ASCII, which would have made the most common 
conversion processes simpler.

To me it appears that both cases need to be handled.  The problem is 
documenting the tradeoffs in efficiency.  D seems to already work quite 
well with arrays of dchars, so there may well not be any need for 
development in that area.  Direct indexing of utf-8 arrays, however, is 
a much more complicated thing, which I doubt can ever be as efficient. 
Memory allocation, however, is a separate, though not independent, 
complexity.  If you can work in small chunks then it becomes less important.

Jun 20 2016

D Programming

C/C++ Programming

Other

digitalmars.D - UTF-8 Everywhere