www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 7084] New: Missing writeln Unicode normalization

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7084

           Summary: Missing writeln Unicode normalization
           Product: D
           Version: D2
          Platform: x86
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: bearophile_hugs eml.cc



In this program the string 'txt1' contains two codepoints: LATIN CAPITAL LETTER
A, and COMBINING DIAERESIS.

I think a good printing function has to perform Unicode normalization and show
a single \U000000C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) glyph. But with DMD
2.057beta it shows two glyphs (on Windows), an 'A' followed by a diaeresis.

writeln(txt2) shows what I think is the correct output for writeln(txt1) too:


import std.stdio;
void main() {
    dstring txt1 = "\U00000041\U00000308"d;
    writeln(txt1);
    dstring txt2 = "\U000000C4"d;
    writeln(txt2);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 09 2011
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7084


hsteoh quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hsteoh quickfur.ath.cx



IMO this should be an enhancement request. As I understand, Unicode
normalization is non-trivial, so we probably should think over how we want to
do it.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 25 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7084


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement




 IMO this should be an enhancement request. As I understand, Unicode
 normalization is non-trivial, so we probably should think over how we want to
 do it.
OK, now it's an enhancement. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Feb 26 2012
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7084




Here's a link to the relevant part of the Unicode standard for whoever wants to
implement normalization:

http://unicode.org/reports/tr15/

Note that there are several different normalizations, with NFC probably being
the closest to what this bug requires.

After scanning through the standard, it seems to me that rather than putting
this in std.stdio (or the prospective std.io), we really should put it in
std.uni or std.utf, and have different algorithms available for programs to
choose the normalization form. The algorithms involved are not trivial, and
some people may not want std.stdio to automatically normalize to a particular
form when they want specifically to use a different form or a non-normalized
output for whatever reason.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Feb 26 2012