digitalmars.D.bugs - [Issue 3455] New: Some Unicode characters not allowed in identifiers

d-bugmail puremagic.com (28/28) Oct 30 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455

d-bugmail puremagic.com (32/32) Oct 30 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455
d-bugmail puremagic.com (14/37) Oct 30 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455
d-bugmail puremagic.com (18/18) Dec 12 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455
d-bugmail puremagic.com (11/11) Dec 12 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455
d-bugmail puremagic.com (11/11) Dec 31 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455
d-bugmail puremagic.com (16/19) Dec 31 2009 http://d.puremagic.com/issues/show_bug.cgi?id=3455

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455

           Summary: Some Unicode characters not allowed in identifiers
           Product: D
           Version: unspecified
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: DMD
        AssignedTo: nobody puremagic.com
        ReportedBy: andrei metalanguage.com



09:30:44 PDT ---
Consider:

void main() {
    auto aλ = "９";
    auto a９ = "９";
}

The first identifier is an "a" followed by this:

http://www.fileformat.info/info/unicode/char/03bb/index.htm

The second identifier is an "a" followed by this:

http://www.fileformat.info/info/unicode/char/ff19/index.htm

Both string literals contain the latter.

The second identifier does not compile, although I checked that my editor
inserted the correct three-byte UTF-8 code.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Oct 30 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455


Matti Niemenmaa <matti.niemenmaa+dbugzilla iki.fi> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |spec
                 CC|                            |matti.niemenmaa+dbugzilla i
                   |                            |ki.fi
           Platform|Other                       |All
         OS/Version|Linux                       |All
           Severity|normal                      |enhancement



2009-10-30 09:51:09 PDT ---
As http://www.digitalmars.com/d/1.0/lex.html#identifier very clearly states,
the allowed characters in identifiers are those defined in the C99 standard,
ISO/IEC 9899:1999(E) Annex D. Have a look at it:
http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf

９, code point 0xff19, is not in that list. The maximum one is 0xd7a3, in
fact. 
This is not a bug, this is an enhancement.

However, rather than an arbitrary and frozen list, I /would/ prefer basing it
simply on Unicode properties, such as Java's choice: identifiers may start with
letters or numeric letters, and may contain, in addition to those, connecting
punctuation, decimal digits, and combining and non-spacing marks. In other
words:

Identifiers may start with code points from the general categories Ll, Lm, Lo,
Lt, Lu, Nl.

Identifiers may contain code points from the general categories Ll, Lm, Lo, Lt,
Lu, Mc, Mn, Nd, Nl, No, Pc.

Java also allows Cc and Cf, of whose usefulness I'm not so convinced. These are
control characters and things like "soft hyphen", which isn't even supposed to
be displayed unless the word line-wraps. Too much potential for confusion IMHO.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Oct 30 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455




11:40:05 PDT ---

 As http://www.digitalmars.com/d/1.0/lex.html#identifier very clearly states,
 the allowed characters in identifiers are those defined in the C99 standard,
 ISO/IEC 9899:1999(E) Annex D. Have a look at it:
 http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf
 
 ９, code point 0xff19, is not in that list. The maximum one is 0xd7a3, in
fact. 
 This is not a bug, this is an enhancement.
 
 However, rather than an arbitrary and frozen list, I /would/ prefer basing it
 simply on Unicode properties, such as Java's choice: identifiers may start with
 letters or numeric letters, and may contain, in addition to those, connecting
 punctuation, decimal digits, and combining and non-spacing marks. In other
 words:
 
 Identifiers may start with code points from the general categories Ll, Lm, Lo,
 Lt, Lu, Nl.
 
 Identifiers may contain code points from the general categories Ll, Lm, Lo, Lt,
 Lu, Mc, Mn, Nd, Nl, No, Pc.
 
 Java also allows Cc and Cf, of whose usefulness I'm not so convinced. These are
 control characters and things like "soft hyphen", which isn't even supposed to
 be displayed unless the word line-wraps. Too much potential for confusion IMHO.

Oh ok. Thanks Matti. I'm leaving this as an enhancement request. Currently the
error message is:

invalid UTF-8 sequence
unsupported char 0x99

This is factually incorrect because the UTF-8 sequence is correct. I suggest
instead:

Unicode character 0xFF19 not allowed in a symbol


Andrei

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Oct 30 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bugzilla digitalmars.com



00:17:37 PST ---
I'm slowly becoming convinced that allowing unicode characters in identifiers
is just a bad idea anyway. While there is plenty of interest in writing code
that manipulates unicode and has unicode strings, there is little interest in
writing the code itself in unicode. There's a growing consensus that code
should be written in ascii, for a long list of reasons.

For C compatibility, D should support the C identifiers, but I don't think
there's an advantage to going beyond that. For instance, the unicode character
used in Andrei's test case won't even display properly in Explorer.

I'll fix the error message, then call it resolved.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 12 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455


Kosmonaut <Kosmonaut tempinbox.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Kosmonaut tempinbox.com



---
[leandro]Relevant SVN commit:[/leandro]
http://www.dsource.org/projects/dmd/changeset/292

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 12 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455


Walter Bright <bugzilla digitalmars.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED



11:11:58 PST ---
Fixed dmd 1.054 and 2.038

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 31 2009

d-bugmail puremagic.com writes:

http://d.puremagic.com/issues/show_bug.cgi?id=3455


Ali Cehreli <acehreli yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |acehreli yahoo.com




 there is little interest in writing the code itself in unicode.
 There's a growing consensus that code should be written in ascii,
 for a long list of reasons.

Thank you very much for allowing us to program in UTF-8. There is a yet-to-grow
Turkish D community out there who have tremendous joy in being able to program
in Turkish.

I may be in the minority here, but UTF-8 identifiers has been the most
important feature for me to consider D.

Ali

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

Dec 31 2009

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - [Issue 3455] New: Some Unicode characters not allowed in identifiers