www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - DMD: invalid UTF character `\U0000d800`

reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
I'm writing a parser generator for ANTLR-grammars and have come 
across the rule

fragment Letter
     : [a-zA-Z$_] // these are below 0x7F
     | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters 
above 0x7F which are not a surrogate
     | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate 
pairs encodings for U+10000 to U+10FFFF
     ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

     Match m__Letter()
     {
         return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), 
ch('_')),
                    not(alt(rng('\u0000', '\u007F'), rng('\uD800', 
'\uDBFF'))),
                    seq(rng('\uD800', '\uDBFF'), rng('\uDC00', 
'\uDFFF')));
     }

given suitable defs of alt, rng, seq, not.

This errors as

  CtoLexer_parser.d   665  57 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   665  67 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  28 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   666  38 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  53 error           invalid UTF 
character \U0000dc00
  CtoLexer_parser.d   666  63 error           invalid UTF 
character \U0000dfff

Doesn't DMD support these Unicodes yet?
Nov 07 2020
parent reply Jacob Carlborg <doob me.com> writes:
On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote:

  CtoLexer_parser.d   665  57 error           invalid UTF 
 character \U0000d800
  CtoLexer_parser.d   665  67 error           invalid UTF 
 character \U0000dbff
  CtoLexer_parser.d   666  28 error           invalid UTF 
 character \U0000d800
  CtoLexer_parser.d   666  38 error           invalid UTF 
 character \U0000dbff
  CtoLexer_parser.d   666  53 error           invalid UTF 
 character \U0000dc00
  CtoLexer_parser.d   666  63 error           invalid UTF 
 character \U0000dfff

 Doesn't DMD support these Unicodes yet?
They're not valid: "The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points" [1]. "... the standard states that such arrangements should be treated as encoding errors" [1]. Perhaps they need to be combined with other code points to form a valid character. [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF -- /Jacob Carlborg
Nov 07 2020
parent reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg 
wrote:
 [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
Thanks! I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance, cast(dchar)0x0000d8000 for `\U0000d800` to accomplish this?
Nov 08 2020
next sibling parent Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
     cast(dchar)0x0000d8000
To clarify, enum dch1 = cast(dchar)0xa0a0; enum dch2 = '\ua0a0'; assert(dch1 == dch2); works. Can I use the first-variant if I want to postpone these encoding questions for now?
Nov 08 2020
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
 dchar
Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.
Nov 08 2020
parent Jacob Carlborg <doob me.com> writes:
On 2020-11-08 13:39, Kagamin wrote:

 Surrogate pairs are used in rules because java strings are utf-16 
 encoded, it doesn't make much sense for other encodings.
D supports the UTF-16 encoding as well. The compiler doesn't accept the surrogate pairs even for UTF-16 strings. -- /Jacob Carlborg
Nov 08 2020
prev sibling next sibling parent Steven Schveighoffer <schveiguy gmail.com> writes:
On 11/8/20 5:47 AM, Per Nordlöw wrote:
 On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
 [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF
Thanks! I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance,     cast(dchar)0x0000d8000 for     `\U0000d800` to accomplish this?
Yes, use the cast. It should work. It's just the D grammar that is stopping you, a dchar is just an integer under the hood, so the cast should be fine. -Steve
Nov 08 2020
prev sibling parent reply Boris Carvajal <boris2.9 gmail.com> writes:
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
 Can I just do, for instance,

     cast(dchar)0x0000d8000

 for

     `\U0000d800`

 to accomplish this?
There's also: dchar(0x0000d8000)
Nov 09 2020
parent Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:
 There's also:

 dchar(0x0000d8000)
Thanks
Nov 12 2020