digitalmars.D.learn - DMD: invalid UTF character `\U0000d800`

Per =?UTF-8?B?Tm9yZGzDtnc=?= (36/36) Nov 07 2020 I'm writing a parser generator for ANTLR-grammars and have come

Jacob Carlborg (14/27) Nov 07 2020 They're not valid:

Per =?UTF-8?B?Tm9yZGzDtnc=?= (11/12) Nov 08 2020 Thanks!

Per =?UTF-8?B?Tm9yZGzDtnc=?= (7/8) Nov 08 2020 To clarify,
Kagamin (3/4) Nov 08 2020 Surrogate pairs are used in rules because java strings are utf-16

Jacob Carlborg (5/7) Nov 08 2020 D supports the UTF-16 encoding as well. The compiler doesn't accept the

Steven Schveighoffer (5/22) Nov 08 2020 Yes, use the cast. It should work.
Boris Carvajal (3/8) Nov 09 2020 There's also:

Per =?UTF-8?B?Tm9yZGzDtnc=?= (2/4) Nov 12 2020 Thanks

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

I'm writing a parser generator for ANTLR-grammars and have come 
across the rule

fragment Letter
     : [a-zA-Z$_] // these are below 0x7F
     | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters 
above 0x7F which are not a surrogate
     | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate 
pairs encodings for U+10000 to U+10FFFF
     ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

     Match m__Letter()
     {
         return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), 
ch('_')),
                    not(alt(rng('\u0000', '\u007F'), rng('\uD800', 
'\uDBFF'))),
                    seq(rng('\uD800', '\uDBFF'), rng('\uDC00', 
'\uDFFF')));
     }

given suitable defs of alt, rng, seq, not.

This errors as

  CtoLexer_parser.d   665  57 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   665  67 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  28 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   666  38 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  53 error           invalid UTF 
character \U0000dc00
  CtoLexer_parser.d   666  63 error           invalid UTF 
character \U0000dfff

Doesn't DMD support these Unicodes yet?

Nov 07 2020

Jacob Carlborg <doob me.com> writes:

On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote:

  CtoLexer_parser.d   665  57 error           invalid UTF 
 character \U0000d800
  CtoLexer_parser.d   665  67 error           invalid UTF 
 character \U0000dbff
  CtoLexer_parser.d   666  28 error           invalid UTF 
 character \U0000d800
  CtoLexer_parser.d   666  38 error           invalid UTF 
 character \U0000dbff
  CtoLexer_parser.d   666  53 error           invalid UTF 
 character \U0000dc00
  CtoLexer_parser.d   666  63 error           invalid UTF 
 character \U0000dfff

 Doesn't DMD support these Unicodes yet?

They're not valid:

"The Unicode standard permanently reserves these code point 
values for UTF-16 encoding of the high and low surrogates, and 
they will never be assigned a character, so there should be no 
reason to encode them. The official Unicode standard says that no 
UTF forms, including UTF-16, can encode these code points" [1].

"... the standard states that such arrangements should be treated 
as encoding errors" [1].

Perhaps they need to be combined with other code points to form a 
valid character.

[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

--
/Jacob Carlborg

Nov 07 2020

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg 
wrote:
 [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

Thanks!

I'm only using these UTF characters to create ranges that source 
code characters as checked against during parsing. Therefore I 
would like to just convert these to a `dchar` for now using a 
`cast`. Can I just do, for instance,

     cast(dchar)0x0000d8000

for

     `\U0000d800`

to accomplish this?

Nov 08 2020

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
     cast(dchar)0x0000d8000

To clarify,

     enum dch1 = cast(dchar)0xa0a0;
     enum dch2 = '\ua0a0';
     assert(dch1 == dch2);

works. Can I use the first-variant if I want to postpone these 
encoding questions for now?

Nov 08 2020

Kagamin <spam here.lot> writes:

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
 dchar

Surrogate pairs are used in rules because java strings are utf-16 
encoded, it doesn't make much sense for other encodings.

Nov 08 2020

Jacob Carlborg <doob me.com> writes:

On 2020-11-08 13:39, Kagamin wrote:

 Surrogate pairs are used in rules because java strings are utf-16 
 encoded, it doesn't make much sense for other encodings.

D supports the UTF-16 encoding as well. The compiler doesn't accept the 
surrogate pairs even for UTF-16 strings.

-- 
/Jacob Carlborg

Nov 08 2020

Steven Schveighoffer <schveiguy gmail.com> writes:

On 11/8/20 5:47 AM, Per Nordlöw wrote:
 On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:
 [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF

 
 Thanks!
 
 I'm only using these UTF characters to create ranges that source code 
 characters as checked against during parsing. Therefore I would like to 
 just convert these to a `dchar` for now using a `cast`. Can I just do, 
 for instance,
 
      cast(dchar)0x0000d8000
 
 for
 
      `\U0000d800`
 
 to accomplish this?

Yes, use the cast. It should work.

It's just the D grammar that is stopping you, a dchar is just an integer 
under the hood, so the cast should be fine.

-Steve

Nov 08 2020

Boris Carvajal <boris2.9 gmail.com> writes:

On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:
 Can I just do, for instance,

     cast(dchar)0x0000d8000

 for

     `\U0000d800`

 to accomplish this?

There's also:

dchar(0x0000d8000)

Nov 09 2020

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:
 There's also:

 dchar(0x0000d8000)

Thanks

Nov 12 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - DMD: invalid UTF character `\U0000d800`