digitalmars.D.learn - DMD: invalid UTF character `\U0000d800`
- Per =?UTF-8?B?Tm9yZGzDtnc=?= (36/36) Nov 07 2020 I'm writing a parser generator for ANTLR-grammars and have come
- Jacob Carlborg (14/27) Nov 07 2020 They're not valid:
- Per =?UTF-8?B?Tm9yZGzDtnc=?= (11/12) Nov 08 2020 Thanks!
- Per =?UTF-8?B?Tm9yZGzDtnc=?= (7/8) Nov 08 2020 To clarify,
- Kagamin (3/4) Nov 08 2020 Surrogate pairs are used in rules because java strings are utf-16
- Jacob Carlborg (5/7) Nov 08 2020 D supports the UTF-16 encoding as well. The compiler doesn't accept the
- Steven Schveighoffer (5/22) Nov 08 2020 Yes, use the cast. It should work.
- Boris Carvajal (3/8) Nov 09 2020 There's also:
- Per =?UTF-8?B?Tm9yZGzDtnc=?= (2/4) Nov 12 2020 Thanks
I'm writing a parser generator for ANTLR-grammars and have come across the rule fragment Letter : [a-zA-Z$_] // these are below 0x7F | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF ; at https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158 This rule is converted into Match m__Letter() { return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), ch('_')), not(alt(rng('\u0000', '\u007F'), rng('\uD800', '\uDBFF'))), seq(rng('\uD800', '\uDBFF'), rng('\uDC00', '\uDFFF'))); } given suitable defs of alt, rng, seq, not. This errors as CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff Doesn't DMD support these Unicodes yet?
Nov 07 2020
On Saturday, 7 November 2020 at 16:12:06 UTC, Per Nordlöw wrote:CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff Doesn't DMD support these Unicodes yet?They're not valid: "The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates, and they will never be assigned a character, so there should be no reason to encode them. The official Unicode standard says that no UTF forms, including UTF-16, can encode these code points" [1]. "... the standard states that such arrangements should be treated as encoding errors" [1]. Perhaps they need to be combined with other code points to form a valid character. [1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF -- /Jacob Carlborg
Nov 07 2020
On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFFThanks! I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance, cast(dchar)0x0000d8000 for `\U0000d800` to accomplish this?
Nov 08 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:cast(dchar)0x0000d8000To clarify, enum dch1 = cast(dchar)0xa0a0; enum dch2 = '\ua0a0'; assert(dch1 == dch2); works. Can I use the first-variant if I want to postpone these encoding questions for now?
Nov 08 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:dcharSurrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.
Nov 08 2020
On 2020-11-08 13:39, Kagamin wrote:Surrogate pairs are used in rules because java strings are utf-16 encoded, it doesn't make much sense for other encodings.D supports the UTF-16 encoding as well. The compiler doesn't accept the surrogate pairs even for UTF-16 strings. -- /Jacob Carlborg
Nov 08 2020
On 11/8/20 5:47 AM, Per Nordlöw wrote:On Saturday, 7 November 2020 at 17:49:54 UTC, Jacob Carlborg wrote:Yes, use the cast. It should work. It's just the D grammar that is stopping you, a dchar is just an integer under the hood, so the cast should be fine. -Steve[1] https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFFThanks! I'm only using these UTF characters to create ranges that source code characters as checked against during parsing. Therefore I would like to just convert these to a `dchar` for now using a `cast`. Can I just do, for instance, cast(dchar)0x0000d8000 for `\U0000d800` to accomplish this?
Nov 08 2020
On Sunday, 8 November 2020 at 10:47:34 UTC, Per Nordlöw wrote:Can I just do, for instance, cast(dchar)0x0000d8000 for `\U0000d800` to accomplish this?There's also: dchar(0x0000d8000)
Nov 09 2020
On Monday, 9 November 2020 at 16:39:49 UTC, Boris Carvajal wrote:There's also: dchar(0x0000d8000)Thanks
Nov 12 2020