digitalmars.D - String Literal Docs
- Alix Pexton (32/32) Jun 19 2010 I've been sketching some grammar diagrams for D2.0, a little like those
- Ellery Newcomer (8/41) Jun 19 2010 http://d.puremagic.com/issues/show_bug.cgi?id=4351
- div0 (5/37) Jun 19 2010 Hex strings are specifically exempted from the requirement for valid utf...
- Ellery Newcomer (6/44) Jun 19 2010 All I can say is
- div0 (6/11) Jun 19 2010 Then you've found a bug, you know what to do:
- Alix Pexton (15/28) Jun 20 2010 Hmn, that would seem to indicate to me that the postfix is being allowed...
- div0 (12/43) Jun 20 2010 It says multiple of 2, not even number of digits. To me that implies
- Nick Sabalausky (5/11) Jun 20 2010 "multiple of 2" == "even number"
- div0 (10/16) Jun 20 2010 I also said 'To me that implies'. Please don't take what I said out of
- Nick Sabalausky (6/18) Jun 20 2010 That wan't my intent, sorry if it came across that way. It sounded to me...
- Alix Pexton (8/28) Jun 20 2010 From looking at the source, I now know that all string literals can
- div0 (12/32) Jun 22 2010 What I was getting at is that if you use the w suffix, then surely you
- Alix Pexton (6/11) Jun 20 2010 Bug 2734 is the underscores in floats issue.
- Alix Pexton (9/12) Jun 20 2010 I think I will take the plunge and base my diagrams on the source of
- Ellery Newcomer (5/18) Jun 20 2010 Do share. I've always been too lazy to read lexer.c, and from this
- Alix Pexton (3/23) Jun 20 2010 of course ^^
- Alix Pexton (17/42) Jun 21 2010 Well, I think I have got my head around lexer.c now, and its various
- Ellery Newcomer (7/55) Jun 21 2010 to hell with lexer.c. I'm not changing anything.
- Alix Pexton (7/10) Jun 21 2010 So far I have only covered the lexer, but most of it needs redoing in
- Justin Spahr-Summers (6/9) Jun 22 2010 I can't speak for Alix, but I would absolutely be interested. I'm
I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd. In the section of www.digitalmars.com/d/2.0/lex.html on string literals, the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg, DoubleQuote and Hex strings and not on either Delimited or Token strings, which didn't make a lot of sense to me, so I tested it with DMD (v2.046, win)... --- import std.stdio; void main(){ auto t1 = "double quote"d; // OK auto t2 = `back tick`d; // OK auto t3 = x"dead beef"; // postfix not allowed on hexstrings! auto t4 = q"<delimited/>"d;// OK auto t5 = q{if}d; // OK writefln("all literals A-OK!"); } --- This makes sense to me, HexStrings with wide chars would have made my brain scream >< So, to correct the documentation, the "postfix" needs to be removed from HexString and added to DelimitedString and TokenString. I tried to see if this was already reporded in the bug tracker but couldn't see anything close. On a slightly quieter note, there is also a spare underscore in the definition of HexidecimalDigit as it "extends" DecimalDigit which already has an underscore. I also noticed a bug in the tracker related to initial underscores in float literals, if the diagrams start getting to puzzling I might look into that ^^ A... PS, my copy of tDPL is in the post, yay!
Jun 19 2010
On 06/19/2010 03:12 PM, Alix Pexton wrote:I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd. In the section of www.digitalmars.com/d/2.0/lex.html on string literals, the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg, DoubleQuote and Hex strings and not on either Delimited or Token strings, which didn't make a lot of sense to me, so I tested it with DMD (v2.046, win)... --- import std.stdio; void main(){ auto t1 = "double quote"d; // OK auto t2 = `back tick`d; // OK auto t3 = x"dead beef"; // postfix not allowed on hexstrings! auto t4 = q"<delimited/>"d;// OK auto t5 = q{if}d; // OK writefln("all literals A-OK!"); } --- This makes sense to me, HexStrings with wide chars would have made my brain scream ><http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine: auto w = x"1e1d 1e1f"w; on dmd 2.047 but what it results in is pretty screwy.So, to correct the documentation, the "postfix" needs to be removed from HexString and added to DelimitedString and TokenString. I tried to see if this was already reporded in the bug tracker but couldn't see anything close. On a slightly quieter note, there is also a spare underscore in the definition of HexidecimalDigit as it "extends" DecimalDigit which already has an underscore. I also noticed a bug in the tracker related to initial underscores in float literals, if the diagrams start getting to puzzling I might look into that ^^What what?A... PS, my copy of tDPL is in the post, yay!
Jun 19 2010
On 19/06/2010 22:16, Ellery Newcomer wrote:On 06/19/2010 03:12 PM, Alix Pexton wrote:Hex strings are specifically exempted from the requirement for valid utf. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.ukI've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd. In the section of www.digitalmars.com/d/2.0/lex.html on string literals, the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg, DoubleQuote and Hex strings and not on either Delimited or Token strings, which didn't make a lot of sense to me, so I tested it with DMD (v2.046, win)... --- import std.stdio; void main(){ auto t1 = "double quote"d; // OK auto t2 = `back tick`d; // OK auto t3 = x"dead beef"; // postfix not allowed on hexstrings! auto t4 = q"<delimited/>"d;// OK auto t5 = q{if}d; // OK writefln("all literals A-OK!"); } --- This makes sense to me, HexStrings with wide chars would have made my brain scream ><http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine:
Jun 19 2010
On 06/19/2010 04:26 PM, div0 wrote:On 19/06/2010 22:16, Ellery Newcomer wrote:All I can say is auto w = x"dead beef"w; results in Error: invalid UTF-8 sequence on dmd 2.047On 06/19/2010 03:12 PM, Alix Pexton wrote:Hex strings are specifically exempted from the requirement for valid utf.I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd. In the section of www.digitalmars.com/d/2.0/lex.html on string literals, the productions imply that the [c|w|d] "postfix" is allowed on Wysiwyg, DoubleQuote and Hex strings and not on either Delimited or Token strings, which didn't make a lot of sense to me, so I tested it with DMD (v2.046, win)... --- import std.stdio; void main(){ auto t1 = "double quote"d; // OK auto t2 = `back tick`d; // OK auto t3 = x"dead beef"; // postfix not allowed on hexstrings! auto t4 = q"<delimited/>"d;// OK auto t5 = q{if}d; // OK writefln("all literals A-OK!"); } --- This makes sense to me, HexStrings with wide chars would have made my brain scream ><http://d.puremagic.com/issues/show_bug.cgi?id=4351 but I'm not so sure about the hex string one. I think you just gave it invalid unicode. E.g., this compiles fine:
Jun 19 2010
On 19/06/2010 23:17, Ellery Newcomer wrote:All I can say is auto w = x"dead beef"w; results in Error: invalid UTF-8 sequence on dmd 2.047Then you've found a bug, you know what to do: http://d.puremagic.com/issues/ -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 19 2010
On 20/06/2010 01:09, div0 wrote:On 19/06/2010 23:17, Ellery Newcomer wrote:Hmn, that would seem to indicate to me that the postfix is being allowed when the hex represents a valid UTF sequence, but not otherwise. I didn't do too much testing myself as I know next to zilch about string internals >< The text that describes hex strings says that they have to have an even number of digits, but this would seem to imply that they have to have a multiple of 4 or 8 for wstrings and dstrings respectively, which makes sense, but I'm not sure that can be verified in the lexing of a string literal without insane lookahead rules >< But, then I guess that is why the spec says that hex strings are exempt from the valid UTF rule, and in that case hexstrings should really make byte arrays rather than strings, but failing that, always chars and not anything wider. A...All I can say is auto w = x"dead beef"w; results in Error: invalid UTF-8 sequence on dmd 2.047Then you've found a bug, you know what to do: http://d.puremagic.com/issues/
Jun 20 2010
On 20/06/2010 11:03, Alix Pexton wrote:On 20/06/2010 01:09, div0 wrote:It says multiple of 2, not even number of digits. To me that implies it's always 2 and the suffix acceptance is just a bug. It could be made more clear though.On 19/06/2010 23:17, Ellery Newcomer wrote:Hmn, that would seem to indicate to me that the postfix is being allowed when the hex represents a valid UTF sequence, but not otherwise. I didn't do too much testing myself as I know next to zilch about string internals >< The text that describes hex strings says that they have to have an even number of digits, but this would seem to imply that they have to have a multiple of 4 or 8 for wstrings and dstrings respectively, which makes sense, but I'm not sure that can be verified in the lexing of a string literal without insane lookahead rules ><All I can say is auto w = x"dead beef"w; results in Error: invalid UTF-8 sequence on dmd 2.047Then you've found a bug, you know what to do: http://d.puremagic.com/issues/But, then I guess that is why the spec says that hex strings are exempt from the valid UTF rule, and in that case hexstrings should really make byte arrays rather than strings, but failing that, always chars and not anything wider. A...Yeah, hex strings should probably have the type ubyte[] If you using them to put arbitrary binary in your program you're almost certainly going to cast the array to something else anyway, so char[], wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, making them ?char[] seems wrong. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.uk
Jun 20 2010
"div0" <div0 users.sourceforge.net> wrote in message news:hvkrsc$2r5c$1 digitalmars.com...It says multiple of 2, not even number of digits."multiple of 2" == "even number" "Even" as in "even vs odd"Yeah, hex strings should probably have the type ubyte[] If you using them to put arbitrary binary in your program you're almost certainly going to cast the array to something else anyway, so char[], wchar[], dchar[] all seem a bit pointless and as they allow invalid utf, making them ?char[] seems wrong.You have me completely convinced.
Jun 20 2010
On 20/06/2010 18:55, Nick Sabalausky wrote:"div0"<div0 users.sourceforge.net> wrote in message news:hvkrsc$2r5c$1 digitalmars.com...I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here. I read the spec. as specifying that the hex characters should be in groups of 2, I also take it as implying that the suffixes are not applicable. You're more than welcome to your own take on it. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.ukIt says multiple of 2, not even number of digits."multiple of 2" == "even number" "Even" as in "even vs odd"
Jun 20 2010
"div0" <div0 users.sourceforge.net> wrote in message news:hvlok6$1rfu$1 digitalmars.com...On 20/06/2010 18:55, Nick Sabalausky wrote:That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at."div0"<div0 users.sourceforge.net> wrote in message news:hvkrsc$2r5c$1 digitalmars.com...I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.It says multiple of 2, not even number of digits."multiple of 2" == "even number" "Even" as in "even vs odd"
Jun 20 2010
On 20/06/2010 20:14, Nick Sabalausky wrote:"div0"<div0 users.sourceforge.net> wrote in message news:hvlok6$1rfu$1 digitalmars.com...From looking at the source, I now know that all string literals can have a postfix, and that as far as lexing goes, all strings are in UTF8. I've not tracked down yet where the the value of the postfix is applied, but I'm fairly certain that it would be easy enough to turn off the UTF verification for the hexstrings at that end. As far as making my diagrams, I don't think it matters, for now... A...On 20/06/2010 18:55, Nick Sabalausky wrote:That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at."div0"<div0 users.sourceforge.net> wrote in message news:hvkrsc$2r5c$1 digitalmars.com...I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.It says multiple of 2, not even number of digits."multiple of 2" == "even number" "Even" as in "even vs odd"
Jun 20 2010
On 20/06/2010 20:14, Nick Sabalausky wrote:"div0"<div0 users.sourceforge.net> wrote in message news:hvlok6$1rfu$1 digitalmars.com...What I was getting at is that if you use the w suffix, then surely you would expect the number of hex digits to be a multiple of 4 not 2. If there are only 6 digits what then? Are the missing one inferred to be 0, is it a compile error, or something else? Because of the use of the 2, I inferred from the spec that the suffixes were not supposed to be allowed. If it had said even number of digits, I'd have been more inclined to think that the suffixes are legal. Either which way it just high lights that the spec isn't sufficiently clear. -- My enormous talent is exceeded only by my outrageous laziness. http://www.ssTk.co.ukOn 20/06/2010 18:55, Nick Sabalausky wrote:That wan't my intent, sorry if it came across that way. It sounded to me like you were implying there was a difference between "multiple of 2" and "even number". If that wasn't the case, then I guess I'm just not sure what you were really getting at."div0"<div0 users.sourceforge.net> wrote in message news:hvkrsc$2r5c$1 digitalmars.com...I also said 'To me that implies'. Please don't take what I said out of context and be a smart arse about it. There's more than enough of that goes on round here.It says multiple of 2, not even number of digits."multiple of 2" == "even number" "Even" as in "even vs odd"
Jun 22 2010
On 19/06/2010 22:16, Ellery Newcomer wrote:On 06/19/2010 03:12 PM, Alix Pexton wrote:Bug 2734 is the underscores in floats issue. Bug 949 also has a shed full of replacement grammar rules that fix escape sequences and some coner cases in floats (and probably some other stuff too!) A...I also noticed a bug in the tracker related to initial underscores in float literals, if the diagrams start getting to puzzling I might look into that ^^What what?
Jun 20 2010
On 19/06/2010 21:12, Alix Pexton wrote:I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd.I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...
Jun 20 2010
On 06/20/2010 03:01 PM, Alix Pexton wrote:On 19/06/2010 21:12, Alix Pexton wrote:Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd.I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...
Jun 20 2010
On 20/06/2010 21:37, Ellery Newcomer wrote:On 06/20/2010 03:01 PM, Alix Pexton wrote:of course ^^ A...On 19/06/2010 21:12, Alix Pexton wrote:Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd.I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...
Jun 20 2010
On 20/06/2010 22:46, Alix Pexton wrote:On 20/06/2010 21:37, Ellery Newcomer wrote:Well, I think I have got my head around lexer.c now, and its various peculiarities, like "000377." being a valid float (although not according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^). The weirdness occurs because some of some corner cases are handled not by the neat little state state machine that validates reals, but in the scanner at the point where it recognises a number beginning with a zero. The productions in lex.html represent the range of inputs that are accepted by the state machine without taking into account that the scanner rejects the sequence "._" (which makes sense as that is the identifier "_" in the outer scope). Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat, but a strict reading of lex.html would not allow it. Overall the diagram for hexfloat is much simpler than the one for decimalfloat, which I think will have to be split into 3 >< A... PS, octal must die!On 06/20/2010 03:01 PM, Alix Pexton wrote:of course ^^ A...On 19/06/2010 21:12, Alix Pexton wrote:Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd.I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...
Jun 21 2010
On 06/21/2010 02:21 PM, Alix Pexton wrote:On 20/06/2010 22:46, Alix Pexton wrote:Oh wow. That's a sweet little diagram. Those dots are hard to see though.On 20/06/2010 21:37, Ellery Newcomer wrote:Well, I think I have got my head around lexer.c now, and its various peculiarities, like "000377." being a valid float (although not according to my shiny new, limited edition copy of tDPL (fig2.2 p35)^^).On 06/20/2010 03:01 PM, Alix Pexton wrote:of course ^^ A...On 19/06/2010 21:12, Alix Pexton wrote:Do share. I've always been too lazy to read lexer.c, and from this discussion, it sounds like there are a few spots where my own lexer grammar is incorrect (or at least differs from dmd).I've been sketching some grammar diagrams for D2.0, a little like those on JSON.org, and of course I didn't get far before I ran into something odd.I think I will take the plunge and base my diagrams on the source of DMD. After looking at the code in lexer.c, it does not seem as far beyond my rusty old c++ parsing skills as I had expected! Massive credit to Walter for having a codebase that is as mature as DMD without it turning into a labyrinth of preprocessor macros and cryptic "comefrom"s. This will mean however that my little project may take a little longer, sigh... A...The weirdness occurs because some of some corner cases are handled not by the neat little state state machine that validates reals, but in the scanner at the point where it recognises a number beginning with a zero. The productions in lex.html represent the range of inputs that are accepted by the state machine without taking into account that the scanner rejects the sequence "._" (which makes sense as that is the identifier "_" in the outer scope).to hell with lexer.c. I'm not changing anything.Andrei's analysis in tDPL also points out that 0xp0 is a valid hexfloat, but a strict reading of lex.html would not allow it. Overall the diagram for hexfloat is much simpler than the one for decimalfloat, which I think will have to be split into 3 >< A... PS, octal must die!I'll settle for modified syntax 0c123. But yeah. Are your diagrams solely concerned with the lexer? Because I have a (messy) parser grammar which I'm a bit more confident about if you're interested.
Jun 21 2010
On 21/06/2010 21:20, Ellery Newcomer wrote:Are your diagrams solely concerned with the lexer? Because I have a (messy) parser grammar which I'm a bit more confident about if you're interested.So far I have only covered the lexer, but most of it needs redoing in light of the errors in the DMD docs, but I am hoping to cover the whole spec, eventually... The more I do the quicker I'm able to make them as my workflow evolves, so its hard to say how long it will take... A...
Jun 21 2010
On Mon, 21 Jun 2010 15:20:16 -0500, Ellery Newcomer <ellery- newcomer utulsa.edu> wrote:Are your diagrams solely concerned with the lexer? Because I have a (messy) parser grammar which I'm a bit more confident about if you're interested.I can't speak for Alix, but I would absolutely be interested. I'm working on an "Objective-D" preprocessor and my parsing still has lots of holes, even besides the stuff I have marked to-do. A strict reading of the website has already turned up a few inaccuracies.
Jun 22 2010