digitalmars.D - [dox] Fixing the lexical rule for BinaryInteger
- Andre Artus (23/28) Aug 16 2013 The nonterminal BinaryDigits, does not exist.
- Brian Schott (4/4) Aug 16 2013 I've been doing some work with the language grammar
- Andre Artus (41/45) Aug 16 2013 You have done impressive work on your grammar; I just have some
- Brian Schott (15/61) Aug 16 2013 I'm aware of that. If you're able to get ANTLR to actually
- Andre Artus (6/11) Aug 16 2013 I forked just under an hour ago, am I on old bits?
- H. S. Teoh (40/51) Aug 16 2013 [...]
- Andre Artus (25/92) Aug 16 2013 Yup, that's the issue. Coding the actual behaviour by hand, or
- H. S. Teoh (24/126) Aug 17 2013 I didn't mean to push it up to the parser. I was just using BNF to show
- Andre Artus (36/177) Aug 17 2013 I agree with you, Brian, all three of these constructions go
- H. S. Teoh (23/71) Aug 17 2013 You're right, I think the D specs page on literals using BNF is a bit of
- Andre Artus (41/117) Aug 17 2013 I would not mind doing this, I'll see what Walter says.
- Andre Artus (18/18) Aug 17 2013 [...]
- H. S. Teoh (11/32) Aug 16 2013 Regex equivalent:
- Andre Artus (12/16) Aug 16 2013 I have fixed up a few issues in DGrammar.g4, I will put them up
- Brian Schott (4/22) Aug 16 2013 I must have missed those when I pulled the grammar out of my
The documentation on the lexical rules for BinaryInteger (http://dlang.org/lex.html#BinaryInteger) has a few issues:BinaryInteger: BinPrefix BinaryDigitsThe nonterminal BinaryDigits, does not exist.BinaryDigitsUS: BinaryDigitUS BinaryDigitUS BinaryDigitsUSThe construction for BinaryDigitsUS currently allows for the following: _(_)*, e.g. 0b_, 0b__, 0b___ etc. Which is clearly not allowed by the compiler. I have put up a change on GitHub [1], but there is a clear problem. The DMD compiler allows for any of the following (reduced cases): a. 0b__1 b. 0b_1_ c. 0b1__ Whereas my change disallows the second case (b), but is in line with how the other integers are specified. This is a specification problem (limitation of BNF), not an implementation problem. In plain English one would just say that the BinaryDigitsUS sequence should contain at least one BinaryDigit character. I'm busy working on the HexadecimalInteger, which has related issues. 1. https://github.com/andre-artus/dlang.org/blob/LexBinaryDigit/lex.dd
Aug 16 2013
I've been doing some work with the language grammar specification. You may find these resources useful: http://d.puremagic.com/issues/show_bug.cgi?id=10233 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4
Aug 16 2013
On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:I've been doing some work with the language grammar specification. You may find these resources useful: http://d.puremagic.com/issues/show_bug.cgi?id=10233 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4You have done impressive work on your grammar; I just have some small issues. 1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.1 2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : fails Same with HexadecimalInteger. 3. The imports don't allow for all cases. 4. how are you handling the scope attribute specifier in the "attribute ':'" case, e.g. "public:"? There seems to be a few more places where it diverges a bit from what the compiler currently accepts. I'm not arguing for the wisdom of writing code as I am about to show, but the following compiles with the current release build of DMD, but may not parse with DGrammar, quite likely balk in the scanner: module main; public: static: import std.stdio; int main(string[] argv) { auto myBin = 0b0011_1101; writefln("%1$x\t%1$.8b\t%1$s", myBin); auto myBin2 = 0b_______1; writefln("%1$x\t%1$.8b\t%1$s", myBin2); auto myBin3 = 0b____1___; writefln("%1$x\t%1$.8b\t%1$s", myBin3); auto myHex1 = 0x1__; writefln("%1$x\t%1$.8b\t%1$s", myHex1); auto myHex2 = 0x_1_; writefln("%1$x\t%1$.8b\t%1$s", myHex2); auto myHex3 = 0x__1; writefln("%1$x\t%1$.8b\t%1$s", myHex3); return 0; }
Aug 16 2013
On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:I'm aware of that. If you're able to get ANTLR to actually produce a working parser for D I'd be happy to merge your pull request. I haven't been able to get any parser generators to work for D.I've been doing some work with the language grammar specification. You may find these resources useful: http://d.puremagic.com/issues/show_bug.cgi?id=10233 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4You have done impressive work on your grammar; I just have some small issues. 1. I run into a number of errors trying to generate the Java code, I'm using ANTLR 4.12. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : failsIt's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.Same with HexadecimalInteger. 3. The imports don't allow for all cases.https://github.com/Hackerpilot/DGrammar/issues4. how are you handling the scope attribute specifier in the "attribute ':'" case, e.g. "public:"? There seems to be a few more places where it diverges a bit from what the compiler currently accepts. I'm not arguing for the wisdom of writing code as I am about to show, but the following compiles with the current release build of DMD, but may not parse with DGrammar, quite likely balk in the scanner: module main; public: static: import std.stdio; int main(string[] argv) { auto myBin = 0b0011_1101; writefln("%1$x\t%1$.8b\t%1$s", myBin); auto myBin2 = 0b_______1; writefln("%1$x\t%1$.8b\t%1$s", myBin2); auto myBin3 = 0b____1___; writefln("%1$x\t%1$.8b\t%1$s", myBin3); auto myHex1 = 0x1__; writefln("%1$x\t%1$.8b\t%1$s", myHex1); auto myHex2 = 0x_1_; writefln("%1$x\t%1$.8b\t%1$s", myHex2); auto myHex3 = 0x__1; writefln("%1$x\t%1$.8b\t%1$s", myHex3); return 0; }I wrote that grammar as part of my work on DCD and DScanner. My lexer, parser, and AST library need some more testing. Please download DScanner and run it with either the --ast or --syntaxCheck options. If you find issues, please report them on Github.
Aug 16 2013
-- SNIP --I wrote that grammar as part of my work on DCD and DScanner. My lexer, parser, and AST library need some more testing. Please download DScanner and run it with either the --ast or --syntaxCheck options. If you find issues, please report them on Github.I forked just under an hour ago, am I on old bits? I have fixed all but one of the build issues, I would like to fix the last one before I commit as I don't like to leave my repo's in a broken state. I'll continue the discussion on GitHub.
Aug 16 2013
On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:[...][...] I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row. You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax. I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest. But if you want to accept "strange" literals like 0b__1__, you could do something like: <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1" The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string. T -- That's not a bug; that's a feature!2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : failsIt's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.
Aug 16 2013
On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:[...][...] I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : failsIt's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.It's not what I would call best practice, but the following is possible in the current compiler:auto myBin1 = 0b0011_1101; // Sane auto myBin2 = 0b_______1; // Trouble, myBin2 == 1 auto myBin3 = 0b____1___; // Trouble, myBin3 == 1Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.But if you want to accept "strange" literals like 0b__1__, you could do something like: <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1" The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string.The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e. BinaryLiteral : '0b' [_01]* [01] [_01]* ; I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug. It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.
Aug 16 2013
On Sat, Aug 17, 2013 at 04:02:40AM +0200, Andre Artus wrote:On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:I didn't mean to push it up to the parser. I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:[...][...] I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal. But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : failsIt's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P) The issue here is that when specs / DMD / TDPL don't agree, then it's not always clear which among the three are wrong. Perhaps *all* of them are wrong. Just because DMD accepts invalid code doesn't mean it should be part of the specs, for example. It could constitute a DMD bug.I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.It's not what I would call best practice, but the following is possible in the current compiler:auto myBin1 = 0b0011_1101; // Sane auto myBin2 = 0b_______1; // Trouble, myBin2 == 1 auto myBin3 = 0b____1___; // Trouble, myBin3 == 1Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is equivalent to the regex I posted later.But if you want to accept "strange" literals like 0b__1__, you could do something like: <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1" The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string.The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e. BinaryLiteral : '0b' [_01]* [01] [_01]* ; I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.Well, you could bug Walter about what *should* be accepted, and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD. Again, I seriously doubt that such a change would cause any code breakage, because writing 0b1 as 0b____1____ is just so ridiculous that any such code *should* be broken. T -- Prosperity breeds contempt, and poverty breeds consent. -- Suck.com
Aug 17 2013
Andre Artus wrote: 2. Your BinaryInteger and HexadecimalInteger only allow for one of the following (reduced) cases: 0b1__ : works 0b_1_ : fails 0b__1 : failsI agree with you, Brian, all three of these constructions go contrary to the goal of making the code clearer. I would not be too surprised if a significant number of programmers would see those as different numbers, at least until they paused to take it in.Brian Schott wrote: It's my opinion that the compiler should reject all of these because I think of the underscore as a separator between digits, but I'm constantly fighting the "spec, dmd, and idiom all disagree" issue.I think that you are right.[...] H. S. Teoh wrote: I remember reading this part of the spec on dlang.org, and I wonder if it was worded the way it is just for simplicity, because to specify something like "_ must appear between digits" involves some complicated BNF rules, which maybe seems like overkill for a single literal.H. S. Teoh wrote: But sometimes it is good to be precise, if we want to enforce "proper" conventions for underscores: <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1" This BNF spec forces "_" to only appear between two binary digits, and never more than a single _ in a row.Andre Artus wrote: Yup, that's the issue. Coding the actual behaviour by hand, or doing it with a regular expression, is close to trivial.H. S. Teoh wrote: You can also make your parser only pick up <binaryDigit> when performing semantic on binary literals, so the other stuff is ignored and only serves to enforce syntax.Andre Artus wrote: Pushing it up to the parser is an option in implementation, but I don't see that making the specification easier (it's 3:40 in the morning here, so I am very likely not thinking too clearly about this).H. S. Teoh wrote: I didn't mean to push it up to the parser.Sorry I misunderstood, I had been up for over 21 hours at the time I wrote, so it was getting a bit difficult for me to concentrate. I got the impression you were saying that the parser would be responsible for extracting the binary digits.H. S. Teoh wrote: I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.H. S. Teoh wrote: I'd be surprised if there's any D code out there that doesn't fit this spec, to be honest.Andre Artus wrote: It's not what I would call best practice, but the following is possible in the current compiler:auto myBin1 = 0b0011_1101; // Sane auto myBin2 = 0b_______1; // Trouble, myBin2 == 1 auto myBin3 = 0b____1___; // Trouble, myBin3 == 1Which means a tools built against the documented spec are going to choke on these weird cases. Personally I would prefer if the more questionable options were not allowed as they potentially defeat the goal of improving clarity. But, that's a breaking change.H. S. Teoh wrote: I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P)I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".H. S. Teoh wrote: The issue here is that when specs / DMD / TDPL don't agree, then it's not always clear which among the three are wrong. Perhaps *all* of them are wrong. Just because DMD accepts invalid code doesn't mean it should be part of the specs, for example. It could constitute a DMD bug.It would be good to get some clarification on this.H. S. Teoh wrote: But if you want to accept "strange" literals like 0b__1__, you could do something like: <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1" The odd form of the rule for <binaryLiteral> is to ensure that there's at least one binary digit in the string, whereas <underscoreBinaryDigits> is just a wildcard anything-goes rule that takes any combination of 0, 1, and _, including the empty string.Andre Artus wrote: The rule that matches the DMD compiler is actually very easy to do in ANTLR4, i.e. BinaryLiteral : '0b' [_01]* [01] [_01]* ; I'm a bit too tired to fully pay attention, but it seems you are saying that "0b" (no additional numbers) should match, which I believe it should not (although I admit to not testing this). If it does then I would consider that a bug.H. S. Teoh wrote: No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is equivalent to the regex I posted later.I should have paid better attention, as I missed the <binaryDigit> in <binaryLiteral>. To be honest I was having a hard time focusing due to lack of sleep and a pervading stench of paint fumes seeping in from the adjacent building.Andre Artus wrote: It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.H. S. Teoh wrote: Well, you could bug Walter about what *should* be accepted,I'm not sure how to go about that.H. S. Teoh wrote: and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD.Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value. Possible regex alternatives (note I do not include the sign, as per current spec). (0|[1-9]([_]*[0-9])*) or arguably better (0|[1-9]([_]?[0-9])*)H. S. Teoh wrote: Again, I seriously doubt that such a change would cause any code breakage, because writing 0b1 as 0b____1____ is just so ridiculous that any such code *should* be broken.Agreed.
Aug 17 2013
On Sat, Aug 17, 2013 at 11:29:03PM +0200, Andre Artus wrote: [...]You're right, I think the D specs page on literals using BNF is a bit of an overkill. Maybe it should be rewritten using regexen. It would be easier to understand, for one thing. [...]H. S. Teoh wrote: I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.Walter is someone who believes that compilers should only have errors, not warnings. :) [...]H. S. Teoh wrote: I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P)I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".Email him and ask? :)Andre Artus wrote: It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.H. S. Teoh wrote: Well, you could bug Walter about what *should* be accepted,I'm not sure how to go about that.Yeah that sounds like a bug in the specs.H. S. Teoh wrote: and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD.Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.Possible regex alternatives (note I do not include the sign, as per current spec). (0|[1-9]([_]*[0-9])*) or arguably better (0|[1-9]([_]?[0-9])*)[...] I think it should be: (0|[1-9]([0-9]*(_[0-9]+)*)?) That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34. T -- Blunt statements really don't have a point.
Aug 17 2013
[...]H. S. Teoh wrote: I was just using BNF to show that it's possible to specify the behaviour precisely. And also that it's rather convoluted just for something as intuitively straightforward as an integer literal. Which is a likely reason why the current specs are a bit blurry about what should/shouldn't be allowed.Andre Artus wrote: I don't think I've seen lexemes defined using (a variant of) BNF before, most often a form of regular expressions are used. One could cut down and clarify the page describing the lexical syntax significantly employing simple regular expressions.H. S. Teoh wrote: You're right, I think the D specs page on literals using BNF is a bit of an overkill. Maybe it should be rewritten using regexen. It would be easier to understand, for one thing.I would not mind doing this, I'll see what Walter says. It would also be quite easy to generate syntax diagrams from a reg-expr.[...]H. S. Teoh wrote: I know that, but I'm saying that hardly *any* code would break if we made DMD reject things like this. I don't think anybody in their right mind would write code like that. (Unless they were competing in the IODCC... :-P)I agree that the compiler should probably break that code, I believe some breaking changes are good when they help the programmer fix potential bugs. But I am also someone who compiles with "Treat warnings as errors".H. S. Teoh wrote: Walter is someone who believes that compilers should only have errors, not warnings. :)That can go both ways, but I suspect you mean that in the good way.[...]Andre Artus wrote: It's not a problem implementing the rule, I am more concerned with documenting it in a clear and unambiguous way so that people building tools from it can get it right. BNF isn't always the easiest way to do so, but it's what being used.H. S. Teoh wrote: Well, you could bug Walter about what *should* be accepted,I'm not sure how to go about that.H. S. Teoh wrote: Email him and ask? :)I'll try that.H. S. Teoh wrote: and if he agrees to restrict it to having _ only between two digits, then you'd file a bug against DMD.Well if we could get a ruling on this then we could include HexadecimalInteger in the ruling as it has similar behaviour in DMD. The current specification for DecimalInteger also allows a trailing sequence of underscores. It also does not include the sign as part of the token value.H. S. Teoh wrote: Yeah that sounds like a bug in the specs.Yes, I believe so. The same issues are under "Floating Point Literals". Should be easy to fix.I concur with your assessment. I believe my second reg-ex is functionally equivalent to the one you propose (test results below). Although I would concede that yours may be easier to grok. The following match my regex (assuming it's whitespace delimited) 1 1_1 1_2_3_4_5_6_7_8_9_0 1234_45_15 1234567_8_90 123456789_0 1_234567890 12_34567890 123_4567890 1234_567890 12345_67890 123456_7890 1234567_890 12345678_90 123456789_0 123_45_6_789012345_67890 Whereas these do not _1 1_ _1_ 1______1 -12_34 -1234 123_45_6__789012345_67890 1234567890_ _1234567890_ _1234567890 1234567890_Possible regex alternatives (note I do not include the sign, as per current spec). (0|[1-9]([_]*[0-9])*) or arguably better (0|[1-9]([_]?[0-9])*)[...] I think it should be: (0|[1-9]([0-9]*(_[0-9]+)*)?) That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by (zero or more digits 0-9 followed by zero or more (underscore followed by one or more digits 0-9)). This enforces only a single underscore between digits, and no preceding/trailing underscores. So it would exclude things like 12_____34, which is just as ridiculous as 123___, and only allow 12_34.
Aug 17 2013
[...] For fun I made a scanner rule that forces BinaryInteger to conform to a power of 2 grouping of nibbles. I think it loses it's clarity after 16 bits. I made the underscore optional between nibbles, but required for groups of 2 bytes and above. Some passing cases from my test inputs. 0b00010001 0b0001_0001 0b00010001_0001_0001 0b00010001_00010001 0b0001_0001_00010001 0b00010001_00010001 0b00010001_00010001_00010001_00010001 0b00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001 0b00010001_00010001_00010001_00010001_00010001_0001_0001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001 It loses some of the value of arbitrary grouping specifically the ability to group bits in a bitmask by function.
Aug 17 2013
On Fri, Aug 16, 2013 at 05:50:24PM -0700, H. S. Teoh wrote: [...]<binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits> <binaryDigits> ::= <binaryDigit> <binaryDigits> | <binaryDigit> <underscoreBinaryDigits> ::= "" | "_" <binaryDigits> | "_" <binaryDigits> <underscoreBinaryDigits> <binaryDigit> ::= "0" | "1"Regex equivalent: 0b(0|1)(0|1)*(_(0|1)(0|1)*)* [...]<binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> <underscoreBinaryDigits> <underscoreBinaryDigits> ::= "_" | "_" <underscoreBinaryDigits> | <binaryDigit> | <binaryDigit> <underscoreBinaryDigits> | "" <binaryDigit> ::= "0" | "1"[...] Regex equivalent: 0b(0|1|_)*(0|1)(0|1|_)* T -- "How are you doing?" "Doing what?"
Aug 16 2013
On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:I've been doing some work with the language grammar specification. You may find these resources useful: http://d.puremagic.com/issues/show_bug.cgi?id=10233 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested. According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars: import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens. The two I marked above caused problems when generating. I don't know whether you are in the middle of trying to fix the indirect left recursion issue but I see that terminals tied to "unaryExpression" are duplicated all over the place. I can fix the recursion issue, and clean up the dups if that would help you.
Aug 16 2013
On Friday, 16 August 2013 at 23:07:38 UTC, Andre Artus wrote:On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:I must have missed those when I pulled the grammar out of my parser's DDOC comments.I've been doing some work with the language grammar specification. You may find these resources useful: http://d.puremagic.com/issues/show_bug.cgi?id=10233 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4I have fixed up a few issues in DGrammar.g4, I will put them up on GitHub if you are interested. According the the Definitive ANTLR Reference the following list of words are reserved in ANTLR grammars: import, fragment, lexer, parser, grammar, returns, locals, throws, *catch*, *finally*, mode, options, tokens. The two I marked above caused problems when generating.I don't know whether you are in the middle of trying to fix the indirect left recursion issue but I see that terminals tied to "unaryExpression" are duplicated all over the place. I can fix the recursion issue, and clean up the dups if that would help you.It would. I'm not actively working on that grammar.
Aug 16 2013