digitalmars.D - [dox] Fixing the lexical rule for BinaryInteger

Andre Artus (23/28) Aug 16 2013 The nonterminal BinaryDigits, does not exist.

Brian Schott (4/4) Aug 16 2013 I've been doing some work with the language grammar

Andre Artus (41/45) Aug 16 2013 You have done impressive work on your grammar; I just have some

Brian Schott (15/61) Aug 16 2013 I'm aware of that. If you're able to get ANTLR to actually

Andre Artus (6/11) Aug 16 2013 I forked just under an hour ago, am I on old bits?
H. S. Teoh (40/51) Aug 16 2013 [...]

Andre Artus (25/92) Aug 16 2013 Yup, that's the issue. Coding the actual behaviour by hand, or

H. S. Teoh (24/126) Aug 17 2013 I didn't mean to push it up to the parser. I was just using BNF to show

Andre Artus (36/177) Aug 17 2013 I agree with you, Brian, all three of these constructions go

H. S. Teoh (23/71) Aug 17 2013 You're right, I think the D specs page on literals using BNF is a bit of

Andre Artus (41/117) Aug 17 2013 I would not mind doing this, I'll see what Walter says.

Andre Artus (18/18) Aug 17 2013 [...]

H. S. Teoh (11/32) Aug 16 2013 Regex equivalent:

Andre Artus (12/16) Aug 16 2013 I have fixed up a few issues in DGrammar.g4, I will put them up

Brian Schott (4/22) Aug 16 2013 I must have missed those when I pulled the grammar out of my

"Andre Artus" <andre.artus gmail.com> writes:

The documentation on the lexical rules for BinaryInteger 
(http://dlang.org/lex.html#BinaryInteger) has a few issues:

 BinaryInteger:
    BinPrefix BinaryDigits

The nonterminal BinaryDigits, does not exist.


 BinaryDigitsUS:
    BinaryDigitUS
    BinaryDigitUS BinaryDigitsUS

The construction for BinaryDigitsUS currently allows for the 
following:

_(_)*, e.g. 0b_, 0b__, 0b___ etc.


Which is clearly not allowed by the compiler.


I have put up a change on GitHub [1], but there is a clear 
problem. The DMD compiler allows for any of the following 
(reduced cases):

a. 0b__1
b. 0b_1_
c. 0b1__

Whereas my change disallows the second case (b), but is in line 
with how the other integers are specified.

This is a specification problem (limitation of BNF), not an 
implementation problem. In plain English one would just say that 
the BinaryDigitsUS sequence should contain at least one 
BinaryDigit character.

I'm busy working on the HexadecimalInteger, which has related 
issues.

1. 
https://github.com/andre-artus/dlang.org/blob/LexBinaryDigit/lex.dd

Aug 16 2013

"Brian Schott" <briancschott gmail.com> writes:

I've been doing some work with the language grammar 
specification. You may find these resources useful:

http://d.puremagic.com/issues/show_bug.cgi?id=10233
https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

Aug 16 2013

"Andre Artus" <andre.artus gmail.com> writes:

On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

You have done impressive work on your grammar; I just have some 
small issues.

1. I run into a number of errors trying to generate the Java 
code, I'm using ANTLR 4.1

2. Your BinaryInteger and HexadecimalInteger only allow for one 
of the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

Same with HexadecimalInteger.

3. The imports don't allow for all cases.

4. how are you handling the scope attribute specifier in the 
"attribute ':'" case, e.g. "public:"?

There seems to be a few more places where it diverges a bit from 
what the compiler currently accepts.

I'm not arguing for the wisdom of writing code as I am about to 
show, but the following compiles with the current release build 
of DMD, but may not parse with DGrammar, quite likely balk in the 
scanner:

module main;

public:
static:
import std.stdio;

int main(string[] argv)
{
	auto myBin = 0b0011_1101;

	writefln("%1$x\t%1$.8b\t%1$s", myBin);

	auto myBin2 = 0b_______1;

	writefln("%1$x\t%1$.8b\t%1$s", myBin2);

	auto myBin3 = 0b____1___;

	writefln("%1$x\t%1$.8b\t%1$s", myBin3);

	auto myHex1 = 0x1__;
	writefln("%1$x\t%1$.8b\t%1$s", myHex1);

	auto myHex2 = 0x_1_;
	writefln("%1$x\t%1$.8b\t%1$s", myHex2);

	auto myHex3 = 0x__1;
	writefln("%1$x\t%1$.8b\t%1$s", myHex3);

	
	return 0;
}

Aug 16 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:
 On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

 You have done impressive work on your grammar; I just have some 
 small issues.

 1. I run into a number of errors trying to generate the Java 
 code, I'm using ANTLR 4.1

I'm aware of that. If you're able to get ANTLR to actually 
produce a working parser for D I'd be happy to merge your pull 
request. I haven't been able to get any parser generators to work 
for D.

 2. Your BinaryInteger and HexadecimalInteger only allow for one 
 of the following (reduced) cases:

 0b1__ : works
 0b_1_ : fails
 0b__1 : fails

It's my opinion that the compiler should reject all of these 
because I think of the underscore as a separator between digits, 
but I'm constantly fighting the "spec, dmd, and idiom all 
disagree" issue.

 Same with HexadecimalInteger.

 3. The imports don't allow for all cases.

https://github.com/Hackerpilot/DGrammar/issues

 4. how are you handling the scope attribute specifier in the 
 "attribute ':'" case, e.g. "public:"?

 There seems to be a few more places where it diverges a bit 
 from what the compiler currently accepts.

 I'm not arguing for the wisdom of writing code as I am about to 
 show, but the following compiles with the current release build 
 of DMD, but may not parse with DGrammar, quite likely balk in 
 the scanner:

 module main;

 public:
 static:
 import std.stdio;

 int main(string[] argv)
 {
 	auto myBin = 0b0011_1101;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin);

 	auto myBin2 = 0b_______1;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin2);

 	auto myBin3 = 0b____1___;

 	writefln("%1$x\t%1$.8b\t%1$s", myBin3);

 	auto myHex1 = 0x1__;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex1);

 	auto myHex2 = 0x_1_;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex2);

 	auto myHex3 = 0x__1;
 	writefln("%1$x\t%1$.8b\t%1$s", myHex3);

 	
 	return 0;
 }

I wrote that grammar as part of my work on DCD and DScanner. My 
lexer, parser, and AST library need some more testing. Please 
download DScanner and run it with either the --ast or 
--syntaxCheck options. If you find issues, please report them on 
Github.

Aug 16 2013

"Andre Artus" <andre.artus gmail.com> writes:

-- SNIP --

 I wrote that grammar as part of my work on DCD and DScanner. My 
 lexer, parser, and AST library need some more testing. Please 
 download DScanner and run it with either the --ast or 
 --syntaxCheck options. If you find issues, please report them 
 on Github.

I forked just under an hour ago, am I on old bits?
I have fixed all but one of the build issues, I would like to fix 
the last one before I commit as I don't like to leave my repo's 
in a broken state.

I'll continue the discussion on GitHub.

Aug 16 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
 On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

[...]
2. Your BinaryInteger and HexadecimalInteger only allow for one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

 
 It's my opinion that the compiler should reject all of these because
 I think of the underscore as a separator between digits, but I'm
 constantly fighting the "spec, dmd, and idiom all disagree" issue.

[...]

I remember reading this part of the spec on dlang.org, and I wonder if
it was worded the way it is just for simplicity, because to specify
something like "_ must appear between digits" involves some complicated
BNF rules, which maybe seems like overkill for a single literal.

But sometimes it is good to be precise, if we want to enforce "proper"
conventions for underscores:

<binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>

<binaryDigits> ::= <binaryDigit> <binaryDigits>
		| <binaryDigit>

<underscoreBinaryDigits> ::= ""
		| "_" <binaryDigits>
		| "_" <binaryDigits> <underscoreBinaryDigits>

<binaryDigit> ::= "0"
		| "1"

This BNF spec forces "_" to only appear between two binary digits, and
never more than a single _ in a row. You can also make your parser only
pick up <binaryDigit> when performing semantic on binary literals, so
the other stuff is ignored and only serves to enforce syntax.

I'd be surprised if there's any D code out there that doesn't fit this
spec, to be honest.

But if you want to accept "strange" literals like 0b__1__, you could do
something like:

<binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
<underscoreBinaryDigits>

<underscoreBinaryDigits> ::= "_"
		| "_" <underscoreBinaryDigits>
		| <binaryDigit>
		| <binaryDigit> <underscoreBinaryDigits>
		| ""

<binaryDigit> ::= "0"
		| "1"

The odd form of the rule for <binaryLiteral> is to ensure that there's
at least one binary digit in the string, whereas
<underscoreBinaryDigits> is just a wildcard anything-goes rule that
takes any combination of 0, 1, and _, including the empty string.


T

-- 
That's not a bug; that's a feature!

Aug 16 2013

"Andre Artus" <andre.artus gmail.com> writes:

On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
 On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
 On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

 [...]
2. Your BinaryInteger and HexadecimalInteger only allow for 
one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

 
 It's my opinion that the compiler should reject all of these 
 because
 I think of the underscore as a separator between digits, but 
 I'm
 constantly fighting the "spec, dmd, and idiom all disagree" 
 issue.

 [...]

 I remember reading this part of the spec on dlang.org, and I 
 wonder if
 it was worded the way it is just for simplicity, because to 
 specify
 something like "_ must appear between digits" involves some 
 complicated
 BNF rules, which maybe seems like overkill for a single literal.

 But sometimes it is good to be precise, if we want to enforce 
 "proper"
 conventions for underscores:

 <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>

 <binaryDigits> ::= <binaryDigit> <binaryDigits>
 		| <binaryDigit>

 <underscoreBinaryDigits> ::= ""
 		| "_" <binaryDigits>
 		| "_" <binaryDigits> <underscoreBinaryDigits>

 <binaryDigit> ::= "0"
 		| "1"

 This BNF spec forces "_" to only appear between two binary 
 digits, and never more than a single _ in a row.

Yup, that's the issue. Coding the actual behaviour by hand, or 
doing it with a regular expression, is close to trivial.

 You can also make your parser only
 pick up <binaryDigit> when performing semantic on binary 
 literals, so
 the other stuff is ignored and only serves to enforce syntax.

Pushing it up to the parser is an option in implementation, but I 
don't see that making the specification easier (it's 3:40 in the 
morning here, so I am very likely not thinking too clearly about 
this).

 I'd be surprised if there's any D code out there that doesn't 
 fit this
 spec, to be honest.

It's not what I would call best practice, but the following is 
possible in the current compiler:

 	auto myBin1 = 0b0011_1101; // Sane
 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

Which means a tools built against the documented spec are going 
to choke on these weird cases. Personally I would prefer if the 
more questionable options were not allowed as they potentially 
defeat the goal of improving clarity. But, that's a breaking 
change.

 But if you want to accept "strange" literals like 0b__1__, you 
 could do
 something like:

 <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit> 
 <underscoreBinaryDigits>

 <underscoreBinaryDigits> ::= "_"
 		| "_" <underscoreBinaryDigits>
 		| <binaryDigit>
 		| <binaryDigit> <underscoreBinaryDigits>
 		| ""

 <binaryDigit> ::= "0"
 		| "1"

 The odd form of the rule for <binaryLiteral> is to ensure that 
 there's
 at least one binary digit in the string, whereas
 <underscoreBinaryDigits> is just a wildcard anything-goes rule 
 that
 takes any combination of 0, 1, and _, including the empty 
 string.

The rule that matches the DMD compiler is actually very easy to 
do in ANTLR4, i.e.

BinaryLiteral   : '0b' [_01]* [01] [_01]* ;


I'm a bit too tired to fully pay attention, but it seems you are 
saying that "0b" (no additional numbers) should match, which I 
believe it should not (although I admit to not testing this). If 
it does then I would consider that a bug.

It's not a problem implementing the rule, I am more concerned 
with documenting it in a clear and unambiguous way so that people 
building tools from it can get it right. BNF isn't always the 
easiest way to do so, but it's what being used.

Aug 16 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Aug 17, 2013 at 04:02:40AM +0200, Andre Artus wrote:
 On Saturday, 17 August 2013 at 00:51:57 UTC, H. S. Teoh wrote:
On Sat, Aug 17, 2013 at 01:03:35AM +0200, Brian Schott wrote:
On Friday, 16 August 2013 at 22:43:13 UTC, Andre Artus wrote:

[...]
2. Your BinaryInteger and HexadecimalInteger only allow for
one of
the following (reduced) cases:

0b1__ : works
0b_1_ : fails
0b__1 : fails

It's my opinion that the compiler should reject all of these because
I think of the underscore as a separator between digits, but I'm
constantly fighting the "spec, dmd, and idiom all disagree"
issue.

[...]

I remember reading this part of the spec on dlang.org, and I wonder
if it was worded the way it is just for simplicity, because to
specify something like "_ must appear between digits" involves some
complicated BNF rules, which maybe seems like overkill for a single
literal.

But sometimes it is good to be precise, if we want to enforce
"proper" conventions for underscores:

<binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>

<binaryDigits> ::= <binaryDigit> <binaryDigits>
		| <binaryDigit>

<underscoreBinaryDigits> ::= ""
		| "_" <binaryDigits>
		| "_" <binaryDigits> <underscoreBinaryDigits>

<binaryDigit> ::= "0"
		| "1"

This BNF spec forces "_" to only appear between two binary digits,
and never more than a single _ in a row.

 
 Yup, that's the issue. Coding the actual behaviour by hand, or doing
 it with a regular expression, is close to trivial.
 
You can also make your parser only pick up <binaryDigit> when
performing semantic on binary literals, so the other stuff is ignored
and only serves to enforce syntax.

 
 Pushing it up to the parser is an option in implementation, but I
 don't see that making the specification easier (it's 3:40 in the
 morning here, so I am very likely not thinking too clearly about
 this).

I didn't mean to push it up to the parser. I was just using BNF to show
that it's possible to specify the behaviour precisely. And also that
it's rather convoluted just for something as intuitively straightforward
as an integer literal. Which is a likely reason why the current specs
are a bit blurry about what should/shouldn't be allowed.


I'd be surprised if there's any D code out there that doesn't fit
this spec, to be honest.

 
 It's not what I would call best practice, but the following is
 possible in the current compiler:
 
	auto myBin1 = 0b0011_1101; // Sane
	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

 
 Which means a tools built against the documented spec are going to
 choke on these weird cases. Personally I would prefer if the more
 questionable options were not allowed as they potentially defeat the
 goal of improving clarity. But, that's a breaking change.

I know that, but I'm saying that hardly *any* code would break if we
made DMD reject things like this. I don't think anybody in their right
mind would write code like that. (Unless they were competing in the
IODCC... :-P)

The issue here is that when specs / DMD / TDPL don't agree, then it's
not always clear which among the three are wrong. Perhaps *all* of them
are wrong. Just because DMD accepts invalid code doesn't mean it should
be part of the specs, for example. It could constitute a DMD bug.


But if you want to accept "strange" literals like 0b__1__, you could
do something like:

<binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
<underscoreBinaryDigits>

<underscoreBinaryDigits> ::= "_"
		| "_" <underscoreBinaryDigits>
		| <binaryDigit>
		| <binaryDigit> <underscoreBinaryDigits>
		| ""

<binaryDigit> ::= "0"
		| "1"

The odd form of the rule for <binaryLiteral> is to ensure that
there's at least one binary digit in the string, whereas
<underscoreBinaryDigits> is just a wildcard anything-goes rule that
takes any combination of 0, 1, and _, including the empty string.

 
 The rule that matches the DMD compiler is actually very easy to do
 in ANTLR4, i.e.
 
 BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
 
 
 I'm a bit too tired to fully pay attention, but it seems you are
 saying that "0b" (no additional numbers) should match, which I believe
 it should not (although I admit to not testing this). If it does then
 I would consider that a bug.

No, the BNF rules I wrote are equivalent to your ANTLR4 spec. Which is
equivalent to the regex I posted later.


 It's not a problem implementing the rule, I am more concerned with
 documenting it in a clear and unambiguous way so that people
 building tools from it can get it right. BNF isn't always the
 easiest way to do so, but it's what being used.

Well, you could bug Walter about what *should* be accepted, and if he
agrees to restrict it to having _ only between two digits, then you'd
file a bug against DMD. Again, I seriously doubt that such a change
would cause any code breakage, because writing 0b1 as 0b____1____ is
just so ridiculous that any such code *should* be broken.


T

-- 
Prosperity breeds contempt, and poverty breeds consent. -- Suck.com

Aug 17 2013

"Andre Artus" <andre.artus gmail.com> writes:

 Andre Artus wrote:
 2. Your BinaryInteger and HexadecimalInteger only allow for
 one of the following (reduced) cases:
 
 0b1__ : works
 0b_1_ : fails
 0b__1 : fails





 Brian Schott wrote:
 It's my opinion that the compiler should reject all of these 
 because I think of the underscore as a separator between 
 digits,
 but I'm constantly fighting the "spec, dmd, and idiom all 
 disagree" issue.




I agree with you, Brian, all three of these constructions go 
contrary to the goal of making the code clearer. I would not be 
too surprised if a significant number of programmers would see 
those as different numbers, at least until they paused to take it 
in.

 [...]
 H. S. Teoh wrote:
 
 I remember reading this part of the spec on dlang.org, and I 
 wonder if it was worded the way it is just for simplicity, 
 because to specify something like "_ must appear between 
 digits" involves some complicated BNF rules, which maybe 
 seems like overkill for a single literal.



I think that you are right.

 H. S. Teoh wrote:
 
 But sometimes it is good to be precise, if we want to enforce
 "proper" conventions for underscores:
 
 <binaryLiteral>  ::= "0b" <binaryDigits>
                     <underscoreBinaryDigits>
 
 <binaryDigits>   ::= <binaryDigit> <binaryDigits>
                    | <binaryDigit>
 
 <underscoreBinaryDigits>
                  ::= ""
                    | "_" <binaryDigits>
                    | "_" <binaryDigits> 
 <underscoreBinaryDigits>
 
 <binaryDigit>    ::= "0"
                    | "1"
 
 This BNF spec forces "_" to only appear between two binary 
 digits, and never more than a single _ in a row.




 Andre Artus wrote:
 Yup, that's the issue. Coding the actual behaviour by hand, or 
 doing it with a regular expression, is close to trivial.


 H. S. Teoh wrote:
 You can also make your parser only pick up <binaryDigit> when
 performing semantic on binary literals, so the other stuff is 
 ignored and only serves to enforce syntax.



 Andre Artus wrote:
 Pushing it up to the parser is an option in implementation, 
 but I don't see that making the specification easier (it's 
 3:40 in the morning here, so I am very likely not thinking too 
 clearly about this).



 H. S. Teoh wrote:
 I didn't mean to push it up to the parser.

Sorry I misunderstood, I had been up for over 21 hours at the 
time I wrote, so it was getting a bit difficult for me to 
concentrate. I got the impression you were saying that the parser 
would be responsible for extracting the binary digits.

 H. S. Teoh wrote:
 I was just using BNF to show that it's possible to specify the 
 behaviour precisely.
 And also that it's rather convoluted just for something as 
 intuitively straightforward as an integer literal. Which is a 
 likely reason why the current specs are a bit blurry about what 
 should/shouldn't be allowed.

I don't think I've seen lexemes defined using (a variant of) BNF 
before, most often a form of regular expressions are used. One 
could cut down and clarify the page describing the lexical syntax 
significantly employing simple regular expressions.


 H. S. Teoh wrote:
 I'd be surprised if there's any D code out there that doesn't 
 fit this spec, to be honest.



 Andre Artus wrote:
 It's not what I would call best practice, but the following is
 possible in the current compiler:
 
 	auto myBin1 = 0b0011_1101; // Sane
 	auto myBin2 = 0b_______1; // Trouble, myBin2 == 1
 	auto myBin3 = 0b____1___; // Trouble, myBin3 == 1

 
 Which means a tools built against the documented spec are 
 going to choke on these weird cases. Personally I would prefer 
 if the more questionable options were not allowed as they 
 potentially defeat the goal of improving clarity. But, that's 
 a breaking change.


 H. S. Teoh wrote:
 I know that, but I'm saying that hardly *any* code would break 
 if we made DMD reject things like this. I don't think anybody 
 in their right mind would write code like that. (Unless they 
 were competing in the IODCC... :-P)

I agree that the compiler should probably break that code, I 
believe some breaking changes are good when they help the 
programmer fix potential bugs. But I am also someone who compiles 
with "Treat warnings as errors".

 H. S. Teoh wrote:
 The issue here is that when specs / DMD / TDPL don't agree, 
 then it's not always clear which among the three are wrong. 
 Perhaps *all* of them are wrong. Just because DMD accepts 
 invalid code doesn't mean it should be part of the specs, for 
 example. It could constitute a DMD bug.

It would be good to get some clarification on this.

 H. S. Teoh wrote:
 But if you want to accept "strange" literals like 0b__1__, 
 you could do something like:
 
 <binaryLiteral>          ::= "0b" <underscoreBinaryDigits>
                              <binaryDigit>
                              <underscoreBinaryDigits>
 
 <underscoreBinaryDigits> ::= "_"
                            | "_" <underscoreBinaryDigits>
                            | <binaryDigit>
                            | <binaryDigit> 
 <underscoreBinaryDigits>
                            | ""
 
 <binaryDigit>            ::= "0"
                            | "1"
 
 The odd form of the rule for <binaryLiteral> is to ensure that
 there's at least one binary digit in the string, whereas
 <underscoreBinaryDigits> is just a wildcard anything-goes 
 rule that takes any combination of 0, 1, and _, including the
 empty string.



 Andre Artus wrote:
 The rule that matches the DMD compiler is actually very easy 
 to do in ANTLR4, i.e.
 
 BinaryLiteral   : '0b' [_01]* [01] [_01]* ;
 
 I'm a bit too tired to fully pay attention, but it seems you 
 are saying that "0b" (no additional numbers) should match, 
 which I believe it should not (although I admit to not testing 
 this). If it does then I would consider that a bug.


 H. S. Teoh wrote:
 No, the BNF rules I wrote are equivalent to your ANTLR4 spec. 
 Which is equivalent to the regex I posted later.

I should have paid better attention, as I missed the 
<binaryDigit> in <binaryLiteral>. To be honest I was having a 
hard time focusing due to lack of sleep and a pervading stench of 
paint fumes seeping in from the adjacent building.

 Andre Artus wrote:
 It's not a problem implementing the rule, I am more concerned 
 with documenting it in a clear and unambiguous way so that 
 people building tools from it can get it right. BNF isn't 
 always the easiest way to do so, but it's what being used.


 H. S. Teoh wrote:
 Well, you could bug Walter about what *should* be accepted,

I'm not sure how to go about that.

 H. S. Teoh wrote:
 and if he agrees to restrict it to having _ only between two 
 digits, then you'd file a bug against DMD.

Well if we could get a ruling on this then we could include 
HexadecimalInteger in the ruling as it has similar behaviour in 
DMD.


The current specification for DecimalInteger also allows a 
trailing sequence of underscores. It also does not include the 
sign as part of the token value.

Possible regex alternatives (note I do not include the sign, as 
per current spec).

(0|[1-9]([_]*[0-9])*)

or arguably better
(0|[1-9]([_]?[0-9])*)

 H. S. Teoh wrote:
 Again, I seriously doubt that such a change would cause any 
 code breakage, because writing 0b1 as 0b____1____ is just so 
 ridiculous that any such code *should* be broken.

Agreed.

Aug 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Aug 17, 2013 at 11:29:03PM +0200, Andre Artus wrote:
[...]
H. S. Teoh wrote:
I was just using BNF to show that it's possible to specify the
behaviour precisely.  And also that it's rather convoluted just for
something as intuitively straightforward as an integer literal. Which
is a likely reason why the current specs are a bit blurry about what
should/shouldn't be allowed.

 
 I don't think I've seen lexemes defined using (a variant of) BNF
 before, most often a form of regular expressions are used. One could
 cut down and clarify the page describing the lexical syntax
 significantly employing simple regular expressions.

You're right, I think the D specs page on literals using BNF is a bit of
an overkill. Maybe it should be rewritten using regexen. It would be
easier to understand, for one thing.


[...]
H. S. Teoh wrote:
I know that, but I'm saying that hardly *any* code would break if
we made DMD reject things like this. I don't think anybody in
their right mind would write code like that. (Unless they were
competing in the IODCC... :-P)

 
 I agree that the compiler should probably break that code, I believe
 some breaking changes are good when they help the programmer fix
 potential bugs. But I am also someone who compiles with "Treat
 warnings as errors".

Walter is someone who believes that compilers should only have errors,
not warnings. :)


[...]
Andre Artus wrote:
It's not a problem implementing the rule, I am more concerned
with documenting it in a clear and unambiguous way so that
people building tools from it can get it right. BNF isn't always
the easiest way to do so, but it's what being used.


 
H. S. Teoh wrote:
Well, you could bug Walter about what *should* be accepted,

 
 I'm not sure how to go about that.

Email him and ask? :)


H. S. Teoh wrote:
and if he agrees to restrict it to having _ only between two
digits, then you'd file a bug against DMD.

 
 Well if we could get a ruling on this then we could include
 HexadecimalInteger in the ruling as it has similar behaviour in DMD.
 
 The current specification for DecimalInteger also allows a trailing
 sequence of underscores. It also does not include the sign as part
 of the token value.

Yeah that sounds like a bug in the specs.


 Possible regex alternatives (note I do not include the sign, as per
 current spec).
 
 (0|[1-9]([_]*[0-9])*)
 
 or arguably better
 (0|[1-9]([_]?[0-9])*)

[...]

I think it should be:

	(0|[1-9]([0-9]*(_[0-9]+)*)?)

That is, either it's a 0, or a single digit from 1-9, or 1-9 followed by
(zero or more digits 0-9 followed by zero or more (underscore followed
by one or more digits 0-9)). This enforces only a single underscore
between digits, and no preceding/trailing underscores. So it would
exclude things like 12_____34, which is just as ridiculous as 123___,
and only allow 12_34.


T

-- 
Blunt statements really don't have a point.

Aug 17 2013

"Andre Artus" <andre.artus gmail.com> writes:

 [...]
 H. S. Teoh wrote:
 I was just using BNF to show that it's possible to specify 
 the behaviour precisely.  And also that it's rather 
 convoluted just for something as intuitively straightforward 
 as an integer literal. Which is a likely reason why the 
 current specs are a bit blurry about what should/shouldn't be 
 allowed.



 Andre Artus wrote:
 I don't think I've seen lexemes defined using (a variant of) 
 BNF before, most often a form of regular expressions are used. 
 One could cut down and clarify the page describing the lexical 
 syntax significantly employing simple regular expressions.


 H. S. Teoh wrote:
 You're right, I think the D specs page on literals using BNF is 
 a bit of an overkill. Maybe it should be rewritten using 
 regexen.
 It would be easier to understand, for one thing.

I would not mind doing this, I'll see what Walter says.

It would also be quite easy to generate syntax diagrams from a 
reg-expr.

 [...]
 H. S. Teoh wrote:
 I know that, but I'm saying that hardly *any* code would 
 break if
 we made DMD reject things like this. I don't think anybody in
 their right mind would write code like that. (Unless they were
 competing in the IODCC... :-P)

 
 I agree that the compiler should probably break that code, I 
 believe some breaking changes are good when they help the 
 programmer fix potential bugs. But I am also someone who 
 compiles with "Treat warnings as errors".


 H. S. Teoh wrote:
 Walter is someone who believes that compilers should only have 
 errors, not warnings. :)

That can go both ways, but I suspect you mean that in the good 
way.


 [...]
 Andre Artus wrote:
 It's not a problem implementing the rule, I am more concerned
 with documenting it in a clear and unambiguous way so that
 people building tools from it can get it right. BNF isn't 
 always the easiest way to do so, but it's what being used.


 
 H. S. Teoh wrote:
 Well, you could bug Walter about what *should* be accepted,

 
 I'm not sure how to go about that.


 H. S. Teoh wrote:
 Email him and ask? :)

I'll try that.

 H. S. Teoh wrote:
 and if he agrees to restrict it to having _ only between two
 digits, then you'd file a bug against DMD.

 
 Well if we could get a ruling on this then we could include
 HexadecimalInteger in the ruling as it has similar behaviour 
 in DMD.
 
 The current specification for DecimalInteger also allows a 
 trailing sequence of underscores. It also does not include the 
 sign as part of the token value.


 H. S. Teoh wrote:
 Yeah that sounds like a bug in the specs.

Yes, I believe so. The same issues are under "Floating Point 
Literals". Should be easy to fix.

 Possible regex alternatives (note I do not include the sign, 
 as per current spec).
 
 (0|[1-9]([_]*[0-9])*)
 
 or arguably better
 (0|[1-9]([_]?[0-9])*)

 [...]

 I think it should be:

 	(0|[1-9]([0-9]*(_[0-9]+)*)?)

 That is, either it's a 0, or a single digit from 1-9, or 1-9 
 followed by (zero or more digits 0-9 followed by zero or more 
 (underscore followed by one or more digits 0-9)). This enforces 
 only a single underscore between digits, and no 
 preceding/trailing underscores. So it would exclude things like 
 12_____34, which is just as ridiculous as 123___, and only 
 allow 12_34.

I concur with your assessment.
I believe my second reg-ex is functionally equivalent to the one 
you propose (test results below). Although I would concede that 
yours may be easier to grok.


The following match my regex (assuming it's whitespace delimited)

1
1_1
1_2_3_4_5_6_7_8_9_0	
1234_45_15		
1234567_8_90		
123456789_0		
1_234567890		
12_34567890		
123_4567890
1234_567890
12345_67890
123456_7890
1234567_890
12345678_90
123456789_0
123_45_6_789012345_67890

Whereas these do not

_1
1_
_1_
1______1
-12_34
-1234
123_45_6__789012345_67890
1234567890_
_1234567890_
_1234567890
1234567890_

Aug 17 2013

"Andre Artus" <andre.artus gmail.com> writes:

[...]

For fun I made a scanner rule that forces BinaryInteger to 
conform to a power of 2 grouping of nibbles. I think it loses 
it's clarity after 16 bits.

I made the underscore optional between nibbles, but required for 
groups of 2 bytes and above.

Some passing cases from my test inputs.

0b00010001
0b0001_0001
0b00010001_0001_0001
0b00010001_00010001
0b0001_0001_00010001
0b00010001_00010001
0b00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001
0b00010001_00010001_00010001_00010001_00010001_0001_0001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001_00010001

It loses some of the value of arbitrary grouping specifically the 
ability to group bits in a bitmask by function.

Aug 17 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Aug 16, 2013 at 05:50:24PM -0700, H. S. Teoh wrote:
[...]
 <binaryLiteral> ::= "0b" <binaryDigits> <underscoreBinaryDigits>
 
 <binaryDigits> ::= <binaryDigit> <binaryDigits>
 		| <binaryDigit>
 
 <underscoreBinaryDigits> ::= ""
 		| "_" <binaryDigits>
 		| "_" <binaryDigits> <underscoreBinaryDigits>
 
 <binaryDigit> ::= "0"
 		| "1"

Regex equivalent:

	0b(0|1)(0|1)*(_(0|1)(0|1)*)*


[...]
 <binaryLiteral> ::= "0b" <underscoreBinaryDigits> <binaryDigit>
<underscoreBinaryDigits>
 
 <underscoreBinaryDigits> ::= "_"
 		| "_" <underscoreBinaryDigits>
 		| <binaryDigit>
 		| <binaryDigit> <underscoreBinaryDigits>
 		| ""
 
 <binaryDigit> ::= "0"
 		| "1"

[...]

Regex equivalent:

	0b(0|1|_)*(0|1)(0|1|_)*


T

-- 
"How are you doing?" "Doing what?"

Aug 16 2013

"Andre Artus" <andre.artus gmail.com> writes:

On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

I have fixed up a few issues in DGrammar.g4, I will put them up 
on GitHub if you are interested.

According the the Definitive ANTLR Reference the following list 
of words are reserved in ANTLR grammars:
import, fragment, lexer, parser, grammar, returns, locals, 
throws, *catch*, *finally*, mode, options, tokens.

The two I marked above caused problems when generating.

I don't know whether you are in the middle of trying to fix the 
indirect left recursion issue but I see that terminals tied to 
"unaryExpression" are duplicated all over the place. I can fix 
the recursion issue, and clean up the dups if that would help you.

Aug 16 2013

"Brian Schott" <briancschott gmail.com> writes:

On Friday, 16 August 2013 at 23:07:38 UTC, Andre Artus wrote:
 On Friday, 16 August 2013 at 20:00:35 UTC, Brian Schott wrote:
 I've been doing some work with the language grammar 
 specification. You may find these resources useful:

 http://d.puremagic.com/issues/show_bug.cgi?id=10233
 https://github.com/Hackerpilot/DGrammar/blob/master/D.g4

 I have fixed up a few issues in DGrammar.g4, I will put them up 
 on GitHub if you are interested.

 According the the Definitive ANTLR Reference the following list 
 of words are reserved in ANTLR grammars:
 import, fragment, lexer, parser, grammar, returns, locals, 
 throws, *catch*, *finally*, mode, options, tokens.

 The two I marked above caused problems when generating.

I must have missed those when I pulled the grammar out of my 
parser's DDOC comments.

 I don't know whether you are in the middle of trying to fix the 
 indirect left recursion issue but I see that terminals tied to 
 "unaryExpression" are duplicated all over the place. I can fix 
 the recursion issue, and clean up the dups if that would help 
 you.

It would. I'm not actively working on that grammar.

Aug 16 2013

D Programming

C/C++ Programming

Other

digitalmars.D - [dox] Fixing the lexical rule for BinaryInteger