digitalmars.D - Official D Grammar
- Brian Schott (41/41) Apr 01 2013 I've pretty much finished up my work on the std.d.lexer module. I
- Walter Bright (5/37) Apr 01 2013 Yes, this needs to be fixed.
- Bruno Medeiros (18/26) Apr 09 2013 I don't think that kind of grammar issue is too annoying, since it's
- Walter Bright (2/9) Apr 20 2013 I think the special case is worth it.
- Timon Gehr (2/14) Apr 20 2013 Without doubt.
- Jacob Carlborg (4/11) Apr 09 2013 Just define that @safe should be a token?
-
Stewart Gordon
(10/17)
Apr 23 2013
- Diggory (4/4) Apr 23 2013 If at some point custom attributes will be allowed, @ would be a
- Jonathan M Davis (4/9) Apr 23 2013 We already have custom attributes using @. They're called User Defined
- Jonathan M Davis (7/10) Apr 23 2013 That would mean creating more keywords, which would break code. By using...
- Jacob Carlborg (5/10) Apr 23 2013 You can't use the same reasoning anymore since we have UDA. Those will
- Jonathan M Davis (15/24) Apr 25 2013 True. Adding @blah now risks conflicts where it didn't before, but it st...
- Paulo Pinto (12/27) Apr 24 2013 Please don't.
-
Stewart Gordon
(16/23)
Apr 02 2013
- Jacob Carlborg (5/18) Apr 02 2013 I'm wondering if it's possibly to mechanically check that what's in the
- Tobias Pankrath (5/7) Apr 02 2013 Take the grammar and (randomly) generate strings with it and
- Christopher Bergqvist (7/14) Apr 08 2013 An alternative idea for ensuring that documentation and
- Tobias Pankrath (11/28) Apr 08 2013 I know but the parser is currently hand written and I think
- Philippe Sigaud (3/6) Apr 08 2013 Yes, it comes with the project. But, it's still buggy (sometimes due to ...
- Bruno Medeiros (23/57) Apr 06 2013 Interesting thread. I've been working on a hand-written D parser (in
- Bruno Medeiros (9/12) Apr 06 2013 BTW, even in the lexer spec I've found an issue. How does this parse:
- Jonathan M Davis (4/15) Apr 06 2013 It would definitely have to be INTEGER DOT IDENTIFIER due to UFCS, so it...
- Artur Skawina (18/28) Apr 06 2013 "1..2", "1.ident" and a float literal with '_' after the '.' are the
- Bruno Medeiros (12/27) Apr 06 2013 The "1..2" is actually mentioned in the spec:
- Artur Skawina (11/33) Apr 07 2013 What does the "spec" currently say about ".001"?..
- Bruno Medeiros (9/13) Apr 09 2013 Who says its a bug? From my understanding, this exception is there on
- Artur Skawina (17/29) Apr 09 2013 It's a bug, because the grammar does not correctly describe the rules. A
- Brian Schott (9/9) Apr 20 2013 I've moved my work on the grammar to the following location on
- Dmitry Olshansky (5/14) Apr 20 2013 Bookmarked for now ;)
- Brian Schott (5/7) Apr 26 2013 I'm beginning to think that ANTRL is not up to the task either.
- Brian Schott (8/12) Apr 28 2013 I think I'm done with this grammar for now. If there's any
- Andrei Alexandrescu (4/13) Apr 28 2013 Great.
- Brian Schott (6/6) May 12 2013 I wrote a small program to take the grammar and generate input to
- Brian Schott (13/13) Jun 03 2013 As threatened at DConf, I've started filing bugs against the
- Brian Schott (10/10) Jul 19 2013 Status update:
- Mr. Anonymous (3/13) Jul 19 2013 https://rawgithub.com/Hackerpilot/DGrammar/master/grammar.html
- Brian Schott (2/4) Jul 19 2013 I learn something new every day.
- H. S. Teoh (13/25) Jul 19 2013 Really? Those still exist in Phobos? Shouldn't they be rewritten to the
- Brian Schott (3/6) Jul 19 2013 Yes.
- Johannes Pfau (15/28) Jul 19 2013 Nice work!
- Jacob Carlborg (4/14) Jul 20 2013 Now for the million dollar question: does it work during compile time?
- Michael (3/6) Jul 20 2013 +1
- Tobias Pankrath (3/10) May 01 2013 Do you know, which parts of the grammar make the parser so slow?
- Russel Winder (17/27) May 01 2013 Which other parser generators? cf. for example,
- Bruno Medeiros (15/20) May 02 2013 Half hour??? That seems like a massive amount of time, even if the
- Russel Winder (16/24) May 02 2013 =20
- Tobias Pankrath (3/20) May 03 2013 LALR(1) is O(n) last time I looked. Both Ll(k) and LALR(k) should
- Brian Schott (2/4) May 02 2013 That or buy the documentation. Or both...
I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it. In the meantime I've started some work on an AST module for Phobos that contains the data types necessary to build up a parser module so that we can have a standard set of code build D dev tools off of. I decided to work directly from the standard on dlang.org for this to make sure that my module is correct and that the standard is actually correct. I've seen several threads on this newsgroup complaining about the state of the standard and unfortunately this will be another one. 1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value. 2) Grammar references rules that don't exist. UserDefinedAttribute is defined in terms of CallExpression, but CallExpression doesn't exist elsewhere in the grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that rule is never defined. 3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all defined in terms of only one other rule. I think that we need to be able to create a grammar description that: * Fits in to a single file, so that a tool implementer does not need to collect bits of the grammar from the various pages on dlang.org. * Can be verified to be correct by an existing tool such as Bison, Goldie, JavaCC, <your favorite here> with a small number of changes. * Is part of the dmd/dlang repositories on github and gets updated every time the language changes. I'm willing to work on this if there's a good chance it will actually be implemented. Thoughts?
Apr 01 2013
On 4/1/2013 4:18 PM, Brian Schott wrote:I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it. In the meantime I've started some work on an AST module for Phobos that contains the data types necessary to build up a parser module so that we can have a standard set of code build D dev tools off of. I decided to work directly from the standard on dlang.org for this to make sure that my module is correct and that the standard is actually correct. I've seen several threads on this newsgroup complaining about the state of the standard and unfortunately this will be another one. 1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value.True, do you have a suggestion?2) Grammar references rules that don't exist. UserDefinedAttribute is defined in terms of CallExpression, but CallExpression doesn't exist elsewhere in the grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that rule is never defined.Yes, this needs to be fixed.3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all defined in terms of only one other rule.Using these makes documentation easier, and I don't think it harms anything.I think that we need to be able to create a grammar description that: * Fits in to a single file, so that a tool implementer does not need to collect bits of the grammar from the various pages on dlang.org. * Can be verified to be correct by an existing tool such as Bison, Goldie, JavaCC, <your favorite here> with a small number of changes. * Is part of the dmd/dlang repositories on github and gets updated every time the language changes. I'm willing to work on this if there's a good chance it will actually be implemented. Thoughts?I suggest doing this as a sequence of pull requests, not doing just one big one.
Apr 01 2013
On 02/04/2013 03:13, Walter Bright wrote:I don't think that kind of grammar issue is too annoying, since it's easy to understand what the intended behavior is (in this case at least). But to fix it, well, we can have just that: have the grammar say it should parse an identifier after the , and then issue a semantic error of sorts if the value is not one of the expected special values (safe, etc.). Parsing an identifier here is in any case the best error recovery strategy anyways. A bit more annoying is the case with the extern declaration, with the C++ parameter: extern(C++) here you have to look at a special identifier (the C, D, PASCAL part) and see if there is a ++ token ahead, it's a bit more of special-casing in the parser. Here I think it would have been better to change the the language itself and use "CPP" instead of "C++". A minor simplification. -- Bruno Medeiros - Software Engineer1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value.True, do you have a suggestion?
Apr 09 2013
On 4/9/2013 3:20 AM, Bruno Medeiros wrote:A bit more annoying is the case with the extern declaration, with the C++ parameter: extern(C++) here you have to look at a special identifier (the C, D, PASCAL part) and see if there is a ++ token ahead, it's a bit more of special-casing in the parser. Here I think it would have been better to change the the language itself and use "CPP" instead of "C++". A minor simplification.I think the special case is worth it.
Apr 20 2013
On 04/20/2013 08:36 PM, Walter Bright wrote:On 4/9/2013 3:20 AM, Bruno Medeiros wrote:Without doubt.A bit more annoying is the case with the extern declaration, with the C++ parameter: extern(C++) here you have to look at a special identifier (the C, D, PASCAL part) and see if there is a ++ token ahead, it's a bit more of special-casing in the parser. Here I think it would have been better to change the the language itself and use "CPP" instead of "C++". A minor simplification.I think the special case is worth it.
Apr 20 2013
On 2013-04-02 04:13, Walter Bright wrote:Just define that safe should be a token? -- /Jacob Carlborg1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value.True, do you have a suggestion?
Apr 09 2013
On 02/04/2013 03:13, Walter Bright wrote:On 4/1/2013 4:18 PM, Brian Schott wrote:<snip><snip> Er, do away with the meaningless ? Since nobody seems to have succeeded in explaining how the -attributes differ from the rest, it seems the right way to go. Stewart. -- My email address is valid but not my primary mailbox and not checked regularly. Please keep replies on the 'group where everybody may benefit.1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value.True, do you have a suggestion?
Apr 23 2013
If at some point custom attributes will be allowed, would be a nice syntax for that. Things like safe could be attributes implemented by the standard library - would cut down on the number of keywords.
Apr 23 2013
On Wednesday, April 24, 2013 00:34:54 Diggory wrote:If at some point custom attributes will be allowed, would be a nice syntax for that. Things like safe could be attributes implemented by the standard library - would cut down on the number of keywords.We already have custom attributes using . They're called User Defined Attributes and were added a couple releases ago IIRC. - Jonathan M Davis
Apr 23 2013
On Tuesday, April 23, 2013 21:51:54 Stewart Gordon wrote:Er, do away with the meaningless ? Since nobody seems to have succeeded in explaining how the -attributes differ from the rest, it seems the right way to go.That would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason. - Jonathan M Davis
Apr 23 2013
On 2013-04-24 03:40, Jonathan M Davis wrote:That would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason.You can't use the same reasoning anymore since we have UDA. Those will conflict with new keywords starting with an . -- /Jacob Carlborg
Apr 23 2013
On Wednesday, April 24, 2013 08:39:21 Jacob Carlborg wrote:On 2013-04-24 03:40, Jonathan M Davis wrote:True. Adding blah now risks conflicts where it didn't before, but it still risks fewer conflicts that adding a keyword, since a keyword could be used anywhere, so adding blah as a keyword would conflict with all uses of blah - including blah - whereas adding blah would just conflict with blah. There was also some discussion on how to make it so that blah might not cause conflicts (via namespacing or somesuch), but I don't recall if anything came of that or how much of applies to the actual UDA implementation. So, it makes more sense to go with adding an attribute than a keyword if we have to add a new attribute, because it'll conflict with less, but it's definitely true that we no longer have a way to avoid creating conflicts when adding attributes without altering the grammar in a manner similar to when we made it possible to have attributes starting with . So, it's that much more important that we only add new attributes when we really need them. - Jonathan M DavisThat would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason.You can't use the same reasoning anymore since we have UDA. Those will conflict with new keywords starting with an .
Apr 25 2013
On Wednesday, 24 April 2013 at 01:40:52 UTC, Jonathan M Davis wrote:On Tuesday, April 23, 2013 21:51:54 Stewart Gordon wrote:Please don't. This was the approach taken by Java with the overload and friends, which leads to ugly code in my opinion. Once I worked in a project which took a similar approach, we would have annotations as a way to extend the core grammar. Each team eventually grew its own set of annotations, to the point the amount of annotations became bigger than the core grammar itself, with overlapping meaning. -- PauloEr, do away with the meaningless ? Since nobody seems to have succeeded in explaining how the -attributes differ from the rest, it seems the right way to go.That would mean creating more keywords, which would break code. By using , we avoid having to create new keywords, which I believe was the whole point in the first place. Which attributes got on them was fairly arbitrary, but they do definitely serve a purpose. And any new attributes in the future will probably have on them for the same reason. - Jonathan M Davis
Apr 24 2013
On 02/04/2013 00:18, Brian Schott wrote: <snip>I think that we need to be able to create a grammar description that: * Fits in to a single file, so that a tool implementer does not need to collect bits of the grammar from the various pages on dlang.org. * Can be verified to be correct by an existing tool such as Bison, Goldie, JavaCC, <your favorite here> with a small number of changes. * Is part of the dmd/dlang repositories on github and gets updated every time the language changes.<snip> Indeed, the published grammar needs to be thoroughly checked against what DMD is actually doing, and any discrepancies fixed (or filed in Bugzilla to be fixed in due course). And then they need to be kept in sync. Has the idea of using a parser generator to build D's parsing code been rejected in the past, or is hand-coding just the way Walter decided to do it? Is the code any more efficient than what a typical parser generator would generate? And all disambiguation rules (such as "if it's parseable as a DeclarationStatement, it's a DeclarationStatement") need to be made explicit as part of the grammar. I suppose this is where using Bison or similar would help, as it would point out any ambiguities in the grammar that need rules to resolve them. Stewart.
Apr 02 2013
On 2013-04-02 15:21, Stewart Gordon wrote:Indeed, the published grammar needs to be thoroughly checked against what DMD is actually doing, and any discrepancies fixed (or filed in Bugzilla to be fixed in due course). And then they need to be kept in sync. Has the idea of using a parser generator to build D's parsing code been rejected in the past, or is hand-coding just the way Walter decided to do it? Is the code any more efficient than what a typical parser generator would generate? And all disambiguation rules (such as "if it's parseable as a DeclarationStatement, it's a DeclarationStatement") need to be made explicit as part of the grammar. I suppose this is where using Bison or similar would help, as it would point out any ambiguities in the grammar that need rules to resolve them.I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves. -- /Jacob Carlborg
Apr 02 2013
I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
Apr 02 2013
On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation. Theoretically possible, :) just look at Philippe Sigaud's Pegged.I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
Apr 08 2013
On Monday, 8 April 2013 at 21:50:12 UTC, Christopher Bergqvist wrote:On Tuesday, 2 April 2013 at 19:00:21 UTC, Tobias Pankrath wrote:I know but the parser is currently hand written and I think Walter will only accept an auto generated parser if it is as fast as the current solution. However in an old discussion someone said that the D grammar isn't LALR(1) or LR(1), so I don't think that is possible with current D parser generators. Do we have a pegged grammar for D? Another think is that for documentation purposes you might want to have an more readable but ambiguous grammar than your generator of choice will accept.An alternative idea for ensuring that documentation and implementation are in sync might be to list the full grammar definition as a data structure that can both be used as input for the parser and as input for a tool that generates the documentation. Theoretically possible, :) just look at Philippe Sigaud's Pegged.I'm wondering if it's possibly to mechanically check that what's in the grammar is how DMD behaves.Take the grammar and (randomly) generate strings with it and check if DMD does complain. You'd need a parse only don't check semantics flag, though. This will not check if the strings are parsed correctly by DMD nor if invalid strings are rejected. But it would be a start.
Apr 08 2013
However in an old discussion someone said that the D grammar isn't LALR(1) or LR(1), so I don't think that is possible with current D parser generators. Do we have a pegged grammar for D?Yes, it comes with the project. But, it's still buggy (sometimes due to my own mistakes, sometimes due to plain errors in the online D grammar). And the generated parser is quite slow, halas.
Apr 08 2013
On 02/04/2013 00:18, Brian Schott wrote:I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it. In the meantime I've started some work on an AST module for Phobos that contains the data types necessary to build up a parser module so that we can have a standard set of code build D dev tools off of. I decided to work directly from the standard on dlang.org for this to make sure that my module is correct and that the standard is actually correct. I've seen several threads on this newsgroup complaining about the state of the standard and unfortunately this will be another one. 1) Grammar defined in terms of things that aren't tokens. Take, for example, PropertyDeclaration. It's defined as an " " token followed by... what? "safe"? It's not a real token. It's an identifier. You can't parse this based on checking the token type. You have to check the type and the value. 2) Grammar references rules that don't exist. UserDefinedAttribute is defined in terms of CallExpression, but CallExpression doesn't exist elsewhere in the grammar. BaseInterfaceList is defined in terms of InterfaceClasses, but that rule is never defined. 3) Unnecessary rules. KeyExpression, ValueExpression, ScopeBlockStatement, DeclarationStatement, ThenStatement, ElseStatement, Test, Increment, Aggregate, LwrExpression, UprExpression, FirstExp, LastExp, StructAllocator, StructDeallocator, EnumTag, EnumBaseType, EmptyEnumBody, ConstraintExpression, MixinIdentifier, etc... are all defined in terms of only one other rule. I think that we need to be able to create a grammar description that: * Fits in to a single file, so that a tool implementer does not need to collect bits of the grammar from the various pages on dlang.org. * Can be verified to be correct by an existing tool such as Bison, Goldie, JavaCC, <your favorite here> with a small number of changes. * Is part of the dmd/dlang repositories on github and gets updated every time the language changes. I'm willing to work on this if there's a good chance it will actually be implemented. Thoughts?Interesting thread. I've been working on a hand-written D parser (in Java, for the DDT IDE) and I too have found a slew of grammar spec issues. Some of them more serious than the ones you mentioned above. In same cases it's actually not clear, or downright wrong what the grammar spec says. For example, here's one off of my notes: void func(int foo() { } ); The spec says that is parsable (basically a function declaration in the parameter list), which makes no sense, and DMD doesn't accept. Some cases are a bit trickier, since it's not clear if the syntax should be accepted or not (sometimes they might make sense but not be allowed). These issues make things a bit harder for tools development that require D language parsers. But the whole grammar spec is so messy, I've been unsure whether it's worth filling bug reports or not (would they be addressed?). There is also the problem that even if those issues are fixed now, the spec could very easily fall out of date in the future, unless we have some system to test the spec. Like you mentioned, ideally we would have a grammar spec for a grammar/PG tool so that correctness could more easily be verified. (it doesn't guarantee no spec bugs, but it makes it much harder for them to be there) -- Bruno Medeiros - Software Engineer
Apr 06 2013
On 02/04/2013 00:18, Brian Schott wrote:I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified. -- Bruno Medeiros - Software Engineer
Apr 06 2013
On Saturday, April 06, 2013 16:21:12 Bruno Medeiros wrote:On 02/04/2013 00:18, Brian Schott wrote:It would definitely have to be INTEGER DOT IDENTIFIER due to UFCS, so it sounds like the spec wasn't updated like it should have been. - Jonathan M DavisI've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
Apr 06 2013
On 04/06/13 17:21, Bruno Medeiros wrote:On 02/04/2013 00:18, Brian Schott wrote:"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these. Actually, that last case is even more "interesting"; http://dlang.org/lex.html has "1_2_3_4_5_6_._5_6_7_8" as a valid example, which of course it's not ("_5_6_7_8" is a valid identifier), but there is no reason do disallow "1_2_3_4_5_6_.5_6_7_8".I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.that should be clarified.These are just grammar bugs, that could easily be fixed. Then there are some things that can be less obvious, but shouldn't really be controversial like allowing empty HexString literals. Then there's the enhancement category. Looking through my comments, I think the only deliberate change from dlang.org that I have is in DelimitedString -- there is no reason to forbid q"/abc/def/"; there are no back-compat issues, as it couldn't have existed in legacy D code. artur
Apr 06 2013
On 06/04/2013 20:52, Artur Skawina wrote:On 04/06/13 17:21, Bruno Medeiros wrote:The "1..2" is actually mentioned in the spec: "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed. But unless I missed it, the spec is incorrect for the "1.ident" or "1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and it's not always 100% obvious to a human how these should be handled. Or maybe that's just me :) -- Bruno Medeiros - Software EngineerOn 02/04/2013 00:18, Brian Schott wrote:"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these.I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.
Apr 06 2013
On 04/07/13 00:35, Bruno Medeiros wrote:On 06/04/2013 20:52, Artur Skawina wrote:I know, but documenting a (grammar) bug does not make it go away.On 04/06/13 17:21, Bruno Medeiros wrote:The "1..2" is actually mentioned in the spec: "An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed.On 02/04/2013 00:18, Brian Schott wrote:"1..2", "1.ident" and a float literal with '_' after the '.' are the DecimalFloat cases that I immediately ran into when doing a lexer based on the dlang grammar. It's obvious to a human how these should be handled, but code generators aren't that smart... But they are good at catching mistakes like these.I've pretty much finished up my work on the std.d.lexer module. I am waiting for the review queue to make some progress on the other (three?) modules being reviewed before starting a thread on it.BTW, even in the lexer spec I've found an issue. How does this parse: 5.blah According to the spec (maximal munch technique), it should be FLOAT then IDENTIFIER. But DMD parses it as INTEGER DOT IDENTIFIER. I'm assuming the lastest is the correct behavior, so you can write stuff like 123.init, but that should be clarified.But unless I missed it, the spec is incorrect for the "1.ident" or "1_2_3_4_5_6_._5_6_7_8" cases as there is no exception mentioned there... and it's not always 100% obvious to a human how these should be handled. Or maybe that's just me :)What does the "spec" currently say about ".001"?.. It's been a while since I did a d lexer, based on the dlang grammar - it (the lexer) was supposed to be dmd compatible. Took a closer look at the actual dlang.org rules today while writing this message... Will try to find some time to clean up and convert a working D lexical grammar to PEG; what i have should be 1:1 translatable, except one rule (DelimitedString) and put it on the wiki. Maybe it will help someone avoid these issues. artur
Apr 07 2013
On 07/04/2013 16:14, Artur Skawina wrote:Who says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions: foo[1..2] // It would be silly to have to put a space after the 1 At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present. -- Bruno Medeiros - Software EngineerThe "1..2" is actually mentioned in the spec:I know, but documenting a (grammar) bug does not make it go away."An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed.
Apr 09 2013
On 04/09/13 12:24, Bruno Medeiros wrote:On 07/04/2013 16:14, Artur Skawina wrote:It's a bug, because the grammar does not correctly describe the rules. A parser/lexer based on the grammar alone will not work. Documenting the exception(s) helps the human, but doesn't make the grammar correct. I've started the PEG conversion of my lexer rules and the relevant one looks like this: DecimalFloat: (LeadingDecimal "." !"." !IdentifierStart DecimalDigitsNoSingleUS DecimalExponent) / (LeadingDecimal "." !"." !IdentifierStart DigitUS*) / ("." LeadingDecimal DecimalExponent?) This works as-is, w/o any extra info - the working lexer is mechanically generated from aot this rule. (It differs from the dlang definition in at least four ways - "1..2", "1.ident", "1.000_1" and ".001") I'll try to finish the conversion and post the whole lexical grammar in a couple days (have almost no D-time right now). arturWho says its a bug? From my understanding, this exception is there on purpose, to make it easier to use the DOT_DOT operator in the slice expresions: foo[1..2] // It would be silly to have to put a space after the 1 At most you could make a case that "1." shouldn't ever parse as float, that the decimal part should be required if the dot is present.The "1..2" is actually mentioned in the spec:I know, but documenting a (grammar) bug does not make it go away."An exception to this rule is that a .. embedded inside what looks like two floating point literals, as in 1..2, is interpreted as if the .. was separated by a space from the first integer." so it's there, even if it can be missed.
Apr 09 2013
I've moved my work on the grammar to the following location on Github: https://github.com/Hackerpilot/DGrammar This uses ANTLR, as the other parser generators can't handle D's grammar. Several rules from the official grammar were removed, and several others were added (Such as an actual rule for a function declaration...) I also tried to fix any inacuracies or omissions I came across in the online documentation. Comments, issues, and pull requests welcome.
Apr 20 2013
20-Apr-2013 12:31, Brian Schott пишет:I've moved my work on the grammar to the following location on Github: https://github.com/Hackerpilot/DGrammar This uses ANTLR, as the other parser generators can't handle D's grammar.Great. IMHO ANTLR is one of the sanest.Several rules from the official grammar were removed, and several others were added (Such as an actual rule for a function declaration...) I also tried to fix any inacuracies or omissions I came across in the online documentation. Comments, issues, and pull requests welcome.Bookmarked for now ;) -- Dmitry Olshansky
Apr 20 2013
On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:This uses ANTLR, as the other parser generators can't handle D's grammar.I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
Apr 26 2013
On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.I think I'm done with this grammar for now. If there's any interest, I can use it to create syntax diagrams for inclusion on dlang.org. I'll also be filing a series of bug reports over the next few days. (Two recent examples: 0.0L is not valid according to the lexical standard, and deprecated("string") is not documented anywhere)
Apr 28 2013
On 4/28/13 8:16 PM, Brian Schott wrote:On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:Yes! This is awesome. Thanks for this work.I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.I think I'm done with this grammar for now. If there's any interest, I can use it to create syntax diagrams for inclusion on dlang.org.I'll also be filing a series of bug reports over the next few days. (Two recent examples: 0.0L is not valid according to the lexical standard, and deprecated("string") is not documented anywhere)Great. Andrei
Apr 28 2013
I wrote a small program to take the grammar and generate input to the dot program from Graphviz. You can generate the diagrams with some scripts available on Github or view a prebuilt version here: http://hackerpilot.org/d/dgrammar.zip Please take a look at this and let me know if you find any mistakes.
May 12 2013
As threatened at DConf, I've started filing bugs against the grammar specification. Anyone interested can track bug 10233[1], which I've marked as blocked by the various issues I've been finding. As usual, my best guess at D's actual grammar[2] is located here[3]. Top secret project is hinted at here[4]. [1] http://d.puremagic.com/issues/show_bug.cgi?id=10233 [2] I disclaim all responsibility for injuries sustained while looking at the functionDefinition rule. [3] https://github.com/Hackerpilot/DGrammar/blob/master/D.g4 [4] http://hackerpilot.github.io/experimental/std_lexer/phobos/parser.html
Jun 03 2013
Status update: My parser is at a point where it can parse all of phobos (well, excluding c-style array declarations, which I'm convinced are broken and should be removed from the language). I've been updating the grammar definition as I worked on the parser. An HTML version is available for download here: https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html Lexer, parser, and AST code is available here: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. Bug reports and pull requests are welcome.
Jul 19 2013
On Friday, 19 July 2013 at 20:49:27 UTC, Brian Schott wrote:Status update: My parser is at a point where it can parse all of phobos (well, excluding c-style array declarations, which I'm convinced are broken and should be removed from the language). I've been updating the grammar definition as I worked on the parser. An HTML version is available for download here: https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html Lexer, parser, and AST code is available here: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. Bug reports and pull requests are welcome.https://rawgithub.com/Hackerpilot/DGrammar/master/grammar.html ;)
Jul 19 2013
On Friday, 19 July 2013 at 20:52:33 UTC, Mr. Anonymous wrote:https://rawgithub.com/Hackerpilot/DGrammar/master/grammar.html ;)I learn something new every day.
Jul 19 2013
On Fri, Jul 19, 2013 at 10:49:25PM +0200, Brian Schott wrote:Status update: My parser is at a point where it can parse all of phobos (well, excluding c-style array declarations, which I'm convinced are broken and should be removed from the language).Really? Those still exist in Phobos? Shouldn't they be rewritten to the "native" style?I've been updating the grammar definition as I worked on the parser. An HTML version is available for download here: https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html Lexer, parser, and AST code is available here: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. Bug reports and pull requests are welcome.Wonderful! I'll have to look into this sometime. I've been dreaming about writing a D pretty-printer, and this may be just the thing I need to get it going. :) (Well actually, I *wrote* a D pretty-printer, but I only got as far as lexing, with some token-matching rules for the pretty-printing. Needless to say, the results were rather underwhelming, since I didn't have an AST to work with.) T -- Written on the window of a clothing store: No shirt, no shoes, no service.
Jul 19 2013
On Friday, 19 July 2013 at 20:57:14 UTC, H. S. Teoh wrote:Really? Those still exist in Phobos? Shouldn't they be rewritten to the "native" style?Yes. https://github.com/D-Programming-Language/phobos/pull/1412/files
Jul 19 2013
Am Fri, 19 Jul 2013 22:49:25 +0200 schrieb "Brian Schott" <briancschott gmail.com>:Status update: My parser is at a point where it can parse all of phobos (well, excluding c-style array declarations, which I'm convinced are broken and should be removed from the language). I've been updating the grammar definition as I worked on the parser. An HTML version is available for download here: https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html Lexer, parser, and AST code is available here: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. Bug reports and pull requests are welcome.Nice work! I guess you already tried druntime but did you also run your lexer and or parser against the dmd test suite? The test suite probably has quite some old legacy stuff like c-style array declarations so it might be a pita to run those tests but it might show some real problems as well. If you wanted to do more testing you could also run it on some well-maintained big D projects (GTKd, derelict3, ...) http://wiki.dlang.org/Libraries_and_Frameworks http://wiki.dlang.org/Open_Source_Projects It's also nice that we may finally get a correct and complete grammar. BTW: I think it might be a good idea to start a new thread for this update. This stuff is quite important for D/phobos and usually old threads tend to be ignored or get less attention.
Jul 19 2013
On 2013-07-19 22:49, Brian Schott wrote:Status update: My parser is at a point where it can parse all of phobos (well, excluding c-style array declarations, which I'm convinced are broken and should be removed from the language). I've been updating the grammar definition as I worked on the parser. An HTML version is available for download here: https://raw.github.com/Hackerpilot/DGrammar/master/grammar.html Lexer, parser, and AST code is available here: https://github.com/Hackerpilot/Dscanner/tree/range-based-lexer/std/d. Bug reports and pull requests are welcome.Now for the million dollar question: does it work during compile time? -- /Jacob Carlborg
Jul 20 2013
On Friday, 19 July 2013 at 20:49:27 UTC, Brian Schott wrote:Status update: excluding c-style array declarations, which I'm convinced are broken and should be removed from the language+1 I found for myself that "native" style arrays are more natural.
Jul 20 2013
On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:Do you know, which parts of the grammar make the parser so slow? Do you use any suspicious features of ANTLR?This uses ANTLR, as the other parser generators can't handle D's grammar.I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
May 01 2013
On Wed, 2013-05-01 at 17:34 +0200, Tobias Pankrath wrote:On Friday, 26 April 2013 at 20:44:12 UTC, Brian Schott wrote:Which other parser generators? cf. for example, http://en.wikipedia.org/wiki/Comparison_of_parser_generatorsOn Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:This uses ANTLR, as the other parser generators can't handle=20 D's grammar.LL(k) can sometimes do this. As can LALR(1) of course.I'm beginning to think that ANTRL is not up to the task either.=20 I've somehow managed to get the grammar to the point where it=20 correctly parses several phobos modules but takes a half hour=20 to do so.Do you know, which parts of the grammar make the parser so slow? Do you use any suspicious features of ANTLR?There isn't just one ANTLR: ANTLR 2, ANTLR 3, and ANTLR 4 are very different beasties. Not many of the parser generators generate D code :-( --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 01 2013
On 26/04/2013 21:44, Brian Schott wrote:On Saturday, 20 April 2013 at 08:31:34 UTC, Brian Schott wrote:Half hour??? That seems like a massive amount of time, even if the generated parser is very suboptimal and is doing a lot of unnecessary backtracking. I mean, even if the parser has quadratic performance over typical usage source files, it seems like a lot. To be honest, that's one of the reasons that put me off working with ANLTR. It seems easy to create a parser with ANTLR, but to create an efficient, well-behaved parser it looks quite complicated, in the sense that you can't abstract yourself from what is happening under the hood... you have to read a lot of theory and documention to learn the innards of ANTLR, and understand what kind of code it's actually generating, and how it processes input. (at moments it feels like you have to take a degree to learn how to use it effectively...) -- Bruno Medeiros - Software EngineerThis uses ANTLR, as the other parser generators can't handle D's grammar.I'm beginning to think that ANTRL is not up to the task either. I've somehow managed to get the grammar to the point where it correctly parses several phobos modules but takes a half hour to do so.
May 02 2013
On Thu, 2013-05-02 at 17:44 +0100, Bruno Medeiros wrote: [=E2=80=A6]To be honest, that's one of the reasons that put me off working with=20 ANLTR. It seems easy to create a parser with ANTLR, but to create an=20 efficient, well-behaved parser it looks quite complicated, in the sense==20that you can't abstract yourself from what is happening under the=20 hood... you have to read a lot of theory and documention to learn the=20 innards of ANTLR, and understand what kind of code it's actually=20 generating, and how it processes input. (at moments it feels like you=20 have to take a degree to learn how to use it effectively...)The Groovy parser is an ANTLR 2.7.7 grammar, it works well and quickly. I think the trick is to work with the LL(k) idioms and avoid letting LALR(1) thoughts creep in. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
May 02 2013
On Thursday, 2 May 2013 at 17:13:46 UTC, Russel Winder wrote:On Thu, 2013-05-02 at 17:44 +0100, Bruno Medeiros wrote: […]LALR(1) is O(n) last time I looked. Both Ll(k) and LALR(k) should never backtrack. Dunno what Antlr is doing.To be honest, that's one of the reasons that put me off working with ANLTR. It seems easy to create a parser with ANTLR, but to create an efficient, well-behaved parser it looks quite complicated, in the sense that you can't abstract yourself from what is happening under the hood... you have to read a lot of theory and documention to learn the innards of ANTLR, and understand what kind of code it's actually generating, and how it processes input. (at moments it feels like you have to take a degree to learn how to use it effectively...)The Groovy parser is an ANTLR 2.7.7 grammar, it works well and quickly. I think the trick is to work with the LL(k) idioms and avoid letting LALR(1) thoughts creep in.
May 03 2013
On Thursday, 2 May 2013 at 16:44:44 UTC, Bruno Medeiros wrote:(at moments it feels like you have to take a degree to learn how to use it effectively...)That or buy the documentation. Or both...
May 02 2013