digitalmars.D - Interpreting the D grammar

Jacob Carlborg (21/21) Aug 02 2015 I'm trying to read the D grammar [1] to enhance the D TextMate bundle.

cym13 (6/26) Aug 02 2015 You can't build a regular expression for any grammar. You can for

Jacob Carlborg (6/10) Aug 02 2015 TextMate grammars support recursion, it's possible to define a grammar

cym13 (12/24) Aug 02 2015 Yes, that will work for this simple example, but what of

deadalnix (8/39) Aug 06 2015 You can for Regular grammar or any subclass:

MakersF (9/29) Aug 02 2015 Of course it's recursive! Do you want the grammar to be able to

Jacob Carlborg (7/15) Aug 02 2015 I don't know how this work, that's why I'm asking. But I read something

Xinok (6/13) Aug 02 2015 There's lots of videos online that show you how to do this. I

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> (2/4) Aug 06 2015 a* seems pretty infinite to me. :P

Timon Gehr (3/7) Aug 06 2015 And any grammar defining that language is recursive.
Tofu Ninja (4/8) Aug 06 2015 (0|1)*

H. S. Teoh via Digitalmars-d (5/15) Aug 06 2015 Finite is not a problem, if the upper limit is Graham's Number...

Xinok (13/33) Aug 02 2015 I guess you're not familiar with the theoretical aspect of

Jacob Carlborg (9/19) Aug 02 2015 TextMate grammars are not _just_ regular expressions. They can define

MakersF (17/43) Aug 06 2015 Then your best shot is to approximate the grammar with the regual

Dmitry Olshansky (21/55) Aug 06 2015 If one limits the depth of nested constructs to some reasonable value
Jacob Carlborg (6/20) Aug 06 2015 I was hoping to enhance the current TextMate grammar for D to get it a

deadalnix (2/22) Aug 06 2015 http://stackoverflow.com/a/1732454/672906

Jacob Carlborg (8/9) Aug 06 2015 I don't have so much of a choice. TextMate supports what it supports.

Jacob Carlborg <doob me.com> writes:

I'm trying to read the D grammar [1] to enhance the D TextMate bundle. 
If we take the add expression as an example. It's defined like this in 
the grammar:

AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

And like this in the grammar made by Brian [2]:

addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

I'm not so familiar with grammars but this looks like it's recursive. Is 
it possible to translate this piece of grammar to a regular expression? 
TextMate uses regular expressions and a couple of 
enhancements/extensions to define a grammar for a language.

[1] http://dlang.org/grammar.html
[2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

-- 
/Jacob Carlborg

Aug 02 2015

"cym13" <cpicard openmailbox.org> writes:

On Sunday, 2 August 2015 at 14:50:35 UTC, Jacob Carlborg wrote:
 I'm trying to read the D grammar [1] to enhance the D TextMate 
 bundle. If we take the add expression as an example. It's 
 defined like this in the grammar:

 AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

 And like this in the grammar made by Brian [2]:

 addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

 I'm not so familiar with grammars but this looks like it's 
 recursive. Is it possible to translate this piece of grammar to 
 a regular expression? TextMate uses regular expressions and a 
 couple of enhancements/extensions to define a grammar for a 
 language.

 [1] http://dlang.org/grammar.html
 [2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

You can't build a regular expression for any grammar. You can for 
some grammars but those are only a simple subset. For example, 
checking parens balance is impossible with common (not recursive) 
regular expressions only, and even with recursion it soon reaches 
its limitations.

Aug 02 2015

Jacob Carlborg <doob me.com> writes:

On 02/08/15 18:08, cym13 wrote:

 You can't build a regular expression for any grammar. You can for some
 grammars but those are only a simple subset. For example, checking
 parens balance is impossible with common (not recursive) regular
 expressions only, and even with recursion it soon reaches its limitations.

TextMate grammars support recursion, it's possible to define a grammar 
with balanced parentheses [1].

[1] https://manual.macromates.com/en/language_grammars

-- 
/Jacob Carlborg

Aug 02 2015

"cym13" <cpicard openmailbox.org> writes:

On Sunday, 2 August 2015 at 17:29:57 UTC, Jacob Carlborg wrote:
 On 02/08/15 18:08, cym13 wrote:

 You can't build a regular expression for any grammar. You can 
 for some
 grammars but those are only a simple subset. For example, 
 checking
 parens balance is impossible with common (not recursive) 
 regular
 expressions only, and even with recursion it soon reaches its 
 limitations.

 TextMate grammars support recursion, it's possible to define a 
 grammar with balanced parentheses [1].

 [1] https://manual.macromates.com/en/language_grammars

Yes, that will work for this simple example, but what of 
interleaved parentheses ? Say you want (), [] and "" to match, 
how can you do ?

     [[("]("), "])(", ")"]]

There are constructs that aren't possibly doable using even 
extend regular expressions. That's why grammars were invented 
after all.

Reading your documentation, it seems that you are not expected to 
reduce the grammar to a regular expression, rather it uses many 
regular expressions to describes parts of the language grammar, 
so that should work.

Aug 02 2015

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 2 August 2015 at 16:08:24 UTC, cym13 wrote:
 On Sunday, 2 August 2015 at 14:50:35 UTC, Jacob Carlborg wrote:
 I'm trying to read the D grammar [1] to enhance the D TextMate 
 bundle. If we take the add expression as an example. It's 
 defined like this in the grammar:

 AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

 And like this in the grammar made by Brian [2]:

 addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

 I'm not so familiar with grammars but this looks like it's 
 recursive. Is it possible to translate this piece of grammar 
 to a regular expression? TextMate uses regular expressions and 
 a couple of enhancements/extensions to define a grammar for a 
 language.

 [1] http://dlang.org/grammar.html
 [2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

 You can't build a regular expression for any grammar. You can 
 for some grammars but those are only a simple subset. For 
 example, checking parens balance is impossible with common (not 
 recursive) regular expressions only, and even with recursion it 
 soon reaches its limitations.

You can for Regular grammar or any subclass: 
https://en.wikipedia.org/wiki/Regular_grammar

You can't for more complex grammars.

That being said, you can have regex with placeholder and rerun 
the regex on the content of the placeholder is the grammar is 
recursive. I don't think that is an efficient and/or convenient 
way to parse D.

Aug 06 2015

"MakersF" <lfc morz.com> writes:

On Sunday, 2 August 2015 at 14:50:35 UTC, Jacob Carlborg wrote:
 I'm trying to read the D grammar [1] to enhance the D TextMate 
 bundle. If we take the add expression as an example. It's 
 defined like this in the grammar:

 AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

 And like this in the grammar made by Brian [2]:

 addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

 I'm not so familiar with grammars but this looks like it's 
 recursive. Is it possible to translate this piece of grammar to 
 a regular expression? TextMate uses regular expressions and a 
 couple of enhancements/extensions to define a grammar for a 
 language.

 [1] http://dlang.org/grammar.html
 [2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

Of course it's recursive! Do you want the grammar to be able to 
only define a finite number of programs?

But in this case you could write the original grammar rule as
mul |
cat |
(mul|cat)((+|-) (mul|cat))* (+|-) (mul|cat)

but you lose the precedence of the operation as it is a flat list 
and not a tree

Aug 02 2015

Jacob Carlborg <doob me.com> writes:

On 02/08/15 18:37, MakersF wrote:

 Of course it's recursive! Do you want the grammar to be able to only
 define a finite number of programs?

I don't know how this work, that's why I'm asking. But I read something 
about left recursion needs to be removed to be able to parse a grammar, 
at least for some parsers.

 But in this case you could write the original grammar rule as
 mul |
 cat |
 (mul|cat)((+|-) (mul|cat))* (+|-) (mul|cat)

 but you lose the precedence of the operation as it is a flat list and
 not a tree

I don't think that's important for syntax highlighting.

-- 
/Jacob Carlborg

Aug 02 2015

"Xinok" <xinok live.com> writes:

On Sunday, 2 August 2015 at 17:33:35 UTC, Jacob Carlborg wrote:
 On 02/08/15 18:37, MakersF wrote:

 Of course it's recursive! Do you want the grammar to be able 
 to only
 define a finite number of programs?

 I don't know how this work, that's why I'm asking. But I read 
 something about left recursion needs to be removed to be able 
 to parse a grammar, at least for some parsers.

There's lots of videos online that show you how to do this. I 
suppose some parsers are smart enough to rewrite the grammar to 
remove left recursion. Otherwise, for a simple parser which does 
nothing more than a breadth-first search, it may require 
exponential time to parse a string.

Aug 02 2015

"Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= <shorttail hotmail.com> writes:

On Sunday, 2 August 2015 at 16:37:06 UTC, MakersF wrote:
 Of course it's recursive! Do you want the grammar to be able to 
 only define a finite number of programs?

a* seems pretty infinite to me. :P

Aug 06 2015

Timon Gehr <timon.gehr gmx.ch> writes:

On 08/07/2015 01:07 AM, "Casper =?UTF-8?B?RsOmcmdlbWFuZCI=?= 
<shorttail hotmail.com>" wrote:
 On Sunday, 2 August 2015 at 16:37:06 UTC, MakersF wrote:
 Of course it's recursive! Do you want the grammar to be able to only
 define a finite number of programs?

 a* seems pretty infinite to me. :P

And any grammar defining that language is recursive.

Aug 06 2015

"Tofu Ninja" <emmons0 purdue.edu> writes:

On Thursday, 6 August 2015 at 23:08:01 UTC, Casper Færgemand 
wrote:
 On Sunday, 2 August 2015 at 16:37:06 UTC, MakersF wrote:
 Of course it's recursive! Do you want the grammar to be able 
 to only define a finite number of programs?

 a* seems pretty infinite to me. :P

(0|1)*

Just define your language in binary, problem solved

Aug 06 2015

"H. S. Teoh via Digitalmars-d" <digitalmars-d puremagic.com> writes:

On Thu, Aug 06, 2015 at 11:15:15PM +0000, Tofu Ninja via Digitalmars-d wrote:
 On Thursday, 6 August 2015 at 23:08:01 UTC, Casper F�rgemand wrote:
On Sunday, 2 August 2015 at 16:37:06 UTC, MakersF wrote:
Of course it's recursive! Do you want the grammar to be able to only
define a finite number of programs?

a* seems pretty infinite to me. :P

 
 (0|1)*
 
 Just define your language in binary, problem solved

Finite is not a problem, if the upper limit is Graham's Number...


T

-- 
Bomb technician: If I'm running, try to keep up.

Aug 06 2015

"Xinok" <xinok live.com> writes:

On Sunday, 2 August 2015 at 14:50:35 UTC, Jacob Carlborg wrote:
 I'm trying to read the D grammar [1] to enhance the D TextMate 
 bundle. If we take the add expression as an example. It's 
 defined like this in the grammar:

 AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

 And like this in the grammar made by Brian [2]:

 addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

 I'm not so familiar with grammars but this looks like it's 
 recursive. Is it possible to translate this piece of grammar to 
 a regular expression? TextMate uses regular expressions and a 
 couple of enhancements/extensions to define a grammar for a 
 language.

 [1] http://dlang.org/grammar.html
 [2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

I guess you're not familiar with the theoretical aspect of 
"formal languages". The D grammar is a context-free grammar which 
cannot be reduced to a regular expression. As cym13 stated, there 
are some simple context-free grammars which can be rewritten as 
regular expressions, but the D grammar cannot be. Take a look at 
the Chomsky Hierarchy [1] for a better understanding.

The classic example of a context-free language is the set of 
balanced parenthesis, i.e. (()) is balanced and ())))) is not. 
This language is not regular meaning you cannot write a regular 
expression for it, but you can write a context-free grammar for 
it.

[1] https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy

Aug 02 2015

Jacob Carlborg <doob me.com> writes:

On 02/08/15 19:15, Xinok wrote:

 I guess you're not familiar with the theoretical aspect of "formal
 languages". The D grammar is a context-free grammar which cannot be
 reduced to a regular expression. As cym13 stated, there are some simple
 context-free grammars which can be rewritten as regular expressions, but
 the D grammar cannot be. Take a look at the Chomsky Hierarchy [1] for a
 better understanding.

 The classic example of a context-free language is the set of balanced
 parenthesis, i.e. (()) is balanced and ())))) is not. This language is
 not regular meaning you cannot write a regular expression for it, but
 you can write a context-free grammar for it.

TextMate grammars are not _just_ regular expressions. They can define 
balanced parentheses [1].

The point of a language grammar in a text editor is not to have a 100% 
correct implementation of the grammar. Rather it should syntax highlight 
the code in a way that is useful for the user.

[1] https://manual.macromates.com/en/language_grammars

-- 
/Jacob Carlborg

Aug 02 2015

"MakersF" <lfc morz.com> writes:

On Sunday, 2 August 2015 at 18:22:01 UTC, Jacob Carlborg wrote:
 On 02/08/15 19:15, Xinok wrote:

 I guess you're not familiar with the theoretical aspect of 
 "formal
 languages". The D grammar is a context-free grammar which 
 cannot be
 reduced to a regular expression. As cym13 stated, there are 
 some simple
 context-free grammars which can be rewritten as regular 
 expressions, but
 the D grammar cannot be. Take a look at the Chomsky Hierarchy 
 [1] for a
 better understanding.

 The classic example of a context-free language is the set of 
 balanced
 parenthesis, i.e. (()) is balanced and ())))) is not. This 
 language is
 not regular meaning you cannot write a regular expression for 
 it, but
 you can write a context-free grammar for it.

 TextMate grammars are not _just_ regular expressions. They can 
 define balanced parentheses [1].

 The point of a language grammar in a text editor is not to have 
 a 100% correct implementation of the grammar. Rather it should 
 syntax highlight the code in a way that is useful for the user.

 [1] https://manual.macromates.com/en/language_grammars

Then your best shot is to approximate the grammar with the regual 
expressions you have access to. You'll get to a point where some 
constructs can not be correctly represented; at that point you 
should probably write a regex which produces what the grammar 
produces and some more.

In the example before of generating paired interleaved 
parentheses, you could generate every possible combination of 
parentheses, like
( (|)|[|]|{|}|" )*
where only the external parentheses are syntax for the regex. 
That regex matches all the productions of the paired parentheses 
grammar, and many more strings.

At the end of the day you want to highlight correct syntax, and 
if an user writes wrong syntax is OK to have wrong highlight, so 
be sure your regex work for the right syntax, and can do random 
stuff for the wrong one

Aug 06 2015

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 06-Aug-2015 11:26, MakersF wrote:
 On Sunday, 2 August 2015 at 18:22:01 UTC, Jacob Carlborg wrote:
 On 02/08/15 19:15, Xinok wrote:

 I guess you're not familiar with the theoretical aspect of "formal
 languages". The D grammar is a context-free grammar which cannot be
 reduced to a regular expression. As cym13 stated, there are some simple
 context-free grammars which can be rewritten as regular expressions, but
 the D grammar cannot be. Take a look at the Chomsky Hierarchy [1] for a
 better understanding.

 The classic example of a context-free language is the set of balanced
 parenthesis, i.e. (()) is balanced and ())))) is not. This language is
 not regular meaning you cannot write a regular expression for it, but
 you can write a context-free grammar for it.

 TextMate grammars are not _just_ regular expressions. They can define
 balanced parentheses [1].

 The point of a language grammar in a text editor is not to have a 100%
 correct implementation of the grammar. Rather it should syntax
 highlight the code in a way that is useful for the user.

 [1] https://manual.macromates.com/en/language_grammars

 Then your best shot is to approximate the grammar with the regual
 expressions you have access to. You'll get to a point where some
 constructs can not be correctly represented; at that point you should
 probably write a regex which produces what the grammar produces and some
 more.

If one limits the depth of nested constructs to some reasonable value 
(e.g. 5-6) then any context-free grammar is regular. In big grammars 
combinatorial explosion may get quite high so that limit better be low.

If regular expressions are not hardcoded but rather themsleves are 
generated then it's quite feasible to do. In fact Perl folks have been 
doing this "approximate by regex" for years.

 In the example before of generating paired interleaved parentheses, you
 could generate every possible combination of parentheses, like
 ( (|)|[|]|{|}|" )*
 where only the external parentheses are syntax for the regex. That regex
 matches all the productions of the paired parentheses grammar, and many
 more strings.

Here again a regex constructed to match e.g. 10-level deep expressions 
is more then enough. Like this:

unittest
{
     import std.regex;
     string x = ""; // the first level is going to be ([^()]*\([^()]*\)?)*
     foreach(_; 0..10)
         x = `([^()]*\([^()]*`~ x ~`\)?)*`; // pump an extra level of 
parenthesised expression
     x = "^"~x~"$"; //make sure we match the whole string
     assert(matchFirst("a+(b*c-d*(e+45)+(aaaa-(d))+(c*2+1)", x));
}



-- 
Dmitry Olshansky

Aug 06 2015

Jacob Carlborg <doob me.com> writes:

On 06/08/15 10:26, MakersF wrote:

 Then your best shot is to approximate the grammar with the regual
 expressions you have access to. You'll get to a point where some
 constructs can not be correctly represented; at that point you should
 probably write a regex which produces what the grammar produces and some
 more.

 In the example before of generating paired interleaved parentheses, you
 could generate every possible combination of parentheses, like
 ( (|)|[|]|{|}|" )*
 where only the external parentheses are syntax for the regex. That regex
 matches all the productions of the paired parentheses grammar, and many
 more strings.

 At the end of the day you want to highlight correct syntax, and if an
 user writes wrong syntax is OK to have wrong highlight, so be sure your
 regex work for the right syntax, and can do random stuff for the wrong one

I was hoping to enhance the current TextMate grammar for D to get it a 
bit closer to the D grammar [1]. That's why I started this thread.

[1] http://dlang.org/grammar.html

-- 
/Jacob Carlborg

Aug 06 2015

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 2 August 2015 at 14:50:35 UTC, Jacob Carlborg wrote:
 I'm trying to read the D grammar [1] to enhance the D TextMate 
 bundle. If we take the add expression as an example. It's 
 defined like this in the grammar:

 AddExpression:
     MulExpression
     AddExpression + MulExpression
     AddExpression - MulExpression
     CatExpression

 And like this in the grammar made by Brian [2]:

 addExpression:
       mulExpression
     | addExpression ('+' | '-' | '~') mulExpression
     ;

 I'm not so familiar with grammars but this looks like it's 
 recursive. Is it possible to translate this piece of grammar to 
 a regular expression? TextMate uses regular expressions and a 
 couple of enhancements/extensions to define a grammar for a 
 language.

 [1] http://dlang.org/grammar.html
 [2] https://rawgit.com/Hackerpilot/DGrammar/master/grammar.html

http://stackoverflow.com/a/1732454/672906

Aug 06 2015

Jacob Carlborg <doob me.com> writes:

On 06/08/15 20:14, deadalnix wrote:

 http://stackoverflow.com/a/1732454/672906

I don't have so much of a choice. TextMate supports what it supports. 
There are TextMate grammars for very many languages, so it's not a 
problem of not being able to represent a language grammar in TextMate. I 
just wanted to get closer to the D grammar to get a more accurate 
grammar in TextMate.

-- 
/Jacob Carlborg

Aug 06 2015

D Programming

C/C++ Programming

Other

digitalmars.D - Interpreting the D grammar