digitalmars.D - Template regexes, version 2

Don Clugston (80/80) Feb 21 2006 I've had another go at doing compile-time regexps. With all the bugfixes...

Georg Wrede (26/76) Feb 21 2006 Nice!!
Craig Black (11/30) Feb 21 2006 You are the man! Keep up the good work!

Georg Wrede (3/11) Feb 21 2006 I'm not even gonna try to imagine what happens when Don gets his hands

pragma (3/14) Feb 21 2006 You and me both. I'm just as excited as everyone else.

Craig Black (217/217) Feb 21 2006 charset="iso-8859-1"

Ivan Senji (3/11) Feb 21 2006

Craig Black (1/2) Feb 21 2006 Be afraid, be very afraid.
Craig Black (9/10) Feb 21 2006 Probably you are joking, but I feel the need to defend my code anyway. ...

Ivan Senji (6/21) Feb 21 2006 I guess i'll believe you.

Craig Black (11/31) Feb 22 2006 No, the regular expression doesn't actually parse C++, but it helps to p...

Walter Bright (5/7) Feb 21 2006 Don't worry about that, they'll be gone again in the next version . B...
Walter Bright (6/32) Feb 21 2006 Great! But can I suggest using the format used by

Don Clugston (16/37) Feb 22 2006 OK. Except ... I think replace() doesn't have to worry about $ colliding...

Don Clugston (40/84) Feb 23 2006 I should mention that the code I posted doesn't do real regexps, because...

Georg Wrede (3/8) Feb 23 2006 LOL!
pragma (27/66) Feb 23 2006 I'm scratching my head as to how to fix the greedy-matching problem. (m...

David Medlock (28/44) Feb 23 2006 One option is something like the following:
Don Clugston (61/140) Feb 24 2006 Thanks! I'm just amazed at how this stuff has progressed, it's so easy n...

Walter Bright (1/1) Feb 23 2006 Thanks!

Don Clugston (9/12) Feb 25 2006 I've now updated it so that it does proper regexps (non-greedy matching

Don Clugston <dac nospam.com.au> writes:

I've had another go at doing compile-time regexps. With all the bugfixes 
in recent releases, it's now possible to do it with mixins, generating 
local functions instead of global ones. This provides _considerably_ 
more flexibility.

The ultimately generated functions can look like:

char [] firstmatch(char [] s)
{
    char [][] results;
    int someOtherResult;

    void func(char [] s) {
       // this parses the string, using the original regexp string.
       // intermediate results (eg expressions in parentheses) are
       // in the local variables: results[][], someOtherResult, etc.
    }
    func(s);
    return results[0];
}

Usage is like:
char [] x = firstmatch!("ab+")(someStr);

It seems to me that there are 3 types of regexes:
* pure static -- where the regex string is a string literal, known at 
compile time
* pure dynamic -- eg, as used in a grep utility.
* pseudo-static. By this I mean regexps where the structure is constant, 
but some strings are replaced with variables.

Perl-ish languages deal with the last case by allowing embedded 
variables in strings. The template regexes Ive been developing can cope 
with them by using additional parameters and extended syntax, eg  1 for 
the string value of the first parameter.
eg

char [] var1 = "here";
char [] input = " a56aherea";
char [] x = firstmatch!("a 1a")(input, var1);
var1 = "56";
char [] y = firstmatch!("a 1a")(input, var1);

would give x == "aherea", y=="aherea".

As far as runtime efficiency goes, it's almost ideal, except that 
consecutive terms result in heavily nested trivial functions:

eg for the trivial "x\dy":
bool func(char [] s)
{
     bool a(char [] s) {
     	bool a(char [] s) {
             return s[0]=='y';
     	}
         return isdigit(s[0]) && a(s[1..$]);
     }
     return s[0]=='x' && a(s[1..$]);
}

and we need to rely on the optimiser to turn this into:
s[0]=='x' && isdigit(s[1]) && s[2]=='y';
How good is DMD at performing this optimisation? A nested function that 
is only called once _should_ always be inlined, no matter how deeply 
nested it is, but reality could be quite different...

Many optimisations are possible with this scheme (for example, the inner 
functions could just use int parameters, so that they don't need to pass 
s[] repeatedly; almost all of the optimisations from the runtime 
std.regexp could be applied). It should also be possible to implement 
Perl 6-style regexps.

...And then I return to the D newsgroups after a week and find the 
goalposts have moved: regexps are now built into the language.

The mixin regexps are only at an early stage of development, but given 
the current discussions I think it's important to know what can be done
with templates (probably more than most people expect). In the case of 
what I've called "pseudo-static" regexps, they are arguably more 
powerful than the built-in regexps of DMD 0.147.

I don't know where to go from here. There are a few possibilities:
1. use template regexps as a demonstration of the power of D templates.
--> Implement reduced syntax, keep the code simple and not well 
optimised; more of a proof-of-concept; go for "maximum coolness".
2. Like 1, but with optimisations; try to be the fastest regexp of any 
language. (cleaner-faster-better ... but we use the built-in regexps 
anyway <g>).
3. Create a standard library. This is what I was aiming for, but I don't
think it makes sense any more.
4. potential integration into the language. Is this even possible?

Probably the most sensible is to go with 1, to wake up the C++/boost 
crowd. Hopefully this should also provide some direction for the 
built-in syntax.
Thoughts?

Feb 21 2006

Georg Wrede <georg.wrede nospam.org> writes:

       Awesome!


Don Clugston wrote:
 I've had another go at doing compile-time regexps. With all the bugfixes 
 in recent releases, it's now possible to do it with mixins, generating 
 local functions instead of global ones. This provides _considerably_ 
 more flexibility.
 
 The ultimately generated functions can look like:
 
 char [] firstmatch(char [] s)
 {
    char [][] results;
    int someOtherResult;
 
    void func(char [] s) {
       // this parses the string, using the original regexp string.
       // intermediate results (eg expressions in parentheses) are
       // in the local variables: results[][], someOtherResult, etc.
    }
    func(s);
    return results[0];
 }
 
 Usage is like:
 char [] x = firstmatch!("ab+")(someStr);

Nice!!

 It seems to me that there are 3 types of regexes:
 * pure static -- where the regex string is a string literal, known at 
 compile time
 * pure dynamic -- eg, as used in a grep utility.
 * pseudo-static. By this I mean regexps where the structure is constant, 
 but some strings are replaced with variables.

Pure static is what I always wanted with regexps in D!

Pure dynamic has less of utility to me, but others may disagree. And 
that is taken care of by Walter now.

Pseudo-static is cool!! And I believe *immensely* useful.

If I do log analyzers, net statistics programs, serious data mining 
frameworks, then pseudo-static regexps is what I do all day long! And, 
at the other end, for even most 'dscript' tasks this is the core!

 As far as runtime efficiency goes, it's almost ideal, 

Amazing!

 ...And then I return to the D newsgroups after a week and find the 
 goalposts have moved: regexps are now built into the language.

Awwww, they're just runtime. Mere syntax sherades to smoke-and-mirror 
away the fact.

 The mixin regexps are only at an early stage of development, but given 
 the current discussions I think it's important to know what can be done
 with templates (probably more than most people expect). 

Most people couldn't.  :-)

But then again, "a language can't be Serious if all its features are 
graspable to VB programmers".

 In the case of 
 what I've called "pseudo-static" regexps, they are arguably more 
 powerful than the built-in regexps of DMD 0.147.

I DEMAND that pseudo-static regexps be in the very next release of DMD.

 I don't know where to go from here. There are a few possibilities:

 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well 
 optimised; more of a proof-of-concept; go for "maximum coolness".

 From a "marketing point of view", that may be wise.

 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).

I probably would not use the "built-ins" (if that refers to the current 
runtime library-come-syntax stuff) hardly at all.

 3. Create a standard library. This is what I was aiming for,
 but I don't think it makes sense any more.

Of course it does!

 4. potential integration into the language. Is this even possible?

Wanna guess my vote on this?!

 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. 

Why wake 'em up????

Unless... you expect them all to abandon C++ on sight and stampede over 
here?

Heh, I wonder what 20 Dons would achieve together!

Feb 21 2006

"Craig Black" <cblack ara.com> writes:

 The mixin regexps are only at an early stage of development, but given the 
 current discussions I think it's important to know what can be done
 with templates (probably more than most people expect). In the case of 
 what I've called "pseudo-static" regexps, they are arguably more powerful 
 than the built-in regexps of DMD 0.147.

 I don't know where to go from here. There are a few possibilities:
 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well optimised; 
 more of a proof-of-concept; go for "maximum coolness".
 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).
 3. Create a standard library. This is what I was aiming for, but I don't
 think it makes sense any more.
 4. potential integration into the language. Is this even possible?

 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. Hopefully this should also provide some direction for the built-in 
 syntax.
 Thoughts?

You are the man!  Keep up the good work!

As far as what direction to go, a stardard library makes perfect sense.  If 
Walter's kludge for regex's doesn't fit with this new approach, perhaps he 
can adapt it to make it work.  Even without this, I would much rather have 
faster regex's than language integration.

BTW, I developed a new syntax that is a spinoff of regular expressions 
(using C++).  It includes syntax to manipulate a stack and store values in a 
parse tree.  Using this language, I was able to define a parser for a modern 
object-oriented language in like 230 lines of code.  I could show you this 
stuff if you are interested.

-Craig

Feb 21 2006

Georg Wrede <georg.wrede nospam.org> writes:

Craig Black wrote:
 Don Clugston wrote:

 BTW, I developed a new syntax that is a spinoff of regular
 expressions (using C++).  It includes syntax to manipulate a stack
 and store values in a parse tree.  Using this language, I was able to
 define a parser for a modern object-oriented language in like 230
 lines of code.  I could show you this stuff if you are interested.
 
 -Craig

I'm not even gonna try to imagine what happens when Don gets his hands 
on that stuff. And I used to be pretty good at imagining stuff.

Feb 21 2006

pragma <pragma_member pathlink.com> writes:

In article <43FB4A18.4080904 nospam.org>, Georg Wrede says...
Craig Black wrote:
 Don Clugston wrote:

 BTW, I developed a new syntax that is a spinoff of regular
 expressions (using C++).  It includes syntax to manipulate a stack
 and store values in a parse tree.  Using this language, I was able to
 define a parser for a modern object-oriented language in like 230
 lines of code.  I could show you this stuff if you are interested.
 
 -Craig

I'm not even gonna try to imagine what happens when Don gets his hands 
on that stuff. And I used to be pretty good at imagining stuff.

You and me both.  I'm just as excited as everyone else.

- Eric Anderton at yahoo

Feb 21 2006

"Craig Black" <cblack ara.com> writes:

	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Since it seems that there may be some interest, and to prove that I'm =
not bull-shitting, I thought I would post the code that I mentioned =
earlier.  It describes a parser for an unnamed object-oriented language =
of my design (that rips off a lot of stuff from D).  It's actually less =
than 230 lines if you don't count the comments and whitespace.  It's all =
one big string that gets parsed at run-time, but perhaps D could turn it =
into optimized source code.

const char *languageParser =3D
"Real::(!Delims&(-?(/d+(/.(/d)+)?)|(/d*/.(/d)+)([eE][/-+]?/d{1,3})?) ~;)"=
 // A real number
"Char::(!Delims&(\') ~&((\\)?.) ~\';;)" // char literal
"Text::(!Delims&(\") ~&([^\".]*) ~\";;)" // text literal

// Delimeters
"Comment1::(//(^(/n).)*)"
"Comment2::(/*(^(/*//).)*/*//)"
"Delims::(/s(//(!Comment1|!Comment2)/s)*)"

//
// Expressions
//
"BiOp1:([/./^])" // . ^ (exponentiation)

"BiOp3:([/-+])" // - +
"BiOp4:(=3D|/!=3D|<|>|<=3D|>=3D|in|/~|/!/~)" // =3D !=3D < > <=3D >=3D =
in ~ !~
"BiOp5:(/&|/||/./.)" // & | .. (range operator)
"Parens:(/(!Expr/))" // Expression bound by parenthesis
"MParam:(((`=3D) ~&(/i) ~;;=3D)?!Expr)"
"MParens:(&(/() ~(!MParam(,!MParam)*)?/);)" // 0 or more comma delimited =
expressions bound by parens
"Index:(&(/[) ~!Expr(,!Expr)*/];)" // Array index
"Tuple1:(&(/[) ~!MParam(,!MParam)*/];)" // Array index tuple
"Tuple2:(&(/() ~!MParam(,!MParam)*/);)" // Parens tuple
"Tuple:(!Tuple1|!Tuple2)"
"TBang:(&(/!) ~;)" // template instantiation bang
"DotExpr:(&(/.) ~;&(/i) ~;)" // dots in an id expression
"IdExpr:((`/:idexpr) ~&(/i) ~;(!DotExpr)*(((!TBang)?!MParens)|!Index)*;)"=
 // Id expression
"LExpr:(!Real|!Text|!Char|!TypeofExpr|!IdExpr|!Tuple|!Parens|!MethodExpr|=
!DelegateExpr|!ClosureExpr|!PrefixExpr)"
// Prefix operators
"PrefixOp:([/*/-/!/&]|sizeof|typeid|typeof)"=20
"PrefixExpr:(&(!PrefixOp) ~!LExpr;)"
// Exponentiation
"BiExpr1:((`t1) ~!LExpr&(!BiOp1)$~!BiSub2;)"
// Multiplication, division, and modulus
"BiSub2:(!BiExpr1|!LExpr)"
"BiExpr2:((`t2) ~!BiSub2&(!BiOp2)$~!BiSub3;)"
// Addition and Subtraction
"BiSub3:(!BiExpr2|!BiSub2)"
"BiExpr3:((`t3) ~!BiSub3&(!BiOp3)$~!BiSub4;)"
// Comparisons =3D=3D !=3D < > <=3D >=3D
"BiSub4:(!BiExpr3|!BiSub3)"
"BiExpr4:((`t4) ~!BiSub4&(!BiOp4)$~!BiSub5;)"

// & | ..
"BiSub5:(!BiExpr4|!BiSub4)"
"BiSub6:(!BiExpr5|!BiSub5)"
"BiExpr5:((`t5) ~!BiSub5&(!BiOp5)$~!BiSub6;)"
"Keywords:(module|interface|class|enum|static|const|delegate|closure|meth=
od)"
// Type expression
"TypeofExpr:(&(typeof) ~!Expr(/.&(/i) ~;)*;)"
"TypeDots:(^(!Keywords)&(/i) ~(/.&(/i) ~;)*;)"
"Array:(&(/[) ~(!Expr(,!Expr)*)?/];)" // Array type
"Const:((&(/!) ~;)|((const(`/!)) ~;))"
"MTypeExpr:(((!Const)?(&(/*) ~;|!Array))*(!Const)?(!DelegateType|!Closure=
Type|!TypeofExpr|!TypeDots))"
"TypeExpr:((`/:type) ~!MTypeExpr;)"
// delegate, closure types
//  void(int) myDelegate =3D  myMethod;
//  void(int) myDelegate =3D  myInstance.myMethod;
// #void(int) myClosure =3D #myMethod(5);
// #void(int) myDelegate =3D #myInstance.myMethod(5);
"SID:(&(/i) ~;^(/i))"
"DParam:((`/:var) ~!TypeExpr(!SID(,!SID)*)?(,;!DParam)?)"
"DParens:(/((!DParam;)?/))"
"Delegate:(/ |delegate)"

"DelegateType:((`/ ) ~!Delegate!TypeExpr!DParens;)"

// Anonymous method expression
// %void(int a) myDelegate =3D %void(int a) Write(a);;
"AParam:((`/:var) ~!TypeExpr!SID(,!SID)*(,;!AParam)?)"
"AParens:(/((!AParam;)?/))"
"MethodExpr:((`/%) ~!Method!TypeExpr!AParens!Block;)"
"DelegateExpr:((`/ ) ~!Delegate(!IdExpr/.)?&(/i) ~;;)"
// Closure expression

// New expression
"NewExpr:(&(new) ~!TypeExpr(!MParens)?;)"
// CastExpr
"CastExpr:((`cast) ~!Expr&((!Static)?as)$~!TypeExpr;)"
// An expression
"BiOps:(!BiOp1|!BiOp2|!BiOp3|!BiOp4|!BiOp5)"
"PreLExpr:(!LExpr^(!BiOps))"
"Expr:(^[/]/;/)/}](!PreLExpr|!BiExpr5|!BiExpr4|!BiExpr3|"
"!BiExpr2|!BiExpr1|!NewExpr|!CastExpr))"
//
// Statements
//
// Assignment statement
"AssOp:(=3D|/+=3D|/-=3D|/*=3D|//=3D|/^=3D)"
"AssSt:((`ass) ~!IdExpr(&(!AssOp)$~)!Expr;)"
// Delete statement
"DeleteSt:(&(delete) ~!IdExpr;/;)"
// Return, break, continue statements
"ReturnSt:(&(return) ~(!Expr)?;/;)"=20
"BreakSt:(&(break) ~;/;)"
"ContinueSt:(&(continue) ~;/;)"
// Postfix statement
"PostfixOp:(/+/+|/-/-)"
"PostfixSt:((`pf) ~!IdExpr&(!PostfixOp)$~;)"
// Variable declaration statement
"Var:(&(/i) ~(=3D(!NewExpr|!Expr))?;)" // var =3D expr
"Static:((&(/$) ~;)|(static(`/$) ~;))"
"VarSt:((`/:var) ~(!Static)?!TypeExpr!Var(,!Var)*;/;)"
"SVarSt:((`/:var) ~!TypeExpr!Var(,!Var)*;/;)"
// If statement
"IfSt:(&(if) ~!Expr(/:)?!Block(else!Block)?;)"
// Static if statement
"SIfSt:(&(!Static(if)) ~!Expr(/:)?!Block(else!Block)?;)"
// Switch statement
"CaseSt:((`case) ~(case)?!Expr*(/:)?!Block;)"
"DefaultSt:(&(default) ~(/:)?!Block;)"
"SwitchSt:(&(switch) ~/{(!CaseSt)*(!DefaultSt)?/};)"
// While statement
"WhileSt:(&(while) ~!Expr!Block;)"=20

// Do statement
"DoSt:(&(do) ~!Block(while)!Expr;/;)"

// Method call statement
"MethodSt:((`/(/)) ~!IdExpr;/;)"
// Foreach statement=20
"FEVarDecl:((`/:var) ~!TypeExpr&(/i) ~;;)"
"ForeachHeader:(!FEVarDecl(,!FEVarDecl;)?/;!Expr(&(step) ~!Expr;)?)"
"ForeachSt:(&(foreach) ~/(!ForeachHeader/)!Block(else!Block);)"
// For statement
"FVarDecl:((`/:var) ~!TypeExpr&(/i) ~;(=3D!Expr);)"
"ForInit:(!FVarDecl|!AssSt)"
"FStatement:(!AssSt|!PostfixSt)"
"ForSt:(&(for) ~/((!ForInit(,!ForInit)*)?/;!Expr/;(!FStatement(,!FStateme=
nt)*)?/)!Block;)"
// In and out blocks can include preconditions and postconditions=20
"InBlock:(&(in) ~!Block;)"
"OutBlock:(&(out) ~!Block;)"
// Trace block
"TraceBlock:(&(trace) ~!Block;)"
// Echo, assert
"EchoSt:(&(echo) ~!Expr;)"
"AssertSt:(&(assert) ~!Expr;)"
// Exceptions
"CatchSt:((`catch) ~(catch)?&(/i) ~;*(/:)?!Block;)"
"DefaultSt:(&(default) ~(/:)?!Block;)"
"TryBlock:(&(try) ~/{(!CatchSt)*(!DefaultSt)?/};)"
"ThrowSt:(&(throw) ~;)"
// Enumeration definition
"EnumNode:(&(/i) ~(=3D!Expr);)"
"EnumDef:(&(enum) ~/{!EnumNode(,!EnumNode)*/};)"
// Method definition
"TID:(&(/i) ~^(/i)(=3D!Expr)?;)"
"TParam:((`/:var) ~(&(in|out|inout) ~;)?!TypeExpr!TID(,!TID)*(,;!TParam)?=
)"
"TParens:(/((!TParam;)?/))"
"Method:(/%|method)"
"SMethodDef:((`/%) ~!Method!TypeExpr&(/i)  ~;!TParens!Block;)"
"MMethodDef:((`/%) ~(!Static|!Const)?!Method!TypeExpr&(/i)  ~;!TParens!Bl=
ock;)"
"IMethodDef:((`/%) ~(!Const)?!Method!TypeExpr&(/i)  ~;!TParens/;)"
//"Construct:(&(construct) ~!Block;)"
//"Destruct:(&(destruct) ~!Block;)"
//"Startup:..."
//"Terminate:..."
//"Set:(&(set) ~!TParens!Block;)"
//"Get:(!TypeExprget!Block;)"
// Operator overloading
"OpOverload:(&(operator) ~(!PrefixO|!PostfixO|!AssignO|!BinaryO)!Block;)"=

"BinaryO:(&(^(/.)!BiOps) ~;!TypeExpr/(!TypeExpr&(/i) ~;,!TypeExpr&(/i) ~;=
/))"
"AssignO:(&(^(=3D)!AssOp) ~;void/(!TypeExpr&(/i) ~;/))"
"PrefixO:(&(/-/!) ~;!TypeExpr/(/))"
"PostfixO:(&(/+/+|/-/-) ~;void/(/))"
"BinaryOverload:(&(operator) ~!BinaryO!Block;)"
// Class definition
"ProtSt:(&(public|private|protected) ~;/:)"
"ClassSt:(!ProtSt|!Alias|!TypeDef|!VarSt|!MMethodDef|!ClassDef|!EnumDef|!=
InterfaceDef|!OpOverload)"
"ClassBlock:(&(/{) ~(!ClassSt)*/};)"
"TemplateParam:((&(int) ~;)?(&(/i) ~;))"
"TemplateParams:(!TemplateParam(,!TemplateParam)*)"
"ClassDef:(&(class) ~&(/i) ~(/(!TemplateParams/))?;(/:!IdExpr(,!IdExpr)*)=
?!ClassBlock;)"
// Interface definition
"InterfaceSt:(!ProtSt|!Alias|!TypeDef|!IMethodDef|!ClassDef|!EnumDef|!Int=
erfaceDef)"
"InterfaceBlock:(&(/{) ~(!InterfaceSt)*/};)"
"InterfaceDef:(&(interface) ~&(/i) ~;(/:!IdExpr(,!IdExpr)*)?!InterfaceBlo=
ck;)"
"Alias:(&(alias) ~&(/i) ~;=3D!TypeExpr;)"
"TypeDef:(&(typedef) ~&(/i) ~;=3D!TypeExpr;)"
// import
"ImportNode:(&(/i) ~(/.&(/i)  ~;)*;)"
"ImportSt:(&(import) ~!ImportNode(,!ImportNode)*;/;)"
// error checking, exceptions
// in, out, inout params
// in, out code blocks
"ModuleInit:(&(module) ~&(/i)  ~;/:)"
"ModuleSt:(!ProtSt|!Alias|!TypeDef|!ImportSt|!SVarSt|!SMethodDef|!ClassDe=
f|"
"!EnumDef|!InterfaceDef|!BinaryOverload)"
"ModuleDef:(!ModuleInit(!ModuleSt)*)"
// A statement
"Statement:(!IfSt|!ForSt|!ForeachSt|(!PostfixSt/;)|(!AssSt/;)|!MethodSt|"=

"!VarSt|!InBlock|!OutBlock|!TraceBlock|!TryBlock|!ThrowSt|!EchoSt|"
"!AssertSt|!SwitchSt|!WhileSt|!DoSt|!ReturnSt|!BreakSt|!ContinueSt|"
"!DeleteSt|!SIfSt|!EnumDef|!SMethodDef|!ClassDef)"
// A statement block
"Block:((`/{) ~(/{(!Statement)*/}|!Statement);)";

Feb 21 2006

Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:

Craig Black wrote:
 Since it seems that there may be some interest, and to prove that I'm 
 not bull-shitting, I thought I would post the code that I mentioned 
 earlier.  It describes a parser for an unnamed object-oriented language 
 of my design (that rips off a lot of stuff from D).  It's actually less 
 than 230 lines if you don't count the comments and whitespace.  It's all 
 one big string that gets parsed at run-time, but perhaps D could turn it 
 into optimized source code.
  

<snip>

That looks scary :)

Feb 21 2006

"Craig Black" <cblack ara.com> writes:

 That looks scary :)

Be afraid, be very afraid.

Feb 21 2006

"Craig Black" <cblack ara.com> writes:

 That looks scary :)

Probably you are joking, but I feel the need to defend my code anyway.  To 
someone who doesn't know the syntax it is not going to be very readable. <g> 
However, my code is way more readable than a lot of the regular expressions 
that I've seen.  If you don't believe me, google for regular expressions 
that can be used to parse C++.  One benefit of my language over regular 
expressions is that you can define subroutines.  Thus, it's not all one big 
mess.  It's a bunch of smaller, little messes.  This makes a long regular 
expression much more managable.

-Craig

Feb 21 2006

Ivan Senji <ivan.senji_REMOVE_ _THIS__gmail.com> writes:

Craig Black wrote:
That looks scary :)

 
 
 Probably you are joking

I was joking (but it does look scary)

 , but I feel the need to defend my code anyway.  To 
 someone who doesn't know the syntax it is not going to be very readable. <g> 
 However, my code is way more readable than a lot of the regular expressions 

I guess i'll believe you.

 that I've seen.  If you don't believe me, google for regular expressions 
 that can be used to parse C++.  

Well I would be surprised to find that because C++ grammar isn't 
regular, it isn't even LR, it is something awfull.

 One benefit of my language over regular 
 expressions is that you can define subroutines.  

Interesting? What class of languages can it parse?

 Thus, it's not all one big 
 mess.  It's a bunch of smaller, little messes.  This makes a long regular 
 expression much more managable.

Feb 21 2006

"Craig Black" <cblack ara.com> writes:

"Ivan Senji" <ivan.senji_REMOVE_ _THIS__gmail.com> wrote in message 
news:dtg4am$2si9$1 digitaldaemon.com...
 Craig Black wrote:
That looks scary :)


 Probably you are joking

 I was joking (but it does look scary)

 , but I feel the need to defend my code anyway.  To someone who doesn't 
 know the syntax it is not going to be very readable. <g> However, my code 
 is way more readable than a lot of the regular expressions

 I guess i'll believe you.

 that I've seen.  If you don't believe me, google for regular expressions 
 that can be used to parse C++.

 Well I would be surprised to find that because C++ grammar isn't regular, 
 it isn't even LR, it is something awfull.

No, the regular expression doesn't actually parse C++, but it helps to parse 
it.  A friend of mine was writing a C++ parser and he showed me a regular 
expression he found on the internet (I would show you, but I just looked but 
couldn't find it.)  It was hideously long because it has a lot of repetitous 
stuff.

When you add subroutines to the syntax, the repetition disappears.

 One benefit of my language over regular expressions is that you can 
 define subroutines.

 Interesting? What class of languages can it parse?

Theoretically it can parse anything.  It includes syntax for pushing and 
poping text on a stack and populating text values in a simple parse tree 
format.

 Thus, it's not all one big mess.  It's a bunch of smaller, little messes. 
 This makes a long regular expression much more managable.

Feb 22 2006

"Walter Bright" <newshound digitalmars.com> writes:

"Don Clugston" <dac nospam.com.au> wrote in message 
news:dtev42$1g98$1 digitaldaemon.com...
 ...And then I return to the D newsgroups after a week and find the 
 goalposts have moved: regexps are now built into the language.

Don't worry about that, they'll be gone again in the next version <g>. Based 
on the considerable feedback about this issue, I think I've got a better 
design.

Feb 21 2006

"Walter Bright" <newshound digitalmars.com> writes:

"Don Clugston" <dac nospam.com.au> wrote in message 
news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded variables 
 in strings. The template regexes Ive been developing can cope with them by 
 using additional parameters and extended syntax, eg  1 for the string 
 value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

Great! But can I suggest using the format used by 
std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

 How good is DMD at performing this optimisation?

Not good enough. But I wouldn't worry about that for the moment.

 I don't know where to go from here. There are a few possibilities:
 1. use template regexps as a demonstration of the power of D templates.
 --> Implement reduced syntax, keep the code simple and not well optimised; 
 more of a proof-of-concept; go for "maximum coolness".
 2. Like 1, but with optimisations; try to be the fastest regexp of any 
 language. (cleaner-faster-better ... but we use the built-in regexps 
 anyway <g>).
 3. Create a standard library. This is what I was aiming for, but I don't
 think it makes sense any more.
 4. potential integration into the language. Is this even possible?

 Probably the most sensible is to go with 1, to wake up the C++/boost 
 crowd. Hopefully this should also provide some direction for the built-in 
 syntax.
 Thoughts?

Feb 21 2006

Don Clugston <dac nospam.com.au> writes:

Walter Bright wrote:
 "Don Clugston" <dac nospam.com.au> wrote in message 
 news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded variables 
 in strings. The template regexes Ive been developing can cope with them by 
 using additional parameters and extended syntax, eg  1 for the string 
 value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

 
 Great! But can I suggest using the format used by 
 std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

OK. Except ... I think replace() doesn't have to worry about $ colliding 
with the end of line marker? Anyway, I've used $n for now.
(Doesn't SQL use   for parameters in stored procedures? Can't quite 
remember.)

 How good is DMD at performing this optimisation?

 
 Not good enough. But I wouldn't worry about that for the moment.

I'm not really surprised -- I imagine the back-end doesn't have much 
that's specific to D.

I've attached what I've done. It's not complete -- doesn't do ranges, or 
any of the escape characters, or beginning and end of line.
But it does do the full set of repetition operators, parentheses,
and unions (arbitrarily nestable). It also does the $n parameters.
Most of the remaining stuff is straightforward, and not very exciting.

I feel as though I've finally worked out what mixins are for. They are a 
perfect match for this application.

In any case, this should give you a good idea for what you'll able to 
use in your presentation.

Feb 22 2006

Don Clugston <dac nospam.com.au> writes:

I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.
However, the mixin technique means that the regexp is actually turned 
into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
}

so that in every case (including unions), the next function that needs 
to be called is always in scope. Intriguingly, it also has access to the 
local variables all the way down the tree. (Since of these levels can 
contain its own stack as an array, I'm pretty sure that a Turing machine 
can be created, it's likely to be more powerful than the state machines 
normally used for regexps).

Obviously this can be used to do regexps properly. I was hoping to use a 
clever naming scheme to avoid the alias issues from pragma's version, 
but it seems to be inevitable.

It seems hard to do this properly without creating a highly optimised 
regexp parser!

Don Clugston wrote:
 Walter Bright wrote:
 "Don Clugston" <dac nospam.com.au> wrote in message 
 news:dtev42$1g98$1 digitaldaemon.com...
 Perl-ish languages deal with the last case by allowing embedded 
 variables in strings. The template regexes Ive been developing can 
 cope with them by using additional parameters and extended syntax, eg 
  1 for the string value of the first parameter.
 eg

 char [] var1 = "here";
 char [] input = " a56aherea";
 char [] x = firstmatch!("a 1a")(input, var1);
 var1 = "56";
 char [] y = firstmatch!("a 1a")(input, var1);

 would give x == "aherea", y=="aherea".

 Great! But can I suggest using the format used by 
 std.regexp.RegExp.replace() ? It uses $$, $&, $`, $', and $n.

 
 OK. Except ... I think replace() doesn't have to worry about $ colliding 
 with the end of line marker? Anyway, I've used $n for now.
 (Doesn't SQL use   for parameters in stored procedures? Can't quite 
 remember.)
 
 How good is DMD at performing this optimisation?

 Not good enough. But I wouldn't worry about that for the moment.

 
 I'm not really surprised -- I imagine the back-end doesn't have much 
 that's specific to D.
 
 I've attached what I've done. It's not complete -- doesn't do ranges, or 
 any of the escape characters, or beginning and end of line.
 But it does do the full set of repetition operators, parentheses,
 and unions (arbitrarily nestable). It also does the $n parameters.
 Most of the remaining stuff is straightforward, and not very exciting.
 
 I feel as though I've finally worked out what mixins are for. They are a 
 perfect match for this application.
 
 In any case, this should give you a good idea for what you'll able to 
 use in your presentation.

Feb 23 2006

Georg Wrede <georg.wrede nospam.org> writes:

Don Clugston wrote:

 I should mention that the code I posted doesn't do real regexps, because 
 it has no backtracking. It doesn't do *, +, ? properly, they are always 
 greedy. It can never match ".*a", for example.

 It seems hard to do this properly without creating a highly optimised 
 regexp parser!

LOL!

(Self-reference.)

Feb 23 2006

pragma <pragma_member pathlink.com> writes:

Don, I love this rendition.

In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.

I'm scratching my head as to how to fix the greedy-matching problem.  (my
apologies if you've already mulled this over)

One way to go is provide a switch for greedy and non-greedy behaviors at the
interface level, such that a programmer can select the behavior.  Another is to
adopt a different token sequence in the expression itself (I think some engines
use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  

A third option, is to redesign the expression engine itself to work with
multiple matches, such that each predicate is understood to match the left-hand
side against n right-hand sides (as returned from other predicates).  This way,
the expression ".*a" will match "a" in each sucessive substring following each
successive attempt at matching "."; the result would be a "char[][]" containing
zero or more matches of the "*" predicate expression.


However, the mixin technique means that the regexp is actually turned 
into a binary tree of nested functions.

eg (ab.*|c)a
becomes something like:

bool test(char [] searchstr)
{
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
}

so that in every case (including unions), the next function that needs 
to be called is always in scope. Intriguingly, it also has access to the 
local variables all the way down the tree. (Since of these levels can 
contain its own stack as an array, I'm pretty sure that a Turing machine 
can be created, it's likely to be more powerful than the state machines 
normally used for regexps).

Also, I'm curious to see what DMD would do to optimize that code.  Would some of
those internals dissapear, or would they have to be anonymous functions in order
for that to happen?  It might be worth the trouble to go the latter, given there
are no function arguments used anywhere, as the resulting code might be pretty
darn close to using raw goto statements (is DMD smart enough to *not* create a
stack frame without use of the 'naked' keyword?).

Obviously this can be used to do regexps properly. I was hoping to use a 
clever naming scheme to avoid the alias issues from pragma's version, 
but it seems to be inevitable.

Heh... finding that problem was the worst part. 

Don't forget how my solution has a tendency to hammer and pollute the symbol
namespace for large expressions.  It looks like you managed to work around some
of that as well.  Again, I'm floored by how much more can be gained via nested
functions rather than a more "traditional" approach.

It seems hard to do this properly without creating a highly optimised 
regexp parser!

Hehe.. so the optimal solution is also the most complete? ;)

- Eric Anderton at yahoo

Feb 23 2006

David Medlock <noone nowhere.com> writes:

pragma wrote:
 Don, I love this rendition.
 
 In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
 
I should mention that the code I posted doesn't do real regexps, because 
it has no backtracking. It doesn't do *, +, ? properly, they are always 
greedy. It can never match ".*a", for example.
I was too focused on getting access to variables.

 
 
 I'm scratching my head as to how to fix the greedy-matching problem.  (my
 apologies if you've already mulled this over)
 <snip> 
 Hehe.. so the optimal solution is also the most complete? ;)
 
 - Eric Anderton at yahoo


One option is something like the following:
(Assuming Regex is a linked list of atomic regular expressions, and it 
knows if it is nullable)

bool match( string txt, Regex reg, bool delegate() backtrack )
{
   string tail ;
   // attempt to match the regex against the text
   bool success = reg.match( txt, tail );

   // this function skips the current regex and tries the txt on the 
successor
   bool inner_skip()
   {
     bool result = match( txt, reg.next, backtrack );
     if ( result==false ) return backtrack();
     else return result;
   }

   if ( success )
   {
     if ( reg.nullable ) return match( tail, reg.next, &inner_skip );
     else return match( tail, reg.next, backtrack );
   }
   else return ( backtrack is null ? false : backtrack() );
}

//call with :

match( text, regex, null )

A nice addition would be a callback delegate for matches.

-DavidM

Feb 23 2006

Don Clugston <dac nospam.com.au> writes:

pragma wrote:
 Don, I love this rendition.

Thanks! I'm just amazed at how this stuff has progressed, it's so easy now.
I'm a little uncomfortable that people seem to be credited me for the 
regexp parser which you wrote. I want to make sure that you receive 
proper acknowledgment.

 In article <dtjv60$1tc9$1 digitaldaemon.com>, Don Clugston says...
 I should mention that the code I posted doesn't do real regexps, because 
 it has no backtracking. It doesn't do *, +, ? properly, they are always 
 greedy. It can never match ".*a", for example.
 I was too focused on getting access to variables.

 
 I'm scratching my head as to how to fix the greedy-matching problem.  (my
 apologies if you've already mulled this over)

 
 One way to go is provide a switch for greedy and non-greedy behaviors at the
 interface level, such that a programmer can select the behavior.  Another is to
 adopt a different token sequence in the expression itself (I think some engines
 use "*?", "??", "{m,n}?" and "+?" for non-greedy matching).  
 
 A third option, is to redesign the expression engine itself to work with
 multiple matches, such that each predicate is understood to match the left-hand
 side against n right-hand sides (as returned from other predicates).  This way,
 the expression ".*a" will match "a" in each sucessive substring following each
 successive attempt at matching "."; the result would be a "char[][]" containing
 zero or more matches of the "*" predicate expression.
 
 
 However, the mixin technique means that the regexp is actually turned 
 into a binary tree of nested functions.

 eg (ab.*|c)a
 becomes something like:

 bool test(char [] searchstr)
 {
   int p=0;

   bool seq() {
     bool union() {
        bool seq() {
            bool char_a {}
            bool seq() {
              bool char_b {}
              bool star() {
                  bool dot()
              }
            }
         }
         bool char_c() {}
     }
     bool char_a() {}
   }
   return seq();
 }

 so that in every case (including unions), the next function that needs 
 to be called is always in scope. Intriguingly, it also has access to the 
 local variables all the way down the tree. (Since of these levels can 
 contain its own stack as an array, I'm pretty sure that a Turing machine 
 can be created, it's likely to be more powerful than the state machines 
 normally used for regexps).

 
 Also, I'm curious to see what DMD would do to optimize that code.  Would some
of
 those internals dissapear, or would they have to be anonymous functions in
order
 for that to happen?  It might be worth the trouble to go the latter, given
there
 are no function arguments used anywhere, as the resulting code might be pretty
 darn close to using raw goto statements (is DMD smart enough to *not* create a
 stack frame without use of the 'naked' keyword?).

I suspect it's not very optimised, since nested functions are not 
relevant for C++. But there's a lot of potential. If it can convert 
CALL; RET; sequences into JMP, the whole thing will collapse into an 
optimal mass of goto spaghetti, something like

     xxx
     call part2
     xxx
part2:
     xxx
     call part3
     xxx
part3:
     xxx
     ret


 
 Obviously this can be used to do regexps properly. I was hoping to use a 
 clever naming scheme to avoid the alias issues from pragma's version, 
 but it seems to be inevitable.

 
 Heh... finding that problem was the worst part. 
 
 Don't forget how my solution has a tendency to hammer and pollute the symbol
 namespace for large expressions.  It looks like you managed to work around some
 of that as well.  Again, I'm floored by how much more can be gained via nested
 functions rather than a more "traditional" approach.

Me too. It's just awesome how many language features are coming together 
in this: real compile time constants, template string parameters, nested 
functions, having ~ concatenation built into the language, mixins, alias 
template parameters, static if, constant folding, the !() notation for 
templates, array slices...
And C++ can't do any of those things. It's incredible.

Update: I got the aliases to work in about 20 minutes, without 
difficulties. Now it does proper non-greedy matches. Amazingly, the code 
is barely any more complicated. It still doesn't need explicit backtracking!

Greedy matching can be implemented by repeatedly attempting non-greedy 
matches, and finally redoing the last one which was successful. Because 
each step in the process has its own local variables, the only 
backtracking is restoring a copy of the current position.
I think that greedy matching is an inherently inefficient operation, it 
would be much nicer to have non-greedy as the default.

Consider the case
(expr1)*(expr2)
I think that what most engines do is keep matching (expr1) until they 
get a failure. Then try (expr2). If it fails, wind back (expr1) and try 
(expr2) again. Keep going until you've wound all the way back to the 
beginning. This gives only one successful match to expr2, but many 
unnecessary matches to expr1.

So it's good for cases like "b*(really_complex_regexp)"
But it's bad for cases like "(really_complex_regexp)*b"
Also, backtracking expr1 is a real pain.

My method is the reverse, it has no unnecessary matches to (expr1), but 
may have many for (expr2). I'm not sure that it's really inferior.

Ideally, at compile time you'd determine which of the expressions was 
more complex (perhaps just by counting the number of *,+, ? in each 
string) and choose between the two methods that way.

if something that matches (expr1) can ever be a partial match for 
(expr2); and if it can't, then you know you can just keep matching 
(expr1) until it fails, and only then evaluate (expr2). But I suspect 
that comparing two regexps is one of those NP-Hard problems from CompSci 
that's just not feasible to solve.

 It seems hard to do this properly without creating a highly optimised 
 regexp parser!

 
 Hehe.. so the optimal solution is also the most complete? ;)

It certainly looks that way. And it seems to be the simplest, too. For 
quantifiers "{m,n}, *,+, ?", I used the (I thought) quick-and-dirty 
method of dealing with all of them at once. But a side-effect is that 
it's more optimal.

Also, I used functions with no parameters because it was quick and 
dirty. But it's optimal, too, and it's really easy to add more features.
It's a bizarre experience.

Feb 24 2006

"Walter Bright" <newshound digitalmars.com> writes:

Thanks!

Feb 23 2006

Don Clugston <dac nospam.com.au> writes:

Walter Bright wrote:
 Thanks! 
 
 

I've now updated it so that it does proper regexps (non-greedy matching 
only). I've also added a couple of the trivial bits like character classes.

http://trac.dsource.org/projects/ddl/browser/trunk/meta/regexp2.d

I think that since it now does non-greedy matching correctly AFAIK even 
in the tricky cases, it should now be clear to everyone that the concept 
works. But it could be greatly improved, and some of the basic stuff is 
still missing (like escape characters).
Let me know what you think is most important for your presentation.

Feb 25 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Template regexes, version 2