digitalmars.D - Unquoted regular expressions

Georg Wrede (46/46) Mar 28 2005 IIRC, we left this issue at "they'd be followed by a left parenthesis."

Walter (8/12) Mar 29 2005 This is done in the DMDScript lexer. However, it's an ugly hack and requ...

Georg Wrede (42/57) Mar 30 2005 It is! And I'm not forcing this issue here, it's just that "it would be

xs0 (7/90) Mar 31 2005 /" isn't that good, it prevents you from having a normal opDiv(char[]).....

Georg Wrede (3/9) Mar 31 2005 I could live with that.

xs0 (5/12) Apr 01 2005 I don't know (to split some string buffer with the parameter, perhaps?),...

Georg Wrede (2/11) Apr 01 2005 Ehh, which might be more important? ;-)

xs0 (5/23) Apr 01 2005 Well, one can argue both ways, but I think that the language should be

Georg Wrede (6/34) Apr 01 2005 Good point. And I've personally argued for the same logic on several

Georg Wrede <georg.wrede nospam.org> writes:

IIRC, we left this issue at "they'd be followed by a left parenthesis."

I think that the lexer should try everyting else, and if nothing else is 
ok, then try if it is an unquoted regular expression. Without regard to 
parenthesis. Then we could write:

foo = /laksjdf/;
bar = /lskdf/sdf/;

Foo and bar would be functions taking a string, an int, and outputting 
an array, and returning bool.

This may not have to be slow. After all, we do already have the regex 
library, and the lexer could use only the "recognizing" part of it, i.e. 
do the parsing without generating any binary code.

If it does not parse as a regular expression, then flag error.

This would make parser writing much cleaner.

intR       = /lksjdf/;
floatR     = /fjfjf/;
stringR    = /yee\"lksjdlfks\"haw/;
hairyClass = /lksjdlfkjsdlfkjsldfklskdfjlskdjlfk/;

and also,

ok = /lksjdflks/ (string, pos, resultArray);

or

if (/lksjd/ (theArray, curPos, res))
{
     // do the doobydoo;
}

Since regular expressions are a natural part of parsing, that area 
should be the main target.

...
     if      (intR      (input, where, res) wehaveInt();
     else if (floatR    (input, where, res) wehaveFloat();
     else if (stringR   (input, where, res) wehaveString();
     else if (hairyClass(input, where, res) shootTheSheriff();
...
     else giveUp();

FTR, the whole idea here being that these regular expressions are 
compiled to executable at compile time.

And, that regexes encountered at runtime would be compiled at encounter, 
as in:

char[] getInput = askRegexDialog();  // string assignment
regexTypeFunction gi = getInput;     // compile regex

if (gi(ins, p, oa))
{
     // we found user's wishes in the input stream
}

------------

This is outline, so details are not accurate. (For example, how to 
mediate the end position of regex search, was not considered here.)

Mar 28 2005

"Walter" <newshound digitalmars.com> writes:

"Georg Wrede" <georg.wrede nospam.org> wrote in message
news:4248919A.1010208 nospam.org...
 IIRC, we left this issue at "they'd be followed by a left parenthesis."

 I think that the lexer should try everyting else, and if nothing else is
 ok, then try if it is an unquoted regular expression. Without regard to
 parenthesis. Then we could write:

This is done in the DMDScript lexer. However, it's an ugly hack and requires
the lexer and parser to cooperate with each other. One of the design goals
of D is to have a lexer that is independent of parsing. This pretty much
means that regular expression literals using / is out. But check out some of
the new functions I added to www.digitalmars.com/d/std_regexp.html, I think
the syntax is reasonably clean and useful.

Mar 29 2005

Georg Wrede <georg.wrede nospam.org> writes:

Walter wrote:
 "Georg Wrede" <georg.wrede nospam.org> wrote in message
 news:4248919A.1010208 nospam.org...
 
IIRC, we left this issue at "they'd be followed by a left parenthesis."

I think that the lexer should try everyting else, and if nothing else is
ok, then try if it is an unquoted regular expression. Without regard to
parenthesis. Then we could write:

 
 This is done in the DMDScript lexer. However, it's an ugly hack and requires
 the lexer and parser to cooperate with each other. One of the design goals
 of D is to have a lexer that is independent of parsing. This pretty much
 means that regular expression literals using / is out. But check out some of
 the new functions I added to www.digitalmars.com/d/std_regexp.html, I think
 the syntax is reasonably clean and useful.

It is! And I'm not forcing this issue here, it's just that "it would be 
nice", and "not difficult". (IMHO)

So, with the unquoted regexes + separate lexing, we have the following 
problem:

height = volume /width/ length;
halfHeight = volume /width/length/ 2;

(I've emphasized the problem above with stupid spacing.)

/width/length/ can either be a regex that substitutes every width with 
length, or it can be part of a math expression.

 From the context it is painfully obvious which, but the lexer (who has 
aphasia because it is context independent), can't know if we here have 
unquoted regular expressions, or not.

Case closed. For good.

What else can we do? Oh, and why would we?

The existing regex library is nice! But as we see from other languages, 
having the regexes "in the code" and "obviously compiled at compile 
time, rather than runtime", does seem to have its merits.

"Obviously"? Well, reading source code of a compiled language, if a 
literal regex appears as such in the source code, then it is taken for 
granted that it is compiled at compile time. It's just something you do. 
OTOH, having regexes in quoted strings and used by library functions 
"feels" like they are runtime compiled.

Another thing, this discussion of Unquoted regexes has been a kind of 
misnomer (and I take most of the blame). A regex that looks like

/blkajsld/

is actually quoted already. (Look at substitute in ed, vi, vim, and the 
like, and you see that other characters can be used for quoting. That is 
usually convenient when the regex itself contains forward slashes.)

So, how would we have the lexer recognize a regex type? And the 
programmer immediately spot it too?

"/lkajsdl/" would seem natural. So "/ and /" would be two new tokens.

Very nice. Except that this precludes having /asdfadsf/ as the contents 
of a normal string. This is getting depressing.

Would reversing the tokens do it? Then we'd have /"lksjdlks"/ and 
/"lkasjdf/lkasjdf"/ for searching and substituting, respectively.

The lexer would not even have to know of the middle slash, since the 
function signatures for both are the same (at least as per the earlier 
thread).

Requiring backslash quoting of double quotes would take care of the 
situation where the regex contains a double quote immediately followed 
by a forward slash, as in /"lksjdlfkasd\"/lkasjdf"/.

Mar 30 2005

xs0 <xs0 xs0.com> writes:

/" isn't that good, it prevents you from having a normal opDiv(char[])...

how about RE"..." and RE`...`? we already have wysiwyg strings, so this 
wouldn't be something completely new (at least syntax-wise).. It doesn't 
use a slash, but who cares, I think it's still obvious that it's 
compile-time?


xs0


Georg Wrede wrote:
 Walter wrote:
 
 "Georg Wrede" <georg.wrede nospam.org> wrote in message
 news:4248919A.1010208 nospam.org...

 IIRC, we left this issue at "they'd be followed by a left parenthesis."

 I think that the lexer should try everyting else, and if nothing else is
 ok, then try if it is an unquoted regular expression. Without regard to
 parenthesis. Then we could write:


 This is done in the DMDScript lexer. However, it's an ugly hack and 
 requires
 the lexer and parser to cooperate with each other. One of the design 
 goals
 of D is to have a lexer that is independent of parsing. This pretty much
 means that regular expression literals using / is out. But check out 
 some of
 the new functions I added to www.digitalmars.com/d/std_regexp.html, I 
 think
 the syntax is reasonably clean and useful.

 
 
 It is! And I'm not forcing this issue here, it's just that "it would be 
 nice", and "not difficult". (IMHO)
 
 So, with the unquoted regexes + separate lexing, we have the following 
 problem:
 
 height = volume /width/ length;
 halfHeight = volume /width/length/ 2;
 
 (I've emphasized the problem above with stupid spacing.)
 
 /width/length/ can either be a regex that substitutes every width with 
 length, or it can be part of a math expression.
 
  From the context it is painfully obvious which, but the lexer (who has 
 aphasia because it is context independent), can't know if we here have 
 unquoted regular expressions, or not.
 
 Case closed. For good.
 
 What else can we do? Oh, and why would we?
 
 The existing regex library is nice! But as we see from other languages, 
 having the regexes "in the code" and "obviously compiled at compile 
 time, rather than runtime", does seem to have its merits.
 
 "Obviously"? Well, reading source code of a compiled language, if a 
 literal regex appears as such in the source code, then it is taken for 
 granted that it is compiled at compile time. It's just something you do. 
 OTOH, having regexes in quoted strings and used by library functions 
 "feels" like they are runtime compiled.
 
 Another thing, this discussion of Unquoted regexes has been a kind of 
 misnomer (and I take most of the blame). A regex that looks like
 
 /blkajsld/
 
 is actually quoted already. (Look at substitute in ed, vi, vim, and the 
 like, and you see that other characters can be used for quoting. That is 
 usually convenient when the regex itself contains forward slashes.)
 
 So, how would we have the lexer recognize a regex type? And the 
 programmer immediately spot it too?
 
 "/lkajsdl/" would seem natural. So "/ and /" would be two new tokens.
 
 Very nice. Except that this precludes having /asdfadsf/ as the contents 
 of a normal string. This is getting depressing.
 
 Would reversing the tokens do it? Then we'd have /"lksjdlks"/ and 
 /"lkasjdf/lkasjdf"/ for searching and substituting, respectively.
 
 The lexer would not even have to know of the middle slash, since the 
 function signatures for both are the same (at least as per the earlier 
 thread).
 
 Requiring backslash quoting of double quotes would take care of the 
 situation where the regex contains a double quote immediately followed 
 by a forward slash, as in /"lksjdlfkasd\"/lkasjdf"/.

Mar 31 2005

Georg Wrede <georg.wrede nospam.org> writes:

xs0 wrote:
 
 /" isn't that good, it prevents you from having a normal opDiv(char[])...

To what is a "normal opDiv(char[])" used?

 how about RE"..." and RE`...`? we already have wysiwyg strings, so this 
 wouldn't be something completely new (at least syntax-wise).. It doesn't 
 use a slash, but who cares, I think it's still obvious that it's 
 compile-time?

I could live with that.

Mar 31 2005

xs0 <xs0 xs0.com> writes:

Georg Wrede wrote:
 xs0 wrote:
 
 /" isn't that good, it prevents you from having a normal opDiv(char[])...

 
 
 To what is a "normal opDiv(char[])" used?

I don't know (to split some string buffer with the parameter, perhaps?), 
  but the point was that /"RE"/ would clash with division by char[], 
which is not good..


xs0

Apr 01 2005

Georg Wrede <georg.wrede nospam.org> writes:

xs0 wrote:
 Georg Wrede wrote:
 xs0 wrote:
 /" isn't that good, it prevents you from having a normal 
 opDiv(char[])...

 To what is a "normal opDiv(char[])" used?

 
 I don't know (to split some string buffer with the parameter, perhaps?), 
  but the point was that /"RE"/ would clash with division by char[], 
 which is not good..

Ehh, which might be more important? ;-)

Apr 01 2005

xs0 <xs0 xs0.com> writes:

Georg Wrede wrote:
 xs0 wrote:
 
 Georg Wrede wrote:

 xs0 wrote:

 /" isn't that good, it prevents you from having a normal 
 opDiv(char[])...

 To what is a "normal opDiv(char[])" used?


 I don't know (to split some string buffer with the parameter, 
 perhaps?),  but the point was that /"RE"/ would clash with division by 
 char[], which is not good..

 
 
 Ehh, which might be more important? ;-)

Well, one can argue both ways, but I think that the language should be 
without random exceptions like this.. It would be about the same s**t as 
requiring spaces between >> in template parameters in C++ ..


xs0

Apr 01 2005

Georg Wrede <georg.wrede nospam.org> writes:

xs0 wrote:
 Georg Wrede wrote:
 
 xs0 wrote:

 Georg Wrede wrote:

 xs0 wrote:

 /" isn't that good, it prevents you from having a normal 
 opDiv(char[])...


 To what is a "normal opDiv(char[])" used?



 I don't know (to split some string buffer with the parameter, 
 perhaps?),  but the point was that /"RE"/ would clash with division 
 by char[], which is not good..



 Ehh, which might be more important? ;-)

 
 
 Well, one can argue both ways, but I think that the language should be 
 without random exceptions like this.. It would be about the same s**t as 
 requiring spaces between >> in template parameters in C++ ..

Good point. And I've personally argued for the same logic on several 
occasions here.

On this particular occasion, however, we'd either have to come up with 
something else as the token, or weigh principle against practice.

Suggestions, ideas?

Apr 01 2005

D Programming

C/C++ Programming

Other

digitalmars.D - Unquoted regular expressions