digitalmars.D - Re: Is str ~ regex the root of all evil, or the leaf of all good?

bearophile <bearophileHUGS lycos.com> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

Derek Parnell <derek psych.ward> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

bearophile <bearophileHUGS lycos.com> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

bearophile <bearophileHUGS lycos.com> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

"jovo" <jovo at.home> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

"jovo" <jovo at.home> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

KennyTM~ <kennytm gmail.com> Feb 19 2009
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

Bill Baxter <wbaxter gmail.com> Feb 19 2009

"Denis Koroskin" <2korden gmail.com> Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> Feb 19 2009

Bill Baxter <wbaxter gmail.com> Feb 19 2009
"Denis Koroskin" <2korden gmail.com> Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex



I think I don't like the "g".

-----------------------

To test an API it's often good to try to use it or compare it against similar
practical&common operations done with another language or library. So here I
show two examples in Python. You can try to translate such two operations with
the std.re of D2 to see how they become :-)


The first example shows the usage of a callable for re.sub() (in D it may be
called replace()).

Here replacer() is a user-defined function given to re.sub()/matchobj.sub()
that they call on each match.

Note that in Python functions are objects, so I have dynamically added to the
replacer() function an instance attribute named "counter". In D (and Python)
you can do the same thing creating a small class with counter attribute.


import re

def replacer(mobj):
    replacer.counter += 1
    return "REPL%02d" % replacer.counter
replacer.counter = 0

s1 = ".......TAG............TAG................TAG..........TAG....."

result = ".......REPL01............REPL02................REPL03..........REPL04..."

r = re.sub("TAG", replacer, s1)
assert r == result

----------

This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()








(notes that here all groups are found eagerly. If you want a lazy matching in
Python you have to use re.finditer() or matchobj.finditer()).

I may like a syntax similar to this, where opIndex() allows to find the matched
group:

 patt.match(data)[0]






 patt.match(data)[1]







Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex


 
 I think I don't like the "g".


How can anyone think they don't like something? You like it or not, but 
it's not the result of a thought process. I guess.

Anyway: g is from Perl. Let's keep it that way.


Andrei

Feb 19 2009

Derek Parnell <derek psych.ward> writes:

On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:

 bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex


 
 I think I don't like the "g".


 How can anyone think they don't like something? You like it or not, but 
 it's not the result of a thought process. I guess.


It is not a question of whether one likes or doesn't like; this expression
is attempting to say something about one's level of certainty about liking
something. That is to say, one might not be positive if they *know* if they
like something or not, therefore they *think* (suspect, but not have
definitive evidence) of their stance.

 Anyway: g is from Perl. Let's keep it that way.


Perfect justification ;-)

-- 
Derek Parnell
Melbourne, Australia
skype: derek.j.parnell

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:
 
 bearophile wrote:
 Andrei Alexandrescu:

 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex



 I think I don't like the "g".


 it's not the result of a thought process. I guess.


 It is not a question of whether one likes or doesn't like; this expression
 is attempting to say something about one's level of certainty about liking
 something. That is to say, one might not be positive if they *know* if they
 like something or not, therefore they *think* (suspect, but not have
 definitive evidence) of their stance.


I see. Me, I always use "think" to evoke an actual thinking process. 
Otherwise I use "feel" or "believe". (This turns out to be important in 
various interpersonal interactions, e.g. do you want to drive the 
conversation towards thoughts or feelings? Guess which is gonna get you 
a date :o).) So by definition I can't think I like something. But I 
understand how some may use "I think" as a synonym for "Without being 
sure, to me it seems".


Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex


 
 I think I don't like the "g".
 
 -----------------------
 
 To test an API it's often good to try to use it or compare it against similar
practical&common operations done with another language or library. So here I
show two examples in Python. You can try to translate such two operations with
the std.re of D2 to see how they become :-)
 
 
 The first example shows the usage of a callable for re.sub() (in D it may be
called replace()).
 
 Here replacer() is a user-defined function given to re.sub()/matchobj.sub()
that they call on each match.
 
 Note that in Python functions are objects, so I have dynamically added to the
replacer() function an instance attribute named "counter". In D (and Python)
you can do the same thing creating a small class with counter attribute.
 
 
 import re
 
 def replacer(mobj):
     replacer.counter += 1
     return "REPL%02d" % replacer.counter
 replacer.counter = 0
 
 s1 = ".......TAG............TAG................TAG..........TAG....."
 
 result = ".......REPL01............REPL02................REPL03..........REPL04..."
 
 r = re.sub("TAG", replacer, s1)
 assert r == result
 
 ----------


Excellent idea. Let's see:

uint counter;
string replacer(string) { return format("REPL%02d", counter++); }
auto s1 = ".......TAG............TAG................TAG..........TAG.....";
auto result = 
".......REPL01............REPL02................REPL03..........REPL04...";
r = replace!(replacer)(s1, "TAG");
assert(r == result);

 This is a little example of managing groups in Python:
 
 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()








auto data = ">hello1 how are5 you?<";
auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
foreach (i; 0 .. iter.engine.captures)
     writeln(iter.capture[i]);

 (notes that here all groups are found eagerly. If you want a lazy matching in
Python you have to use re.finditer() or matchobj.finditer()).
 
 I may like a syntax similar to this, where opIndex() allows to find the
matched group:
 
 patt.match(data)[0]






 patt.match(data)[1]








No go due to confusions with random-access ranges.


Andrei

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

Excellent idea. Let's see:<


Thank you for all your work and the will to answer the posts here.
Some usable API is slowly shaping up :-)


 uint counter;
 string replacer(string) { return format("REPL%02d", counter++); }
 auto s1 = ".......TAG............TAG................TAG..........TAG.....";
 auto result = ".......REPL01............REPL02................REPL03..........REPL04...";
 r = replace!(replacer)(s1, "TAG");
 assert(r == result);


It looks good enough.

With a static variable it may become:

string replacer(string) {
    static int counter;
    return format("REPL%02d", counter++);
}


With small struct/class it may become:

struct Replacer {
    int counter;
    string opCall(string s) {
        this.counter++;
        return format("REPL%02d", counter);
    }
}

-------------------

 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);


I don't understand that.

What's the purpose of ".engine"?

"captures" may be better named "ngroups" or "ncaptures", or you may just use
the .len/.length attribute in some way.

foreach (i, group; iter.groups)
    writeln(i " ", group);

"group" may be a struct that defines toString and can be cast to string, and
also keeps the starting position of the group into the original string.

Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);


 I don't understand that.
 
 What's the purpose of ".engine"?


It's the regex engine that has generated the match. I coded that wrong 
in two different ways, it should have been:

foreach (i; 0 .. iter.captures)
       writeln(iter.capture(i));

 "captures" may be better named "ngroups" or "ncaptures", or you may just use
the .len/.length attribute in some way.


"Capture" is the traditional term as far as I understand. I can't use 
.length because it messes up with range semantics. "len" would be too 
confusing. "ncaptures" is too cute. Nobody's perfect :o).

 foreach (i, group; iter.groups)
     writeln(i " ", group);
 
 "group" may be a struct that defines toString and can be cast to string, and
also keeps the starting position of the group into the original string.


That sounds good.


Andrei

Feb 19 2009

bearophile <bearophileHUGS lycos.com> writes:

Andrei Alexandrescu:

 foreach (i; 0 .. iter.captures)
       writeln(iter.capture(i));


 "Capture" is the traditional term as far as I understand. I can't use
 .length because it messes up with range semantics. "len" would be too
 confusing. "ncaptures" is too cute. Nobody's perfect :o).


 "group" may be a struct that defines toString and can be cast to string,
 and also keeps the starting position of the group into the original string.




 That sounds good.


Well, then match() may return just a dynamic array of such groups/captures. So
such array has both .length and opIndex. It looks simple :-)

Bye,
bearophile

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

bearophile wrote:
 Andrei Alexandrescu:
 
 foreach (i; 0 .. iter.captures) writeln(iter.capture(i));


 "Capture" is the traditional term as far as I understand. I can't
 use .length because it messes up with range semantics. "len" would
 be too confusing. "ncaptures" is too cute. Nobody's perfect :o).


 "group" may be a struct that defines toString and can be cast to
 string, and also keeps the starting position of the group into
 the original string.




 That sounds good.


 Well, then match() may return just a dynamic array of such
 groups/captures. So such array has both .length and opIndex. It looks
 simple :-)


Looks simple but it isn't. How do you advance to the next match?

foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

This should print:

r
c
d
r

There's need to make progress in the matching, not in the capture. How 
do you distinguish among them?


Andrei

Feb 19 2009

"jovo" <jovo at.home> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei


foreach(capture; match(s, r))
  foreach(group; capture)
    writeln(group);

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei


 foreach(capture; match(s, r))
   foreach(group; capture)
     writeln(group);
 
 
 


The consecrated terminology is:

foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

"Group" is a group defined without an intent to capture. A "capture" is 
a group that also binds to the state of the match.

Anyhow... this can be done but things get a tad more confusing for other 
uses. How about this:

foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

?


Andrei

Feb 19 2009

"jovo" <jovo at.home> writes:

"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei



I think you must answer this question more generally, same for all library.
May be both?

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei


 
 I think you must answer this question more generally, same for all library.
 May be both?


I'd hate to fall again into the fallacy of trying to appease everyone's 
taste. Really std.regexp has set a negative record with the incredible 
array of names: find, search, exec, match, test, and probably I forgot a 
couple. Also it has offered a variety of random features in both 
free-function and member-function format, not even always doing the same 
thing. Germans have a saying: "Kurtz und gut". Let's make it short and good.

Andrei

Feb 19 2009

KennyTM~ <kennytm gmail.com> writes:

Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures
 
 .captures looks like it should be a set of captures, not a count.
 
 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.
 
 ---bb


iter.count

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:
 
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb


 Agree. I thought that iter.captures is a set (range) of captures.
 


I'm done implementing that.

Andrei

Feb 19 2009

Bill Baxter <wbaxter gmail.com> writes:

On Fri, Feb 20, 2009 at 9:47 AM, KennyTM~ <kennytm gmail.com> wrote:
 Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb


 iter.count


Maybe I haven't paid close enough attention here, but I think the
reason he didn't say .count or .length is that it's ambiguous whether
it means the number of captures or the number of matches.

--bb

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

[snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()








 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);


I would expect that to be

foreach (/*Capture */ i; 0 .. iter.engine.captures)
     writeln(i);

 (notes that here all groups are found eagerly. If you want a lazy  
 matching in Python you have to use re.finditer() or  
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find  
 the matched group:

 patt.match(data)[0]






 patt.match(data)[1]








 No go due to confusions with random-access ranges.


Why iter.capture[0] and iter.capture[1] aren't good enough?
How are they different from iter.engine.captures[0] and  
iter.engine.captures[1]?

Why it is a no go if you access iter.captures as a random-access range?

I'm sorry if these are dumb questions, but the code you've shown is a bit  
confusing (these iter.engine.captures and iter.captures).

 Andrei

Feb 19 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Denis Koroskin wrote:
 On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 [snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()








 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);


 I would expect that to be
 
 foreach (/*Capture */ i; 0 .. iter.engine.captures)
     writeln(i);
 
 (notes that here all groups are found eagerly. If you want a lazy 
 matching in Python you have to use re.finditer() or 
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find 
 the matched group:

 patt.match(data)[0]






 patt.match(data)[1]








 No go due to confusions with random-access ranges.


 Why iter.capture[0] and iter.capture[1] aren't good enough?
 How are they different from iter.engine.captures[0] and 
 iter.engine.captures[1]?
 
 Why it is a no go if you access iter.captures as a random-access range?
 
 I'm sorry if these are dumb questions, but the code you've shown is a 
 bit confusing (these iter.engine.captures and iter.captures).


They're good. The code I posted was dumb. The "engine" thing does not 
belong there, and "captures" should be indeed a random-access range.


Andrei

Feb 19 2009

Bill Baxter <wbaxter gmail.com> writes:

I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

.captures looks like it should be a set of captures, not a count.

This is a need that comes up again and again -- querying the size, or
count, or length of some sub-element like this -- so I think it would
greatly benefit Phobos to choose some less ambiguous convention and
stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

---bb

Feb 19 2009

"Denis Koroskin" <2korden gmail.com> writes:

On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:

 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb


Agree. I thought that iter.captures is a set (range) of captures.

Feb 19 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Is str ~ regex the root of all evil, or the leaf of all good?