www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: Is str ~ regex the root of all evil, or the leaf of all good?

reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

I think I don't like the "g". ----------------------- To test an API it's often good to try to use it or compare it against similar practical&common operations done with another language or library. So here I show two examples in Python. You can try to translate such two operations with the std.re of D2 to see how they become :-) The first example shows the usage of a callable for re.sub() (in D it may be called replace()). Here replacer() is a user-defined function given to re.sub()/matchobj.sub() that they call on each match. Note that in Python functions are objects, so I have dynamically added to the replacer() function an instance attribute named "counter". In D (and Python) you can do the same thing creating a small class with counter attribute. import re def replacer(mobj): replacer.counter += 1 return "REPL%02d" % replacer.counter replacer.counter = 0 s1 = ".......TAG............TAG................TAG..........TAG....." result = ".......REPL01............REPL02................REPL03..........REPL04..." r = re.sub("TAG", replacer, s1) assert r == result ---------- This is a little example of managing groups in Python:
 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()



(notes that here all groups are found eagerly. If you want a lazy matching in Python you have to use re.finditer() or matchobj.finditer()). I may like a syntax similar to this, where opIndex() allows to find the matched group:
 patt.match(data)[0]



 patt.match(data)[1]



Bye, bearophile
Feb 19 2009
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

I think I don't like the "g".

How can anyone think they don't like something? You like it or not, but it's not the result of a thought process. I guess. Anyway: g is from Perl. Let's keep it that way. Andrei
Feb 19 2009
parent reply Derek Parnell <derek psych.ward> writes:
On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:

 bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

I think I don't like the "g".

How can anyone think they don't like something? You like it or not, but it's not the result of a thought process. I guess.

It is not a question of whether one likes or doesn't like; this expression is attempting to say something about one's level of certainty about liking something. That is to say, one might not be positive if they *know* if they like something or not, therefore they *think* (suspect, but not have definitive evidence) of their stance.
 Anyway: g is from Perl. Let's keep it that way.

Perfect justification ;-) -- Derek Parnell Melbourne, Australia skype: derek.j.parnell
Feb 19 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Derek Parnell wrote:
 On Thu, 19 Feb 2009 07:51:46 -0800, Andrei Alexandrescu wrote:
 
 bearophile wrote:
 Andrei Alexandrescu:

 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

I think I don't like the "g".

it's not the result of a thought process. I guess.

It is not a question of whether one likes or doesn't like; this expression is attempting to say something about one's level of certainty about liking something. That is to say, one might not be positive if they *know* if they like something or not, therefore they *think* (suspect, but not have definitive evidence) of their stance.

I see. Me, I always use "think" to evoke an actual thinking process. Otherwise I use "feel" or "believe". (This turns out to be important in various interpersonal interactions, e.g. do you want to drive the conversation towards thoughts or feelings? Guess which is gonna get you a date :o).) So by definition I can't think I like something. But I understand how some may use "I think" as a synonym for "Without being sure, to me it seems". Andrei
Feb 19 2009
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 
 I think the "g", "i", and "m" flags are popular enough if you've done any
amount of regex

I think I don't like the "g". ----------------------- To test an API it's often good to try to use it or compare it against similar practical&common operations done with another language or library. So here I show two examples in Python. You can try to translate such two operations with the std.re of D2 to see how they become :-) The first example shows the usage of a callable for re.sub() (in D it may be called replace()). Here replacer() is a user-defined function given to re.sub()/matchobj.sub() that they call on each match. Note that in Python functions are objects, so I have dynamically added to the replacer() function an instance attribute named "counter". In D (and Python) you can do the same thing creating a small class with counter attribute. import re def replacer(mobj): replacer.counter += 1 return "REPL%02d" % replacer.counter replacer.counter = 0 s1 = ".......TAG............TAG................TAG..........TAG....." result = ".......REPL01............REPL02................REPL03..........REPL04..." r = re.sub("TAG", replacer, s1) assert r == result ----------

Excellent idea. Let's see: uint counter; string replacer(string) { return format("REPL%02d", counter++); } auto s1 = ".......TAG............TAG................TAG..........TAG....."; auto result = ".......REPL01............REPL02................REPL03..........REPL04..."; r = replace!(replacer)(s1, "TAG"); assert(r == result);
 This is a little example of managing groups in Python:
 
 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()




auto data = ">hello1 how are5 you?<"; auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*")); foreach (i; 0 .. iter.engine.captures) writeln(iter.capture[i]);
 (notes that here all groups are found eagerly. If you want a lazy matching in
Python you have to use re.finditer() or matchobj.finditer()).
 
 I may like a syntax similar to this, where opIndex() allows to find the
matched group:
 
 patt.match(data)[0]



 patt.match(data)[1]




No go due to confusions with random-access ranges. Andrei
Feb 19 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

Excellent idea. Let's see:<

Thank you for all your work and the will to answer the posts here. Some usable API is slowly shaping up :-)
 uint counter;
 string replacer(string) { return format("REPL%02d", counter++); }
 auto s1 = ".......TAG............TAG................TAG..........TAG.....";
 auto result = ".......REPL01............REPL02................REPL03..........REPL04...";
 r = replace!(replacer)(s1, "TAG");
 assert(r == result);

It looks good enough. With a static variable it may become: string replacer(string) { static int counter; return format("REPL%02d", counter++); } With small struct/class it may become: struct Replacer { int counter; string opCall(string s) { this.counter++; return format("REPL%02d", counter); } } -------------------
 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

I don't understand that. What's the purpose of ".engine"? "captures" may be better named "ngroups" or "ncaptures", or you may just use the .len/.length attribute in some way. foreach (i, group; iter.groups) writeln(i " ", group); "group" may be a struct that defines toString and can be cast to string, and also keeps the starting position of the group into the original string. Bye, bearophile
Feb 19 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 auto data = ">hello1 how are5 you?<";
 auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*"));
 foreach (i; 0 .. iter.engine.captures)
      writeln(iter.capture[i]);

I don't understand that. What's the purpose of ".engine"?

It's the regex engine that has generated the match. I coded that wrong in two different ways, it should have been: foreach (i; 0 .. iter.captures) writeln(iter.capture(i));
 "captures" may be better named "ngroups" or "ncaptures", or you may just use
the .len/.length attribute in some way.

"Capture" is the traditional term as far as I understand. I can't use .length because it messes up with range semantics. "len" would be too confusing. "ncaptures" is too cute. Nobody's perfect :o).
 foreach (i, group; iter.groups)
     writeln(i " ", group);
 
 "group" may be a struct that defines toString and can be cast to string, and
also keeps the starting position of the group into the original string.

That sounds good. Andrei
Feb 19 2009
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Andrei Alexandrescu:

 foreach (i; 0 .. iter.captures)
       writeln(iter.capture(i));

 "Capture" is the traditional term as far as I understand. I can't use
 .length because it messes up with range semantics. "len" would be too
 confusing. "ncaptures" is too cute. Nobody's perfect :o).

 "group" may be a struct that defines toString and can be cast to string,
 and also keeps the starting position of the group into the original string.


 That sounds good.

Well, then match() may return just a dynamic array of such groups/captures. So such array has both .length and opIndex. It looks simple :-) Bye, bearophile
Feb 19 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
bearophile wrote:
 Andrei Alexandrescu:
 
 foreach (i; 0 .. iter.captures) writeln(iter.capture(i));

 "Capture" is the traditional term as far as I understand. I can't
 use .length because it messes up with range semantics. "len" would
 be too confusing. "ncaptures" is too cute. Nobody's perfect :o).

 "group" may be a struct that defines toString and can be cast to
 string, and also keeps the starting position of the group into
 the original string.


 That sounds good.

Well, then match() may return just a dynamic array of such groups/captures. So such array has both .length and opIndex. It looks simple :-)

Looks simple but it isn't. How do you advance to the next match? foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]); This should print: r c d r There's need to make progress in the matching, not in the capture. How do you distinguish among them? Andrei
Feb 19 2009
parent reply "jovo" <jovo at.home> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei

foreach(capture; match(s, r)) foreach(group; capture) writeln(group);
Feb 19 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnk8te$cgl$1 digitalmars.com...
 Looks simple but it isn't. How do you advance to the next match?

 foreach (m; "abracadabra".match("(.)a", "g")) writeln(m.capture[0]);

 This should print:

 r
 c
 d
 r

 There's need to make progress in the matching, not in the capture. How do 
 you distinguish among them?


 Andrei

foreach(capture; match(s, r)) foreach(group; capture) writeln(group);

The consecrated terminology is: foreach(match; match(s, r)) foreach(capture; match) writeln(capture); "Group" is a group defined without an intent to capture. A "capture" is a group that also binds to the state of the match. Anyhow... this can be done but things get a tad more confusing for other uses. How about this: foreach(match; match(s, r)) foreach(capture; match.captures) writeln(capture); ? Andrei
Feb 19 2009
parent reply "jovo" <jovo at.home> writes:
"Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei

I think you must answer this question more generally, same for all library. May be both?
Feb 19 2009
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
jovo wrote:
 "Andrei Alexandrescu" <SeeWebsiteForEmail erdani.org> wrote in message 
 news:gnkc24$hul$1 digitalmars.com...
 The consecrated terminology is:

 foreach(match; match(s, r))
    foreach(capture; match)
      writeln(capture);

 "Group" is a group defined without an intent to capture. A "capture" is a 
 group that also binds to the state of the match.

 Anyhow... this can be done but things get a tad more confusing for other 
 uses. How about this:

 foreach(match; match(s, r))
    foreach(capture; match.captures)
      writeln(capture);

 ?


 Andrei

I think you must answer this question more generally, same for all library. May be both?

I'd hate to fall again into the fallacy of trying to appease everyone's taste. Really std.regexp has set a negative record with the incredible array of names: find, search, exec, match, test, and probably I forgot a couple. Also it has offered a variety of random features in both free-function and member-function format, not even always doing the same thing. Germans have a saying: "Kurtz und gut". Let's make it short and good. Andrei
Feb 19 2009
next sibling parent KennyTM~ <kennytm gmail.com> writes:
Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures
 
 .captures looks like it should be a set of captures, not a count.
 
 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.
 
 ---bb

iter.count
Feb 19 2009
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:
 
 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

Agree. I thought that iter.captures is a set (range) of captures.

I'm done implementing that. Andrei
Feb 19 2009
prev sibling parent Bill Baxter <wbaxter gmail.com> writes:
On Fri, Feb 20, 2009 at 9:47 AM, KennyTM~ <kennytm gmail.com> wrote:
 Bill Baxter wrote:
 I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

iter.count

Maybe I haven't paid close enough attention here, but I think the reason he didn't say .count or .length is that it's ambiguous whether it means the number of captures or the number of matches. --bb
Feb 19 2009
prev sibling next sibling parent reply "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

[snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()




auto data = ">hello1 how are5 you?<"; auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*")); foreach (i; 0 .. iter.engine.captures) writeln(iter.capture[i]);

I would expect that to be foreach (/*Capture */ i; 0 .. iter.engine.captures) writeln(i);
 (notes that here all groups are found eagerly. If you want a lazy  
 matching in Python you have to use re.finditer() or  
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find  
 the matched group:

 patt.match(data)[0]



 patt.match(data)[1]




No go due to confusions with random-access ranges.

Why iter.capture[0] and iter.capture[1] aren't good enough? How are they different from iter.engine.captures[0] and iter.engine.captures[1]? Why it is a no go if you access iter.captures as a random-access range? I'm sorry if these are dumb questions, but the code you've shown is a bit confusing (these iter.engine.captures and iter.captures).
 Andrei

Feb 19 2009
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Denis Koroskin wrote:
 On Thu, 19 Feb 2009 19:00:41 +0300, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:
 
 [snip]
 This is a little example of managing groups in Python:

 import re
 data = ">hello1 how are5 you?<"
 patt = re.compile(r".*?(hello\d).*?(are\d).*")
 patt.match(data).groups()




auto data = ">hello1 how are5 you?<"; auto iter = match(data, regex(r".*?(hello\d).*?(are\d).*")); foreach (i; 0 .. iter.engine.captures) writeln(iter.capture[i]);

I would expect that to be foreach (/*Capture */ i; 0 .. iter.engine.captures) writeln(i);
 (notes that here all groups are found eagerly. If you want a lazy 
 matching in Python you have to use re.finditer() or 
 matchobj.finditer()).
  I may like a syntax similar to this, where opIndex() allows to find 
 the matched group:

 patt.match(data)[0]



 patt.match(data)[1]




No go due to confusions with random-access ranges.

Why iter.capture[0] and iter.capture[1] aren't good enough? How are they different from iter.engine.captures[0] and iter.engine.captures[1]? Why it is a no go if you access iter.captures as a random-access range? I'm sorry if these are dumb questions, but the code you've shown is a bit confusing (these iter.engine.captures and iter.captures).

They're good. The code I posted was dumb. The "engine" thing does not belong there, and "captures" should be indeed a random-access range. Andrei
Feb 19 2009
prev sibling next sibling parent Bill Baxter <wbaxter gmail.com> writes:
I don't like the syntax I saw somewhere earlier in the thread of
    0..iter.captures

.captures looks like it should be a set of captures, not a count.

This is a need that comes up again and again -- querying the size, or
count, or length of some sub-element like this -- so I think it would
greatly benefit Phobos to choose some less ambiguous convention and
stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

---bb
Feb 19 2009
prev sibling parent "Denis Koroskin" <2korden gmail.com> writes:
On Thu, 19 Feb 2009 23:23:13 +0300, Bill Baxter <wbaxter gmail.com> wrote:

 I don't like the syntax I saw somewhere earlier in the thread of
     0..iter.captures

 .captures looks like it should be a set of captures, not a count.

 This is a need that comes up again and again -- querying the size, or
 count, or length of some sub-element like this -- so I think it would
 greatly benefit Phobos to choose some less ambiguous convention and
 stick to it.   Like  nCaptures, numCaptures, capturesLength, etc.

 ---bb

Agree. I thought that iter.captures is a set (range) of captures.
Feb 19 2009