www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Question about using regex

reply James Oliphant <jollie.roger gmail.com> writes:
While following the regex discussion, I have been compiling the examples 
to help with my understanding of how it works.

From Dmitry's example page:
	http://blackwhale.github.com/regular-expression.html
and from the dlang.org website:
	http://dlang.org/phobos/std_regex.html

std.regex.replace calls a delegate
	auto delegate(Captures!string)
which does not compile.  The definition in Phobos for Captures is
	struct Captures(R,DIndex)
and for the purposes of these examples changing the delegate to
	auto delegate(Captures!(string,uint))
seems to work.  Is this correct?


In another example on Dmitry's page that starts:
	auto m = match("Ranges are hot!", r"(\w)\w*(\w)"); //at least 3 
"word" symbols
The output from the example is "Ranges, R, s", but I don't quite 
understand why those where the matches in this case.  Also does the 
regular expression imply match at least 2 "word" symbols where \w* means 
match 0 or more "word" symbols?

These newsgroups are a great resource, keep up the great work!
Mar 21 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21.03.2012 20:05, James Oliphant wrote:
 While following the regex discussion, I have been compiling the examples
 to help with my understanding of how it works.

  From Dmitry's example page:
 	http://blackwhale.github.com/regular-expression.html
 and from the dlang.org website:
 	http://dlang.org/phobos/std_regex.html

 std.regex.replace calls a delegate
 	auto delegate(Captures!string)
 which does not compile.  The definition in Phobos for Captures is
 	struct Captures(R,DIndex)
 and for the purposes of these examples changing the delegate to
 	auto delegate(Captures!(string,uint))
 seems to work.  Is this correct?
Mm-hm it means the fix to use size_t by default is in upstream, but not in 2.058 I think. User needs not to specify index type, this is a hook for future extension.
 In another example on Dmitry's page that starts:
 	auto m = match("Ranges are hot!", r"(\w)\w*(\w)"); //at least 3
 "word" symbols
 The output from the example is "Ranges, R, s", but I don't quite
 understand why those where the matches in this case.
Ok, \w matches any single word character, that is alpha, numeric or one of few other oddities*. Now (\w) captures 1 character into 1st _submatch_ ('R'). \w* captures the rest the gets reverted so that the next (\w) matches The second (\w) thus captures last char ('s') into 2nd _submatch_ captures lists submatches captured during one match, [0] is the whole match. I get it that people tend to think that I was about to show multiple _matches_ here, but that belongs to the next chapter. Here I was just showing how to work with submatches, that needs to be stressed somehow. *This is enormously useful tool to get info on unicode stuff and regex in particular http://unicode.org/cldr/utility/index.jsp Also does the
 regular expression imply match at least 2 "word" symbols where \w* means
 match 0 or more "word" symbols?
Yup, that's right at least 2, I should correct wording.
 These newsgroups are a great resource, keep up the great work!
You are welcome. -- Dmitry Olshansky
Mar 21 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 21.03.2012 21:13, Dmitry Olshansky wrote:
 On 21.03.2012 20:05, James Oliphant wrote:
 While following the regex discussion, I have been compiling the examples
 to help with my understanding of how it works.

 From Dmitry's example page:
 http://blackwhale.github.com/regular-expression.html
 and from the dlang.org website:
 http://dlang.org/phobos/std_regex.html

 std.regex.replace calls a delegate
 auto delegate(Captures!string)
 which does not compile. The definition in Phobos for Captures is
 struct Captures(R,DIndex)
 and for the purposes of these examples changing the delegate to
 auto delegate(Captures!(string,uint))
 seems to work. Is this correct?
Mm-hm it means the fix to use size_t by default is in upstream, but not in 2.058 I think. User needs not to specify index type, this is a hook for future extension.
 In another example on Dmitry's page that starts:
 auto m = match("Ranges are hot!", r"(\w)\w*(\w)"); //at least 3
 "word" symbols
 The output from the example is "Ranges, R, s", but I don't quite
 understand why those where the matches in this case.
Ok, \w matches any single word character, that is alpha, numeric or one of few other oddities*. Now (\w) captures 1 character into 1st _submatch_ ('R'). \w* captures the rest the gets reverted so that the next (\w) matches The second (\w) thus captures last char ('s') into 2nd _submatch_ captures lists submatches captured during one match, [0] is the whole match. I get it that people tend to think that I was about to show multiple _matches_ here, but that belongs to the next chapter. Here I was just showing how to work with submatches, that needs to be stressed somehow.
Oh wait, it's in this chapter :) I probably should make more noise about "g" flag, and separate submatches from range of matches more cleanly.
 *This is enormously useful tool to get info on unicode stuff and regex
 in particular
 http://unicode.org/cldr/utility/index.jsp


 Also does the
 regular expression imply match at least 2 "word" symbols where \w* means
 match 0 or more "word" symbols?
Yup, that's right at least 2, I should correct wording.
 These newsgroups are a great resource, keep up the great work!
You are welcome.
-- Dmitry Olshansky
Mar 21 2012