www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Very Stupid Regex question

reply "seany" <seany uni-bonn.de> writes:
Cosider please the following:

string s1 = PREabcdPOST;
string s2 = PREabPOST;


string[] srar = ["ab", "abcd"];
// this can not be constructed with a particular order

foreach(sr; srar)
{

   auto r = regex(sr; "g");
   auto m = matchFirst(s1, r);
   break;
   // this one matches ab
   // but I want this to match abcd
   // and for s2 I want to match ab

}

obviously there are ways like counting the match length, and then 
using the maximum length, instead of breaking as soon as a match 
is found.

Are there any other better ways?
Aug 07 2014
next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Thursday, 7 August 2014 at 16:05:17 UTC, seany wrote:
 Cosider please the following:

 string s1 = PREabcdPOST;
 string s2 = PREabPOST;


 string[] srar = ["ab", "abcd"];
 // this can not be constructed with a particular order

 foreach(sr; srar)
 {

   auto r = regex(sr; "g");
   auto m = matchFirst(s1, r);
   break;
   // this one matches ab
   // but I want this to match abcd
   // and for s2 I want to match ab

 }

 obviously there are ways like counting the match length, and 
 then using the maximum length, instead of breaking as soon as a 
 match is found.

 Are there any other better ways?
It's not clear to me what exactly you want, but: Are the regexes in `srar` related? That is, does one regex always include the previous one as a prefix? Then you can use optional matches: /ab(cd)?/ This will match "abcd" if it is there, but will also match "ab" otherwise.
Aug 07 2014
prev sibling parent reply Justin Whear <justin economicmodeling.com> writes:
On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:

 obviously there are ways like counting the match length, and then using
 the maximum length, instead of breaking as soon as a match is found.
 
 Are there any other better ways?
You're not really using regexes properly. You want to greedily match as much as possible in this case, e.g.: void main() { import std.regex; auto re = regex("ab(cd)?"); assert("PREabcdPOST".matchFirst(re).hit == "abcd"); assert("PREabPOST".matchFirst(re).hit == "ab"); }
Aug 07 2014
parent reply "seany" <seany uni-bonn.de> writes:
On Thursday, 7 August 2014 at 16:12:59 UTC, Justin Whear wrote:
 On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:

 obviously there are ways like counting the match length, and 
 then using
 the maximum length, instead of breaking as soon as a match is 
 found.
 
 Are there any other better ways?
You're not really using regexes properly. You want to greedily match as much as possible in this case, e.g.: void main() { import std.regex; auto re = regex("ab(cd)?"); assert("PREabcdPOST".matchFirst(re).hit == "abcd"); assert("PREabPOST".matchFirst(re).hit == "ab"); }
thing is, abcd is read from a file, and in the compile time, i dont know if cd may at all be there or not, ir if it should be ab(ef)
Aug 07 2014
parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 07, 2014 at 04:49:05PM +0000, seany via Digitalmars-d-learn wrote:
 On Thursday, 7 August 2014 at 16:12:59 UTC, Justin Whear wrote:
On Thu, 07 Aug 2014 16:05:16 +0000, seany wrote:

obviously there are ways like counting the match length, and then
using the maximum length, instead of breaking as soon as a match is
found.

Are there any other better ways?
You're not really using regexes properly. You want to greedily match as much as possible in this case, e.g.: void main() { import std.regex; auto re = regex("ab(cd)?"); assert("PREabcdPOST".matchFirst(re).hit == "abcd"); assert("PREabPOST".matchFirst(re).hit == "ab"); }
thing is, abcd is read from a file, and in the compile time, i dont know if cd may at all be there or not, ir if it should be ab(ef)
So basically you have a file containing regex patterns, and you want to find the longest match among them? One way to do this is to combine them at runtime: string[] patterns = ... /* read from file, etc. */; // Longer patterns match first patterns.sort!((a,b) => a.length > b.length); // Build regex string regexStr = "%((%(%c%))%||%)".format(patterns); auto re = regex(regexStr); ... // Run matches against input char[] input = ...; auto m = input.match(re); auto matchedString = m.captures[0]; T -- When solving a problem, take care that you do not become part of the problem.
Aug 07 2014
parent reply Justin Whear <justin economicmodeling.com> writes:
On Thu, 07 Aug 2014 10:22:37 -0700, H. S. Teoh via Digitalmars-d-learn
wrote:

 
 So basically you have a file containing regex patterns, and you want to
 find the longest match among them?
 	// Longer patterns match first patterns.sort!((a,b) => a.length >
 	b.length);
 
 	// Build regex string regexStr = "%((%(%c%))%||%)".format
(patterns);
 	auto re = regex(regexStr);
This only works if the patterns are simple literals. E.g. the pattern 'a +' might match a longer sequence than 'aaa'. If you're out for the longest possible match, iteratively testing each pattern is probably the way to go.
Aug 07 2014
next sibling parent "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 07, 2014 at 05:33:42PM +0000, Justin Whear via Digitalmars-d-learn
wrote:
 On Thu, 07 Aug 2014 10:22:37 -0700, H. S. Teoh via Digitalmars-d-learn
 wrote:
 
 
 So basically you have a file containing regex patterns, and you want
 to find the longest match among them?
 	// Longer patterns match first patterns.sort!((a,b) => a.length >
 	b.length);
 
 	// Build regex string regexStr = "%((%(%c%))%||%)".format
(patterns);
 	auto re = regex(regexStr);
This only works if the patterns are simple literals. E.g. the pattern 'a +' might match a longer sequence than 'aaa'. If you're out for the longest possible match, iteratively testing each pattern is probably the way to go.
Hmm, you're right. I was a bit disappointed to find out that the | operator in std.regex (and also in Perl's regex) doesn't do longest-match but first-match. :-( I had always thought it did longest-match, like in lex/flex. I wish we can extend std.regex to allow longest-match for alternations... but there may be performance consequences. T -- There's light at the end of the tunnel. It's the oncoming train.
Aug 07 2014
prev sibling parent reply "H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:
On Thu, Aug 07, 2014 at 10:42:13AM -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
[...]
 Hmm, you're right. I was a bit disappointed to find out that the |
 operator in std.regex (and also in Perl's regex) doesn't do
 longest-match but first-match. :-( I had always thought it did
 longest-match, like in lex/flex.
 
 I wish we can extend std.regex to allow longest-match for
 alternations... but there may be performance consequences.
https://issues.dlang.org/show_bug.cgi?id=13268 T -- Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
Aug 07 2014
parent "seany" <seany uni-bonn.de> writes:
On Thursday, 7 August 2014 at 18:16:11 UTC, H. S. Teoh via 
Digitalmars-d-learn wrote:

 https://issues.dlang.org/show_bug.cgi?id=13268


 T
Thank you soooooooooo much!!
Aug 07 2014