www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - regex - match/matchAll and bmatch - different output

reply Ivan Kazmenko <gassa mail.ru> writes:
Hi,

While solving Advent of Code problems for fun (already discussed 
in the forum: 
http://forum.dlang.org/post/cwdkmblukzptsrsrvdkr forum.dlang.org), I ran into
an issue.  I wanted to test for the pattern "two consecutive characters,
arbitrary sequence, the same two consecutive characters".  Sadly, my solution
using regular expressions gave a wrong result, but a hand-written one was
accepted.

The problem reduced to the following:

import std.regex, std.stdio;
void main ()
{
	writeln (bmatch   ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (match    ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (matchAll ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (bmatch   ("xabab", r"(..).*\1"));  // [["abab", "ab"]]
	writeln (match    ("xabab", r"(..).*\1"));  // []
	writeln (matchAll ("xabab", r"(..).*\1"));  // []
}

As you can see, bmatch (usage discouraged in the docs) gives me 
the result I want, but match (also discouraged) and matchAll (way 
to go) don't.

Am I misusing matchAll, or is this a bug?

Ivan Kazmenko.
Dec 30 2015
next sibling parent Ivan Kazmenko <gassa mail.ru> writes:
On Wednesday, 30 December 2015 at 11:06:55 UTC, Ivan Kazmenko 
wrote:
 ...

 As you can see, bmatch (usage discouraged in the docs) gives me 
 the result I want, but match (also discouraged) and matchAll 
 (way to go) don't.

 Am I misusing matchAll, or is this a bug?
Reported as https://issues.dlang.org/show_bug.cgi?id=15489.
Dec 31 2015
prev sibling parent reply anonymous <anonymous example.com> writes:
On 30.12.2015 12:06, Ivan Kazmenko wrote:
 import std.regex, std.stdio;
 void main ()
 {
      writeln (bmatch   ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
      writeln (match    ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
      writeln (matchAll ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
      writeln (bmatch   ("xabab", r"(..).*\1"));  // [["abab", "ab"]]
      writeln (match    ("xabab", r"(..).*\1"));  // []
      writeln (matchAll ("xabab", r"(..).*\1"));  // []
 }

 As you can see, bmatch (usage discouraged in the docs) gives me the
 result I want, but match (also discouraged) and matchAll (way to go) don't.

 Am I misusing matchAll, or is this a bug?
The `\1` there is a backreference. Backreferences are not part of regular expressions, in the sense that they allow you to describe more than regular languages. [1] As far as I know, bmatch uses a widespread matching mechanism, while match/matchAll use a different, less common one. It wouldn't surprise me if match/matchAll simply didn't support backreferences. Backreferences are not documented, as far as I can see, but they're working in other patterns. So, yeah, this is possibly a bug. [1] https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
Jan 01 2016
parent Ivan Kazmenko <gassa mail.ru> writes:
On Friday, 1 January 2016 at 12:29:01 UTC, anonymous wrote:
 On 30.12.2015 12:06, Ivan Kazmenko wrote:
 As you can see, bmatch (usage discouraged in the docs) gives 
 me the
 result I want, but match (also discouraged) and matchAll (way 
 to go) don't.

 Am I misusing matchAll, or is this a bug?
The `\1` there is a backreference. Backreferences are not part of regular expressions, in the sense that they allow you to describe more than regular languages. [1] As far as I know, bmatch uses a widespread matching mechanism, while match/matchAll use a different, less common one. It wouldn't surprise me if match/matchAll simply didn't support backreferences. Backreferences are not documented, as far as I can see, but they're working in other patterns. So, yeah, this is possibly a bug. [1] https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages
The overview by the module author (http://dlang.org/regular-expression.html) does mention in the last paragraph that backreferences are supported. Looks like it is a common feature in other programming languages, too. The "\1" part is working correctly when "abab" or "abxab" or "ababx" but not "abac". This means it is probably intended to work, and handling "xabab" incorrectly is a bug. Also, as I understand it from the docs, matchAll/matchFirst use the most appropriate of match/bmatch internally, so if match does not properly support the particular backreference but bmatch does, the bug is in using the incorrect one to handle a pattern. At any rate, wrong result with a 8-character pattern produces a "regex don't work" impression, and I hope something can be done about it.
Jan 02 2016