www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Empty subexpressions captures in std.regex

reply PC <petevik38 yahoo.com.au> writes:
Hi, I've been lurking in this group for a few months, have read
through TDPL (which is great Andrei) and have started using D for
some
small programs. So far it's been a joy to use (you may have a C++
convert on your hands) and with the convenience of rdmd, I've been
using it where I'd normally use a scripting language.

It's been pretty good for this especially as Phobos has had almost
everything I've wanted to do covered. I have run into some issues
with
std.regex matching empty subexpressions though (dmd 2.047, win32):

    auto r1 = regex( "(a*)b" );
    auto m = match( "b", r1 );
    writefln( "captures = %s, empty = %s", m.captures.length,
m.empty );

=> captures = 0, empty = true

If I disable the call to optimize, it gives the expected results:

=> captures = 2, empty = false

Also, with optimize disabled:

    auto r = regex("([^,]*),([^,]*),([^,]*)");
    m = match( ",,", r );
    writefln( "captures = %s, empty = %s", m.captures.length,
m.empty );

=> captures = 3, empty = false

I noticed in Captures:

         property size_t length()
        {
            foreach (i; 0 .. matches.length)
            {
                if (matches[i].startIdx >= input.length) return i;
            }
            return matches.length;
        }

In this case matches[3].startIdx = 2 and matches[3].endIdx=2. Should
this line be:

     if (matches[i].startIdx > input.length) return i;


Anyway kudos to everyone involved with D, I'm certainly going to be
using it a lot in the future.
Jul 11 2010
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
Hi PC,


Thanks for your kind words.

Regarding regex, we need to get a report into bugzilla so we keep track 
of the problem. When you say "disable the call to optimize" are you 
referring to the -O compiler flag? In that case it's a compiler problem 
(otherwise it might be a library issue). Could you please clarify?


Thanks,

Andrei

On 07/11/2010 06:29 AM, PC wrote:
 Hi, I've been lurking in this group for a few months, have read
 through TDPL (which is great Andrei) and have started using D for
 some
 small programs. So far it's been a joy to use (you may have a C++
 convert on your hands) and with the convenience of rdmd, I've been
 using it where I'd normally use a scripting language.

 It's been pretty good for this especially as Phobos has had almost
 everything I've wanted to do covered. I have run into some issues
 with
 std.regex matching empty subexpressions though (dmd 2.047, win32):

      auto r1 = regex( "(a*)b" );
      auto m = match( "b", r1 );
      writefln( "captures = %s, empty = %s", m.captures.length,
 m.empty );

 =>  captures = 0, empty = true

 If I disable the call to optimize, it gives the expected results:

 =>  captures = 2, empty = false

 Also, with optimize disabled:

      auto r = regex("([^,]*),([^,]*),([^,]*)");
      m = match( ",,", r );
      writefln( "captures = %s, empty = %s", m.captures.length,
 m.empty );

 =>  captures = 3, empty = false

 I noticed in Captures:

           property size_t length()
          {
              foreach (i; 0 .. matches.length)
              {
                  if (matches[i].startIdx>= input.length) return i;
              }
              return matches.length;
          }

 In this case matches[3].startIdx = 2 and matches[3].endIdx=2. Should
 this line be:

       if (matches[i].startIdx>  input.length) return i;


 Anyway kudos to everyone involved with D, I'm certainly going to be
 using it a lot in the future.
Jul 11 2010
parent PC <petevik38 yahoo.com.au> writes:
Sorry about the lack of clarity in the last post. I actually
commented out the call to the Regex.optimize in Regex.compile.

    auto r1 = regex( "(a*)b" );
    r1.printProgram();

Prints out:

printProgram()
  0: 	REtestbit 98, 13
 18: 	REparen len=15 n=0, pc=>42
 27: 	REnm  len=2, n=0, m=4294967295, pc=>42
 40: 	REchar 'a'
 42: 	REchar 'b'
 44: 	REend

With optimize(buf); commented out I get:

printProgram()
  0: 	REparen len=15 n=0, pc=>24
  9: 	REnm  len=2, n=0, m=4294967295, pc=>24
 22: 	REchar 'a'
 24: 	REchar 'b'
 26: 	REend

I don't understand why REtestbit is inserted at the start of the
program by the optimize routine, but it will not match if there
is no "a" at the start of the input (e.g. "b").

I think I need to spend some more time looking through the
regex.d source to understand it better

- Pete


== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
article
 Hi PC,
 Thanks for your kind words.
 Regarding regex, we need to get a report into bugzilla so we keep
track
 of the problem. When you say "disable the call to optimize" are you
 referring to the -O compiler flag? In that case it's a compiler
problem
 (otherwise it might be a library issue). Could you please clarify?
 Thanks,
 Andrei
 On 07/11/2010 06:29 AM, PC wrote:
 Hi, I've been lurking in this group for a few months, have read
 through TDPL (which is great Andrei) and have started using D for
 some
 small programs. So far it's been a joy to use (you may have a C++
 convert on your hands) and with the convenience of rdmd, I've been
 using it where I'd normally use a scripting language.

 It's been pretty good for this especially as Phobos has had almost
 everything I've wanted to do covered. I have run into some issues
 with
 std.regex matching empty subexpressions though (dmd 2.047, win32):

      auto r1 = regex( "(a*)b" );
      auto m = match( "b", r1 );
      writefln( "captures = %s, empty = %s", m.captures.length,
 m.empty );

 =>  captures = 0, empty = true

 If I disable the call to optimize, it gives the expected results:

 =>  captures = 2, empty = false

 Also, with optimize disabled:

      auto r = regex("([^,]*),([^,]*),([^,]*)");
      m = match( ",,", r );
      writefln( "captures = %s, empty = %s", m.captures.length,
 m.empty );

 =>  captures = 3, empty = false

 I noticed in Captures:

           property size_t length()
          {
              foreach (i; 0 .. matches.length)
              {
                  if (matches[i].startIdx>= input.length) return i;
              }
              return matches.length;
          }

 In this case matches[3].startIdx = 2 and matches[3].endIdx=2.
Should
 this line be:

       if (matches[i].startIdx>  input.length) return i;


 Anyway kudos to everyone involved with D, I'm certainly going to
be
 using it a lot in the future.
Jul 12 2010