digitalmars.D - Bad performance of simple regular expression

digitalmars.D - Bad performance of simple regular expression - why??

MarcL (26/26) Feb 05 2007 hi everyone,

Bill Lear (8/12) Feb 05 2007 Please post a representative sample so we can help you, and do please

MarcL (23/39) Feb 05 2007 first of all thanks a lot to everyone who answered on my

Bill Baxter (14/44) Feb 05 2007 Like Bill said, a sample would be helpful.

Tomas Lindquist Olsen (3/6) Feb 05 2007 I stumbled across this link last night. It says something slightly

Bill Baxter (2/9) Feb 05 2007 I'm all ears...

Tomas Lindquist Olsen (3/13) Feb 05 2007 Oups.. Here you go:

Walter Bright (5/6) Feb 05 2007 That looks pretty cool. Anyone care to take a stab at implementing this

Manfred Nowak (13/14) Feb 06 2007 It is a ridiculous business to incorporate additional algorithms

Walter Bright (2/5) Feb 06 2007 My understanding is that backtracking is difficult to do with the DFA.

Manfred Nowak (9/11) Feb 06 2007 On converting _one_ regular expression to a DFA there is no need to

Jascha Wetzel (9/11) Feb 08 2007 maximal munch, non-greedy matching and negative matches would usually

Benji Smith (12/25) Feb 09 2007 The problem is that most regex engines (in fact, all of them, as far as

Joel C. Salomon (5/8) Feb 09 2007 This has been linked to in this thread, but Ken Thompson’s algorithm

Benji Smith (28/35) Feb 07 2007 A little while back, at my job, I implemented a regex parser (in Java)

Joel C. Salomon (6/9) Feb 05 2007 The regexp library isn’t the only really useful bit of software packed...

MarcL (3/22) Feb 05 2007 here's a link to the sequences.gb file (attaching it didn't work)

MarcL <lohse mpimp-golm.mpg.de> writes:

hi everyone, 

first of all i want to say that i'm not a professional programmer  - so the
problem i have might be 
caused by my own  lack of experience. Nevertheless i want to describe it,
hoping that someone in this 
group might be able to tell me what's wrong. I am a molecular biologist and i
often have to deal with 
larger amounts of DNA and protein sequence data (which is in principle text). I
am mainly using Perl to 
process these DNA files, and Perl generally performs very well (regular
expressions are actually the killer 
tool for working with DNA sequences). Unfortunately not everything in Perl is a
fast as the regular 
expressions and so i started trying to learn a language that can be compiled to
produce fast 
executables : C++ - and went crazy because everything is so complicated. All
the ugly details that Perl 
takes care of for the user have to be organized manually and that really gave
me the creeps. Then i 
learned about D and it sounded like it the solution to my problem: A compilable
language that supports 
associative arrays, garbage collection and (most importantly for me) regular
expressions! Great! I 
experimented a bit and actually managed to write small working programlet
directly. I was delighted! 
But now comes the reason why i write all this: Being enthusiastic about this
new nice language i started 
to write a module that should implement basic functions for working with the
most common DNA 
sequence file formats. To parse these files i planned to use regular
expressions. So far so good. When 
testing my module with a small DNA file everything seemed OK -then i tried to
use it to parse a more 
real world-sized DNA file (~155000 characters of DNA sequence plus about the
same amount of textual 
information) and had to find out that a simple std.regexp.split call took about
59 seconds!!! I could not 
believe it and wrote a little Perl script doing the same thing and it took less
than 1s!! What's wrong 
here??? This can't really be true, can it? Is the implementation of regular
expressions in the phobos 
library so bad or preliminary that it is so much less performant than the Perl
regex engine? It's actually 
not usable for me like this (which is a sad think because i really like the
other features of D and would 
like to use it). Am i making mistakes or do i simply have to wait for a better
version of phobos? 

Any comments or suggestions would be great. 

cheers

Feb 05 2007

"Bill Lear" <rael see.sig.com> writes:

MarcL <lohse mpimp-golm.mpg.de> writes:

 .... Am i making mistakes or do i simply have to wait for a better
 version of phobos?
 
 Any comments or suggestions would be great. 

Please post a representative sample so we can help you, and do please
try to post lines < 80 characters long, if possible.


Bill
--
Bill Lear
r * e *   * o * y * a * c * m
* a * l * z * p * r * . * o *

Feb 05 2007

MarcL <lohse mpimp-golm.mpg.de> writes:

Bill Lear Wrote:

 MarcL <lohse mpimp-golm.mpg.de> writes:
 
 .... Am i making mistakes or do i simply have to wait for a better
 version of phobos?
 
 Any comments or suggestions would be great. 

 
 Please post a representative sample so we can help you, and do please
 try to post lines < 80 characters long, if possible.
 
 
 Bill
 --
 Bill Lear
 r * e *   * o * y * a * c * m
 * a * l * z * p * r * . * o *

first of all thanks a lot to everyone who answered on my
posting i did not expect to receive feedback so quickly.

This is the piece of code that runs so slowly:

<code>

	char[] raw_sequence, stripped_sequence, header_segment, feature_segment;
	char[][] gb_segments, seq_segments;
	
	raw_sequence = cast(char[])std.file.read("/Users/marc/Desktop/sequences.gb");
	gb_segments = std.regexp.split(raw_sequence, "FEATURES", "");
	writefln("split into ", gb_segments.length, " segments");
	seq_segments = std.regexp.split(gb_segments[1], "ORIGIN", "");
	header_segment = gb_segments[0];
	feature_segment = std.regexp.sub(seq_segments[0], "^.*location/qualifiers\n",
"", "i");
	stripped_sequence = std.string.toupper(std.regexp.sub(seq_segments[1], "[0-9\t
\n/]", "", "g"));

</code>

I did not precompile the regular expression, so maybe this 
is one of the reasons why it's slow, but it doesn't contain any loops
so the expressions are only used once. If anyone wants to try that
on his own machine - i attached the "sequences.gb" file.

I have tested the code on a PowerBook (G4 1,5GHz) using 
gdc and also on a Linux machine (1,8GHz PentiumM) using 
the original dmd compiler - but the results were the same

Feb 05 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

MarcL wrote:
 hi everyone, 
 
 first of all i want to say that i'm not a professional programmer  - so the
problem i have might be 
 caused by my own  lack of experience. Nevertheless i want to describe it,
hoping that someone in this 
 group might be able to tell me what's wrong. I am a molecular biologist and i
often have to deal with 
 larger amounts of DNA and protein sequence data (which is in principle text).
I am mainly using Perl to 
 process these DNA files, and Perl generally performs very well (regular
expressions are actually the killer 
 tool for working with DNA sequences). Unfortunately not everything in Perl is
a fast as the regular 
 expressions and so i started trying to learn a language that can be compiled
to produce fast 
 executables : C++ - and went crazy because everything is so complicated. All
the ugly details that Perl 
 takes care of for the user have to be organized manually and that really gave
me the creeps. Then i 
 learned about D and it sounded like it the solution to my problem: A
compilable language that supports 
 associative arrays, garbage collection and (most importantly for me) regular
expressions! Great! I 
 experimented a bit and actually managed to write small working programlet
directly. I was delighted! 
 But now comes the reason why i write all this: Being enthusiastic about this
new nice language i started 
 to write a module that should implement basic functions for working with the
most common DNA 
 sequence file formats. To parse these files i planned to use regular
expressions. So far so good. When 
 testing my module with a small DNA file everything seemed OK -then i tried to
use it to parse a more 
 real world-sized DNA file (~155000 characters of DNA sequence plus about the
same amount of textual 
 information) and had to find out that a simple std.regexp.split call took
about 59 seconds!!! I could not 
 believe it and wrote a little Perl script doing the same thing and it took
less than 1s!! What's wrong 
 here??? This can't really be true, can it? Is the implementation of regular
expressions in the phobos 
 library so bad or preliminary that it is so much less performant than the Perl
regex engine? It's actually 
 not usable for me like this (which is a sad think because i really like the
other features of D and would 
 like to use it). Am i making mistakes or do i simply have to wait for a better
version of phobos? 
 
 Any comments or suggestions would be great. 
 
 cheers 
 

Like Bill said, a sample would be helpful.

The most common mistake is to not precompile the regexp.
For instance using std.regexp.search with the same pattern string is not 
optimal.

If you're going to be using the same pattern multiple times, then you 
should create a RegExp object once, and then apply it multiple times 
using its search, match etc methods.

Regardless, perl was basically created as a convenient way to use 
regular expressions, so it's implementation could very likely be more 
efficient than D's.

Do perl regexp's handle unicode/utf8 properly these days?
(Actually, do D's for that matter?)

--bb

Feb 05 2007

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
 Regardless, perl was basically created as a convenient way to use regular
 expressions, so it's implementation could very likely be more efficient
 than D's.

I stumbled across this link last night. It says something slightly
different...

Feb 05 2007

Bill Baxter <dnewsgroup billbaxter.com> writes:

Tomas Lindquist Olsen wrote:
 On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
 Regardless, perl was basically created as a convenient way to use regular
 expressions, so it's implementation could very likely be more efficient
 than D's.

 
 I stumbled across this link last night. It says something slightly
 different...

I'm all ears...

Feb 05 2007

Tomas Lindquist Olsen <tomas famolsen.dk> writes:

On Tue, 06 Feb 2007 10:35:06 +0900, Bill Baxter wrote:

 Tomas Lindquist Olsen wrote:
 On Tue, 06 Feb 2007 08:47:53 +0900, Bill Baxter wrote:
 Regardless, perl was basically created as a convenient way to use
 regular expressions, so it's implementation could very likely be more
 efficient than D's.

 
 I stumbled across this link last night. It says something slightly
 different...

 
 I'm all ears...

Oups.. Here you go:

http://swtch.com/~rsc/regexp/regexp1.html

Feb 05 2007

Walter Bright <newshound digitalmars.com> writes:

Tomas Lindquist Olsen wrote:
 http://swtch.com/~rsc/regexp/regexp1.html

That looks pretty cool. Anyone care to take a stab at implementing this 
in std.regexp? It could be done by examining the generated bytecode and 
attempting to convert it to an NFA. If that succeeds, then great, do the 
NFA. If it fails, then fallback to the original method.

Feb 05 2007

Manfred Nowak <svv1999 hotmail.com> writes:

Walter Bright <newshound digitalmars.com> wrote

 If it fails, then fallback to the original method.

It is a ridiculous business to incorporate additional algorithms 
because a current implementation is broken.

The well known chain of converting RE->NFA->DFA->minimal DFA
should handle that problem fast, because the minimal DFA for a RE of 
the form ((a?)^n a^n) for fixed n is a chain of 2n+1 nodes, which 
corresponds to the longest acceptable input plus final state.

In this chain the first n nodes are nonfinal, all others are final.
The first node is the start node.
All nodes have transitions to the next node under "a".

That is recognision time is linear in the length of the input: no 
exponential behaviour. 

-manfred

Feb 06 2007

Walter Bright <newshound digitalmars.com> writes:

Manfred Nowak wrote:
 Walter Bright <newshound digitalmars.com> wrote
 
 If it fails, then fallback to the original method.


My understanding is that backtracking is difficult to do with the DFA.

Feb 06 2007

Manfred Nowak <svv1999 hotmail.com> writes:

Walter Bright <newshound digitalmars.com> wrote

 My understanding is that backtracking is difficult to do with the
 DFA.

On converting _one_ regular expression to a DFA there is no need to 
backtrack ever.

The example considered in the article is roughly equivalent to
a^n a?^n. No need for backtracking and the normal tool chain handles 
this.

Also for maximal munching several regular expression there is no need 
for backtracking. 

-manfred

Feb 06 2007

Jascha Wetzel <"[firstname]" mainia.de> writes:

Manfred Nowak wrote:
 Also for maximal munching several regular expression there is no need 
 for backtracking. 

maximal munch, non-greedy matching and negative matches would usually
use backtracking. doing it without backtracking would require to use
several DFAs - or is there a simpler method?

afaik, the "school-book-way" for backtracking are deterministic pushdown
automatons (PDAs) that can also be interpreted efficiently or compiled
to bytecode.
so i don't see why there shouldn't be backtracking in the regexp
implementation.

Feb 08 2007

Benji Smith <dlanguage benjismith.net> writes:

Jascha Wetzel wrote:
 Manfred Nowak wrote:
 Also for maximal munching several regular expression there is no need 
 for backtracking. 

 
 maximal munch, non-greedy matching and negative matches would usually
 use backtracking. doing it without backtracking would require to use
 several DFAs - or is there a simpler method?
 
 afaik, the "school-book-way" for backtracking are deterministic pushdown
 automatons (PDAs) that can also be interpreted efficiently or compiled
 to bytecode.
 so i don't see why there shouldn't be backtracking in the regexp
 implementation.

The problem is that most regex engines (in fact, all of them, as far as 
I know) use backtracking even when a DFA implementation could be used to 
determine (much earlier in the search process) that a match is impossible.

A backtracking approach might be necessary with positive and negative 
lookaround assertions (though I still suspect that a DFA could even be 
used for these cases), but every mainstream regex engine uses a 
backtracking approach even for simple choice situations like (cat|car) 
where a DFA would be much more appropriate.

And, by the way, it's possible to create a DFA implementation that 
degrades into a backtracking matcher for certain subexpressions.

--benji

Feb 09 2007

"Joel C. Salomon" <JoelCSalomon Gmail.com> writes:

Benji Smith wrote:
 The problem is that most regex engines (in fact, all of them, as far as 
 I know) use backtracking even when a DFA implementation could be used to 
 determine (much earlier in the search process) that a match is impossible.

This has been linked to in this thread, but Ken Thompson’s algorithm 
does /not/ use backtracking, and there are open-source implementations 
around; see <http://www.swtch.com/~rsc/regexp/regexp1.html>.

--Joel

Feb 09 2007

Benji Smith <dlanguage benjismith.net> writes:

Walter Bright wrote:
 Tomas Lindquist Olsen wrote:
 http://swtch.com/~rsc/regexp/regexp1.html

 
 That looks pretty cool. Anyone care to take a stab at implementing this 
 in std.regexp? It could be done by examining the generated bytecode and 
 attempting to convert it to an NFA. If that succeeds, then great, do the 
 NFA. If it fails, then fallback to the original method.

A little while back, at my job, I implemented a regex parser (in Java) 
that builds an NFA and then transforms it into a DFA, re-serializing it 
into a regex for consumption by any old regex engine.

For example, this regex assumes NFA performance characteristics when 
compiled by the (java, perl, python, etc) regex engine:

   (?:cat|car|cut|cute|cumbersome)

This regex is semantically identical (meaning that it will produce 
exactly the same set of matches as the simpler expression), but it 
assumes DFA performance characteristics in the regex engine:

   c(?:a[tr]|u(?:te?|umbersome))

Since I needed my regular expressions to execute in an ordinary regex 
engine, I couldn't tweak the engine internals, but I could trick the 
regex engine into being a DFA.

The major caveat to this approach is that parenthetical groups get 
jumbled. If you use parentheses to capture text (rather than just to 
match it), the numbering of your backreferences is going to get all 
screwed up. When I use this approach, I almost always use noncapturing 
parentheses (?: ... ) rather than ordinary parens.

However, if you had control of the regex engine itself (rather than just 
feeding tweaked expressions into an ordinary engine), you could easily 
implement DFA behavior characteristics without changing the capturing 
semantics.

I'd implement this myself for D, but I'm afraid my NDA with my employer 
would probably prevent me from doing so. The theory is straightforward, 
though, so it should be do-able by anyone with the appropriate RE 
experience.

--Benji Smith

Feb 07 2007

"Joel C. Salomon" <JoelCSalomon Gmail.com> writes:

Tomas Lindquist Olsen wrote:
 I stumbled across this link last night. It says something slightly different...
 
 http://swtch.com/~rsc/regexp/regexp1.html

The regexp library isn’t the only really useful bit of software packed 
into Plan 9.  I’m implementing a compiler for the first time (C-ish, not 
D) under Plan 9, and it’s a constant temptation to add “just one more 
feature” because there’s a library to support it.

--Joel

Feb 05 2007

MarcL <lohse mpimp-golm.mpg.de> writes:

 
 
 Like Bill said, a sample would be helpful.
 
 The most common mistake is to not precompile the regexp.
 For instance using std.regexp.search with the same pattern string is not 
 optimal.
 
 If you're going to be using the same pattern multiple times, then you 
 should create a RegExp object once, and then apply it multiple times 
 using its search, match etc methods.
 
 Regardless, perl was basically created as a convenient way to use 
 regular expressions, so it's implementation could very likely be more 
 efficient than D's.
 
 Do perl regexp's handle unicode/utf8 properly these days?
 (Actually, do D's for that matter?)
 
 --bb


here's a link to the sequences.gb file (attaching it didn't work)
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634

Feb 05 2007

D Programming

C/C++ Programming

Other

digitalmars.D - Bad performance of simple regular expression - why??