digitalmars.D - std.regexp.split very slow - a bug?
- Marc Lohse (128/128) Feb 07 2007 hi again,
- Marc Lohse (42/200) Feb 07 2007 The same funny thing happens when using
- Frits van Bommel (7/26) Feb 07 2007 I wanted to post this the first time, but I didn't feel like trying out
hi again, two days ago i had posted that regular expressions were running very slow - here comes a more detailed description of that problem. It seems that the std.regexp.split function has a problem (or i am a moron and use it in a wrong way, but, as i mentioned before i am a biologist and not a professional programmer and i'd be happy about help if case 2 is true). The thing i want to do is read in a DNA sequence file that also contains information about the genes found in the raw DNA sequence. The file is in a commonly used format called GenBank and the different data segments are seperated by keywords which should make it easy to use std.regexp.split to dissect it. The following code is just an example that tries to split a GenBank file at the "ORIGIN" keyword. The file has a size of 323 KB and if you want to reproduce my "experiment" you can obtain it here: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634 // regex.d import std.stdio; import std.regexp; import std.file; void main() { char [] gb_data; char [][] segments; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); segments = split(gb_data, "ORIGIN", ""); writefln("seq segments: ", segments.length); } The following happens when i run it: marc marclinux:~/Desktop> time ./regex seq segments: 197549 real 12m52.420s user 12m48.132s sys 0m2.812s The execution takes ALMOST THIRTEEN MINUTES!! which made me fall from my chair. After having climbed back on it i tried the same in perl: #regex.pl use strict; my $gb_data = ""; my segments; open FILE, "/home/marc/Desktop/tobacco.gb"; while (<FILE>) { $gb_data .= $_; } close FILE; segments = split /ORIGIN/, $gb_data; print "seq segment: ".length($segments[1])."\n" output: marc marclinux:~/Desktop> time ./regex.pl seq segment: 197549 real 0m0.034s user 0m0.024s sys 0m0.012s ....well it took 34ms to do the same thing. I could not believe it and rewrote it using std.regexp.search instead of std.regexp.split: //regex2.d import std.stdio; import std.regexp; import std.file; void main() { char [] gb_data; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); auto m = search(gb_data, "ORIGIN", ""); writefln("seq segment: ", m.post.length); } output: marc marclinux:~/Desktop> time ./regex2 seq segment: 197549 real 0m0.025s user 0m0.024s sys 0m0.000s AHA. So D is faster than Perl - it took 25ms, but the split function is obviously *not suitable* for splitting a long text at a simple, single word (actually this does not even make use of complicated regular expression snytax). Becoming curious i rewrote the thing again, now using std.string.find: //find.d import std.stdio; import std.string; import std.file; void main() { char [] gb_data, seq_segment, pattern; long pos; pattern = "ORIGIN"; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); pos = find(gb_data, pattern); seq_segment = gb_data[(pos+pattern.length)..gb_data.length]; writefln("seq segment: ",seq_segment.length); //writefln("SEQ segment", m.post); } output: marc marclinux:~/Desktop> time ./find seq segment: 197549 real 0m0.005s user 0m0.000s sys 0m0.004s marc marclinux:~/Desktop> whoa! Now it only takes 5ms. So my problem seems to be solved - i will use either the search or the find variant. Interestingly, when splitting the same text at newlines the execution just takes about 13ms. I have no idea why the split function behaves so differently and this is also my question for the experts. cheers ml
Feb 07 2007
std.regex.split very slow -> bug?hi again, two days ago i had posted that regular expressions were running very slow - here comes a more detailed description of that problem. It seems that the std.regexp.split function has a problem (or i am a moron and use it in a wrong way, but, as i mentioned before i am a biologist and not a professional programmer and i'd be happy about help if case 2 is true). The thing i want to do is read in a DNA sequence file that also contains information about the genes found in the raw DNA sequence. The file is in a commonly used format called GenBank and the different data segments are seperated by keywords which should make it easy to use std.regexp.split to dissect it. The following code is just an example that tries to split a GenBank file at the "ORIGIN" keyword. The file has a size of 323 KB and if you want to reproduce my "experiment" you can obtain it here: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=76559634 // regex.d import std.stdio; import std.regexp; import std.file; void main() { char [] gb_data; char [][] segments; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); segments = split(gb_data, "ORIGIN", ""); writefln("seq segments: ", segments.length); } The following happens when i run it: marc marclinux:~/Desktop> time ./regex seq segments: 197549 real 12m52.420s user 12m48.132s sys 0m2.812s The execution takes ALMOST THIRTEEN MINUTES!! which made me fall from my chair. After having climbed back on it i tried the same in perl: #regex.pl use strict; my $gb_data = ""; my segments; open FILE, "/home/marc/Desktop/tobacco.gb"; while (<FILE>) { $gb_data .= $_; } close FILE; segments = split /ORIGIN/, $gb_data; print "seq segment: ".length($segments[1])."\n" output: marc marclinux:~/Desktop> time ./regex.pl seq segment: 197549 real 0m0.034s user 0m0.024s sys 0m0.012s ....well it took 34ms to do the same thing. I could not believe it and rewrote it using std.regexp.search instead of std.regexp.split: //regex2.d import std.stdio; import std.regexp; import std.file; void main() { char [] gb_data; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); auto m = search(gb_data, "ORIGIN", ""); writefln("seq segment: ", m.post.length); } output: marc marclinux:~/Desktop> time ./regex2 seq segment: 197549 real 0m0.025s user 0m0.024s sys 0m0.000s AHA. So D is faster than Perl - it took 25ms, but the split function is obviously *not suitable* for splitting a long text at a simple, single word (actually this does not even make use of complicated regular expression snytax). Becoming curious i rewrote the thing again, now using std.string.find: //find.d import std.stdio; import std.string; import std.file; void main() { char [] gb_data, seq_segment, pattern; long pos; pattern = "ORIGIN"; gb_data = cast(char[])read("/home/marc/Desktop/tobacco.gb"); pos = find(gb_data, pattern); seq_segment = gb_data[(pos+pattern.length)..gb_data.length]; writefln("seq segment: ",seq_segment.length); //writefln("SEQ segment", m.post); } output: marc marclinux:~/Desktop> time ./find seq segment: 197549 real 0m0.005s user 0m0.000s sys 0m0.004s marc marclinux:~/Desktop> whoa! Now it only takes 5ms. So my problem seems to be solved - i will use either the search or the find variant. Interestingly, when splitting the same text at newlines the execution just takes about 13ms. I have no idea why the split function behaves so differently and this is also my question for the experts. cheers mlThe same funny thing happens when using std.regexp.sub. In the following line i want to remove all non-DNA characters from the read in sequence segment using sub: stripped_sequence = sub(seq_segment, "[0-9\n\t/ ]", "", "g"); output: time ./bio_test real 0m17.154s user 0m16.737s sys 0m0.032s Again this expression takes unexpectedly long to execute: about 17s on my PentiumM 1,8GHz. When i reformulate the task avoiding regular expressions: char[] clean_seq = ""; foreach (char N; stripped_sequence) { if ((N == '0') || (N == '1') || (N == '2') || (N == '3') || (N == '4') || (N == '5') || (N == '6') || (N == '7') || (N == '8') || (N == '9') || (N == ' ') || (N == '\n') || (N == '\t') || (N == '/')) continue; clean_seq ~= N; } it looks (and is) very ugly but it runs faster: output: time ./bio_test real 0m0.413s user 0m0.040s sys 0m0.004s Note that the actual computation time is only about 40ms (the real time is longer because the sequence and other info is printed to STDOUT). Again my question what's wrong here? I used the regexp.sub exactly the way that it's used in public example code snippets. Have other people also had these problems or am i the first to use the regular expressions of D on longer text strings? (although i wouldn't think that 323KB of text are really long). Any help and or suggestions|comments would be extremely welcome!
Feb 07 2007
Marc Lohse wrote:Becoming curious i rewrote the thing again, now using std.string.find:[snip]output: marc marclinux:~/Desktop> time ./find seq segment: 197549 real 0m0.005s user 0m0.000s sys 0m0.004s marc marclinux:~/Desktop> whoa! Now it only takes 5ms. So my problem seems to be solved - i will use either the search or the find variant.I wanted to post this the first time, but I didn't feel like trying out the difference: std.string *also* has a 'split' function. I just tried it after reading the first part of your post and it took 4 milliseconds on my computer (while the regexp version is still running).Interestingly, when splitting the same text at newlines the execution just takes about 13ms. I have no idea why the split function behaves so differently and this is also my question for the experts.Yes, that's a bit weird.
Feb 07 2007