digitalmars.D - Regex DNA benchmark
- bearophile (120/120) Dec 01 2007 (The following notes are relative to DMD V.1.024).
- Jascha Wetzel (6/7) Dec 01 2007 i have implemented a regex engine that is supposed to be categorically
- bearophile (8/11) Dec 01 2007 - Regarding the Shootout I don't know how much useful it can be, because...
- Jascha Wetzel (12/17) Dec 01 2007 unfortunately, the data structures for a DFA based regex engine are too
- bearophile (5/8) Dec 01 2007 I think D needs into its std lib a RE engine as flexible and as fast as ...
- Jascha Wetzel (9/11) Dec 01 2007 i don't know what exactly most people need. i'm just pretty sure they
- Don Clugston (3/14) Dec 02 2007 Actually std.regexp does support non-greedy matching, it's just undocume...
- Robert Fraser (4/11) Dec 01 2007 TCL's is a combination DFA/NFA (it uses an NFA engine where possible,
(The following notes are relative to DMD V.1.024). The Shooutout benchmarks are very interesting, they offer many ways to learn about the languages, how to write fast code, ecc. One of the many good sides of those benchmarks is that they compare different languages, so you can quickly see when/where your language is "abnormally" slow, helping you to improve that too. This time I have tried the Regex DNA benchmark, it reads a simulated FASTA file, and processes it with REs. Despite being a toy, this program is related to real code bioinformatics do often. On the Shootout site there are two D versions, one of them uses std.regexp, the other PCRE. Even if the PCRE-based test is slower than the Python/Perl versions (but they use the trick of splitting the RE, so they can't be fully compared), it shows D will need a much much faster std.regexp (10 seconds with PCRE, versus 117 seconds with std.regexp). I have tried to improve the std.regexp-based version, because I like D for allowing me to do medium-level programming. To create the D version I have followed the Python version: when the std library of D is well done, then the D sourcecode for this benchmark can be as short as the Python one. This gives what the D blurb offers: >D is a systems programming language. Its focus is on combining the power and high performance of C and C++ with the programmer productivity of modern languages like Ruby and Python.< This is the D code (only 2-space indent), I have left some comments that show some D problems: import std.stdio, std.string, std.cstream; import std.regexp: RegExp, search, resplit = split; void main() { string[] sseq; int n; char[1 << 15] cbuf; // auto seq = din.toString(); // SLOW while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0) // sseq ~= cbuf[0 .. n][]; // slow sseq ~= cbuf[0 .. n].dup; auto seq = sseq.join(""); auto ilen = seq.length; //seq = sub(seq, ">.*\n|\n", "", "g"); // SLOW!! seq = resplit(seq, ">.*\n|\n").join(""); int clen = seq.length; foreach(p; split("agggtaaa|tttaccct [cgt]gggtaaa|tttaccc[acg] a[act]ggtaaa|tttacc[agt]t ag[act]gtaaa|tttac[agt]ct agg[act]taaa|ttta[agt]cct aggg[acg]aaa|ttt[cgt]ccct agggt[cgt]aa|tt[acg]accct agggta[cgt]a|t[acg]taccct agggtaa[cgt]|[acg]ttaccct")) { int m = 0; foreach(_; RegExp(p).search(seq)) m++; writefln(p, ' ', m); } foreach(el; split("B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c) N(a|c|g|t) R(a|g) S(c|g) V(a|c|g) W(a|t) Y(c|t)")) // seq = RegExp(el[0..1], "g").replace(seq, el[1..$]); // Slow seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]); writefln("\n", ilen, "\n", clen, "\n", seq.length); } Note: if you don't belive my tests and my words you can try replacing the fast things with the slow commented-out ones. (The input data for this program is a FASTA file, you can generate it with the fasta benchmark on the Shootout site itself). The din.toString() is very slow; in the code I have replaced it with that readBlock+join. If Walter wants to make din.toString() faster he may use something like: string readAllStdin() { string[] seq; size_t len, n, len_tot; char[1 << 15] cbuf; std.gc.disable; // can be nested seq.length = 1; // Initial guess while ((n = din.readBlock(cbuf.ptr, cbuf.length)) > 0) { if (len == seq.length) seq.length = seq.length * 2; seq[len] = cbuf[0 .. n].dup; len++; len_tot += n; } // Join seq string result; if (len_tot) { size_t pos; result.length = len_tot; foreach (word; seq) { result[pos .. pos + word.length] = word; pos += word.length; } } std.gc.enable; return result; } somestring.join("") when the joining sting is empty can become something faster like: TyElem[] join(TyElem)(TyElem[][] words) { TyElem[] result; if (words.length) { size_t len, pos; foreach(word; words) len += word.length; result.length = len; foreach (word; words) { result[pos .. pos + word.length] = word; pos += word.length; } } return result; } From my tests it seems that: sseq ~= cbuf[0 .. n][]; is slower than: sseq ~= cbuf[0 .. n].dup; This is probably a bug of std.regexp, the sub() seems O(n^2): seq = sub(seq, ">.*\n|\n", "", "g"); So I have used: seq = resplit(seq, ">.*\n|\n").join(""); I have also used: seq = (new RegExp(el[0..1], "g")).replace(seq, el[1..$]); because this seems a bit slower: // seq = RegExp(el[0..1], "g").replace(seq, el[1..$]); This is a Python version (as the fast Perl version it splits some REs to speed up, I haven't done it in the D code because it slows down the D code): import re, sys seq = sys.stdin.read() ilen = len(seq) seq = re.sub('>.*', '', seq) seq = re.sub('\n', '', seq) clen = len(seq) for f in ('agggtaaa|tttaccct', '[cgt]gggtaaa|tttaccc[acg]', 'a[act]ggtaaa|tttacc[agt]t', 'ag[act]gtaaa|tttac[agt]ct', 'agg[act]taaa|ttta[agt]cct', 'aggg[acg]aaa|ttt[cgt]ccct', 'agggt[cgt]aa|tt[acg]accct', 'agggta[cgt]a|t[acg]taccct', 'agggtaa[cgt]|[acg]ttaccct'): s, r = f.split('|') print f, len(re.findall(s, seq) + re.findall(r, seq)) sbs = "B(c|g|t) D(a|g|t) H(a|c|t) K(g|t) M(a|c) N(a|c|g|t) R(a|g) S(c|g) V(a|c|g) W(a|t) Y(c|t)" for el in sbs.split(): seq = re.sub(el[0], el[1:], seq) print "\n%d\n%d\n%d" % (ilen, clen, len(seq)) The findall() is quite useful, and other tidbits show that the API of std.regexp (despite being good already) may be improved still a bit. Bye, bearophile
Dec 01 2007
bearophile wrote:On the Shootout site there are two D versions, one of them uses std.regexp, the other PCRE. Even if the PCRE-based test is slower than the Python/Perl versions (but they use the trick of splitting the RE, so they can't be fully compared), it shows D will need a much much faster std.regexp (10 seconds with PCRE, versus 117 seconds with std.regexp).i have implemented a regex engine that is supposed to be categorically faster than the one in Phobos (it's part of the apaged package). i will do some polishing soon, in order to make it useful as a library and do some benchmarking. i'm pretty confident that it can improve on that shootout-benchmark, not only because it can compile regexps to D code.
Dec 01 2007
Jascha Wetzel:i will do some polishing soon, in order to make it useful as a library and do some benchmarking. i'm pretty confident that it can improve on that shootout-benchmark, not only because it can compile regexps to D code.- Regarding the Shootout I don't know how much useful it can be, because that site accepes only a really limited number of external libs, and generally accepts only a single sourcecode (maybe your code spits out D code that can be inserted in a single D sourcecode, but I don't know if the Shootout site accepts that). - Regarding the compilation to D code, does it work only at compile time? Or does it call DMD at runtime? - Regarding the generation of normal code, that later can be compiled, time ago I have found this, to use REs in the lwc compiler: http://students.ceid.upatras.gr/~sxanth/lwc/ex15.html - There are TONS of open source RE engines around (the RE of TCL is very fast). I presume we can find one of them with a license fit for D, so we can just copy it into the std lib of D without writing yet another (slow, buggy, limited, etc) one. Bye, bearophile
Dec 01 2007
bearophile wrote:- Regarding the Shootout I don't know how much useful it can be, because that site accepes only a really limited number of external libs, and generally accepts only a single sourcecode (maybe your code spits out D code that can be inserted in a single D sourcecode, but I don't know if the Shootout site accepts that). - Regarding the compilation to D code, does it work only at compile time? Or does it call DMD at runtime?unfortunately, the data structures for a DFA based regex engine are too complex to (realistically) implement it with compile-time means. it compile the regex to a DFA and interprets it at runtime, or generates D code from it, that can be compiled along with the rest of the program.- Regarding the generation of normal code, that later can be compiled, time ago I have found this, to use REs in the lwc compiler: http://students.ceid.upatras.gr/~sxanth/lwc/ex15.html - There are TONS of open source RE engines around (the RE of TCL is very fast). I presume we can find one of them with a license fit for D, so we can just copy it into the std lib of D without writing yet another (slow, buggy, limited, etc) one.most regex engines are more general parsers that use backtracking and are slow by design (that includes Perl's and Java's engine, PCRE, and many more). that's why i bothered to implement yet another engine. my engine uses the same approach as TRE, one of the very few engines out there that don't do backtracking (i think TCL's is another one). here are details on these issues: http://swtch.com/~rsc/regexp/regexp1.html
Dec 01 2007
Jascha Wetzel:most regex engines are more general parsers that use backtracking and are slow by design (that includes Perl's and Java's engine, PCRE, and many more). that's why i bothered to implement yet another engine.I think D needs into its std lib a RE engine as flexible and as fast as Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, you can do mostly the same things with them. People use the Perl REs every day). Then later people can add any kind of faster & more specialized RE engines (like your one). Don't you agree? :-) Bye, bearophile
Dec 01 2007
bearophile wrote:I think D needs into its std lib a RE engine as flexible and as fast as Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, you can do mostly the same things with them. People use the Perl REs every day). Then later people can add any kind of faster & more specialized RE engines (like your one). Don't you agree? :-)i don't know what exactly most people need. i'm just pretty sure they need a faster implementation than the current one. std.regexp also lacks a few fundamental features (like non-greedy matching). syntax-wise i certainly agree, that one should stick to either the unix or perl syntax. i don't propagate my implementation as a replacement for std.regexp, i just wanted to mention it for people who need fast regexps with most standard features.
Dec 01 2007
Jascha Wetzel wrote:bearophile wrote:Actually std.regexp does support non-greedy matching, it's just undocumented. <g>I think D needs into its std lib a RE engine as flexible and as fast as Python/Perl/Ruby/etc ones. (They use an "almost" standard syntax, you can do mostly the same things with them. People use the Perl REs every day). Then later people can add any kind of faster & more specialized RE engines (like your one). Don't you agree? :-)i don't know what exactly most people need. i'm just pretty sure they need a faster implementation than the current one. std.regexp also lacks a few fundamental features (like non-greedy matching).
Dec 02 2007
Jascha Wetzel wrote:most regex engines are more general parsers that use backtracking and are slow by design (that includes Perl's and Java's engine, PCRE, and many more). that's why i bothered to implement yet another engine. my engine uses the same approach as TRE, one of the very few engines out there that don't do backtracking (i think TCL's is another one). here are details on these issues: http://swtch.com/~rsc/regexp/regexp1.htmlTCL's is a combination DFA/NFA (it uses an NFA engine where possible, and defaults to a DFA for constructs that can't be correctly process in an NFA).
Dec 01 2007