digitalmars.D - Perfect hashing for string switch
- bearophile (19/19) Jan 22 2010 This is just a possible optimization, it's not a new feature request.
- BCS (2/6) Jan 22 2010 Have you compared it to a decisition tree or lex style state mechine?
- bearophile (5/6) Jan 22 2010 I have not, I am sorry :-) But more comparisons can be added.
- BCS (10/22) Jan 22 2010 I'm not sure how I'd set it up but I expect it would amount to a hard co...
- strtr (2/2) Jan 22 2010 bearophile Wrote: <>
- bearophile (4/5) Jan 22 2010 AA keys can be all kind of things, while here I was talking just about p...
- bearophile (15/16) Jan 27 2010 I have now implemented that too, it was not an immediate thing to do (I ...
- Matti Niemenmaa (7/21) Jan 27 2010 Your test6 is invalid: it reads beyond the bounds of the given array.
- BCS (21/49) Jan 27 2010 A simple fix to that is to make he first transition of the state machine...
- bearophile (169/173) Jan 27 2010 Shame on me, I am sorry -.- Thank you. I will try to improve the code la...
- BCS (23/36) Jan 28 2010 Things I'd use in place of that:
- bearophile (8/25) Jan 29 2010 When in not-debug mode that struct of mine is just 1 word long, so it ho...
- BCS (15/38) Jan 29 2010 Almost all the cases I can think of fall into one of a few categories:
This is just a possible optimization, it's not a new feature request. A perfect hash computed at compile time can be good to implement the string switches, I have created a small benchmark: http://codepad.org/YVFfGpsM Timings, ldc, seconds: test1: 4.48 // normal switch test2: 2.98 // perfect hash test3: 2.09 test4: 2.07 test5: 5.44 // AA Notice the very nice thing that I have computed the hash at compile time in test2 switch cases, you can't do that in a simple way in C :-) You can't compute the coefficient tables of the perfect hash at compile-time in D because probably this is currently too much slow to do, unless D compilers become staged compilers and compile the code they have to run at run time too :o) (Well, an intermediate and less intrusive solution is to use LLVM to Just-In-time compile the code that LDC runs at compile time, to speed that up.) A possible problem is that the compilation time grows if the number of strings is high. The version based on the perfect hash (test2) becomes more efficient if the number of wrong strings gets lower. In Tango the AA opIn_r is really slow, I think it can be a bug. I have not found a Tango dev interested in this so far. Note that the perfect hash I have used in this little example is minimal, so there's a better way for the compiler to implement that conversion table (as I have shown in test4 that assumes a minimal perfect hash. It's about as fast as the test3 version that can tolerate a table a bit sparse), but the code I have written works equally well if the hash isn't minimal (this is positive to decrease the compilation time): in that case the table case gets a little more sparse, but the running time is the same, because it's just a little sparse, so llvm compiles it still with table (with holes). GCC is able to compile this code better, I have asked LLVM devs to implement the same thing (this is not about perfect hashing, it can be a way for the compiler to use it better in certain situations, where you have a conversion table or string->int compile-time constant associative array): http://llvm.org/bugs/show_bug.cgi?id=6009 Bye, bearophile
Jan 22 2010
Hello bearophile,This is just a possible optimization, it's not a new feature request. A perfect hash computed at compile time can be good to implement the string switches, I have created a small benchmark:Have you compared it to a decisition tree or lex style state mechine?
Jan 22 2010
BCS:Have you compared it to a decisition tree or lex style state mechine?I have not, I am sorry :-) But more comparisons can be added. I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile
Jan 22 2010
Hello bearophile,BCS:I'm not sure how I'd set it up but I expect it would amount to a hard coded radex sort: - do a normal integer switch on string length - for each length group the strings based on common prefixes (if you have aaab, aaac, aabb, bbbb the grouping would be: [[aaab, aaac], aabb], [bbbb]) - walk down the tree using if statements. I guess it's not much different than a lex DFA but for a small list of static strings it might be faster. Also it has the option of using strcmp for tails and long sub spans.Have you compared it to a decisition tree or lex style state mechine?I have not, I am sorry :-) But more comparisons can be added. I know what decision trees are in data mining, but I presume you mean some kind of ternary tree (3 possible results of the string cmp for each char compared). Bye, bearophile
Jan 22 2010
bearophile Wrote: <> Wouldn't perfect hashing for compile time AAs be a better starting point?
Jan 22 2010
strtr:Wouldn't perfect hashing for compile time AAs be a better starting point?AA keys can be all kind of things, while here I was talking just about perfect hashing of strings, it's a bit less complex problem. So I think it's better to start from string switch first and go on CT AAs later. Bye, bearophile
Jan 22 2010
BCS:Have you compared it to a decisition tree or lex style state mechine?I have now implemented that too, it was not an immediate thing to do (I have removed the versions 2 to 5 to reduce code size on codepad): http://codepad.org/zOmPeE13 The results are good: Timings, ldc, seconds: test1: 4.48 // normal string switch test2: 2.98 // perfect hash test3: 2.09 test4: 2.07 test5: 5.44 // AA. Tango AA opIn_r is bug-slow test6: 1.18 // new result I hope this is enough. I have created that large finite state machine in D with a Python program :-) Bye, bearophile
Jan 27 2010
On 2010-01-27 15:17, bearophile wrote:BCS:Your test6 is invalid: it reads beyond the bounds of the given array. For example with input "th", it will check whether the third character is 'i'; but the length of the array is only 2, so it shouldn't be doing that. -- E-mail address: matti.niemenmaa+news, domain is iki (DOT) fiHave you compared it to a decisition tree or lex style state mechine?I have now implemented that too, it was not an immediate thing to do (I have removed the versions 2 to 5 to reduce code size on codepad): http://codepad.org/zOmPeE13 The results are good: Timings, ldc, seconds: test1: 4.48 // normal string switch test2: 2.98 // perfect hash test3: 2.09 test4: 2.07 test5: 5.44 // AA. Tango AA opIn_r is bug-slow test6: 1.18 // new result I hope this is enough. I have created that large finite state machine in D with a Python program :-)
Jan 27 2010
Hello Matti,On 2010-01-27 15:17, bearophile wrote:A simple fix to that is to make he first transition of the state machine based on the length of the key. that is, 'if' at the top becomes: switch(word.length) { case 13: goto S...: case 12: goto S...: case 11: goto S...: case 10: goto S...: case 9: goto S...: case 8: goto S...: case 7: goto S...: case 6: goto S...: case 5: goto S...: case 4: goto S...: case 3: goto S...: case 2: goto S...: default: goto UNREC; } -- <IXOYE><BCS:Your test6 is invalid: it reads beyond the bounds of the given array. For example with input "th", it will check whether the third character is 'i'; but the length of the array is only 2, so it shouldn't be doing that.Have you compared it to a decisition tree or lex style state mechine?I have now implemented that too, it was not an immediate thing to do (I have removed the versions 2 to 5 to reduce code size on codepad): http://codepad.org/zOmPeE13 The results are good: Timings, ldc, seconds: test1: 4.48 // normal string switch test2: 2.98 // perfect hash test3: 2.09 test4: 2.07 test5: 5.44 // AA. Tango AA opIn_r is bug-slow test6: 1.18 // new result I hope this is enough. I have created that large finite state machine in D with a Python program :-)
Jan 27 2010
Matti Niemenmaa:Your test6 is invalid: it reads beyond the bounds of the given array. For example with input "th", it will check whether the third character is 'i'; but the length of the array is only 2, so it shouldn't be doing that.Shame on me, I am sorry -.- Thank you. I will try to improve the code later. The good thing is that D2 allows me to implement something to catch that kind of out-of-range pointer errors at run-time (D1 is not good enough here). That code at the bottom is not well tested yet, so it can contain many bugs. And probably it's not const-aware yet (if someone has suggestions to improve it I will appreciate them). In programs a certain percentage of pointers run inside an interval of memory, so something like this can be useful. Do you think this is (once well debugged and improved) useful for Phobos2? Up sides: - this code seems to work and catchs the bug at runtime in the test6() function I've written. Downsides: - currently compiled with D2 the test6() gets slower even when the code is not in debug release. Future D2 compilers will need to reduce to zero or near zero this overhead, so safer pointers can be used more freely. Possible future improvements: use an "interval tree" where necessary to quickly find if the pointer is inside more than one interval. This is probably a quite less common need. Bye, bearophile --------------------- import std.stdio: writeln, write; template IsMutable(T) { enum bool IsMutable = __traits(compiles, { T a; a = T.init; }); } unittest { // of IsMutable(T) auto s1 = "hello1"; static assert( !IsMutable!(typeof(s1[0])) ); char[] s2 = cast(char[])"hello2"; static assert( IsMutable!(typeof(s2[0])) ); } /// ***** NOT tested well enough yet ***** struct IntervalPtr(T) { T* ptr; // can be null debug { private T* start_ptr; // can't be null if present private T* end_ptr; // can't be null if present invariant() { assert(start_ptr != null); assert(end_ptr != null); assert(start_ptr <= end_ptr); assert(ptr == null || (ptr >= start_ptr && ptr < end_ptr)); } } else { this(T* aptr) { this.ptr = aptr; } } this(T* start, T* end) { this.ptr = start; debug { assert(start != null); assert(end != null); assert(start <= end); this.start_ptr = start; this.end_ptr = end; } } this(T* start, T* end, T* aptr) { this.ptr = aptr; debug { assert(end != null); assert(end != null); assert(start <= end); if (aptr != null) { assert(aptr >= start); assert(aptr < end); } this.start_ptr = start; this.end_ptr = end; } } T opStar() { return *this.ptr; } void opPostInc() { debug assert(this.ptr < (this.end_ptr - 1)); this.ptr++; } void opPostDec() { debug assert(this.ptr > this.start_ptr); this.ptr--; } void opAddAssign(size_t i) { debug assert(this.ptr < (this.end_ptr - i)); this.ptr += i; } void opSubAssign(size_t i) { debug assert(this.ptr >= (this.start_ptr + i)); this.ptr -= i; } void opAssign(T* other_ptr) { debug { if (other_ptr != null) { assert(other_ptr >= this.start_ptr); assert(other_ptr < this.end_ptr); } } this.ptr = other_ptr; } void opAssign(IntervalPtr other) { debug { assert(other.start_ptr != null); assert(other.end_ptr != null); assert(other.end_ptr != null); if (other.ptr != null) { assert(other.ptr >= other.start_ptr); assert(other.ptr < other.end_ptr); } this.start_ptr = other.start_ptr; this.end_ptr = other.end_ptr; } this.ptr = other.ptr; } const bool opEquals(ref const(IntervalPtr) other) { return this.ptr == other.ptr; } /*ref ?*/ T opIndex(size_t i) { debug { assert((this.ptr + i) >= this.start_ptr); assert((this.ptr + i) < this.end_ptr); } return this.ptr[i]; } static if (IsMutable!T) { void opIndexAssign(T value, size_t i) { debug { assert((this.ptr + i) >= this.start_ptr); assert((this.ptr + i) < this.end_ptr); } this.ptr[i] = value; } } } IntervalPtr!T intervalPtr(T)(/*inout ?*/ T[] arr) { debug return IntervalPtr!T(arr.ptr, arr.ptr + arr.length); else return IntervalPtr!T(arr.ptr); } IntervalPtr!T intervalPtr(T)(/*inout ?*/ T[] arr, T* aptr) { debug return IntervalPtr!T(arr.ptr, arr.ptr + arr.length, aptr); else return IntervalPtr!T(aptr); } IntervalPtr!T intervalPtr(T)(T* start, T* end) { debug return IntervalPtr!T(start, end); else return IntervalPtr!T(start); } IntervalPtr!T intervalPtr(T)(T* start, T* end, T* aptr) { debug return IntervalPtr!T(start, end, aptr); else return IntervalPtr!T(aptr); } void main() { auto s = "hello"; writeln("String: ", s); auto p1 = s.intervalPtr(); writeln("intervalPtr sizeof: ", p1.sizeof); foreach (_; 0 .. 4) { write(">", *p1, "< "); ++p1; } writeln(">", *p1, "< "); auto p2 = intervalPtr(s.ptr, s.ptr + s.length); foreach (_; 0 .. 4) { write(">", *p2, "< "); p2++; } writeln(">", *p2, "< "); p1 = s.ptr; // p1 reset to the start foreach (i; 0 .. s.length) write(">", p1[i], "< "); writeln(); }
Jan 27 2010
Hello bearophile,Matti Niemenmaa:Things I'd use in place of that: ///// char[] str, int at = 0; ... switch(str[at]) { ... } ... at++; or ///// char[] str; ... switch(str[0]) { ... } ... str = str[1..$]; A good compiler should be able to convert the first to about the same as what your type should compile to. A really good compiler should be able to const propagate at and remove the variable and (if you start by switching on the length) fold out even the bounds checks. The second has the same optimization potential but for some reason I think it would be expecting more of a compiler to get it optimized that well. -- <IXOYE><Your test6 is invalid: it reads beyond the bounds of the given array. For example with input "th", it will check whether the third character is 'i'; but the length of the array is only 2, so it shouldn't be doing that.Shame on me, I am sorry -.- Thank you. I will try to improve the code later. The good thing is that D2 allows me to implement something to catch that kind of out-of-range pointer errors at run-time (D1 is not good enough here).
Jan 28 2010
BCS:Things I'd use in place of that: ///// char[] str, int at = 0; ... switch(str[at]) { ... } ... at++; or ///// char[] str; ... switch(str[0]) { ... } ... str = str[1..$];When in not-debug mode that struct of mine is just 1 word long, so it hopefully gets optimized well by LDC2. While removing that free 2-word ling "str" is a bit harder. With that struct you don't need to carry around two things, a vector and index, you have to keep and move around just one variable. And it acts just like a pointer, so you can move it in both directions, etc (it's not handy nor safe to do this with a slice, you need to keep a 3 words of memory as state to be safe). You can (hopefully) use it as drop-in safer replacement for a pointer that you know spans a range of memory (it can have few holes too), so you can use it in D2 code translated from original C code, changing it very little. You don't want to modify too much that kind of C code. In that broken finite state machine of mine I've had to change just one line of code, where the "p" pointer is defined. With DMD you can remove 100% of the overhead defining it as a pointer in not-debug release, and keeping the rest of the code that use the pointer unchanged. This can be done locally (with a bit more complex pointer definition that defines a true pointer in not-debug mode), or more generally modifying those functions that act as constructors, so they return a true pointer in not-debug mode (but I like this less, I think it's not tidy to have functions that return different types in debug mode). Having safer pointers goes well with the style of D language. Probably I'll put this struct in my future d2libs :-) Maybe with small changes such safe ranged pointers can be used in SafeD modules too :-) Bye, bearophile
Jan 29 2010
Hello bearophile,BCS:Almost all the cases I can think of fall into one of a few categories: - the length stuff can be statically verified to be correct and the length need not be carried around - the length stuff can't be statically verified but the code handles he dynamic verification in such a way the optimizer can figure out how to safely omit the length checks for much of the code. - the length stuff can't be statically verified and there is no way to enforce correct dynamic verification so you can't safely omit the length and checks even in release mode. In each case, the native solution can be just as fast as your solution or your solution is not safe in release mode.Things I'd use in place of that: ///// char[] str, int at = 0; ... switch(str[at]) { ... } ... at++; or ///// char[] str; ... switch(str[0]) { ... } ... str = str[1..$];When in not-debug mode that struct of mine is just 1 word long, so it hopefully gets optimized well by LDC2. While removing that free 2-word ling "str" is a bit harder.And it acts just like a pointer, so you can move it in both directions, etcI'll give you that point. -- <IXOYE><
Jan 29 2010