www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: String compare performance

reply bearophile <bearophileHUGS lycos.com> writes:
I have done another test:

Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  D #4: 1.84
  D #5: 1.73
  Psy:  1.59
  D #2: 0.55
  D #6: 0.47  
  D #3: 0.34


import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


So when there is to compare among strings known at compile-time to be small
(like < 6 char), the comparison shall be replaced with inlined single char
comparisons. This makes the code longer so it increases code cache pressure,
but seeing how much slow the alternative is, I think it's an improvement.

(A smart compiler is even able to remove the codon.length==3 test because the
slice data[i..i+3] is always of length 3 (here mysteriously if you remove those
three length tests the program compiled with dmd gets slower)).

Bye,
bearophile
Nov 27 2010
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 I have done another test:

 Timings, dmd compiler, best of 4, seconds:
   D #1: 5.72
   D #4: 1.84
   D #5: 1.73
   Psy:  1.59
   D #2: 0.55
   D #6: 0.47
   D #3: 0.34


 import std.file: read;
 import std.c.stdio: printf;

 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon = data[i .. i + 3];
         if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'G') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&  
 codon[2] == 'A') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'A'))
             count++;
     }
     return count;
 }

 void main() {
     char[] data0 = cast(char[])read("data.txt");
     int n = 300;
     char[] data = new char[data0.length * n];
     for (size_t pos; pos < data.length; pos += data0.length)
         data[pos .. pos+data0.length] = data0;

     printf("%d\n", test(data));
 }


 So when there is to compare among strings known at compile-time to be  
 small (like < 6 char), the comparison shall be replaced with inlined  
 single char comparisons. This makes the code longer so it increases code  
 cache pressure, but seeing how much slow the alternative is, I think  
 it's an improvement.

 (A smart compiler is even able to remove the codon.length==3 test  
 because the slice data[i..i+3] is always of length 3 (here mysteriously  
 if you remove those three length tests the program compiled with dmd  
 gets slower)).

 Bye,
 bearophile

Hi bearophile, I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set: bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b) pure nothrow { if(a.length != b.length) return false; static if(useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if(pab is pbb) return true; auto byte_length = a.length*T.sizeof; auto pa_end = cast(ulong*)(pab+byte_length); final switch (byte_length % ulong.sizeof) { case 7: if(*pab++ != *pbb++ ) return false; case 6: if(*pab++ != *pbb++ ) return false; case 5: if(*pab++ != *pbb++ ) return false; case 4: if(*pab++ != *pbb++ ) return false; case 3: if(*pab++ != *pbb++ ) return false; case 2: if(*pab++ != *pbb++ ) return false; case 1: if(*pab++ != *pbb++ ) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if(pa is pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if(*pa++ != *pb++ ) return false; case 3: if(*pa++ != *pb++ ) return false; case 2: if(*pa++ != *pb++ ) return false; case 1: if(*pa++ != *pb++ ) return false; } while (--n > 0); } } return true; }
Nov 27 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals  
 using vectorization techniques. I found that vectorizing using ulongs  
 worked best on my system except with the shortest strings, where a simple  
 Duff's device edged it out. If you'd like to try it out on your data set:

Thank you for your work :-) A version with your function, D version #8: // D version #8 import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n > 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: D #1: 5.72 D #4: 1.84 D #5: 1.73 Psy: 1.59 D #8: 1.51 D #7: 0.56 (like #6 without length comparisons) D #2: 0.55 D #6: 0.47 D #3: 0.34 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010
next sibling parent reply =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
On 11/28/2010 12:44 PM, bearophile wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a simple
 Duff's device edged it out. If you'd like to try it out on your data set:

Thank you for your work :-) A version with your function, D version #8: // D version #8 import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa< pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n> 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: D #1: 5.72 D #4: 1.84 D #5: 1.73 Psy: 1.59 D #8: 1.51 D #7: 0.56 (like #6 without length comparisons) D #2: 0.55 D #6: 0.47 D #3: 0.34 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3&& codon[0] == 'T'&& codon[1] == 'A'&& codon[2] == 'G') Bye, bearophile

I don't have your data set, but for me using random data this was within a factor 2 of your #3, without any fiddly code. int test(string data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count += 1; data.popFront; } } Also, this one is far easier to generalize for strings of different lengths, and such :-)
Nov 28 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Pelle Månsson:

 I don't have your data set,

In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.
 but for me using random data this was within a factor 2 of your #3, without
any fiddly code.

Thank you for your code. I have added your version, plus a modified version that works on ubytes, because it's faster, #9 and #10: ------------------------- // D version #9 import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- // D version #10 import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: D #1: 5.72 D #9: 5.04 D #10: 3.31 D #4: 1.84 D #5: 1.73 Psy: 1.59 D #8: 1.51 D #7: 0.56 D #2: 0.55 D #6: 0.47 D #3: 0.34 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
parent =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
On 11/28/2010 04:48 PM, bearophile wrote:
 Pelle Månsson:

 I don't have your data set,

In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.
 but for me using random data this was within a factor 2 of your #3, without
any fiddly code.

Thank you for your code. I have added your version, plus a modified version that works on ubytes, because it's faster, #9 and #10: ------------------------- // D version #9 import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- // D version #10 import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: D #1: 5.72 D #9: 5.04 D #10: 3.31 D #4: 1.84 D #5: 1.73 Psy: 1.59 D #8: 1.51 D #7: 0.56 D #2: 0.55 D #6: 0.47 D #3: 0.34 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile

Measuring incorrectly in a performance test is a silly mistake, I was indeed wrong about the factor two thing! I thought find was specialized for chars so it wouldn't require casting to ubyte. Maybe I was wrong again. Thank you for your engaging performance oriented posts. :-)
Nov 28 2010
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:

 Still, part of the point was that string comparisons in general were way  
 too slow. Anyways, I've applied the same technique in a partially unrolled  
 version if you want to check it out:

The version #11 (I have not reformatted your function): // D version #11 import std.file: read; import std.c.stdio: printf; // not formatted bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } pure int test(const char[] data) { int count; foreach (i; 0 .. data.length - 3) { const char[] codon = data[i .. i + 3]; if (arrayComp!(char, 3)(codon, "TAG") || arrayComp!(char, 3)(codon, "TGA") || arrayComp!(char, 3)(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: D #1: 5.72 D #9: 5.04 D #10: 3.31 D #4: 1.84 D #5: 1.73 Psy: 1.59 D #8: 1.51 D #7: 0.56 D #2: 0.55 D #6: 0.47 D #11: 0.47 D #3: 0.34 Only the #3 (tree-based) version is faster. This problem probably needs to be solved by the compiler. Bye, bearophile
Nov 28 2010
prev sibling next sibling parent spir <denis.spir gmail.com> writes:
On Sun, 28 Nov 2010 10:48:46 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 generally I don't like to use find, empty and popFront to solve this prob=

ugh solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=

What I would like is a just decent routine for comparing arrays of elements= which _conceptually_ are values, whatever their complexity (meaning, possi= bly structs or nested arrays). This should always (*) use a default '=3D=3D= ' comparison, which in turn may perform a kind of memcomp. I have no idea how to avoid bad perf such as shown by your tests, but this = is really necessary. Tons of routines and algorithms use '=3D=3D', and amon= g the most used; str.find will perform an arbitrary number of '=3D=3D' chec= ks; and I don't even want to think at pattern matching ;-) Denis (*) I think that: * If an element type is a value type, it should not overide opEquals. '=3D= =3D' should translate to equality of every sub-element. Else, there is some= thing wrong in the design. But there may be use cases I overlook. (Eg metad= ata stored on the element itself?) * A plain value may well refer to a "thing"/object, eg a wiget's outlook po= inting to a color palette, in which case reference comparison is the right = choice -- so, no complication. * Real things (referenced objects, entities with self/id) should not have o= pEquals at all. They should be compare only by id, meaning using 'is'. This= shows again the ambivalence of dyn arrays... -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 28 2010
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a  
 simple
 Duff's device edged it out. If you'd like to try it out on your data  
 set:

Thank you for your work :-) A version with your function, D version #8:

[snip]
 Your function can't be inlined because it's big, so this code isn't  
 faster than inlined code like this generated by the compiler:
 (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] ==  
 'G')

 Bye,
 bearophile

Still, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out: bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } Be warned that you'll have to use strings explicitly typed as immutable char[N], since both array literals and string literals won't match a this template.
Nov 28 2010