digitalmars.D - Re: String compare performance

bearophile <bearophileHUGS lycos.com> Nov 27 2010

"Robert Jacques" <sandford jhu.edu> Nov 27 2010

bearophile <bearophileHUGS lycos.com> Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> Nov 28 2010

bearophile <bearophileHUGS lycos.com> Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> Nov 28 2010

bearophile <bearophileHUGS lycos.com> Nov 28 2010

spir <denis.spir gmail.com> Nov 28 2010
"Robert Jacques" <sandford jhu.edu> Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

I have done another test:

Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  D #4: 1.84
  D #5: 1.73
  Psy:  1.59
  D #2: 0.55
  D #6: 0.47  
  D #3: 0.34


import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


So when there is to compare among strings known at compile-time to be small
(like < 6 char), the comparison shall be replaced with inlined single char
comparisons. This makes the code longer so it increases code cache pressure,
but seeing how much slow the alternative is, I think it's an improvement.

(A smart compiler is even able to remove the codon.length==3 test because the
slice data[i..i+3] is always of length 3 (here mysteriously if you remove those
three length tests the program compiled with dmd gets slower)).

Bye,
bearophile

Nov 27 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 I have done another test:

 Timings, dmd compiler, best of 4, seconds:
   D #1: 5.72
   D #4: 1.84
   D #5: 1.73
   Psy:  1.59
   D #2: 0.55
   D #6: 0.47
   D #3: 0.34


 import std.file: read;
 import std.c.stdio: printf;

 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon = data[i .. i + 3];
         if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'G') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&  
 codon[2] == 'A') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'A'))
             count++;
     }
     return count;
 }

 void main() {
     char[] data0 = cast(char[])read("data.txt");
     int n = 300;
     char[] data = new char[data0.length * n];
     for (size_t pos; pos < data.length; pos += data0.length)
         data[pos .. pos+data0.length] = data0;

     printf("%d\n", test(data));
 }


 So when there is to compare among strings known at compile-time to be  
 small (like < 6 char), the comparison shall be replaced with inlined  
 single char comparisons. This makes the code longer so it increases code  
 cache pressure, but seeing how much slow the alternative is, I think  
 it's an improvement.

 (A smart compiler is even able to remove the codon.length==3 test  
 because the slice data[i..i+3] is always of length 3 (here mysteriously  
 if you remove those three length tests the program compiled with dmd  
 gets slower)).

 Bye,
 bearophile


Hi bearophile,
I've spent some time having fun this afternoon optimizing array-equals  
using vectorization techniques. I found that vectorizing using ulongs  
worked best on my system except with the shortest strings, where a simple  
Duff's device edged it out. If you'd like to try it out on your data set:


bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b)  
pure nothrow  {
     if(a.length != b.length) return false;

     static if(useBitCompare) {
         auto pab = cast(ubyte*)a.ptr;
         auto pbb = cast(ubyte*)b.ptr;
         if(pab is pbb) return true;

         auto byte_length = a.length*T.sizeof;
         auto pa_end      = cast(ulong*)(pab+byte_length);
	    final switch (byte_length % ulong.sizeof) {
             case 7:             if(*pab++ != *pbb++ ) return false;
             case 6:             if(*pab++ != *pbb++ ) return false;
             case 5:             if(*pab++ != *pbb++ ) return false;
             case 4:             if(*pab++ != *pbb++ ) return false;
             case 3:             if(*pab++ != *pbb++ ) return false;
             case 2:             if(*pab++ != *pbb++ ) return false;
             case 1:             if(*pab++ != *pbb++ ) return false;
             case 0:
	    }
         auto pa      = cast(ulong*)pab;
         auto pb      = cast(ulong*)pbb;
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
     } else {                                            // default to a  
short duff's device
         auto pa = a.ptr;
         auto pb = b.ptr;
         if(pa is pb) return true;
	    auto n  = (a.length + 3) / 4;
	    final switch (a.length % 4) {
             case 0:        do {  if(*pa++ != *pb++ ) return false;
             case 3:              if(*pa++ != *pb++ ) return false;
             case 2:              if(*pa++ != *pb++ ) return false;
             case 1:              if(*pa++ != *pb++ ) return false;
                       } while (--n > 0);
	    }    }
     return true;
}

Nov 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals  
 using vectorization techniques. I found that vectorizing using ulongs  
 worked best on my system except with the shortest strings, where a simple  
 Duff's device edged it out. If you'd like to try it out on your data set:


Thank you for your work :-)
A version with your function, D version #8:


// D version #8
import std.file: read;
import std.c.stdio: printf;
import std.exception: assumeUnique;

bool arrayComp(bool useBitCompare=true, T)
              (const T[] a, const T[] b) pure nothrow {
    if (a.length != b.length)
        return false;

    static if (useBitCompare) {
        auto pab = cast(ubyte*)a.ptr;
        auto pbb = cast(ubyte*)b.ptr;
        if (pab is pbb)
            return true;

        auto byte_length = a.length * T.sizeof;
        auto pa_end = cast(ulong*)(pab + byte_length);

        final switch (byte_length % ulong.sizeof) {
            case 7: if (*pab++ != *pbb++) return false;
            case 6: if (*pab++ != *pbb++) return false;
            case 5: if (*pab++ != *pbb++) return false;
            case 4: if (*pab++ != *pbb++) return false;
            case 3: if (*pab++ != *pbb++) return false;
            case 2: if (*pab++ != *pbb++) return false;
            case 1: if (*pab++ != *pbb++) return false;
            case 0:
        }

        auto pa = cast(ulong*)pab;
        auto pb = cast(ulong*)pbb;

        while (pa < pa_end) {
            if (*pa++ != *pb++)
                return false;
        }
    } else { // default to a short duff's device
        auto pa = a.ptr;
        auto pb = b.ptr;
        if (pa == pb)
            return true;
        auto n  = (a.length + 3) / 4;

        final switch (a.length % 4) {
            case 0: do { if (*pa++ != *pb++) return false;
            case 3:      if (*pa++ != *pb++) return false;
            case 2:      if (*pa++ != *pb++) return false;
            case 1:      if (*pa++ != *pb++) return false;
                       } while (--n > 0);
        }
    }

    return true;
}

int test(string data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        auto codon = data[i .. i + 3];
        if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") ||
arrayComp(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;
    string sdata = assumeUnique(data);

    printf("%d\n", test(sdata));
}


Timings, dmd compiler, best of 4, seconds:
  D #1: 5.72
  D #4: 1.84
  D #5: 1.73
  Psy:  1.59
  D #8: 1.51
  D #7: 0.56 (like #6 without length comparisons)
  D #2: 0.55
  D #6: 0.47  
  D #3: 0.34


Your function can't be inlined because it's big, so this code isn't faster than
inlined code like this generated by the compiler:
(codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G')

Bye,
bearophile

Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:

On 11/28/2010 12:44 PM, bearophile wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a simple
 Duff's device edged it out. If you'd like to try it out on your data set:


 Thank you for your work :-)
 A version with your function, D version #8:


 // D version #8
 import std.file: read;
 import std.c.stdio: printf;
 import std.exception: assumeUnique;

 bool arrayComp(bool useBitCompare=true, T)
                (const T[] a, const T[] b) pure nothrow {
      if (a.length != b.length)
          return false;

      static if (useBitCompare) {
          auto pab = cast(ubyte*)a.ptr;
          auto pbb = cast(ubyte*)b.ptr;
          if (pab is pbb)
              return true;

          auto byte_length = a.length * T.sizeof;
          auto pa_end = cast(ulong*)(pab + byte_length);

          final switch (byte_length % ulong.sizeof) {
              case 7: if (*pab++ != *pbb++) return false;
              case 6: if (*pab++ != *pbb++) return false;
              case 5: if (*pab++ != *pbb++) return false;
              case 4: if (*pab++ != *pbb++) return false;
              case 3: if (*pab++ != *pbb++) return false;
              case 2: if (*pab++ != *pbb++) return false;
              case 1: if (*pab++ != *pbb++) return false;
              case 0:
          }

          auto pa = cast(ulong*)pab;
          auto pb = cast(ulong*)pbb;

          while (pa<  pa_end) {
              if (*pa++ != *pb++)
                  return false;
          }
      } else { // default to a short duff's device
          auto pa = a.ptr;
          auto pb = b.ptr;
          if (pa == pb)
              return true;
          auto n  = (a.length + 3) / 4;

          final switch (a.length % 4) {
              case 0: do { if (*pa++ != *pb++) return false;
              case 3:      if (*pa++ != *pb++) return false;
              case 2:      if (*pa++ != *pb++) return false;
              case 1:      if (*pa++ != *pb++) return false;
                         } while (--n>  0);
          }
      }

      return true;
 }

 int test(string data) {
      int count;
      foreach (i; 0 ..  data.length - 3) {
          auto codon = data[i .. i + 3];
          if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") ||
arrayComp(codon, "TAA"))
              count++;
      }
      return count;
 }

 void main() {
      char[] data0 = cast(char[])read("data.txt");
      int n = 300;
      char[] data = new char[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos+data0.length] = data0;
      string sdata = assumeUnique(data);

      printf("%d\n", test(sdata));
 }


 Timings, dmd compiler, best of 4, seconds:
    D #1: 5.72
    D #4: 1.84
    D #5: 1.73
    Psy:  1.59
    D #8: 1.51
    D #7: 0.56 (like #6 without length comparisons)
    D #2: 0.55
    D #6: 0.47
    D #3: 0.34


 Your function can't be inlined because it's big, so this code isn't faster
than inlined code like this generated by the compiler:
 (codon.length == 3&&  codon[0] == 'T'&&  codon[1] == 'A'&&  codon[2] == 'G')

 Bye,
 bearophile


I don't have your data set, but for me using random data this was within 
a factor 2 of your #3, without any fiddly code.

int test(string data) {
     int count;
     while (true) {
         data = data.find("TAG", "TGA", "TAA")[0];

         if (data.empty) return count;

         count += 1;
         data.popFront;
     }
}

Also, this one is far easier to generalize for strings of different 
lengths, and such :-)

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Pelle M�nsson:

 I don't have your data set,


In my first post I have explained how to create the data set, using a little
Python script that I have listed there. You just need a Python2 interpreter.


 but for me using random data this was within a factor 2 of your #3, without
any fiddly code.


Thank you for your code. I have added your version, plus a modified version
that works on ubytes, because it's faster, #9 and #10:

-------------------------

// D version #9
import std.file: read;
import std.c.stdio: printf;
import std.algorithm: find;
import std.array: empty, popFront;

int test(char[] data) {
    int count;
    while (true) {
        data = data.find("TAG", "TGA", "TAA")[0];
        if (data.empty)
            return count;
         count++;
         data.popFront();
     }
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}

-------------------------

// D version #10
import std.file: read;
import std.c.stdio: printf;
import std.algorithm: find;
import std.array: empty, popFront;

int test(ubyte[] data) {
    int count;
    while (true) {
        data = data.find("TAG", "TGA", "TAA")[0];
        if (data.empty)
            return count;
         count++;
         data.popFront();
     }
}

void main() {
    ubyte[] data0 = cast(ubyte[])read("data.txt");
    int n = 300;
    ubyte[] data = new ubyte[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}

-------------------------

Timings, dmd compiler, best of 4, seconds:
  D #1:  5.72
  D #9:  5.04
  D #10: 3.31
  D #4:  1.84
  D #5:  1.73
  Psy:   1.59
  D #8:  1.51
  D #7:  0.56
  D #2:  0.55
  D #6:  0.47  
  D #3:  0.34


So on this PC it's not much fast. And generally I don't like to use find, empty
and popFront to solve this problem, it's quite unnatural. Here I have explained
what I think is a good enough solution to this performance problem:
http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044

Bye,
bearophile

Nov 28 2010

=?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:

On 11/28/2010 04:48 PM, bearophile wrote:
 Pelle M�nsson:

 I don't have your data set,


 In my first post I have explained how to create the data set, using a little
Python script that I have listed there. You just need a Python2 interpreter.


 but for me using random data this was within a factor 2 of your #3, without
any fiddly code.


 Thank you for your code. I have added your version, plus a modified version
that works on ubytes, because it's faster, #9 and #10:

 -------------------------

 // D version #9
 import std.file: read;
 import std.c.stdio: printf;
 import std.algorithm: find;
 import std.array: empty, popFront;

 int test(char[] data) {
      int count;
      while (true) {
          data = data.find("TAG", "TGA", "TAA")[0];
          if (data.empty)
              return count;
           count++;
           data.popFront();
       }
 }

 void main() {
      char[] data0 = cast(char[])read("data.txt");
      int n = 300;
      char[] data = new char[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos + data0.length] = data0;

      printf("%d\n", test(data));
 }

 -------------------------

 // D version #10
 import std.file: read;
 import std.c.stdio: printf;
 import std.algorithm: find;
 import std.array: empty, popFront;

 int test(ubyte[] data) {
      int count;
      while (true) {
          data = data.find("TAG", "TGA", "TAA")[0];
          if (data.empty)
              return count;
           count++;
           data.popFront();
       }
 }

 void main() {
      ubyte[] data0 = cast(ubyte[])read("data.txt");
      int n = 300;
      ubyte[] data = new ubyte[data0.length * n];
      for (size_t pos; pos<  data.length; pos += data0.length)
          data[pos .. pos + data0.length] = data0;

      printf("%d\n", test(data));
 }

 -------------------------

 Timings, dmd compiler, best of 4, seconds:
    D #1:  5.72
    D #9:  5.04
    D #10: 3.31
    D #4:  1.84
    D #5:  1.73
    Psy:   1.59
    D #8:  1.51
    D #7:  0.56
    D #2:  0.55
    D #6:  0.47
    D #3:  0.34


 So on this PC it's not much fast. And generally I don't like to use find,
empty and popFront to solve this problem, it's quite unnatural. Here I have
explained what I think is a good enough solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044

 Bye,
 bearophile


Measuring incorrectly in a performance test is a silly mistake, I was 
indeed wrong about the factor two thing!

I thought find was specialized for chars so it wouldn't require casting 
to ubyte. Maybe I was wrong again.

Thank you for your engaging performance oriented posts. :-)

Nov 28 2010

bearophile <bearophileHUGS lycos.com> writes:

Robert Jacques:

 Still, part of the point was that string comparisons in general were way  
 too slow. Anyways, I've applied the same technique in a partially unrolled  
 version if you want to check it out:


The version #11 (I have not reformatted your function):


// D version #11
import std.file: read;
import std.c.stdio: printf;

// not formatted
bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow {
     static if(T.sizeof*N <= uint.sizeof) {
         return a.length == N && !( (*(cast(uint*)a.ptr)  ^
*(cast(uint*)b.ptr))  & (uint.max >> 8*(uint.sizeof  - T.sizeof*N) ));
     } else static if(T.sizeof*N <= ulong.sizeof) {
         return a.length == N && !( (*(cast(ulong*)a.ptr) ^
*(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) ));
     } else { // Fall back to a loop
         if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr))
) return false;
         enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N %
ulong.sizeof : ulong.sizeof;
         auto pa      = cast(ulong*)(cast(ubyte*)a.ptr + alignment);
         auto pb      = cast(ulong*)(cast(ubyte*)b.ptr + alignment);
         auto pa_end  = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N);
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
         return true;
     }
}

pure int test(const char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        const char[] codon = data[i .. i + 3];
        if (arrayComp!(char, 3)(codon, "TAG") ||
            arrayComp!(char, 3)(codon, "TGA") ||
            arrayComp!(char, 3)(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos + data0.length] = data0;

    printf("%d\n", test(data));
}


Timings, dmd compiler, best of 4, seconds:
  D #1:  5.72
  D #9:  5.04
  D #10: 3.31
  D #4:  1.84
  D #5:  1.73
  Psy:   1.59
  D #8:  1.51
  D #7:  0.56
  D #2:  0.55
  D #6:  0.47
  D #11: 0.47
  D #3:  0.34

Only the #3 (tree-based) version is faster.
This problem probably needs to be solved by the compiler.

Bye,
bearophile

Nov 28 2010

spir <denis.spir gmail.com> writes:

On Sun, 28 Nov 2010 10:48:46 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 generally I don't like to use find, empty and popFront to solve this prob=


ugh solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=



What I would like is a just decent routine for comparing arrays of elements=
 which _conceptually_ are values, whatever their complexity (meaning, possi=
bly structs or nested arrays). This should always (*) use a default '=3D=3D=
' comparison, which in turn may perform a kind of memcomp.
I have no idea how to avoid bad perf such as shown by your tests, but this =
is really necessary. Tons of routines and algorithms use '=3D=3D', and amon=
g the most used; str.find will perform an arbitrary number of '=3D=3D' chec=
ks; and I don't even want to think at pattern matching ;-)

Denis

(*) I think that:
* If an element type is a value type, it should not overide opEquals. '=3D=
=3D' should translate to equality of every sub-element. Else, there is some=
thing wrong in the design. But there may be use cases I overlook. (Eg metad=
ata stored on the element itself?)
* A plain value may well refer to a "thing"/object, eg a wiget's outlook po=
inting to a color palette, in which case reference comparison is the right =
choice -- so, no complication.
* Real things (referenced objects, entities with self/id) should not have o=
pEquals at all. They should be compare only by id, meaning using 'is'. This=
 shows again the ambivalence of dyn arrays...
-- -- -- -- -- -- --
vit esse estrany =E2=98=A3

spir.wikidot.com

Nov 28 2010

"Robert Jacques" <sandford jhu.edu> writes:

On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a  
 simple
 Duff's device edged it out. If you'd like to try it out on your data  
 set:


 Thank you for your work :-)
 A version with your function, D version #8:


[snip]

 Your function can't be inlined because it's big, so this code isn't  
 faster than inlined code like this generated by the compiler:
 (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] ==  
 'G')

 Bye,
 bearophile


Still, part of the point was that string comparisons in general were way  
too slow. Anyways, I've applied the same technique in a partially unrolled  
version if you want to check it out:

bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow {
     static if(T.sizeof*N <= uint.sizeof) {
         return a.length == N && !( (*(cast(uint*)a.ptr)  ^  
*(cast(uint*)b.ptr))  & (uint.max >> 8*(uint.sizeof  - T.sizeof*N) ));
     } else static if(T.sizeof*N <= ulong.sizeof) {
         return a.length == N && !( (*(cast(ulong*)a.ptr) ^  
*(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) ));
     } else { // Fall back to a loop
         if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr))  
) return false;
         enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N %  
ulong.sizeof : ulong.sizeof;
         auto pa      = cast(ulong*)(cast(ubyte*)a.ptr + alignment);
         auto pb      = cast(ulong*)(cast(ubyte*)b.ptr + alignment);
         auto pa_end  = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N);
         while (pa < pa_end) {
             if(*pa++ != *pb++ ) return false;
         }
         return true;
     }
}

Be warned that you'll have to use strings explicitly typed as immutable  
char[N], since both array literals and string literals won't match a this  
template.

Nov 28 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: String compare performance