digitalmars.D - String compare performance
- bearophile (630/630) Nov 27 2010 While translating a common Python script to D, I have found something in...
- Ellery Newcomer (1/1) Nov 27 2010 how well does dmd perform when you use a switch statement?
- bearophile (31/32) Nov 27 2010 4th D version:
- spir (26/138) Nov 27 2010 4.5.
- bearophile (41/44) Nov 27 2010 Here it is:
- Jonathan M Davis (6/11) Nov 27 2010 You might be able to do that for primitive types and for structs with no...
- Kagamin (2/5) Nov 27 2010 You can use memcmp, though only for utf-8 strings.
- Stewart Gordon (4/11) Nov 28 2010 Only for utf-8 strings? Why's that? I would've thought memcmp to be
- bearophobic (4/19) Nov 28 2010 D community is amazing cult of premature optimization fans. Any one of y...
- Michel Fortin (11/37) Nov 28 2010 Comparing unicode UTF-* strings using memcmp is fine as long as what
- Bruno Medeiros (5/37) Dec 07 2010 Why are people still replying to nameless trolls? There has been several...
- =?ISO-8859-1?Q?Br=fcno_Mediocre?= (3/43) Dec 08 2010 Trololol. Maybe they're a bit dumb, my brother. If they some day become ...
- Bruno Medeiros (4/10) Dec 09 2010 Lol, "Brüno Mediocre", well thought, that's actually funny.
- Daniel Gibson (2/15) Dec 09 2010 Nah, I think it's a rather mediocre joke.
- Robert Jacques (5/18) Nov 28 2010 memcmp is type agnostic if all you want to compare is equality. The othe...
- Kagamin (7/7) Nov 27 2010 You need only match 3-char strings?
- Kagamin (7/7) Nov 27 2010 or
- bearophile (4/6) Nov 27 2010 I agree there are are ways to speed up the code, probably the D #3 code ...
- =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= (7/15) Nov 28 2010 This doesn't work on some architectures if a.ptr or b.ptr is odd...
- bearophile (34/34) Nov 27 2010 I have done another test:
- Robert Jacques (46/90) Nov 27 2010 Hi bearophile,
- bearophile (82/86) Nov 28 2010 Thank you for your work :-)
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (14/100) Nov 28 2010 I don't have your data set, but for me using random data this was within...
- bearophile (68/70) Nov 28 2010 Thank you for your code. I have added your version, plus a modified vers...
- spir (28/30) Nov 28 2010 lem, it's quite unnatural. Here I have explained what I think is a good ...
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (6/76) Nov 28 2010 Measuring incorrectly in a performance test is a silly mistake, I was
- Robert Jacques (30/45) Nov 28 2010 [snip]
- bearophile (63/66) Nov 28 2010 The version #11 (I have not reformatted your function):
- Austin Hastings (25/28) Nov 27 2010 It's not clear to me if the point of your post is "how do I make this go...
- Jonathan M Davis (6/23) Nov 27 2010 What I think is pretty clear from this is that at some point, some work ...
- Iain Buclaw (6/29) Nov 28 2010 Just food for thought. GDC uses memcmp when using string comparisons in ...
- bearophile (5/10) Nov 28 2010 The asm of the first version of the function compiled with gdc uses "rep...
- Jonathan M Davis (2/26) Nov 27 2010 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282
- bearophile (8/18) Nov 28 2010 This case is very simple, so a simple tree (as in the third D version of...
- spir (10/17) Nov 28 2010 's
- bearophile (7/8) Nov 28 2010 The solution I suggest for this problem is: when DMD knows at compile-ti...
While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much). Timings, dmd compiler, best of 4, seconds: Psy: 1.59 2.3 GHz CPU. dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5. -------------------------- import psyco def test(data): count = 0 for i in xrange(len(data) - 3): codon = data[i : i + 3] if codon == "TAG" or codon == "TGA" or codon == "TAA": count += 1 return count def main(): data = open("data.txt").read() data = data * 300 print test(data) psyco.full() main() -------------------------- To produce some test data: from itertools import islice def rnd(ia = 3877, ic = 29573, im = 139968): seed = 42 imf = float(im) while 1: seed = (seed * ia + ic) % im r = seed / imf yield "ACGT"[int(r * 4)] def main(): data = "".join(islice(rnd(), 0, 120000)) file("data.txt", "w").write(data) main() -------------------------- A D2 traslation (I have used printf to reduce a lot the asm produced): import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if (codon == "TAG" || codon == "TGA" || codon == "TAA") count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } -------------------------- import std.file: read; import std.c.stdio: printf; bool cmp3(char[] s1, string s2) { return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2]; } int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } -------------------------- import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { if (data[i] == 'T') { if (data[i+1] == 'A') { if (data[i+2] == 'G') { count++; } else if (data[i+2] == 'A') { count++; } } else if (data[i+1] == 'G') { if (data[i+2] == 'A') count++; } } } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } -------------------------- dmd -O -release -inline test1.d _D5test14testFAaZi comdat L0: sub ESP,024h push EBX xor EBX,EBX push EBP mov EBP,030h[ESP] push ESI xor ESI,ESI add EBP,0FFFFFFFDh push EDI je LA8 mov EDX,03Ch[ESP] mov EAX,038h[ESP] mov EDI,EDX L22: mov ECX,offset FLAT:_D11TypeInfo_Aa6__initZ lea EAX,3[EBX] sub EAX,EBX push ECX lea EDX,[EDI][EBX] push dword ptr FLAT:_DATA[0Ch] push dword ptr FLAT:_DATA[08h] mov 020h[ESP],EAX mov 024h[ESP],EDX push EDX push EAX call near ptr __adEq2 add ESP,014h test EAX,EAX jne L9E mov ECX,offset FLAT:_D11TypeInfo_Aa6__initZ push ECX push dword ptr FLAT:_DATA[01Ch] push dword ptr FLAT:_DATA[018h] push dword ptr 024h[ESP] push dword ptr 024h[ESP] call near ptr __adEq2 add ESP,014h test EAX,EAX jne L9E mov EDX,offset FLAT:_D11TypeInfo_Aa6__initZ push EDX push dword ptr FLAT:_DATA[02Ch] push dword ptr FLAT:_DATA[028h] push dword ptr 024h[ESP] push dword ptr 024h[ESP] call near ptr __adEq2 add ESP,014h test EAX,EAX je L9F L9E: inc ESI L9F: inc EBX cmp EBX,EBP jb L22 LA8: pop EDI mov EAX,ESI pop ESI pop EBP pop EBX add ESP,024h ret 8 -------------------------- dmd -O -release -inline test2.d _D5test24testFAaZi comdat sub ESP,03Ch mov ECX,040h[ESP] push EBX push EBP push ESI xor ESI,ESI push EDI xor EDI,EDI add ECX,0FFFFFFFDh mov 01Ch[ESP],ECX je LC5 mov EDX,054h[ESP] mov EAX,050h[ESP] mov EBP,EDX L26: lea EBX,3[ESI] sub EBX,ESI lea ECX,[EBP][ESI] mov 028h[ESP],ECX mov DL,[ECX] mov ECX,FLAT:_DATA[0Ch] mov 024h[ESP],EBX mov EAX,FLAT:_DATA[08h] cmp DL,[ECX] mov 010h[ESP],ECX jne L65 mov EDX,028h[ESP] mov EAX,EBX mov EBX,010h[ESP] mov CL,1[EDX] cmp CL,1[EBX] jne L65 mov DL,2[EDX] cmp DL,2[EBX] je LB9 L65: mov EDX,028h[ESP] mov CL,[EDX] mov 018h[ESP],EDX mov EAX,024h[ESP] mov EDX,FLAT:_DATA[01Ch] cmp CL,[EDX] mov EAX,FLAT:_DATA[018h] jne L96 mov EBX,018h[ESP] mov AL,1[EBX] cmp AL,1[EDX] jne L96 mov AL,2[EBX] cmp AL,2[EDX] je LB9 L96: mov EDX,FLAT:_DATA[02Ch] mov EAX,FLAT:_DATA[028h] cmp CL,[EDX] jne LBA mov ECX,018h[ESP] mov BL,1[ECX] cmp BL,1[EDX] jne LBA mov CL,2[ECX] cmp CL,2[EDX] jne LBA LB9: inc EDI LBA: inc ESI cmp ESI,01Ch[ESP] jb L26 LC5: mov EAX,EDI pop EDI pop ESI pop EBP pop EBX add ESP,03Ch ret 8 -------------------------- dmd -O -release -inline test3.d _D5test34testFAaZi comdat push EAX xor EDX,EDX push EBX xor EBX,EBX push ESI mov ESI,010h[ESP] add ESI,0FFFFFFFDh je L4A mov 8[ESP],EDX mov EDX,014h[ESP] mov ECX,EDX mov EAX,010h[ESP] mov EDX,8[ESP] L22: cmp [ECX][EDX],054h jne L45 mov AL,1[ECX][EDX] cmp AL,041h jne L39 cmp byte ptr 2[ECX][EDX],047h je L44 jmp short L3D L39: cmp AL,047h jne L45 L3D: cmp byte ptr 2[ECX][EDX],041h jne L45 L44: inc EBX L45: inc EDX cmp EDX,ESI jb L22 L4A: pop ESI mov EAX,EBX pop EBX pop ECX ret 8 -------------------------- gdc -O3 -frelease -inline -S test1.d -o test1.s _D5test14testFAaZi: pushl %ebp movl %esp, %ebp pushl %edi pushl %esi pushl %ebx subl $12, %esp movl 8(%ebp), %eax movl 12(%ebp), %edx movl $0, -24(%ebp) subl $3, %eax movl %edx, -20(%ebp) movl %eax, -16(%ebp) je .L5 xorl %edx, %edx jmp .L8 .p2align 4,,7 .p2align 3 .L6: addl $1, -24(%ebp) addl $1, %edx cmpl %edx, -16(%ebp) jbe .L5 .L8: movl -20(%ebp), %eax movl $.LC0, %edi movl $3, %ecx addl %edx, %eax movl %eax, %esi repz cmpsb je .L6 movl %eax, %esi movl $.LC1, %edi movl $3, %ecx repz cmpsb je .L6 movl $.LC2, %edi movl %eax, %esi movl $3, %ecx repz cmpsb je .L6 addl $1, %edx cmpl %edx, -16(%ebp) ja .L8 .L5: movl -24(%ebp), %eax addl $12, %esp popl %ebx popl %esi popl %edi popl %ebp ret -------------------------------------- gdc -O3 -frelease -inline -S test2.d -o test2.s _D5test24testFAaZi: pushl %ebp xorl %edx, %edx movl %esp, %ebp xorl %eax, %eax pushl %edi pushl %esi pushl %ebx subl $16, %esp movl 8(%ebp), %ebx movl 12(%ebp), %esi movl %ebx, %edi subl $3, %edi jne .L18 jmp .L9 .p2align 4,,7 .p2align 3 .L10: movl -16(%ebp), %ecx movzbl .LC1, %ebx cmpb (%ecx), %bl je .L15 .L12: movzbl .LC2, %ebx cmpb (%ecx), %bl je .L21 .L13: addl $1, %edx cmpl %edx, %edi jbe .L9 .p2align 4,,7 .p2align 3 .L18: leal (%esi,%edx), %ecx movzbl .LC0, %ebx movl $3, -20(%ebp) movl %ecx, -16(%ebp) cmpb (%ecx), %bl jne .L10 movzbl .LC0+1, %ebx cmpb 1(%ecx), %bl jne .L10 movzbl .LC0+2, %ebx cmpb 2(%ecx), %bl jne .L10 .L11: addl $1, %edx addl $1, %eax cmpl %edx, %edi ja .L18 .L9: addl $16, %esp popl %ebx popl %esi popl %edi popl %ebp ret .p2align 4,,7 .p2align 3 .L15: movzbl .LC1+1, %ebx cmpb 1(%ecx), %bl jne .L12 movzbl .LC1+2, %ebx cmpb 2(%ecx), %bl je .L11 movzbl .LC2, %ebx cmpb (%ecx), %bl jne .L13 .p2align 4,,7 .p2align 3 .L21: movzbl .LC2+1, %ebx cmpb 1(%ecx), %bl jne .L13 movzbl .LC2+2, %ebx cmpb 2(%ecx), %bl jne .L13 jmp .L11 -------------------------------------- gdc -O3 -frelease -inline -S test3.d -o test3.s _D5test34testFAaZi: pushl %ebp xorl %eax, %eax movl %esp, %ebp pushl %esi movl 8(%ebp), %esi pushl %ebx movl 12(%ebp), %ebx subl $3, %esi je .L6 movl $1, %edx jmp .L7 .p2align 4,,7 .p2align 3 .L3: cmpl %edx, %esi leal 1(%edx), %ecx jbe .L6 .L11: movl %ecx, %edx .L7: cmpb $84, -1(%ebx,%edx) jne .L3 movzbl (%ebx,%edx), %ecx cmpb $65, %cl je .L10 cmpb $71, %cl jne .L3 xorl %ecx, %ecx cmpb $65, 1(%ebx,%edx) sete %cl addl %ecx, %eax cmpl %edx, %esi leal 1(%edx), %ecx ja .L11 .p2align 4,,7 .p2align 3 .L6: popl %ebx popl %esi popl %ebp ret --------------------------- ldc -O3 -release -inline -output-s test1.d _D5test14testFAaZi: pushl %ebp pushl %ebx pushl %edi pushl %esi subl $20, %esp movl 40(%esp), %esi cmpl $3, %esi je .LBB1_8 addl $4294967293, %esi xorl %ebx, %ebx movl %ebx, %edi .align 16 .LBB1_2: movl 44(%esp), %eax leal (%eax,%ebx), %ebp movl %ebp, 4(%esp) movl $_D11TypeInfo_Aa6__initZ, 16(%esp) movl $.str, 12(%esp) movl $3, 8(%esp) movl $3, (%esp) call _adEq testl %eax, %eax jne .LBB1_5 movl %ebp, 4(%esp) movl $_D11TypeInfo_Aa6__initZ, 16(%esp) movl $.str1, 12(%esp) movl $3, 8(%esp) movl $3, (%esp) call _adEq testl %eax, %eax jne .LBB1_5 movl %ebp, 4(%esp) movl $_D11TypeInfo_Aa6__initZ, 16(%esp) movl $.str2, 12(%esp) movl $3, 8(%esp) movl $3, (%esp) call _adEq testl %eax, %eax je .LBB1_6 .LBB1_5: incl %edi .LBB1_6: incl %ebx cmpl %esi, %ebx jb .LBB1_2 .LBB1_7: movl %edi, %eax addl $20, %esp popl %esi popl %edi popl %ebx popl %ebp ret $8 .LBB1_8: xorl %edi, %edi jmp .LBB1_7 --------------------------- ldc -O3 -release -inline -output-s test2.d _D5test24testFAaZi: pushl %ebx pushl %esi movl 12(%esp), %ecx cmpl $3, %ecx je .LBB2_14 movl 16(%esp), %edx addl $4294967293, %ecx xorl %eax, %eax movl %eax, %esi .align 16 .LBB2_2: movb (%edx,%esi), %bl cmpb $84, %bl jne .LBB2_12 movb 1(%edx,%esi), %bh cmpb $65, %bh jne .LBB2_5 cmpb $71, 2(%esi,%edx) je .LBB2_11 .LBB2_5: cmpb $84, %bl jne .LBB2_12 cmpb $71, %bh jne .LBB2_8 cmpb $65, 2(%esi,%edx) je .LBB2_11 .LBB2_8: cmpb $65, %bh setne %bh cmpb $84, %bl jne .LBB2_12 testb %bh, %bh jne .LBB2_12 cmpb $65, 2(%esi,%edx) jne .LBB2_12 .LBB2_11: incl %eax .LBB2_12: incl %esi cmpl %ecx, %esi jb .LBB2_2 .LBB2_13: popl %esi popl %ebx ret $8 .LBB2_14: xorl %eax, %eax jmp .LBB2_13 --------------------------- ldc -O3 -release -inline -output-s test3.d _D5test34testFAaZi: pushl %ebx pushl %edi pushl %esi movl 16(%esp), %ecx cmpl $3, %ecx je .LBB1_12 movl 20(%esp), %edx addl $4294967293, %ecx xorl %eax, %eax movl %eax, %esi .align 16 .LBB1_2: cmpb $84, (%edx,%esi) jne .LBB1_10 movb 1(%edx,%esi), %bl cmpb $71, %bl je .LBB1_8 cmpb $65, %bl jne .LBB1_10 movb 2(%esi,%edx), %bl cmpb $71, %bl jne .LBB1_7 incl %eax jmp .LBB1_10 .LBB1_7: cmpb $65, %bl jmp .LBB1_9 .LBB1_8: cmpb $65, 2(%esi,%edx) .LBB1_9: sete %bl movzbl %bl, %edi addl %edi, %eax .LBB1_10: incl %esi cmpl %ecx, %esi jb .LBB1_2 .LBB1_11: popl %esi popl %edi popl %ebx ret $8 .LBB1_12: xorl %eax, %eax jmp .LBB1_11 --------------------------- Bye, bearophile
Nov 27 2010
how well does dmd perform when you use a switch statement?
Nov 27 2010
Ellery Newcomer:how well does dmd perform when you use a switch statement?4th D version: import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { switch (data[i .. i + 3]) { case "TAG", "TGA", "TAA": count++; default: } } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Bye, bearophile
Nov 27 2010
On Sat, 27 Nov 2010 11:53:06 -0500 bearophile <bearophileHUGS lycos.com> wrote:Timings, dmd compiler, best of 4, seconds: Psy: 1.59 =20 =20 2.3 GHz CPU. dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.=4.5.=20 -------------------------- =20 import psyco =20 def test(data): count =3D 0 for i in xrange(len(data) - 3): codon =3D data[i : i + 3] if codon =3D=3D "TAG" or codon =3D=3D "TGA" or codon =3D=3D "TAA": count +=3D 1 return count =20 def main(): data =3D open("data.txt").read() data =3D data * 300 print test(data) =20 psyco.full() main() =20 -------------------------- =20 To produce some test data: =20 =20 from itertools import islice =20 def rnd(ia =3D 3877, ic =3D 29573, im =3D 139968): seed =3D 42 imf =3D float(im) while 1: seed =3D (seed * ia + ic) % im r =3D seed / imf yield "ACGT"[int(r * 4)] =20 =20 def main(): data =3D "".join(islice(rnd(), 0, 120000)) file("data.txt", "w").write(data) =20 main() =20 -------------------------- =20 A D2 traslation (I have used printf to reduce a lot the asm produced): =20 =20 import std.file: read; import std.c.stdio: printf; =20 int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon =3D data[i .. i + 3]; if (codon =3D=3D "TAG" || codon =3D=3D "TGA" || codon =3D=3D "TAA=")count++; } return count; } =20 void main() { char[] data0 =3D cast(char[])read("data.txt"); int n =3D 300; char[] data =3D new char[data0.length * n]; for (size_t pos; pos < data.length; pos +=3D data0.length) data[pos .. pos+data0.length] =3D data0; =20 printf("%d\n", test(data)); } =20 -------------------------- =20 import std.file: read; import std.c.stdio: printf; =20 bool cmp3(char[] s1, string s2) { return s1[0] =3D=3D s2[0] && s1[1] =3D=3D s2[1] && s1[2] =3D=3D s2[2]; } =20 int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon =3D data[i .. i + 3]; if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"=))count++; } return count; } =20 void main() { char[] data0 =3D cast(char[])read("data.txt"); int n =3D 300; char[] data =3D new char[data0.length * n]; for (size_t pos; pos < data.length; pos +=3D data0.length) data[pos .. pos+data0.length] =3D data0; =20 printf("%d\n", test(data)); }Impressive! What's your diagnostic on why basic string compare parforms the way it does? And have you tried a naive hand-written strComp (just for the sake of compl= eteness) like: bool strComp (string s1, string s2) { auto l =3D s1.length; if (l !=3D s2.length) return false; foreach (uint i ; 0..l) if (s1[i] !=3D s2[i]) return false; return true; } (I bet this will perform at about the same order of magnitude as your versi= Also, is there a way to bit-compare given memory areas at much higher speed= than element per element (I mean for arrays in general)? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 27 2010
denis.spirWhat's your diagnostic on why basic string compare parforms the way it does?I prefer the numbers and asm listings speak for themselves, because my diagnostic sometimes is wrong. But I guess it's caused by three not inlined calls to __adEq2 each loop, done to compare 3 char long arrays.And have you tried a naive hand-written strComp (just for the sake of completeness) like:Here it is: Timings, dmd compiler, best of 4, seconds: Psy: 1.59 import std.file: read; import std.c.stdio: printf; bool cmp2(char[] s1, string s2) { if (s1.length != s2.length) return false; foreach (uint i ; 0 .. s1.length) if (s1[i] != s2[i]) return false; return true; } int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if (cmp2(codon, "TAG") || cmp2(codon, "TGA") || cmp2(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); }Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't. Bye, bearophile
Nov 27 2010
On Saturday 27 November 2010 14:46:00 bearophile wrote:denis.spirYou might be able to do that for primitive types and for structs with no opEquals() and with no member variables with an opEquals(), but you certainly can't do it in the general case, since in the general case, you'd have to be calling opEquals() on each element individually. - Jonathan M DavisAlso, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 27 2010
bearophile Wrote:You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 27 2010
On 27/11/2010 23:04, Kagamin wrote:bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
Stewart Gordon Wrote:On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory. Bye, bearophobicbearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Stewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer. -- Michel Fortin michel.fortin michelf.com http://michelf.com/On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
On 29/11/2010 02:11, Michel Fortin wrote:On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/ -- Bruno Medeiros - Software EngineerStewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Dec 07 2010
Bruno Medeiros Wrote:On 29/11/2010 02:11, Michel Fortin wrote:Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Stewart Gordon Wrote:Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.On 27/11/2010 23:04, Kagamin wrote:D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Dec 08 2010
On 08/12/2010 15:02, Brüno Mediocre wrote:Bruno Medeiros Wrote:Lol, "Brüno Mediocre", well thought, that's actually funny. -- Brüno Mediocre - Software EngineerWhy are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Dec 09 2010
Bruno Medeiros schrieb:On 08/12/2010 15:02, Brüno Mediocre wrote:Nah, I think it's a rather mediocre joke.Bruno Medeiros Wrote:Lol, "Brüno Mediocre", well thought, that's actually funny.Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Dec 09 2010
On Sun, 28 Nov 2010 20:32:24 -0500, Stewart Gordon <smjg_1998 yahoo.com> wrote:On 27/11/2010 23:04, Kagamin wrote:memcmp is type agnostic if all you want to compare is equality. The other use of memcmp is essentially as an opCmp, in which case it would be type sensitive.bearophile Wrote:Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.You can use memcmp, though only for utf-8 strings.Also, is there a way to bit-compare given memory areas at much higher speed than element per element (I mean for arrays in general)?I don't know. I think you can't.
Nov 28 2010
You need only match 3-char strings? Try this hack bool cmp3c(string a, string etalon) pure { if(a.length!=3)return false; return *cast(short*)a.ptr==*cast(short*)b.ptr && a[2]==b[2]; }
Nov 27 2010
or bool cmp3c(string a, string b) pure { if(a.length!=3)return false; return *cast(char[3]*)a.ptr==*cast(char[3]*)b.ptr; } llvm.memcmp should rock with lengths known at compile time.
Nov 27 2010
Kagamin:You need only match 3-char strings? Try this hackenough. But while some things are better left to Phobos, comparing short strings efficiently is the compiler's job, not mine (and I am not sure your trick works in C99). Bye, bearophile
Nov 27 2010
Kagamin wrote:You need only match 3-char strings? Try this hack =20 bool cmp3c(string a, string etalon) pure { if(a.length!=3D3)return false; return *cast(short*)a.ptr=3D=3D*cast(short*)b.ptr && a[2]=3D=3Db[2]; }This doesn't work on some architectures if a.ptr or b.ptr is odd... Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
Nov 28 2010
I have done another test: Timings, dmd compiler, best of 4, seconds: Psy: 1.59 import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') || (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' && codon[2] == 'A') || (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'A')) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } So when there is to compare among strings known at compile-time to be small (like < 6 char), the comparison shall be replaced with inlined single char comparisons. This makes the code longer so it increases code cache pressure, but seeing how much slow the alternative is, I think it's an improvement. (A smart compiler is even able to remove the codon.length==3 test because the slice data[i..i+3] is always of length 3 (here mysteriously if you remove those three length tests the program compiled with dmd gets slower)). Bye, bearophile
Nov 27 2010
On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com> wrote:I have done another test: Timings, dmd compiler, best of 4, seconds: Psy: 1.59 import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') || (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' && codon[2] == 'A') || (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'A')) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } So when there is to compare among strings known at compile-time to be small (like < 6 char), the comparison shall be replaced with inlined single char comparisons. This makes the code longer so it increases code cache pressure, but seeing how much slow the alternative is, I think it's an improvement. (A smart compiler is even able to remove the codon.length==3 test because the slice data[i..i+3] is always of length 3 (here mysteriously if you remove those three length tests the program compiled with dmd gets slower)). Bye, bearophileHi bearophile, I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set: bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b) pure nothrow { if(a.length != b.length) return false; static if(useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if(pab is pbb) return true; auto byte_length = a.length*T.sizeof; auto pa_end = cast(ulong*)(pab+byte_length); final switch (byte_length % ulong.sizeof) { case 7: if(*pab++ != *pbb++ ) return false; case 6: if(*pab++ != *pbb++ ) return false; case 5: if(*pab++ != *pbb++ ) return false; case 4: if(*pab++ != *pbb++ ) return false; case 3: if(*pab++ != *pbb++ ) return false; case 2: if(*pab++ != *pbb++ ) return false; case 1: if(*pab++ != *pbb++ ) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if(pa is pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if(*pa++ != *pb++ ) return false; case 3: if(*pa++ != *pb++ ) return false; case 2: if(*pa++ != *pb++ ) return false; case 1: if(*pa++ != *pb++ ) return false; } while (--n > 0); } } return true; }
Nov 27 2010
Robert Jacques:I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n > 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010
On 11/28/2010 12:44 PM, bearophile wrote:Robert Jacques:I don't have your data set, but for me using random data this was within int test(string data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count += 1; data.popFront; } } Also, this one is far easier to generalize for strings of different lengths, and such :-)I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa< pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n> 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3&& codon[0] == 'T'&& codon[1] == 'A'&& codon[2] == 'G') Bye, bearophile
Nov 28 2010
Pelle Mĺnsson:I don't have your data set,In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.any fiddly code.Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
On Sun, 28 Nov 2010 10:48:46 -0500 bearophile <bearophileHUGS lycos.com> wrote:generally I don't like to use find, empty and popFront to solve this prob=lem, it's quite unnatural. Here I have explained what I think is a good eno= ugh solution to this performance problem:http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=.D&article_id=3D123044 What I would like is a just decent routine for comparing arrays of elements= which _conceptually_ are values, whatever their complexity (meaning, possi= bly structs or nested arrays). This should always (*) use a default '=3D=3D= ' comparison, which in turn may perform a kind of memcomp. I have no idea how to avoid bad perf such as shown by your tests, but this = is really necessary. Tons of routines and algorithms use '=3D=3D', and amon= g the most used; str.find will perform an arbitrary number of '=3D=3D' chec= ks; and I don't even want to think at pattern matching ;-) Denis (*) I think that: * If an element type is a value type, it should not overide opEquals. '=3D= =3D' should translate to equality of every sub-element. Else, there is some= thing wrong in the design. But there may be use cases I overlook. (Eg metad= ata stored on the element itself?) * A plain value may well refer to a "thing"/object, eg a wiget's outlook po= inting to a color palette, in which case reference comparison is the right = choice -- so, no complication. * Real things (referenced objects, entities with self/id) should not have o= pEquals at all. They should be compare only by id, meaning using 'is'. This= shows again the ambivalence of dyn arrays... -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 28 2010
On 11/28/2010 04:48 PM, bearophile wrote:Pelle Mĺnsson:Measuring incorrectly in a performance test is a silly mistake, I was indeed wrong about the factor two thing! I thought find was specialized for chars so it wouldn't require casting to ubyte. Maybe I was wrong again. Thank you for your engaging performance oriented posts. :-)I don't have your data set,In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.any fiddly code.Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com> wrote:Robert Jacques:[snip]I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set:Thank you for your work :-)Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophileStill, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out: bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } Be warned that you'll have to use strings explicitly typed as immutable char[N], since both array literals and string literals won't match a this template.
Nov 28 2010
Robert Jacques:Still, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out:import std.file: read; import std.c.stdio: printf; // not formatted bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } pure int test(const char[] data) { int count; foreach (i; 0 .. data.length - 3) { const char[] codon = data[i .. i + 3]; if (arrayComp!(char, 3)(codon, "TAG") || arrayComp!(char, 3)(codon, "TGA") || arrayComp!(char, 3)(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 This problem probably needs to be solved by the compiler. Bye, bearophile
Nov 28 2010
On 11/27/2010 11:53 AM, bearophile wrote:While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default". If it is the former, a state machine is probably the right answer for long input strings. start -> E E A,C,G:-> E E T:-> T T A:-> TA T C:-> E T G:-> TG T T:-> T TA A,G:-> E, {++count} TA C:-> E TA T:-> T TG A:-> E, {++count} TG C,G:-> E TG T:-> T Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc. Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26). =Austin
Nov 27 2010
On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:On 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M DavisWhile translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 27 2010
== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleOn Saturday 27 November 2010 22:13:39 Austin Hastings wrote:Just food for thought. GDC uses memcmp when using string comparisons in the first implementation. if (codon == "TAG" || codon == "TGA" || codon == "TAA") And it is still the slowest case of the lot. RegardsOn 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M DavisWhile translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 28 2010
Iain Buclaw:GDC uses memcmp when using string comparisons in the first implementation. if (codon == "TAG" || codon == "TGA" || codon == "TAA") And it is still the slowest case of the lot.The asm of the first version of the function compiled with gdc uses "repz cmpsb". And while I don't remember precise timings for it, I think it was about two times faster than the first version compiled with dmd (on Windows). Regarding the idea of using memcmp, I have added a comment to the issue 5282. Bye, bearophile
Nov 28 2010
On Saturday 27 November 2010 22:23:39 Jonathan M Davis wrote:On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282On 11/27/2010 11:53 AM, bearophile wrote:What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point.While translating a common Python script to D, I have found something interesting, so I have reduced it to a little benchmark. Below there is the reduced Python2 code (it uses Psyco), and a little program to generate some test data. The timing of the first D2 version is not good compared to the Python-Psyco program (the generation of the *300 array is a quick thing), so I have created two more D2 versions to show myself that D2 wasn't broken :-) The reduced code looks like a syntetic benchmark, but it has about the same performance profile of a 60 lines long Python script (the original code was using xrange(0,len(...)-3,3) instead of xrange(len(...)-3), but the situation doesn't change much).It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
Nov 27 2010
Austin Hastings:It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".The third D version in my post is fast enough, so the point of that post is that a very common string compare operation, done in the natural way, is a bit too much slow in D/DMD.If it is the former, a state machine is probably the right answer for long input strings.This case is very simple, so a simple tree (as in the third D version of my post) is probably good enough. In some similar but more complex situations of genomic processing I've seen that in C it's good to have computed gotos, to implement faster state machines. See my recent post "explicitly unimplemented computed gotos" :-) This was my use case beside the desire to create interpreters in D (it's a similar thing).Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc.All 3 codons start with T, so if you test that first letter before the 3 string compare you speed up the original code a lot. But I was looking for a more general solution. Implementing a state machine is good, but in this case I'd like the compiler to do more optimizations by itself. A simple but good enough optimization for the compiler is to replace tests with strings (known to be short at compile-time) with inlined comparisons of their single chars.Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26).In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no undecided or modified nucleosides). Bye, bearophile
Nov 28 2010
On Sat, 27 Nov 2010 22:41:45 -0800 Jonathan M Davis <jmdavisProg gmx.com> wrote:'sWhat I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It=uldnot the sort of thing that's likely to be a priority, but it really sho=Yes, but other comments seems to show memcmp only doubles speed. This would= only bring us twice as slow as python ;-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.combe done at some point. =20=20 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=3D5282
Nov 28 2010
spir:Yes, but other comments seems to show memcmp only doubles speed. This would only bring us twice as slow as python ;-)The solution I suggest for this problem is: when DMD knows at compile-time the length of one of the two strings to equate, and such length is small (like < 6), then it may replace the string eq code like this: (codon == "TAG") With code like: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010