www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - String compare performance

reply bearophile <bearophileHUGS lycos.com> writes:
While translating a common Python script to D, I have found something
interesting, so I have reduced it to a little benchmark.

Below there is the reduced Python2 code (it uses Psyco), and a little program
to generate some test data. The timing of the first D2 version is not good
compared to the Python-Psyco program (the generation of the *300 array is a
quick thing), so I have created two more D2 versions to show myself that D2
wasn't broken :-)

The reduced code looks like a syntetic benchmark, but it has about the same
performance profile of a 60 lines long Python script (the original code was
using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation
doesn't change much).

Timings, dmd compiler, best of 4, seconds:

  Psy:  1.59




2.3 GHz CPU.
dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.4.5.

--------------------------


import psyco

def test(data):
    count = 0
    for i in xrange(len(data) - 3):
        codon = data[i : i + 3]
        if codon == "TAG" or codon == "TGA" or codon == "TAA":
            count += 1
    return count

def main():
    data = open("data.txt").read()
    data = data * 300
    print test(data)

psyco.full()
main()

--------------------------

To produce some test data:



from itertools import islice

def rnd(ia = 3877, ic = 29573, im = 139968):
    seed = 42
    imf = float(im)
    while 1:
        seed = (seed * ia + ic) % im
        r = seed / imf
        yield "ACGT"[int(r * 4)]


def main():
    data = "".join(islice(rnd(), 0, 120000))
    file("data.txt", "w").write(data)

main()

--------------------------

A D2 traslation (I have used printf to reduce a lot the asm produced):



import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (codon == "TAG" || codon == "TGA" || codon == "TAA")
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------


import std.file: read;
import std.c.stdio: printf;

bool cmp3(char[] s1, string s2) {
    return s1[0] == s2[0] && s1[1] == s2[1] && s1[2] == s2[2];
}

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------


import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;

    foreach (i; 0 ..  data.length - 3) {
        if (data[i] == 'T') {
            if (data[i+1] == 'A') {
                if (data[i+2] == 'G') {
                    count++;
                } else if (data[i+2] == 'A') {
                    count++;
                }
            } else if (data[i+1] == 'G') {
                if (data[i+2] == 'A')
                    count++;
            }
        }
    }

    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}

--------------------------

dmd -O -release -inline test1.d

_D5test14testFAaZi	comdat
L0:		sub	ESP,024h
		push	EBX
		xor	EBX,EBX
		push	EBP
		mov	EBP,030h[ESP]
		push	ESI
		xor	ESI,ESI
		add	EBP,0FFFFFFFDh
		push	EDI
		je	LA8
		mov	EDX,03Ch[ESP]
		mov	EAX,038h[ESP]
		mov	EDI,EDX
L22:		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		lea	EAX,3[EBX]
		sub	EAX,EBX
		push	ECX
		lea	EDX,[EDI][EBX]
		push	dword ptr FLAT:_DATA[0Ch]
		push	dword ptr FLAT:_DATA[08h]
		mov	020h[ESP],EAX
		mov	024h[ESP],EDX
		push	EDX
		push	EAX
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	ECX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	ECX
		push	dword ptr FLAT:_DATA[01Ch]
		push	dword ptr FLAT:_DATA[018h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		jne	L9E
		mov	EDX,offset FLAT:_D11TypeInfo_Aa6__initZ
		push	EDX
		push	dword ptr FLAT:_DATA[02Ch]
		push	dword ptr FLAT:_DATA[028h]
		push	dword ptr 024h[ESP]
		push	dword ptr 024h[ESP]
		call	near ptr __adEq2
		add	ESP,014h
		test	EAX,EAX
		je	L9F
L9E:		inc	ESI
L9F:		inc	EBX
		cmp	EBX,EBP
		jb	L22
LA8:		pop	EDI
		mov	EAX,ESI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,024h
		ret	8

--------------------------

dmd -O -release -inline test2.d

_D5test24testFAaZi	comdat
		sub	ESP,03Ch
		mov	ECX,040h[ESP]
		push	EBX
		push	EBP
		push	ESI
		xor	ESI,ESI
		push	EDI
		xor	EDI,EDI
		add	ECX,0FFFFFFFDh
		mov	01Ch[ESP],ECX
		je	LC5
		mov	EDX,054h[ESP]
		mov	EAX,050h[ESP]
		mov	EBP,EDX
L26:		lea	EBX,3[ESI]
		sub	EBX,ESI
		lea	ECX,[EBP][ESI]
		mov	028h[ESP],ECX
		mov	DL,[ECX]
		mov	ECX,FLAT:_DATA[0Ch]
		mov	024h[ESP],EBX
		mov	EAX,FLAT:_DATA[08h]
		cmp	DL,[ECX]
		mov	010h[ESP],ECX
		jne	L65
		mov	EDX,028h[ESP]
		mov	EAX,EBX
		mov	EBX,010h[ESP]
		mov	CL,1[EDX]
		cmp	CL,1[EBX]
		jne	L65
		mov	DL,2[EDX]
		cmp	DL,2[EBX]
		je	LB9
L65:		mov	EDX,028h[ESP]
		mov	CL,[EDX]
		mov	018h[ESP],EDX
		mov	EAX,024h[ESP]
		mov	EDX,FLAT:_DATA[01Ch]
		cmp	CL,[EDX]
		mov	EAX,FLAT:_DATA[018h]
		jne	L96
		mov	EBX,018h[ESP]
		mov	AL,1[EBX]
		cmp	AL,1[EDX]
		jne	L96
		mov	AL,2[EBX]
		cmp	AL,2[EDX]
		je	LB9
L96:		mov	EDX,FLAT:_DATA[02Ch]
		mov	EAX,FLAT:_DATA[028h]
		cmp	CL,[EDX]
		jne	LBA
		mov	ECX,018h[ESP]
		mov	BL,1[ECX]
		cmp	BL,1[EDX]
		jne	LBA
		mov	CL,2[ECX]
		cmp	CL,2[EDX]
		jne	LBA
LB9:		inc	EDI
LBA:		inc	ESI
		cmp	ESI,01Ch[ESP]
		jb	L26
LC5:		mov	EAX,EDI
		pop	EDI
		pop	ESI
		pop	EBP
		pop	EBX
		add	ESP,03Ch
		ret	8

--------------------------

dmd -O -release -inline test3.d

_D5test34testFAaZi	comdat
		push	EAX
		xor	EDX,EDX
		push	EBX
		xor	EBX,EBX
		push	ESI
		mov	ESI,010h[ESP]
		add	ESI,0FFFFFFFDh
		je	L4A
		mov	8[ESP],EDX
		mov	EDX,014h[ESP]
		mov	ECX,EDX
		mov	EAX,010h[ESP]
		mov	EDX,8[ESP]
L22:		cmp	[ECX][EDX],054h
		jne	L45
		mov	AL,1[ECX][EDX]
		cmp	AL,041h
		jne	L39
		cmp	byte ptr 2[ECX][EDX],047h
		je	L44
		jmp short	L3D
L39:		cmp	AL,047h
		jne	L45
L3D:		cmp	byte ptr 2[ECX][EDX],041h
		jne	L45
L44:		inc	EBX
L45:		inc	EDX
		cmp	EDX,ESI
		jb	L22
L4A:		pop	ESI
		mov	EAX,EBX
		pop	EBX
		pop	ECX
		ret	8

--------------------------

gdc -O3 -frelease -inline -S test1.d -o test1.s

_D5test14testFAaZi:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	8(%ebp), %eax
	movl	12(%ebp), %edx
	movl	$0, -24(%ebp)
	subl	$3, %eax
	movl	%edx, -20(%ebp)
	movl	%eax, -16(%ebp)
	je	.L5
	xorl	%edx, %edx
	jmp	.L8
	.p2align 4,,7
	.p2align 3
.L6:
	addl	$1, -24(%ebp)
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	jbe	.L5
.L8:
	movl	-20(%ebp), %eax
	movl	$.LC0, %edi
	movl	$3, %ecx
	addl	%edx, %eax
	movl	%eax, %esi
	repz cmpsb
	je	.L6
	movl	%eax, %esi
	movl	$.LC1, %edi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	movl	$.LC2, %edi
	movl	%eax, %esi
	movl	$3, %ecx
	repz cmpsb
	je	.L6
	addl	$1, %edx
	cmpl	%edx, -16(%ebp)
	ja	.L8
.L5:
	movl	-24(%ebp), %eax
	addl	$12, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret

--------------------------------------

gdc -O3 -frelease -inline -S test2.d -o test2.s

_D5test24testFAaZi:
	pushl	%ebp
	xorl	%edx, %edx
	movl	%esp, %ebp
	xorl	%eax, %eax
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$16, %esp
	movl	8(%ebp), %ebx
	movl	12(%ebp), %esi
	movl	%ebx, %edi
	subl	$3, %edi
	jne	.L18
	jmp	.L9
	.p2align 4,,7
	.p2align 3
.L10:
	movl	-16(%ebp), %ecx
	movzbl	.LC1, %ebx
	cmpb	(%ecx), %bl
	je	.L15
.L12:
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	je	.L21
.L13:
	addl	$1, %edx
	cmpl	%edx, %edi
	jbe	.L9
	.p2align 4,,7
	.p2align 3
.L18:
	leal	(%esi,%edx), %ecx
	movzbl	.LC0, %ebx
	movl	$3, -20(%ebp)
	movl	%ecx, -16(%ebp)
	cmpb	(%ecx), %bl
	jne	.L10
	movzbl	.LC0+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L10
	movzbl	.LC0+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L10
.L11:
	addl	$1, %edx
	addl	$1, %eax
	cmpl	%edx, %edi
	ja	.L18
.L9:
	addl	$16, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
	.p2align 4,,7
	.p2align 3
.L15:
	movzbl	.LC1+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L12
	movzbl	.LC1+2, %ebx
	cmpb	2(%ecx), %bl
	je	.L11
	movzbl	.LC2, %ebx
	cmpb	(%ecx), %bl
	jne	.L13
	.p2align 4,,7
	.p2align 3
.L21:
	movzbl	.LC2+1, %ebx
	cmpb	1(%ecx), %bl
	jne	.L13
	movzbl	.LC2+2, %ebx
	cmpb	2(%ecx), %bl
	jne	.L13
	jmp	.L11

--------------------------------------

gdc -O3 -frelease -inline -S test3.d -o test3.s

_D5test34testFAaZi:
	pushl	%ebp
	xorl	%eax, %eax
	movl	%esp, %ebp
	pushl	%esi
	movl	8(%ebp), %esi
	pushl	%ebx
	movl	12(%ebp), %ebx
	subl	$3, %esi
	je	.L6
	movl	$1, %edx
	jmp	.L7
	.p2align 4,,7
	.p2align 3
.L3:
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	jbe	.L6
.L11:
	movl	%ecx, %edx
.L7:
	cmpb	$84, -1(%ebx,%edx)
	jne	.L3
	movzbl	(%ebx,%edx), %ecx
	cmpb	$65, %cl
	je	.L10
	cmpb	$71, %cl
	jne	.L3
	xorl	%ecx, %ecx
	cmpb	$65, 1(%ebx,%edx)
	sete	%cl
	addl	%ecx, %eax
	cmpl	%edx, %esi
	leal	1(%edx), %ecx
	ja	.L11
	.p2align 4,,7
	.p2align 3
.L6:
	popl	%ebx
	popl	%esi
	popl	%ebp
	ret

---------------------------

ldc -O3 -release -inline -output-s test1.d

_D5test14testFAaZi:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$20, %esp
	movl	40(%esp), %esi
	cmpl	$3, %esi
	je	.LBB1_8
	addl	$4294967293, %esi
	xorl	%ebx, %ebx
	movl	%ebx, %edi
	.align	16
.LBB1_2:
	movl	44(%esp), %eax
	leal	(%eax,%ebx), %ebp
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str1, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	jne	.LBB1_5
	movl	%ebp, 4(%esp)
	movl	$_D11TypeInfo_Aa6__initZ, 16(%esp)
	movl	$.str2, 12(%esp)
	movl	$3, 8(%esp)
	movl	$3, (%esp)
	call	_adEq
	testl	%eax, %eax
	je	.LBB1_6
.LBB1_5:
	incl	%edi
.LBB1_6:
	incl	%ebx
	cmpl	%esi, %ebx
	jb	.LBB1_2
.LBB1_7:
	movl	%edi, %eax
	addl	$20, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	ret	$8
.LBB1_8:
	xorl	%edi, %edi
	jmp	.LBB1_7

---------------------------

ldc -O3 -release -inline -output-s test2.d

_D5test24testFAaZi:
	pushl	%ebx
	pushl	%esi
	movl	12(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB2_14
	movl	16(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB2_2:
	movb	(%edx,%esi), %bl
	cmpb	$84, %bl
	jne	.LBB2_12
	movb	1(%edx,%esi), %bh
	cmpb	$65, %bh
	jne	.LBB2_5
	cmpb	$71, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_5:
	cmpb	$84, %bl
	jne	.LBB2_12
	cmpb	$71, %bh
	jne	.LBB2_8
	cmpb	$65, 2(%esi,%edx)
	je	.LBB2_11
.LBB2_8:
	cmpb	$65, %bh
	setne	%bh
	cmpb	$84, %bl
	jne	.LBB2_12
	testb	%bh, %bh
	jne	.LBB2_12
	cmpb	$65, 2(%esi,%edx)
	jne	.LBB2_12
.LBB2_11:
	incl	%eax
.LBB2_12:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB2_2
.LBB2_13:
	popl	%esi
	popl	%ebx
	ret	$8
.LBB2_14:
	xorl	%eax, %eax
	jmp	.LBB2_13

---------------------------

ldc -O3 -release -inline -output-s test3.d

_D5test34testFAaZi:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %ecx
	cmpl	$3, %ecx
	je	.LBB1_12
	movl	20(%esp), %edx
	addl	$4294967293, %ecx
	xorl	%eax, %eax
	movl	%eax, %esi
	.align	16
.LBB1_2:
	cmpb	$84, (%edx,%esi)
	jne	.LBB1_10
	movb	1(%edx,%esi), %bl
	cmpb	$71, %bl
	je	.LBB1_8
	cmpb	$65, %bl
	jne	.LBB1_10
	movb	2(%esi,%edx), %bl
	cmpb	$71, %bl
	jne	.LBB1_7
	incl	%eax
	jmp	.LBB1_10
.LBB1_7:
	cmpb	$65, %bl
	jmp	.LBB1_9
.LBB1_8:
	cmpb	$65, 2(%esi,%edx)
.LBB1_9:
	sete	%bl
	movzbl	%bl, %edi
	addl	%edi, %eax
.LBB1_10:
	incl	%esi
	cmpl	%ecx, %esi
	jb	.LBB1_2
.LBB1_11:
	popl	%esi
	popl	%edi
	popl	%ebx
	ret	$8
.LBB1_12:
	xorl	%eax, %eax
	jmp	.LBB1_11

---------------------------

Bye,
bearophile
Nov 27 2010
next sibling parent reply Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
how well does dmd perform when you use a switch statement?
Nov 27 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Ellery Newcomer:

 how well does dmd perform when you use a switch statement?
4th D version: import std.file: read; import std.c.stdio: printf; int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { switch (data[i .. i + 3]) { case "TAG", "TGA", "TAA": count++; default: } } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Bye, bearophile
Nov 27 2010
prev sibling next sibling parent reply spir <denis.spir gmail.com> writes:
On Sat, 27 Nov 2010 11:53:06 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 Timings, dmd compiler, best of 4, seconds:

   Psy:  1.59


=20
=20
 2.3 GHz CPU.
 dmd 2.050, ldc based on DMD v1.057 and llvm 2.6, gdc head based on GCC 4.=
4.5.
=20
 --------------------------
=20

 import psyco
=20
 def test(data):
     count =3D 0
     for i in xrange(len(data) - 3):
         codon =3D data[i : i + 3]
         if codon =3D=3D "TAG" or codon =3D=3D "TGA" or codon =3D=3D "TAA":
             count +=3D 1
     return count
=20
 def main():
     data =3D open("data.txt").read()
     data =3D data * 300
     print test(data)
=20
 psyco.full()
 main()
=20
 --------------------------
=20
 To produce some test data:
=20
=20

 from itertools import islice
=20
 def rnd(ia =3D 3877, ic =3D 29573, im =3D 139968):
     seed =3D 42
     imf =3D float(im)
     while 1:
         seed =3D (seed * ia + ic) % im
         r =3D seed / imf
         yield "ACGT"[int(r * 4)]
=20
=20
 def main():
     data =3D "".join(islice(rnd(), 0, 120000))
     file("data.txt", "w").write(data)
=20
 main()
=20
 --------------------------
=20
 A D2 traslation (I have used printf to reduce a lot the asm produced):
=20
=20

 import std.file: read;
 import std.c.stdio: printf;
=20
 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon =3D data[i .. i + 3];
         if (codon =3D=3D "TAG" || codon =3D=3D "TGA" || codon =3D=3D "TAA=
")
             count++;
     }
     return count;
 }
=20
 void main() {
     char[] data0 =3D cast(char[])read("data.txt");
     int n =3D 300;
     char[] data =3D new char[data0.length * n];
     for (size_t pos; pos < data.length; pos +=3D data0.length)
         data[pos .. pos+data0.length] =3D data0;
=20
     printf("%d\n", test(data));
 }
=20
 --------------------------
=20

 import std.file: read;
 import std.c.stdio: printf;
=20
 bool cmp3(char[] s1, string s2) {
     return s1[0] =3D=3D s2[0] && s1[1] =3D=3D s2[1] && s1[2] =3D=3D s2[2];
 }
=20
 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon =3D data[i .. i + 3];
         if (cmp3(codon, "TAG") || cmp3(codon, "TGA") || cmp3(codon, "TAA"=
))
             count++;
     }
     return count;
 }
=20
 void main() {
     char[] data0 =3D cast(char[])read("data.txt");
     int n =3D 300;
     char[] data =3D new char[data0.length * n];
     for (size_t pos; pos < data.length; pos +=3D data0.length)
         data[pos .. pos+data0.length] =3D data0;
=20
     printf("%d\n", test(data));
 }
Impressive! What's your diagnostic on why basic string compare parforms the way it does? And have you tried a naive hand-written strComp (just for the sake of compl= eteness) like: bool strComp (string s1, string s2) { auto l =3D s1.length; if (l !=3D s2.length) return false; foreach (uint i ; 0..l) if (s1[i] !=3D s2[i]) return false; return true; } (I bet this will perform at about the same order of magnitude as your versi= Also, is there a way to bit-compare given memory areas at much higher speed= than element per element (I mean for arrays in general)? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 27 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
denis.spir

 What's your diagnostic on why basic string compare parforms the way it does?
I prefer the numbers and asm listings speak for themselves, because my diagnostic sometimes is wrong. But I guess it's caused by three not inlined calls to __adEq2 each loop, done to compare 3 char long arrays.
 And have you tried a naive hand-written strComp (just for the sake of
completeness) like:
Here it is: Timings, dmd compiler, best of 4, seconds: Psy: 1.59 import std.file: read; import std.c.stdio: printf; bool cmp2(char[] s1, string s2) { if (s1.length != s2.length) return false; foreach (uint i ; 0 .. s1.length) if (s1[i] != s2[i]) return false; return true; } int test(char[] data) { int count; foreach (i; 0 .. data.length - 3) { char[] codon = data[i .. i + 3]; if (cmp2(codon, "TAG") || cmp2(codon, "TGA") || cmp2(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; printf("%d\n", test(data)); }
 Also, is there a way to bit-compare given memory areas at much higher speed
than element per element (I mean for arrays in general)?
I don't know. I think you can't. Bye, bearophile
Nov 27 2010
parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 27 November 2010 14:46:00 bearophile wrote:
 denis.spir
 Also, is there a way to bit-compare given memory areas at much higher
 speed than element per element (I mean for arrays in general)?
I don't know. I think you can't.
You might be able to do that for primitive types and for structs with no opEquals() and with no member variables with an opEquals(), but you certainly can't do it in the general case, since in the general case, you'd have to be calling opEquals() on each element individually. - Jonathan M Davis
Nov 27 2010
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much higher speed
than element per element (I mean for arrays in general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Nov 27 2010
parent reply Stewart Gordon <smjg_1998 yahoo.com> writes:
On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
Nov 28 2010
next sibling parent reply bearophobic <notbear cave.net> writes:
Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory. Bye, bearophobic
Nov 28 2010
parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:
 
 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:
 
 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.
Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Nov 28 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 29/11/2010 02:11, Michel Fortin wrote:
 On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.
Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.
Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/ -- Bruno Medeiros - Software Engineer
Dec 07 2010
parent reply =?ISO-8859-1?Q?Br=fcno_Mediocre?= <gay mail.com> writes:
Bruno Medeiros Wrote:

 On 29/11/2010 02:11, Michel Fortin wrote:
 On 2010-11-28 20:57:38 -0500, bearophobic <notbear cave.net> said:

 Stewart Gordon Wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
D community is amazing cult of premature optimization fans. Any one of you heard of canonically equivalent sequences? The integrated Unicode support is a clusterfuck. Please do compare ASCII strings with memcmp, but no Unicode. Where did the original poster pull this problem from, his ass? "My system runs 100,000,000,000 instructions per second, but this comparison of 4 letter strings uses 5 cycles too much! This is the only problem on the way to world domination with my $500 Microsoft Word clone!". No wait, the problems are completely imaginatory.
Comparing unicode UTF-* strings using memcmp is fine as long as what you want to know is whether the code points are the same. If your point was that per-code-point comparisons aren't the right way to compare Unicode strings (in most situations), then I support this view too. Though if that's what you wanted to say, you could have made your point clearer.
Why are people still replying to nameless trolls? There has been several cases of that in recent threads. :/
Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Dec 08 2010
parent reply Bruno Medeiros <brunodomedeiros+spam com.gmail> writes:
On 08/12/2010 15:02, Brüno Mediocre wrote:
 Bruno Medeiros Wrote:
 Why are people still replying to nameless trolls? There has been several
 cases of that in recent threads. :/
Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Lol, "Brüno Mediocre", well thought, that's actually funny. -- Brüno Mediocre - Software Engineer
Dec 09 2010
parent Daniel Gibson <metalcaedes gmail.com> writes:
Bruno Medeiros schrieb:
 On 08/12/2010 15:02, Brüno Mediocre wrote:
 Bruno Medeiros Wrote:
 Why are people still replying to nameless trolls? There has been several
 cases of that in recent threads. :/
Trololol. Maybe they're a bit dumb, my brother. If they some day become smarter, they'll stop using D. They see how much shit it is. I miss my wife. Oh god.... bring back my life! Bring me my.. sandwich!
Lol, "Brüno Mediocre", well thought, that's actually funny.
Nah, I think it's a rather mediocre joke.
Dec 09 2010
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 28 Nov 2010 20:32:24 -0500, Stewart Gordon <smjg_1998 yahoo.com>  
wrote:

 On 27/11/2010 23:04, Kagamin wrote:
 bearophile Wrote:

 Also, is there a way to bit-compare given memory areas at much
 higher speed than element per element (I mean for arrays in
 general)?
I don't know. I think you can't.
You can use memcmp, though only for utf-8 strings.
Only for utf-8 strings? Why's that? I would've thought memcmp to be type agnostic. Stewart.
memcmp is type agnostic if all you want to compare is equality. The other use of memcmp is essentially as an opCmp, in which case it would be type sensitive.
Nov 28 2010
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
You need only match 3-char strings?
Try this hack

bool cmp3c(string a, string etalon) pure
{
  if(a.length!=3)return false;
  return *cast(short*)a.ptr==*cast(short*)b.ptr && a[2]==b[2];
}
Nov 27 2010
next sibling parent Kagamin <spam here.lot> writes:
or

bool cmp3c(string a, string b) pure
{
  if(a.length!=3)return false;
  return *cast(char[3]*)a.ptr==*cast(char[3]*)b.ptr;
}

llvm.memcmp should rock with lengths known at compile time.
Nov 27 2010
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Kagamin:

 You need only match 3-char strings?
 Try this hack
enough. But while some things are better left to Phobos, comparing short strings efficiently is the compiler's job, not mine (and I am not sure your trick works in C99). Bye, bearophile
Nov 27 2010
prev sibling parent =?UTF-8?B?IkrDqXLDtG1lIE0uIEJlcmdlciI=?= <jeberger free.fr> writes:
Kagamin wrote:
 You need only match 3-char strings?
 Try this hack
=20
 bool cmp3c(string a, string etalon) pure
 {
   if(a.length!=3D3)return false;
   return *cast(short*)a.ptr=3D=3D*cast(short*)b.ptr && a[2]=3D=3Db[2];
 }
This doesn't work on some architectures if a.ptr or b.ptr is odd... Jerome --=20 mailto:jeberger free.fr http://jeberger.free.fr Jabber: jeberger jabber.fr
Nov 28 2010
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
I have done another test:

Timings, dmd compiler, best of 4, seconds:



  Psy:  1.59





import std.file: read;
import std.c.stdio: printf;

int test(char[] data) {
    int count;
    foreach (i; 0 ..  data.length - 3) {
        char[] codon = data[i .. i + 3];
        if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'G') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&
codon[2] == 'A') ||
            (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&
codon[2] == 'A'))
            count++;
    }
    return count;
}

void main() {
    char[] data0 = cast(char[])read("data.txt");
    int n = 300;
    char[] data = new char[data0.length * n];
    for (size_t pos; pos < data.length; pos += data0.length)
        data[pos .. pos+data0.length] = data0;

    printf("%d\n", test(data));
}


So when there is to compare among strings known at compile-time to be small
(like < 6 char), the comparison shall be replaced with inlined single char
comparisons. This makes the code longer so it increases code cache pressure,
but seeing how much slow the alternative is, I think it's an improvement.

(A smart compiler is even able to remove the codon.length==3 test because the
slice data[i..i+3] is always of length 3 (here mysteriously if you remove those
three length tests the program compiled with dmd gets slower)).

Bye,
bearophile
Nov 27 2010
parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 27 Nov 2010 22:08:29 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 I have done another test:

 Timings, dmd compiler, best of 4, seconds:



   Psy:  1.59





 import std.file: read;
 import std.c.stdio: printf;

 int test(char[] data) {
     int count;
     foreach (i; 0 ..  data.length - 3) {
         char[] codon = data[i .. i + 3];
         if ((codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'G') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'G' &&  
 codon[2] == 'A') ||
             (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' &&  
 codon[2] == 'A'))
             count++;
     }
     return count;
 }

 void main() {
     char[] data0 = cast(char[])read("data.txt");
     int n = 300;
     char[] data = new char[data0.length * n];
     for (size_t pos; pos < data.length; pos += data0.length)
         data[pos .. pos+data0.length] = data0;

     printf("%d\n", test(data));
 }


 So when there is to compare among strings known at compile-time to be  
 small (like < 6 char), the comparison shall be replaced with inlined  
 single char comparisons. This makes the code longer so it increases code  
 cache pressure, but seeing how much slow the alternative is, I think  
 it's an improvement.

 (A smart compiler is even able to remove the codon.length==3 test  
 because the slice data[i..i+3] is always of length 3 (here mysteriously  
 if you remove those three length tests the program compiled with dmd  
 gets slower)).

 Bye,
 bearophile
Hi bearophile, I've spent some time having fun this afternoon optimizing array-equals using vectorization techniques. I found that vectorizing using ulongs worked best on my system except with the shortest strings, where a simple Duff's device edged it out. If you'd like to try it out on your data set: bool arrayComp3(bool useBitCompare = true,T)(const T[] a, const T[] b) pure nothrow { if(a.length != b.length) return false; static if(useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if(pab is pbb) return true; auto byte_length = a.length*T.sizeof; auto pa_end = cast(ulong*)(pab+byte_length); final switch (byte_length % ulong.sizeof) { case 7: if(*pab++ != *pbb++ ) return false; case 6: if(*pab++ != *pbb++ ) return false; case 5: if(*pab++ != *pbb++ ) return false; case 4: if(*pab++ != *pbb++ ) return false; case 3: if(*pab++ != *pbb++ ) return false; case 2: if(*pab++ != *pbb++ ) return false; case 1: if(*pab++ != *pbb++ ) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if(pa is pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if(*pa++ != *pb++ ) return false; case 3: if(*pa++ != *pb++ ) return false; case 2: if(*pa++ != *pb++ ) return false; case 1: if(*pa++ != *pb++ ) return false; } while (--n > 0); } } return true; }
Nov 27 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals  
 using vectorization techniques. I found that vectorizing using ulongs  
 worked best on my system except with the shortest strings, where a simple  
 Duff's device edged it out. If you'd like to try it out on your data set:
Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa < pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n > 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010
next sibling parent reply =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
On 11/28/2010 12:44 PM, bearophile wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a simple
 Duff's device edged it out. If you'd like to try it out on your data set:
Thank you for your work :-) import std.file: read; import std.c.stdio: printf; import std.exception: assumeUnique; bool arrayComp(bool useBitCompare=true, T) (const T[] a, const T[] b) pure nothrow { if (a.length != b.length) return false; static if (useBitCompare) { auto pab = cast(ubyte*)a.ptr; auto pbb = cast(ubyte*)b.ptr; if (pab is pbb) return true; auto byte_length = a.length * T.sizeof; auto pa_end = cast(ulong*)(pab + byte_length); final switch (byte_length % ulong.sizeof) { case 7: if (*pab++ != *pbb++) return false; case 6: if (*pab++ != *pbb++) return false; case 5: if (*pab++ != *pbb++) return false; case 4: if (*pab++ != *pbb++) return false; case 3: if (*pab++ != *pbb++) return false; case 2: if (*pab++ != *pbb++) return false; case 1: if (*pab++ != *pbb++) return false; case 0: } auto pa = cast(ulong*)pab; auto pb = cast(ulong*)pbb; while (pa< pa_end) { if (*pa++ != *pb++) return false; } } else { // default to a short duff's device auto pa = a.ptr; auto pb = b.ptr; if (pa == pb) return true; auto n = (a.length + 3) / 4; final switch (a.length % 4) { case 0: do { if (*pa++ != *pb++) return false; case 3: if (*pa++ != *pb++) return false; case 2: if (*pa++ != *pb++) return false; case 1: if (*pa++ != *pb++) return false; } while (--n> 0); } } return true; } int test(string data) { int count; foreach (i; 0 .. data.length - 3) { auto codon = data[i .. i + 3]; if (arrayComp(codon, "TAG") || arrayComp(codon, "TGA") || arrayComp(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos+data0.length] = data0; string sdata = assumeUnique(data); printf("%d\n", test(sdata)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 Your function can't be inlined because it's big, so this code isn't faster than inlined code like this generated by the compiler: (codon.length == 3&& codon[0] == 'T'&& codon[1] == 'A'&& codon[2] == 'G') Bye, bearophile
I don't have your data set, but for me using random data this was within int test(string data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count += 1; data.popFront; } } Also, this one is far easier to generalize for strings of different lengths, and such :-)
Nov 28 2010
parent reply bearophile <bearophileHUGS lycos.com> writes:
Pelle Mĺnsson:

 I don't have your data set,
In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.

any fiddly code.
Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Nov 28 2010
next sibling parent spir <denis.spir gmail.com> writes:
On Sun, 28 Nov 2010 10:48:46 -0500
bearophile <bearophileHUGS lycos.com> wrote:

 generally I don't like to use find, empty and popFront to solve this prob=
lem, it's quite unnatural. Here I have explained what I think is a good eno= ugh solution to this performance problem:
 http://www.digitalmars.com/webnews/newsgroups.php?art_group=3Ddigitalmars=
.D&article_id=3D123044 What I would like is a just decent routine for comparing arrays of elements= which _conceptually_ are values, whatever their complexity (meaning, possi= bly structs or nested arrays). This should always (*) use a default '=3D=3D= ' comparison, which in turn may perform a kind of memcomp. I have no idea how to avoid bad perf such as shown by your tests, but this = is really necessary. Tons of routines and algorithms use '=3D=3D', and amon= g the most used; str.find will perform an arbitrary number of '=3D=3D' chec= ks; and I don't even want to think at pattern matching ;-) Denis (*) I think that: * If an element type is a value type, it should not overide opEquals. '=3D= =3D' should translate to equality of every sub-element. Else, there is some= thing wrong in the design. But there may be use cases I overlook. (Eg metad= ata stored on the element itself?) * A plain value may well refer to a "thing"/object, eg a wiget's outlook po= inting to a color palette, in which case reference comparison is the right = choice -- so, no complication. * Real things (referenced objects, entities with self/id) should not have o= pEquals at all. They should be compare only by id, meaning using 'is'. This= shows again the ambivalence of dyn arrays... -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 28 2010
prev sibling parent =?ISO-8859-1?Q?Pelle_M=E5nsson?= <pelle.mansson gmail.com> writes:
On 11/28/2010 04:48 PM, bearophile wrote:
 Pelle Mĺnsson:

 I don't have your data set,
In my first post I have explained how to create the data set, using a little Python script that I have listed there. You just need a Python2 interpreter.

any fiddly code.
Thank you for your code. I have added your version, plus a modified version ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(char[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- import std.file: read; import std.c.stdio: printf; import std.algorithm: find; import std.array: empty, popFront; int test(ubyte[] data) { int count; while (true) { data = data.find("TAG", "TGA", "TAA")[0]; if (data.empty) return count; count++; data.popFront(); } } void main() { ubyte[] data0 = cast(ubyte[])read("data.txt"); int n = 300; ubyte[] data = new ubyte[data0.length * n]; for (size_t pos; pos< data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } ------------------------- Timings, dmd compiler, best of 4, seconds: Psy: 1.59 So on this PC it's not much fast. And generally I don't like to use find, empty and popFront to solve this problem, it's quite unnatural. Here I have explained what I think is a good enough solution to this performance problem: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=123044 Bye, bearophile
Measuring incorrectly in a performance test is a silly mistake, I was indeed wrong about the factor two thing! I thought find was specialized for chars so it wouldn't require casting to ubyte. Maybe I was wrong again. Thank you for your engaging performance oriented posts. :-)
Nov 28 2010
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sun, 28 Nov 2010 06:44:23 -0500, bearophile <bearophileHUGS lycos.com>  
wrote:
 Robert Jacques:

 I've spent some time having fun this afternoon optimizing array-equals
 using vectorization techniques. I found that vectorizing using ulongs
 worked best on my system except with the shortest strings, where a  
 simple
 Duff's device edged it out. If you'd like to try it out on your data  
 set:
Thank you for your work :-)
[snip]
 Your function can't be inlined because it's big, so this code isn't  
 faster than inlined code like this generated by the compiler:
 (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] ==  
 'G')

 Bye,
 bearophile
Still, part of the point was that string comparisons in general were way too slow. Anyways, I've applied the same technique in a partially unrolled version if you want to check it out: bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } Be warned that you'll have to use strings explicitly typed as immutable char[N], since both array literals and string literals won't match a this template.
Nov 28 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Robert Jacques:

 Still, part of the point was that string comparisons in general were way  
 too slow. Anyways, I've applied the same technique in a partially unrolled  
 version if you want to check it out:
import std.file: read; import std.c.stdio: printf; // not formatted bool arrayComp(T, size_t N)(const T[] a, ref const T[N] b) pure nothrow { static if(T.sizeof*N <= uint.sizeof) { return a.length == N && !( (*(cast(uint*)a.ptr) ^ *(cast(uint*)b.ptr)) & (uint.max >> 8*(uint.sizeof - T.sizeof*N) )); } else static if(T.sizeof*N <= ulong.sizeof) { return a.length == N && !( (*(cast(ulong*)a.ptr) ^ *(cast(ulong*)b.ptr)) & (ulong.max>> 8*(ulong.sizeof - T.sizeof*N) )); } else { // Fall back to a loop if(a.length != N || (*(cast(ulong*)a.ptr) != *(cast(ulong*)b.ptr)) ) return false; enum alignment = T.sizeof*N % ulong.sizeof > 0 ? T.sizeof*N % ulong.sizeof : ulong.sizeof; auto pa = cast(ulong*)(cast(ubyte*)a.ptr + alignment); auto pb = cast(ulong*)(cast(ubyte*)b.ptr + alignment); auto pa_end = cast(ulong*)(cast(ubyte*)a.ptr + T.sizeof*N); while (pa < pa_end) { if(*pa++ != *pb++ ) return false; } return true; } } pure int test(const char[] data) { int count; foreach (i; 0 .. data.length - 3) { const char[] codon = data[i .. i + 3]; if (arrayComp!(char, 3)(codon, "TAG") || arrayComp!(char, 3)(codon, "TGA") || arrayComp!(char, 3)(codon, "TAA")) count++; } return count; } void main() { char[] data0 = cast(char[])read("data.txt"); int n = 300; char[] data = new char[data0.length * n]; for (size_t pos; pos < data.length; pos += data0.length) data[pos .. pos + data0.length] = data0; printf("%d\n", test(data)); } Timings, dmd compiler, best of 4, seconds: Psy: 1.59 This problem probably needs to be solved by the compiler. Bye, bearophile
Nov 28 2010
prev sibling parent reply Austin Hastings <ah08010-d yahoo.com> writes:
On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
interesting, so I have reduced it to a little benchmark.

 Below there is the reduced Python2 code (it uses Psyco), and a little program
to generate some test data. The timing of the first D2 version is not good
compared to the Python-Psyco program (the generation of the *300 array is a
quick thing), so I have created two more D2 versions to show myself that D2
wasn't broken :-)

 The reduced code looks like a syntetic benchmark, but it has about the same
performance profile of a 60 lines long Python script (the original code was
using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3), but the situation
doesn't change much).
It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default". If it is the former, a state machine is probably the right answer for long input strings. start -> E E A,C,G:-> E E T:-> T T A:-> TA T C:-> E T G:-> TG T T:-> T TA A,G:-> E, {++count} TA C:-> E TA T:-> T TG A:-> E, {++count} TG C,G:-> E TG T:-> T Since the probability of failure is about .75, 0.5, {.5 or .75}, the overall comparison cost is about 1 + .25 + .125 = 1.375, times compare+loop times n, plus all the function call overhead, etc. Using a DFA should reduce the cost to 1 * index, times n, assuming you are using a small alphabet (I'm guessing that your codons are ascii+uppercase only, table-size = 26). =Austin
Nov 27 2010
next sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.
 
 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)
 
 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).
It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M Davis
Nov 27 2010
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s article
 On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.

 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)

 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).
It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point. - Jonathan M Davis
Just food for thought. GDC uses memcmp when using string comparisons in the first implementation. if (codon == "TAG" || codon == "TGA" || codon == "TAA") And it is still the slowest case of the lot. Regards
Nov 28 2010
parent bearophile <bearophileHUGS lycos.com> writes:
Iain Buclaw:

 GDC uses memcmp when using string comparisons in the first implementation.
 
 if (codon == "TAG" || codon == "TGA" || codon == "TAA")
 
 And it is still the slowest case of the lot.
The asm of the first version of the function compiled with gdc uses "repz cmpsb". And while I don't remember precise timings for it, I think it was about two times faster than the first version compiled with dmd (on Windows). Regarding the idea of using memcmp, I have added a comment to the issue 5282. Bye, bearophile
Nov 28 2010
prev sibling next sibling parent Jonathan M Davis <jmdavisProg gmx.com> writes:
On Saturday 27 November 2010 22:23:39 Jonathan M Davis wrote:
 On Saturday 27 November 2010 22:13:39 Austin Hastings wrote:
 On 11/27/2010 11:53 AM, bearophile wrote:
 While translating a common Python script to D, I have found something
 interesting, so I have reduced it to a little benchmark.
 
 Below there is the reduced Python2 code (it uses Psyco), and a little
 program to generate some test data. The timing of the first D2 version
 is not good compared to the Python-Psyco program (the generation of the
 *300 array is a quick thing), so I have created two more D2 versions to
 show myself that D2 wasn't broken :-)
 
 The reduced code looks like a syntetic benchmark, but it has about the
 same performance profile of a 60 lines long Python script (the original
 code was using xrange(0,len(...)-3,3)  instead of xrange(len(...)-3),
 but the situation doesn't change much).
It's not clear to me if the point of your post is "how do I make this go faster?" or "Waa! D underperforms Python by default".
What I think is pretty clear from this is that at some point, some work needs to be done to optimize array comparisons when memcmp can be used instead of having to call the individual opEquals() of each element. It's not the sort of thing that's likely to be a priority, but it really should be done at some point.
Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=5282
Nov 27 2010
prev sibling next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Austin Hastings:

 It's not clear to me if the point of your post is "how do I make this go 
 faster?" or "Waa! D underperforms Python by default".
The third D version in my post is fast enough, so the point of that post is that a very common string compare operation, done in the natural way, is a bit too much slow in D/DMD.
 If it is the former, a state machine is probably the right answer for 
 long input strings.
This case is very simple, so a simple tree (as in the third D version of my post) is probably good enough. In some similar but more complex situations of genomic processing I've seen that in C it's good to have computed gotos, to implement faster state machines. See my recent post "explicitly unimplemented computed gotos" :-) This was my use case beside the desire to create interpreters in D (it's a similar thing).
 Since the probability of failure is about .75, 0.5, {.5 or .75}, the 
 overall comparison cost is about 1 + .25 + .125 = 1.375, times 
 compare+loop times n, plus all the function call overhead, etc.
All 3 codons start with T, so if you test that first letter before the 3 string compare you speed up the original code a lot. But I was looking for a more general solution. Implementing a state machine is good, but in this case I'd like the compiler to do more optimizations by itself. A simple but good enough optimization for the compiler is to replace tests with strings (known to be short at compile-time) with inlined comparisons of their single chars.
 Using a DFA should reduce the cost to 1 * index, times n, assuming you 
 are using a small alphabet (I'm guessing that your codons are 
 ascii+uppercase only, table-size = 26).
In this case the alphabet was just 4 symbols: {'A', 'C', 'G', 'T'} (with no undecided or modified nucleosides). Bye, bearophile
Nov 28 2010
prev sibling parent reply spir <denis.spir gmail.com> writes:
On Sat, 27 Nov 2010 22:41:45 -0800
Jonathan M Davis <jmdavisProg gmx.com> wrote:

 What I think is pretty clear from this is that at some point, some work
 needs to be done to optimize array comparisons when memcmp can be used
 instead of having to call the individual opEquals() of each element. It=
's
 not the sort of thing that's likely to be a priority, but it really sho=
uld
 be done at some point. =20
=20 Enhancement Request: http://d.puremagic.com/issues/show_bug.cgi?id=3D5282
Yes, but other comments seems to show memcmp only doubles speed. This would= only bring us twice as slow as python ;-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Nov 28 2010
parent bearophile <bearophileHUGS lycos.com> writes:
spir:

 Yes, but other comments seems to show memcmp only doubles speed. This would
only bring us twice as slow as python ;-)
The solution I suggest for this problem is: when DMD knows at compile-time the length of one of the two strings to equate, and such length is small (like < 6), then it may replace the string eq code like this: (codon == "TAG") With code like: (codon.length == 3 && codon[0] == 'T' && codon[1] == 'A' && codon[2] == 'G') Bye, bearophile
Nov 28 2010