digitalmars.D.announce - Re: DMD 1.034 and 2.018 releases

Pete <example example.com> Aug 11 2008

Walter Bright <newshound1 digitalmars.com> Aug 11 2008
Georg Lukas <georg op-co.de> Aug 12 2008

Don <nospam nospam.com.au> Aug 13 2008

"Dave" <Dave_member pathlink.com> Aug 13 2008

JAnderson <ask me.com> Aug 13 2008

Pete <example example.com> writes:

Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip


Not sure if someone else has already mentioned this but would it be possible
for the compiler to align these arrays on 16 byte boundaries in order to
maximise any possible vector efficiency. AFAIK you can't actually specify align
anything higher than align 8 at the moment which is a bit of a problem.

Regards,

Aug 11 2008

Walter Bright <newshound1 digitalmars.com> writes:

Pete wrote:
 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries
 in order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment
 which is a bit of a problem.


Anything allocated with new will be aligned on 16 byte boundaries.

Aug 11 2008

Georg Lukas <georg op-co.de> writes:

On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?


 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.


From a short look at the array*.d source code, it would be better to 
check if source and destination have the same alignment, i.e.:

a = 0xf00d0013 (3 mod 16)
b = 0xdeaffff3 (3 mod 16)

In that case, the first 16-3 = 13 bytes can be handled using regular D 
code, and the aligned SSE version can be used for the rest.

This would also work for slices, at least when both slices have the same 
alignment remainder. I'm just not sure what overhead such a solution 
would impose for small arrays.

Georg
-- 
|| http://op-co.de ++  GCS/CM d? s: a-- C+++ UL+++ !P L+++ E--- W++  ++
|| gpg: 0x962FD2DE ||  N++ o? K- w---() O M V? PS+ PE-- Y+ PGP++ t*  ||
|| Ge0rG: euIRCnet ||  5 X+ R tv b+(+++) DI+(+++) D+ G e* h! r* !y+  ||
++ IRCnet OFTC OPN ||________________________________________________||

Aug 12 2008

Don <nospam nospam.com.au> writes:

Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?


 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.


 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:
 
 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)
 
 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.
 
 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.


Just begin with a check for minimal size. If less than that size, don't 
use SSE at all.

 
 Georg

Aug 13 2008

"Dave" <Dave_member pathlink.com> writes:

"Don" <nospam nospam.com.au> wrote in message 
news:g7u36h$20j0$1 digitalmars.com...
 Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?


 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.


 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:

 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)

 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.




Good idea. Right now in that code there is (usually) a case for both 
un/aligned.

It typically goes like this:

if(cpu_has_sse2 && a.length > min_size)
{
    if(((cast(size_t) aptr | cast(size_t)bptr | cast(size_t)cptr) & 15) != 
0)
    {    // Unaligned case
    asm
    {
    ...
    movdqu  XMM0, [EAX]
    ...
    }
    }
    else
    {    // Aligned case
    asm
    {
    ...
    movdqa  XMM0, [EAX]
    ...
    }
    }
}

The two blocks of asm code is basically identical except for the un/aligned 
SSE opcodes.

With your idea, one could get rid of the test for alignment, probably some 
bloat and a whole lot of duplication. I guess the question would be if the 
overhead of your idea would be less than the current design.

- Dave

 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.


 Just begin with a check for minimal size. If less than that size, don't 
 use SSE at all.

 Georg

Aug 13 2008

JAnderson <ask me.com> writes:

Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?


 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.


 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:
 
 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)
 
 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.
 
 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.


There would be some overhead for small arrays however as I said in my 
previous email, if your using a small array then its likely that your 
not doing much.  If it is a performance issue you should switch to a 
larger array (by grouping all your smaller ones together).  Of course 
there's the edge case where some actually needs to do a g-billion 
operations on exactly the same small array.

 
 Georg


-Joel

Aug 13 2008

D Programming

C/C++ Programming

Other

digitalmars.D.announce - Re: DMD 1.034 and 2.018 releases