digitalmars.D.announce - DMD 1.034 and 2.018 releases

Walter Bright (8/8) Aug 08 2008 This one has (finally) got array operations implemented. For those who

Lars Ivar Igesund (8/15) Aug 08 2008 The array op docs aren't actually on the 1.0 array page. But great! I

Walter Bright (2/3) Aug 08 2008 Fixed.
Jarrett Billingsley (3/13) Aug 08 2008 Too bad Tango doesn't support them yet. :C

Lars Ivar Igesund (8/24) Aug 09 2008 Are you suggesting that Walter should have told us that he was implement...

Walter Bright (3/22) Aug 09 2008 All Tango needs to do is copy the internal\array*.d files over and add

Lars Ivar Igesund (7/30) Aug 09 2008 I know :) No malice intended.
Sean Kelly (6/28) Aug 09 2008 I took care of this when 1.033 was released, since the files were first ...

dsimcha (4/4) Aug 08 2008 Uhh...Is it just me or did you accidentally repackage 2.017 and call it ...

Walter Bright (2/6) Aug 08 2008 I don't know what happened there, but I just re-uploaded it.

Sean Kelly (3/7) Aug 08 2008 Sweet! One of the best updates ever.
Jonathan Crapuchettes (3/13) Aug 08 2008 This is great! Thanks for adding this Walter.
Bill Baxter (9/17) Aug 08 2008 That is pretty neat.

Walter Bright (2/9) Aug 08 2008 Array ops were always supposed to be there.

Bill Baxter (7/17) Aug 08 2008 Ok, I thought the charter for D1 was no new language features, period.

Walter Bright (2/2) Aug 08 2008 Now on Reddit!
bearophile (25/25) Aug 08 2008 Probably I am missing something important, I have tried this code with t...

bearophile (7/8) Aug 08 2008 I meant:

Walter Bright (3/13) Aug 08 2008 D already distinguishes operations on the array handle, a, from

Walter Bright (10/46) Aug 08 2008 Doesn't work for initializers.

Moritz Warning (7/29) Aug 09 2008 [..]
bearophile (36/46) Aug 09 2008 Then I have not seen such comment in the docs, if this is absent from th...

bearophile (90/90) Aug 09 2008 First benchmark, just D against itself, not used GCC yet, the results sh...

bearophile (5/5) Aug 09 2008 Second version, just a bit cleaner code, less bug-prone, etc:

bearophile (72/72) Aug 09 2008 C version too:

dsimcha (53/54) Aug 09 2008 vector ops are generally slower, but maybe there's some bug/problem in m...

bearophile (4/6) Aug 09 2008 Right. Finding good benchmarks is not easy, and I have shown the code he...
Don (7/16) Aug 10 2008 Yes. The solution to that is to check for huge array sizes, and use a

Walter Bright (5/10) Aug 09 2008 I wouldn't be a bit surprised at that since / for int[]s does not have a...

bearophile (11/16) Aug 09 2008 I didn't know it. We may write a list about such things.

Walter Bright (5/9) Aug 09 2008 If this happens, then it's worth verifying that the asm code is actually...

bearophile (5/10) Aug 10 2008 I know. I was talking about the parts of the code that for example adds ...

Walter Bright (6/18) Aug 10 2008 Not really, it's easier to just copy that particular function out of the

Dave (89/107) Aug 11 2008 The SSE2 is being used, but what would be nice would be the same code th...

Dave (26/59) Aug 11 2008 format=flowed;

Walter Bright (27/28) Aug 10 2008 I found the results to be heavily dependent on the data set size:

Walter Bright (2/3) Aug 09 2008 Please post all bugs to bugzilla! thanks

Michael P. (3/13) Aug 08 2008 This will probably make me sound like an idiot, but what are these array...

bearophile (5/6) Aug 08 2008 If Walter gives you one link, you have to follow it before asking :-)

Michael P. (2/10) Aug 08 2008 Who knows where that could have led to... I was just playing it safe. :D

JAnderson (4/14) Aug 08 2008 Sweet! I love the way you put this forth as a challenge. Maybe D will

Walter Bright (2/4) Aug 09 2008 I thought a little competition might bring out the best in people!

bearophile (5/6) Aug 09 2008 Lot of people today have 2 (or even 4 cores), the order of the computati...

Christopher Wright (4/8) Aug 09 2008 The overhead of creating a new thread for this would be significant.

Jarrett Billingsley (4/15) Aug 09 2008 I think we could see a lot more improvement from using vector ops to per...
JAnderson (12/25) Aug 09 2008 I agree. I think a lot of profiling would be in order to see when
renoX (12/25) Sep 07 2008 Well for this kind of scheme, you wouldn't start a new set of thread

Craig Black (5/5) Aug 09 2008 Very exciting stuff! Keep up the good work.

bearophile (6/7) Aug 09 2008 Currently it optimizes very little, I think. I have posted C and D bench...

Don (3/13) Aug 10 2008 I intend to contribute some asm routines, but have been working on

Walter Bright (2/4) Aug 10 2008 Cool!

Pete (3/13) Aug 11 2008 Not sure if someone else has already mentioned this but would it be poss...

Walter Bright (2/7) Aug 11 2008 Anything allocated with new will be aligned on 16 byte boundaries.
Georg Lukas (16/27) Aug 12 2008 From a short look at the array*.d source code, it would be better to

Don (3/29) Aug 13 2008 Just begin with a check for minimal size. If less than that size, don't

Dave (33/62) Aug 13 2008 Good idea. Right now in that code there is (usually) a case for both

JAnderson (8/34) Aug 13 2008 There would be some overhead for small arrays however as I said in my

Don (4/14) Aug 14 2008 My tests indicate that array operations also support ^ and ^=, but
Pablo Ripolles (5/15) Aug 18 2008 Fantastic! Thanks!

Walter Bright <newshound1 digitalmars.com> writes:

This one has (finally) got array operations implemented. For those who 
want to show off their leet assembler skills, the initial assembler 
implementation code is in phobos/internal/array*.d. Burton Radons wrote 
the assembler. Can you make it faster?

http://www.digitalmars.com/d/1.0/changelog.html
http://ftp.digitalmars.com/dmd.1.034.zip

http://www.digitalmars.com/d/2.0/changelog.html
http://ftp.digitalmars.com/dmd.2.018.zip

Aug 08 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

The array op docs aren't actually on the 1.0 array page. But great! I
remember trying to use these 3 years ago :D

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

Lars Ivar Igesund wrote:
 The array op docs aren't actually on the 1.0 array page.

Fixed.

Aug 08 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Lars Ivar Igesund" <larsivar igesund.net> wrote in message 
news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 The array op docs aren't actually on the 1.0 array page. But great! I
 remember trying to use these 3 years ago :D

Too bad Tango doesn't support them yet.  :C

Aug 08 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

Jarrett Billingsley wrote:

 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 The array op docs aren't actually on the 1.0 array page. But great! I
 remember trying to use these 3 years ago :D

 
 Too bad Tango doesn't support them yet.  :C

Are you suggesting that Walter should have told us that he was implementing
this feature ahead of releasing 1.034?

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Aug 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:
 
 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 The array op docs aren't actually on the 1.0 array page. But great! I
 remember trying to use these 3 years ago :D

 Too bad Tango doesn't support them yet.  :C

 
 Are you suggesting that Walter should have told us that he was implementing
 this feature ahead of releasing 1.034?

All Tango needs to do is copy the internal\array*.d files over and add 
them to the makefile.

Aug 09 2008

Lars Ivar Igesund <larsivar igesund.net> writes:

Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:
 
 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons
 wrote the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 The array op docs aren't actually on the 1.0 array page. But great! I
 remember trying to use these 3 years ago :D

 Too bad Tango doesn't support them yet.  :C

 
 Are you suggesting that Walter should have told us that he was
 implementing this feature ahead of releasing 1.034?

 
 All Tango needs to do is copy the internal\array*.d files over and add
 them to the makefile.

I know :) No malice intended.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource, #d.tango & #D: larsivi
Dancing the Tango

Aug 09 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:

 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 The array op docs aren't actually on the 1.0 array page. But great! I
 remember trying to use these 3 years ago :D

 Too bad Tango doesn't support them yet.  :C

 Are you suggesting that Walter should have told us that he was implementing
 this feature ahead of releasing 1.034?

 All Tango needs to do is copy the internal\array*.d files over and add
 them to the makefile.

I took care of this when 1.033 was released, since the files were first included
then.  There are likely updates in 1.034, but nothing a few minutes with my
merge tool can't handle.  Sadly, that particular merge tool is a bit broken at
the moment, but I'll see about taking care of this anyway.


Sean

Aug 09 2008

dsimcha <dsimcha yahoo.com> writes:

Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018? 
I
checked all the obvious stuff, cleared my cache and all, and no matter what I
do,
the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I
run it as 2.017.

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

dsimcha wrote:
 Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018?
 I
 checked all the obvious stuff, cleared my cache and all, and no matter what I
do,
 the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I
 run it as 2.017.

I don't know what happened there, but I just re-uploaded it.

Aug 08 2008

Sean Kelly <sean invisibleduck.org> writes:

== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

Sweet!  One of the best updates ever.


Sean

Aug 08 2008

Jonathan Crapuchettes <jcrapuchettes gmail.com> writes:

This is great! Thanks for adding this Walter.
JC

Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

Aug 08 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sat, Aug 9, 2008 at 5:24 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 This one has (finally) got array operations implemented. For those who want
 to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote the
 assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

That is pretty neat.

So does this mean you've reconsidered your position on adding new
features to D1.x?
Because q{} strings,  1..10 literals, and that enhancement to IFTI
used by std.algorithm, all sure would be nice.  I'd take those over
fancy array ops any day.

--bb

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

Bill Baxter wrote:
 That is pretty neat.
 
 So does this mean you've reconsidered your position on adding new
 features to D1.x?
 Because q{} strings,  1..10 literals, and that enhancement to IFTI
 used by std.algorithm, all sure would be nice.  I'd take those over
 fancy array ops any day.

Array ops were always supposed to be there.

Aug 08 2008

"Bill Baxter" <wbaxter gmail.com> writes:

On Sat, Aug 9, 2008 at 7:30 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 Bill Baxter wrote:
 That is pretty neat.

 So does this mean you've reconsidered your position on adding new
 features to D1.x?
 Because q{} strings,  1..10 literals, and that enhancement to IFTI
 used by std.algorithm, all sure would be nice.  I'd take those over
 fancy array ops any day.

 Array ops were always supposed to be there.

Ok, I thought the charter for D1 was no new language features, period.

FWIW, I always thought D1 IFTI was supposed to be smarter, so to me
adding array ops seems to be similar to porting the fix for 493 to D1.
 (http://d.puremagic.com/issues/show_bug.cgi?id=493)

--bb

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

Now on Reddit!

http://www.reddit.com/comments/6vjcv/d_programming_language_gets_vector_operations/

Aug 08 2008

bearophile <bearophileHUGS lycos.com> writes:

Probably I am missing something important, I have tried this code with the
1.034 (that compiles my large d libs fine), but I have found many problems:

import std.stdio: putr = writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];

    //putr(a1[] + a2[]); // test.d(6): Error: Array operations not implemented

    auto a3 = a1[] + 4;
    putr(a3); // [1,2,3,0,0,0,0]

    int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not
implemented

    int[] a5 = [3, 5, 7, 9];
    int[] a6 = a1 + a5; // test.d(16): Error: Array operations not implemented

    int[] a7;
    a7[] = a1[] + a2[];
    putr(a7); // prints: []

    auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented
    putr(a8);
}


Few more questions/notes:
- I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4, I am used to
that from PyLab, etc.
- How does it works (or not works) with jagged/square matrices?
- When possible I'm do few benchmarks compared to normal D code, C code
compiled normally with GCC and C code automatically vectorized by GCC.
- Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?
I presume the answer is negative.
- Hopefully in the future they may support the SSE3/SSSE3 too that my CPU
supports.

Bye, and good work,
bearophile

Aug 08 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile:
 - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,

I meant:
a + b
a + 4

instead of:
a[] + b[]
a[] + 4

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 bearophile:
 - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,

 
 I meant:
 a + b
 a + 4
 
 instead of:
 a[] + b[]
 a[] + 4

D already distinguishes operations on the array handle, a, from 
operations on the contents of a, a[]. I think this is a good distinction.

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Probably I am missing something important, I have tried this code
 with the 1.034 (that compiles my large d libs fine), but I have found
 many problems:
 
 import std.stdio: putr = writefln;
 
 void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6];
 
 //putr(a1[] + a2[]); // test.d(6): Error: Array operations not
 implemented

It only works if the top level is an assignment operation.

 
 auto a3 = a1[] + 4; putr(a3); // [1,2,3,0,0,0,0]
 
 int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not
 implemented

Doesn't work for initializers.

 
 int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error:
 Array operations not implemented

Doesn't work for initializers.

 
 int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []

I don't know what putr is.

 
 auto a8 = a1 + a2; // test.d(21): Error: Array operations not
 implemented

Have to use slice [] operator.

 putr(a8); }
 
 
 Few more questions/notes: - I like a syntax as a+b and a[]+4 instead
 of a[]+b[] and a[]+4, I am used to that from PyLab, etc. - How does
 it works (or not works) with jagged/square matrices?

It doesn't.

  - When possible
 I'm do few benchmarks compared to normal D code, C code compiled
 normally with GCC and C code automatically vectorized by GCC. - Is it
 able to compute a+b+c with a single loop (as all Fortran compilers
 do)?

Yes.

  I presume the answer is negative. - Hopefully in the future they
 may support the SSE3/SSSE3 too that my CPU supports.
 
 Bye, and good work, bearophile

Aug 08 2008

Moritz Warning <moritzwarning web.de> writes:

On Fri, 08 Aug 2008 22:43:08 -0700, Walter Bright wrote:

 bearophile wrote:
 Probably I am missing something important, I have tried this code with
 the 1.034 (that compiles my large d libs fine), but I have found many
 problems:
 
 import std.stdio: putr = writefln;
 
 void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6];
 
 //putr(a1[] + a2[]); // test.d(6): Error: Array operations not
 implemented

 
 It only works if the top level is an assignment operation.
 
 

[..]
 
 int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error:
 Array operations not implemented

 
 Doesn't work for initializers.
 
 

[..]

Looks like there is room for improvement.
It does put work on the programmers nerves when things doesn't work as 
expected. :)

Anyway - Good work!

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

bearophile wrote:
 import std.stdio: putr = writefln;



Walter Bright:
It only works if the top level is an assignment operation.<

Then I have not seen such comment in the docs, if this is absent from the docs,
then this deserves to be added. And the error message too given DMD can be
improved.


Doesn't work for initializers.<

Both the docs (if not already present) and the error message have to explain
this.

This output looks like a bug of the compiler anyway:
[1,2,3,0,0,0,0]


 int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []

I don't know what putr is.

It's just a shorter alias of the writefln.


auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented<<


Have to use slice [] operator.<

I'd like a less wrong error message then.


Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<


Yes.<

This is very positive :-)


D already distinguishes operations on the array handle, a, from operations on
the contents of a, a[]. I think this is a good distinction.<

I understand and I agree, but the [] make the code a little less natural to
write.

----------------------------

For reference this is the shortened code, it compiles and runs but the results
and error messages are bogus:

import std.stdio: writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];

    auto a3 = a1[] + 4;
    writefln(a3); // prints: [1,2,3,0,0,0,0]

    int[] a7;
    a7[] = a1[] + a2[];
    writefln(a7); // prints: []

    // a7 = a1 + a2; // test2.d(14): Error: Array operations not implemented
}

The last line gives a wrong message error (well, the message errors in the
precedent code were all wrong).

-------------------

The following code works, yay! :-)

import std.stdio: writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];
    auto a3 = new int[2];

    a3[] = a1[] + a2[];
    writefln(a3); // prints: [3,6]
}

Later,
bearophile

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my
benchmark (note it needs just Phobos!), not tested on Linux yet:


import std.stdio: put = writef, putr = writefln;
import std.conv: toInt;

version (Win32) {
    import std.c.windows.windows: QueryPerformanceCounter,
QueryPerformanceFrequency;

    double clock() {
        long t;
        QueryPerformanceCounter(&t);

        return cast(double)t / queryPerformanceFrequency;
    }

    long queryPerformanceFrequency;

    static this() {
        QueryPerformanceFrequency(&queryPerformanceFrequency);
    }
}

version (linux) {
    import std.c.linux.linux: time;

    double clock() {
        return cast(double)time(null);
    }
}


void main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 10;
    n *= 8; // to avoid problems with SSE2
    int nloops = args.length >= 3 ? toInt(args[2]) : 1;
    bool use_vec = args.length == 4 ? cast(bool)toInt(args[3]) : true;

    putr("array len= ", n, "  nloops= ", nloops, "  Use vec ops: ", use_vec);

    auto a1 = new int[n]; // void?
    auto a2 = new int[n]; // void?
    auto a3 = new int[n];

    foreach (i, ref el; a1)
        el = i * 7 + 1;
    foreach (i, ref el; a2)
        el = i + 1;

    auto t = clock();
    if (use_vec)
        for (int j = 0; j < nloops; j++)
            a3[] = a1[] / a2[];
    else
        for (int j = 0; j < nloops; j++)
            for (int i; i < a3.length; i++)
                a3[i] = a1[i] / a2[i];
    putr("time= ", clock() - t, " s");

    if (a3.length < 300)
        putr("\nResult:\n", a3);
}

/*
D code with /:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10563 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.222 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654696 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.655401 s


D code with *:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10615 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.21904 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.65515 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.65566 s
    (Note that 0.65566 > 0.65515 isn't due to noise)


D code with +:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10848 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.22527 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654797 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.654991 s

*/


Bye,
bearophile

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Second version, just a bit cleaner code, less bug-prone, etc:
http://codepad.org/BlwSIBKl

Timings on linux on DMD 2.0 with * as operation seems much better.

Bater,
bearophile

Aug 09 2008

bearophile <bearophileHUGS mailas.com> writes:

C version too:

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define MYOP *
typedef int T;
#define TFORM "%d "

void error(char *string) {
    fprintf(stderr, "ERROR: %s\n", string);
    exit(EXIT_FAILURE);
}

double myclock() {
    clock_t t = clock();
    if (t == -1)
        return 0.0;
    else
        return t / (double)CLOCKS_PER_SEC;
}

int main(int argc, char** argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 10;

    n *= 8; // to avoid problems with SSE2
    int nloops = argc >= 3 ? atoi(argv[2]) : 1;

    printf("array len= %d  nloops= %d\n", n, nloops);

    //__attribute__((aligned(16)))
    T* __restrict a1 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a2 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a3 = (T*)malloc(sizeof(T) * n + 16);
    if (a1 == NULL || a2 == NULL || a3 == NULL)
        error("memory overflow");

    int i, j;
    for (i = 0; i < n; i++) {
        a1[i] = i * 7 + 1;
        a2[i] = i + 1;
    }

    double t = myclock();
    for (j = 0; j < nloops; j++)
        for (i = 0; i < n; i++) // Alignment of access forced using peeling.
            a3[i] = a1[i] MYOP a2[i];
    printf("time= %f s\n", myclock() - t);

    if (n < 300) {
        printf("\nResult:\n");
        for (i = 0; i < n; i++)
            printf(TFORM, a3[i]);
        putchar('\n');
    }

    return 0;
}

/*

MYOP = *, compiled with:
gcc -Wall -O3 -s benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.656000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.374000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.547000 s


MYOP = *, compiled with:
gcc -Wall -O3 -s -ftree-vectorize -msse3 -ftree-vectorizer-verbose=5
benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.468000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.156000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.531000 s

In the larger array the cache effects may dominate over computing time.

*/

Aug 09 2008

dsimcha <dsimcha yahoo.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 First benchmark, just D against itself, not used GCC yet, the results show that

vector ops are generally slower, but maybe there's some bug/problem in my
benchmark (note it needs just Phobos!), not tested on Linux yet:

I see at least part of the problem.  When you use such huge arrays, it ends up
being more a test of your memory bandwidth than of the vector ops.  Three arrays
of 80000 ints comes to a total of about 960k.  This is not going to fit in any
L1
cache for a long time.  Heck, my CPU only has 512k L2 cache per core.  Here are
my
results using smaller arrays designed to fit in my 64k L1 data cache, and the
same
code as Bearophile.

+ operator:
D:\code>array_benchmark.exe 500 1000000 0
array len= 4000  nloops= 1000000  Use vec ops: false
time= 4.82841 s

D:\code>array_benchmark.exe 500 1000000 1
array len= 4000  nloops= 1000000  Use vec ops: true
time= 2.32902 s

* operator :

D:\code>array_benchmark.exe 500 1000000 0
array len= 4000  nloops= 1000000  Use vec ops: false
time= 6.1556 s

D:\code>array_benchmark.exe 500 1000000 1
array len= 4000  nloops= 1000000  Use vec ops: true
time= 6.16539 s

/ operator:

D:\code>array_benchmark.exe 500 100000 0
array len= 4000  nloops= 100000  Use vec ops: false
time= 7.02435 s

D:\code>array_benchmark.exe 500 100000 1
array len= 4000  nloops= 100000  Use vec ops: true
time= 6.84251 s

BTW, for the sake of comparison, here's my CPU specs from CPU-Z. Also note that
I'm running in 32-bit mode.

Number of processors	1
Number of cores	2 per processor
Number of threads	2 (max 2) per processor
Name	AMD Athlon 64 X2 3600+
Code Name	Brisbane
Specification	AMD Athlon(tm) 64 X2 Dual Core Processor 3600+
Package	Socket AM2 (940)
Family/Model/Stepping	F.B.1
Extended Family/Model	F.6B
Brand ID	4
Core Stepping	BH-G1
Technology	65 nm
Core Speed	2698.1 MHz
Multiplier x Bus speed	9.5 x 284.0 MHz
HT Link speed	852.0 MHz
Stock frequency	1900 MHz
Instruction sets	MMX (+), 3DNow! (+), SSE, SSE2, SSE3, x86-64
L1 Data cache (per processor)	2 x 64 KBytes, 2-way set associative, 64-byte
line size
L1 Instruction cache (per processor)	2 x 64 KBytes, 2-way set associative,
64-byte
line size
L2 cache (per processor)	2 x 512 KBytes, 16-way set associative, 64-byte line
size

Aug 09 2008

bearophile <bearophileHUGS mailas.com> writes:

dsimcha:
 I see at least part of the problem.  When you use such huge arrays, it ends up
 being more a test of your memory bandwidth than of the vector ops.

Right. Finding good benchmarks is not easy, and I have shown the code here for
people to spot problems in it. I have added a C version too now.

Bye,
bearophile

Aug 09 2008

Don <nospam nospam.com.au> writes:

dsimcha wrote:
 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 First benchmark, just D against itself, not used GCC yet, the results show that

 vector ops are generally slower, but maybe there's some bug/problem in my
 benchmark (note it needs just Phobos!), not tested on Linux yet:
 
 I see at least part of the problem.  When you use such huge arrays, it ends up
 being more a test of your memory bandwidth than of the vector ops.  Three
arrays
 of 80000 ints comes to a total of about 960k.  This is not going to fit in any
L1
 cache for a long time.

Yes. The solution to that is to check for huge array sizes, and use a 
different routine (using prefetching) in that case. Actually, the most 
important routine to be doing that is memcpy/ array slice assignment, 
but I'm not sure it does. I think it just does a movsd.

So I think this is still a useful case to benchmark, it's not the most 
important one, though.

Aug 10 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 First benchmark, just D against itself, not used GCC yet, the results
 show that vector ops are generally slower, but maybe there's some
 bug/problem in my benchmark (note it needs just Phobos!), not tested
 on Linux yet:

[...]
 a3[] = a1[] / a2[];

I wouldn't be a bit surprised at that since / for int[]s does not have a 
custom asm routine for it. See phobos/internal/arrayint.d

If someone wants to write one, I'll put it in!

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 I wouldn't be a bit surprised at that since / for int[]s does not have a 
 custom asm routine for it.

I didn't know it. We may write a list about such things.
But as you can see I have performed benchmarks with + * / not just /.

It's very easy to write wrong benchmarks, so I am careful, but from the little
I have seen so far the speed improvements are absent or less than 1 (slow
down). And I haven't seen yet SS2 asm in my compiled programs :-)


Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<


Yes.<

But later on Reddit the answer by Walter was:

This optimization is called "loop fusion", and is well known. It doesn't always
result in a speedup, though. The dmd compiler doesn't do it, but that is not
the fault of D.<

At a closer look the two questions are different, I think he meant:
a += b + c; => single loop
a += b; a += c; => two loops
I think this is acceptable.

Bye,
bearophile

Aug 09 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 It's very easy to write wrong benchmarks, so I am careful, but from
 the little I have seen so far the speed improvements are absent or
 less than 1 (slow down).

If this happens, then it's worth verifying that the asm code is actually 
being run by inserting a printf in it.

 And I haven't seen yet SS2 asm in my compiled programs :-)

The dmd compiler doesn't generate SS2 instructions. But the routines in 
internal\array*.d do.

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 If this happens, then it's worth verifying that the asm code is actually 
 being run by inserting a printf in it.

I presume I'll have to recompile Phobos for that.


 And I haven't seen yet SS2 asm in my compiled programs :-)

 The dmd compiler doesn't generate SS2 instructions. But the routines in 
 internal\array*.d do.

I know. I was talking about the parts of the code that for example adds the
arrays; according to the phobos source code they use SSE2 but in the final
source code produces they are absent.

Bye,
bearophile

Aug 10 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.

 
 I presume I'll have to recompile Phobos for that.

Not really, it's easier to just copy that particular function out of the
library and paste it into your test module, that way it's easier to
experiment with.

 And I haven't seen yet SS2 asm in my compiled programs :-)

 The dmd compiler doesn't generate SS2 instructions. But the
 routines in internal\array*.d do.

 
 I know. I was talking about the parts of the code that for example
 adds the arrays; according to the phobos source code they use SSE2
 but in the final source code produces they are absent.

I don't know what you mean. The SSE2 instructions are in 
internal/arrayint.d, and they do get compiled in.

Aug 10 2008

"Dave" <Dave_member pathlink.com> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g7na5s$qg0$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.

 I presume I'll have to recompile Phobos for that.

 Not really, it's easier to just copy that particular function out of the
 library and paste it into your test module, that way it's easier to
 experiment with.

 And I haven't seen yet SS2 asm in my compiled programs :-)

 The dmd compiler doesn't generate SS2 instructions. But the
 routines in internal\array*.d do.

 I know. I was talking about the parts of the code that for example
 adds the arrays; according to the phobos source code they use SSE2
 but in the final source code produces they are absent.

 I don't know what you mean. The SSE2 instructions are in 
 internal/arrayint.d, and they do get compiled in.

The SSE2 is being used, but what would be nice would be the same code that 
Burton used for his benchmarks. Is that available?

Thanks,

- Dave

import std.stdio, std.date, std.conv;

void main(string[] args)
{
    if(args.length < 3)
    {
        writefln("usage: ",args[0]," <array size> <iterations>");
        return;
    }
    auto ASIZE = toInt(args[1]);
    auto ITERS = toInt(args[2]);
    writefln("Array Size = ",ASIZE,", Iterations = ",ITERS);
    int[] ia, ib, ic;
    ia = new int[ASIZE];
    ib = new int[ASIZE];
    ic = new int[ASIZE];
    ib[] = ic[] = 10;
    double[] da, db, dc;
    da = new double[ASIZE];
    db = new double[ASIZE];
    dc = new double[ASIZE];
    db[] = dc[] = 10.0;

    {
    ia[] = 0;
    int sum = 0;
    d_time s = getUTCtime();
    for(size_t i = 0; i < ITERS; i++)
    {
        sum += aops!(int)(ia,ib,ic);
    }
    d_time e = getUTCtime();
    writefln("intaops: ",(e - s) / 1000.0," secs, sum = ",sum);
    }

    {
    ia[] = 0;
    int sum = 0;
    d_time s = getUTCtime();
    for(size_t i = 0; i < ITERS; i++)
    {
        sum += loop!(int)(ia,ib,ic);
    }
    d_time e = getUTCtime();
    writefln("intloop: ",(e - s) / 1000.0," secs, sum = ",sum);
    }

    {
    da[] = 0.0;
    double sum = 0.0;
    d_time s = getUTCtime();
    for(size_t i = 0; i < ITERS; i++)
    {
        sum += aops!(double)(da,db,dc);
    }
    d_time e = getUTCtime();
    writefln("dfpaops: ",(e - s) / 1000.0," secs, sum = ",sum);
    }

    {
    da[] = 0.0;
    double sum = 0.0;
    d_time s = getUTCtime();
    for(size_t i = 0; i < ITERS; i++)
    {
        sum += loop!(double)(da,db,dc);
    }
    d_time e = getUTCtime();
    writefln("dfploop: ",(e - s) / 1000.0," secs, sum = ",sum);
    }
}

T aops(T)(T[] a, T[] b, T[] c)
{
    a[] = b[] + c[];
    return a[$-1];
}

T loop(T)(T[] a, T[] b, T[] c)
{
    foreach(i, inout val; a) val = b[i] + c[i];
    return a[$-1];
}

C:\Zz>dmd -O -inline -release top.d

C:\Zz>top 4000 100000
Array Size = 4000, Iterations = 100000
intaops: 0.204 secs, sum = 2000000
intloop: 0.515 secs, sum = 2000000
dfpaops: 0.625 secs, sum = 2e+06
dfploop: 0.563 secs, sum = 2e+06

Aug 11 2008

"Dave" <Dave_member pathlink.com> writes:

	format=flowed;
	charset="iso-8859-1";
	reply-type=response
Content-Transfer-Encoding: 7bit

"Dave" <Dave_member pathlink.com> wrote in message 
news:g7qr3h$2l6$1 digitalmars.com...
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7na5s$qg0$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.

 I presume I'll have to recompile Phobos for that.

 Not really, it's easier to just copy that particular function out of the
 library and paste it into your test module, that way it's easier to
 experiment with.

 And I haven't seen yet SS2 asm in my compiled programs :-)

 The dmd compiler doesn't generate SS2 instructions. But the
 routines in internal\array*.d do.

 I know. I was talking about the parts of the code that for example
 adds the arrays; according to the phobos source code they use SSE2
 but in the final source code produces they are absent.

 I don't know what you mean. The SSE2 instructions are in 
 internal/arrayint.d, and they do get compiled in.

 The SSE2 is being used, but what would be nice would be the same code that 
 Burton used for his benchmarks. Is that available?

 Thanks,

 - Dave

Before:

 C:\Zz>top 4000 100000
 Array Size = 4000, Iterations = 100000
 intaops: 0.204 secs, sum = 2000000
 intloop: 0.515 secs, sum = 2000000
 dfpaops: 0.625 secs, sum = 2e+06
 dfploop: 0.563 secs, sum = 2e+06

After adding aligned case for _arraySliceSliceAddSliceAssign_d

C:\Zz>top 4000 100000
Array Size = 4000, Iterations = 100000
intaops: 0.212 secs, sum = 2000000
intloop: 0.525 secs, sum = 2000000
dfpaops: 0.438 secs, sum = 2e+06
dfploop: 0.557 secs, sum = 2e+06

;---

SiSoftware Sandra

Processor
Model : Intel(R) Core(TM)2 CPU          6700    2.66GHz

Processor Cache(s)
Internal Data Cache : 32kB, Synchronous, Write-Thru, 8-way set, 64 byte line 
size
Internal Instruction Cache : 32kB, Synchronous, Write-Back, 8-way set, 64 
byte line size
L2 On-board Cache : 4MB, ECC, Synchronous, ATC, 16-way set, 64 byte line 
size, 2 threads sharing
L2 Cache Multiplier : 1/1x  (2667MHz)

Aug 11 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 D code with +:

I found the results to be heavily dependent on the data set size:

C:\mars>test5 1000 10000
array len= 8000  nloops= 10000
     vec time= 0.0926506 s
non-vec time= 0.626356 s

C:\mars>test5 2000 10000
array len= 16000  nloops= 10000
     vec time= 0.279727 s
non-vec time= 1.70048 s

C:\mars>test5 3000 10000
array len= 24000  nloops= 10000
     vec time= 0.795482 s
non-vec time= 2.47597 s

C:\mars>test5 4000 10000
array len= 32000  nloops= 10000
     vec time= 2.36905 s
non-vec time= 3.90906 s

C:\mars>test5 5000 10000
array len= 40000  nloops= 10000
     vec time= 3.12636 s
non-vec time= 3.70741 s

For smaller sets, it's a 2x speedup, for larger ones only a few percent.

What we're seeing here is most likely the effects of the data set size 
exceeding the cache. It would be a fun project for someone to see if 
somehow the performance for such large data sets could be improved, 
perhaps by "warming" up the cache?

Aug 10 2008

Walter Bright <newshound1 digitalmars.com> writes:

bearophile wrote:
 This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]

Please post all bugs to bugzilla! thanks

Aug 09 2008

Michael P. <baseball.mjp gmail.com> writes:

Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip


This will probably make me sound like an idiot, but what are these array
operations everyone's so stoked about? I've only been learning D for a week and
a half, fill me in!
BTW, nice update!

Aug 08 2008

bearophile <bearophileHUGS lycos.com> writes:

Michael P.:
 This will probably make me sound like an idiot, but what are these array
operations everyone's so stoked about? I've only been learning D for a week and
a half, fill me in!

If Walter gives you one link, you have to follow it before asking :-)
http://www.digitalmars.com/d/1.0/arrays.html#array-operations

Bye,
bearophile

Aug 08 2008

Michael P. <baseball.mjp gmail.com> writes:

bearophile Wrote:

 Michael P.:
 This will probably make me sound like an idiot, but what are these array
operations everyone's so stoked about? I've only been learning D for a week and
a half, fill me in!

 
 If Walter gives you one link, you have to follow it before asking :-)
 http://www.digitalmars.com/d/1.0/arrays.html#array-operations
 
 Bye,
 bearophile

Who knows where that could have led to... I was just playing it safe. :D

Aug 08 2008

JAnderson <ask me.com> writes:

Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

Sweet!  I love the way you put this forth as a challenge.  Maybe D will 
have the worlds fastest array operations :)

-Joel

Aug 08 2008

Walter Bright <newshound1 digitalmars.com> writes:

JAnderson wrote:
 Sweet!  I love the way you put this forth as a challenge.  Maybe D will 
 have the worlds fastest array operations :)

I thought a little competition might bring out the best in people!

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:
 Can you make it faster?

Lot of people today have 2 (or even 4 cores), the order of the computation of
those ops is arbitrary, so a major (nearly linear, hopefully) speedup will
probably come as soon all the cores are used. This job splitting is probably an
advantage even then the ops aren't computed by asm code.

I've taken a look at my code, and so far I don't see many spots where the array
operations (once they actually give some speedup) can be useful (there are many
other things I can find much more useful than such ops, see my wish lists). But
if the array ops are useful for enough people, then it may be useful to burn
some programming time to make those array ops use all the 2-4+ cores.

Bye,
bearophile

Aug 09 2008

Christopher Wright <dhasenan gmail.com> writes:

bearophile wrote:
 Walter Bright:
 Can you make it faster?

 
 Lot of people today have 2 (or even 4 cores), the order of the computation of
those ops is arbitrary, so a major (nearly linear, hopefully) speedup will
probably come as soon all the cores are used. This job splitting is probably an
advantage even then the ops aren't computed by asm code.

The overhead of creating a new thread for this would be significant. 
You'd probably be better off using a regular loop for arrays that are 
not huge.

Aug 09 2008

"Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:

"Christopher Wright" <dhasenan gmail.com> wrote in message 
news:g7ljal$2i84$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 Can you make it faster?

 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.

 The overhead of creating a new thread for this would be significant. You'd 
 probably be better off using a regular loop for arrays that are not huge.

I think we could see a lot more improvement from using vector ops to perform 
SIMD operations.  They are just begging for it.

Aug 09 2008

JAnderson <ask me.com> writes:

Christopher Wright wrote:
 bearophile wrote:
 Walter Bright:
 Can you make it faster?

 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.

 
 The overhead of creating a new thread for this would be significant. 
 You'd probably be better off using a regular loop for arrays that are 
 not huge.

I agree.  I think a lot of profiling would be in order to see when 
certain things become an advantage to use.  Then use a branch to jump to 
the best algorithm for the particular case (platform + length of array). 
  Hopefully the compiler could inline the algorithm so that constant 
sized arrays don't pay for the additional overhead.

There would be a small cost for the extra branch for small dynamic 
arrays.  Ideally one could  argue that if this becomes a performance 
bottleneck then the program is doing a lot of operations on lots of 
small arrays.  The user could change the design to group their small 
arrays into a larger array to get the performance they desire.

-Joel

Aug 09 2008

renoX <renosky free.fr> writes:

Christopher Wright a �crit :
 bearophile wrote:
 Walter Bright:
 Can you make it faster?

 Lot of people today have 2 (or even 4 cores), the order of the 
 computation of those ops is arbitrary, so a major (nearly linear, 
 hopefully) speedup will probably come as soon all the cores are used. 
 This job splitting is probably an advantage even then the ops aren't 
 computed by asm code.

 
 The overhead of creating a new thread for this would be significant.


Well for this kind of scheme, you wouldn't start a new set of thread 
each time! Just start a set of worker threads (one per cpu pinned to 
each cpu) which are created at startup of the program, and do nothing 
until they are woken up when there is an operation which can be 
accelerated through parallelism.

 You'd probably be better off using a regular loop for arrays that are 
 not huge.

Sure, even with pre-created threads, using several cpu induce additional 
cost at startup and end cost so this would be worthwhile only with loops 
'big enough'..

A pitfall also is to ensure that two cpu don't write to the same cache 
line, otherwise this 'false sharing' will reduce the performance.

renoX

Sep 07 2008

"Craig Black" <craigblack2 cox.net> writes:

Very exciting stuff!  Keep up the good work.

Currently it only optimizes int and float.  I assume you could get it 
working for double pretty easily as well.  Is it extensible to user defined 
types like a Vector3 class?

-Craig

Aug 09 2008

bearophile <bearophileHUGS lycos.com> writes:

Craig Black:
 Currently it only optimizes int and float.

Currently it optimizes very little, I think. I have posted C and D benchmarks:

http://codepad.org/BlwSIBKl

http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=12718

Bye,
bearophile

Aug 09 2008

Don <nospam nospam.com.au> writes:

Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

I intend to contribute some asm routines, but have been working on 
bigint operations (both Tango and Phobos) for the past couple of weeks.

Aug 10 2008

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 I intend to contribute some asm routines, but have been working on 
 bigint operations (both Tango and Phobos) for the past couple of weeks.

Cool!

Aug 10 2008

Pete <example example.com> writes:

Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

Not sure if someone else has already mentioned this but would it be possible
for the compiler to align these arrays on 16 byte boundaries in order to
maximise any possible vector efficiency. AFAIK you can't actually specify align
anything higher than align 8 at the moment which is a bit of a problem.

Regards,

Aug 11 2008

Walter Bright <newshound1 digitalmars.com> writes:

Pete wrote:
 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries
 in order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment
 which is a bit of a problem.

Anything allocated with new will be aligned on 16 byte boundaries.

Aug 11 2008

Georg Lukas <georg op-co.de> writes:

On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 
 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.

From a short look at the array*.d source code, it would be better to 
check if source and destination have the same alignment, i.e.:

a = 0xf00d0013 (3 mod 16)
b = 0xdeaffff3 (3 mod 16)

In that case, the first 16-3 = 13 bytes can be handled using regular D 
code, and the aligned SSE version can be used for the rest.

This would also work for slices, at least when both slices have the same 
alignment remainder. I'm just not sure what overhead such a solution 
would impose for small arrays.

Georg
-- 
|| http://op-co.de ++  GCS/CM d? s: a-- C+++ UL+++ !P L+++ E--- W++  ++
|| gpg: 0x962FD2DE ||  N++ o? K- w---() O M V? PS+ PE-- Y+ PGP++ t*  ||
|| Ge0rG: euIRCnet ||  5 X+ R tv b+(+++) DI+(+++) D+ G e* h! r* !y+  ||
++ IRCnet OFTC OPN ||________________________________________________||

Aug 12 2008

Don <nospam nospam.com.au> writes:

Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.

 
 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:
 
 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)
 
 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.
 
 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.

Just begin with a check for minimal size. If less than that size, don't 
use SSE at all.

 
 Georg

Aug 13 2008

"Dave" <Dave_member pathlink.com> writes:

"Don" <nospam nospam.com.au> wrote in message 
news:g7u36h$20j0$1 digitalmars.com...
 Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.

 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:

 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)

 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.


Good idea. Right now in that code there is (usually) a case for both 
un/aligned.

It typically goes like this:

if(cpu_has_sse2 && a.length > min_size)
{
    if(((cast(size_t) aptr | cast(size_t)bptr | cast(size_t)cptr) & 15) != 
0)
    {    // Unaligned case
    asm
    {
    ...
    movdqu  XMM0, [EAX]
    ...
    }
    }
    else
    {    // Aligned case
    asm
    {
    ...
    movdqa  XMM0, [EAX]
    ...
    }
    }
}

The two blocks of asm code is basically identical except for the un/aligned 
SSE opcodes.

With your idea, one could get rid of the test for alignment, probably some 
bloat and a whole lot of duplication. I guess the question would be if the 
overhead of your idea would be less than the current design.

- Dave

 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.

 Just begin with a check for minimal size. If less than that size, don't 
 use SSE at all.

 Georg

Aug 13 2008

JAnderson <ask me.com> writes:

Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries in
 order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment which
 is a bit of a problem.

 
 From a short look at the array*.d source code, it would be better to 
 check if source and destination have the same alignment, i.e.:
 
 a = 0xf00d0013 (3 mod 16)
 b = 0xdeaffff3 (3 mod 16)
 
 In that case, the first 16-3 = 13 bytes can be handled using regular D 
 code, and the aligned SSE version can be used for the rest.
 
 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.

There would be some overhead for small arrays however as I said in my 
previous email, if your using a small array then its likely that your 
not doing much.  If it is a performance issue you should switch to a 
larger array (by grouping all your smaller ones together).  Of course 
there's the edge case where some actually needs to do a g-billion 
operations on exactly the same small array.

 
 Georg

-Joel

Aug 13 2008

Don <nospam nospam.com.au> writes:

Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

My tests indicate that array operations also support ^ and ^=, but 
that's not listed in the spec. Not the first time that D's been better 
than advertised. <g>

Aug 14 2008

Pablo Ripolles <in-call gmx.net> writes:

Fantastic!  Thanks!

This is the present already!
http://www.digitalmars.com/d/1.0/future.html
http://www.digitalmars.com/d/2.0/future.html


Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip

Aug 18 2008

D Programming

C/C++ Programming

Other

digitalmars.D.announce - DMD 1.034 and 2.018 releases