www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - DMD 1.034 and 2.018 releases

reply Walter Bright <newshound1 digitalmars.com> writes:
This one has (finally) got array operations implemented. For those who 
want to show off their leet assembler skills, the initial assembler 
implementation code is in phobos/internal/array*.d. Burton Radons wrote 
the assembler. Can you make it faster?

http://www.digitalmars.com/d/1.0/changelog.html
http://ftp.digitalmars.com/dmd.1.034.zip

http://www.digitalmars.com/d/2.0/changelog.html
http://ftp.digitalmars.com/dmd.2.018.zip
Aug 08 2008
next sibling parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 08 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Lars Ivar Igesund wrote:
 The array op docs aren't actually on the 1.0 array page.
Fixed.
Aug 08 2008
prev sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Lars Ivar Igesund" <larsivar igesund.net> wrote in message 
news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Too bad Tango doesn't support them yet. :C
Aug 08 2008
parent reply Lars Ivar Igesund <larsivar igesund.net> writes:
Jarrett Billingsley wrote:

 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Too bad Tango doesn't support them yet. :C
Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034? -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 09 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:
 
 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Too bad Tango doesn't support them yet. :C
Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?
All Tango needs to do is copy the internal\array*.d files over and add them to the makefile.
Aug 09 2008
next sibling parent Lars Ivar Igesund <larsivar igesund.net> writes:
Walter Bright wrote:

 Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:
 
 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons
 wrote the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Too bad Tango doesn't support them yet. :C
Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?
All Tango needs to do is copy the internal\array*.d files over and add them to the makefile.
I know :) No malice intended. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 09 2008
prev sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 Lars Ivar Igesund wrote:
 Jarrett Billingsley wrote:

 "Lars Ivar Igesund" <larsivar igesund.net> wrote in message
 news:g7ias2$2kbo$1 digitalmars.com...
 Walter Bright wrote:

 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
The array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Too bad Tango doesn't support them yet. :C
Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?
All Tango needs to do is copy the internal\array*.d files over and add them to the makefile.
I took care of this when 1.033 was released, since the files were first included then. There are likely updates in 1.034, but nothing a few minutes with my merge tool can't handle. Sadly, that particular merge tool is a bit broken at the moment, but I'll see about taking care of this anyway. Sean
Aug 09 2008
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018? 
I
checked all the obvious stuff, cleared my cache and all, and no matter what I
do,
the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I
run it as 2.017.
Aug 08 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
dsimcha wrote:
 Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018?
 I
 checked all the obvious stuff, cleared my cache and all, and no matter what I
do,
 the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I
 run it as 2.017.
I don't know what happened there, but I just re-uploaded it.
Aug 08 2008
prev sibling next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Walter Bright (newshound1 digitalmars.com)'s article
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
Sweet! One of the best updates ever. Sean
Aug 08 2008
prev sibling next sibling parent Jonathan Crapuchettes <jcrapuchettes gmail.com> writes:
This is great! Thanks for adding this Walter.
JC

Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
Aug 08 2008
prev sibling next sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Sat, Aug 9, 2008 at 5:24 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 This one has (finally) got array operations implemented. For those who want
 to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote the
 assembler. Can you make it faster?

 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip

 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
That is pretty neat. So does this mean you've reconsidered your position on adding new features to D1.x? Because q{} strings, 1..10 literals, and that enhancement to IFTI used by std.algorithm, all sure would be nice. I'd take those over fancy array ops any day. --bb
Aug 08 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Bill Baxter wrote:
 That is pretty neat.
 
 So does this mean you've reconsidered your position on adding new
 features to D1.x?
 Because q{} strings,  1..10 literals, and that enhancement to IFTI
 used by std.algorithm, all sure would be nice.  I'd take those over
 fancy array ops any day.
Array ops were always supposed to be there.
Aug 08 2008
parent "Bill Baxter" <wbaxter gmail.com> writes:
On Sat, Aug 9, 2008 at 7:30 AM, Walter Bright
<newshound1 digitalmars.com> wrote:
 Bill Baxter wrote:
 That is pretty neat.

 So does this mean you've reconsidered your position on adding new
 features to D1.x?
 Because q{} strings,  1..10 literals, and that enhancement to IFTI
 used by std.algorithm, all sure would be nice.  I'd take those over
 fancy array ops any day.
Array ops were always supposed to be there.
Ok, I thought the charter for D1 was no new language features, period. FWIW, I always thought D1 IFTI was supposed to be smarter, so to me adding array ops seems to be similar to porting the fix for 493 to D1. (http://d.puremagic.com/issues/show_bug.cgi?id=493) --bb
Aug 08 2008
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Now on Reddit!

http://www.reddit.com/comments/6vjcv/d_programming_language_gets_vector_operations/
Aug 08 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Probably I am missing something important, I have tried this code with the
1.034 (that compiles my large d libs fine), but I have found many problems:

import std.stdio: putr = writefln;

void main() {
    int[] a1 = [1, 2, 3];
    int[] a2 = [2, 4, 6];

    //putr(a1[] + a2[]); // test.d(6): Error: Array operations not implemented

    auto a3 = a1[] + 4;
    putr(a3); // [1,2,3,0,0,0,0]

    int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not
implemented

    int[] a5 = [3, 5, 7, 9];
    int[] a6 = a1 + a5; // test.d(16): Error: Array operations not implemented

    int[] a7;
    a7[] = a1[] + a2[];
    putr(a7); // prints: []

    auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented
    putr(a8);
}


Few more questions/notes:
- I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4, I am used to
that from PyLab, etc.
- How does it works (or not works) with jagged/square matrices?
- When possible I'm do few benchmarks compared to normal D code, C code
compiled normally with GCC and C code automatically vectorized by GCC.
- Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?
I presume the answer is negative.
- Hopefully in the future they may support the SSE3/SSSE3 too that my CPU
supports.

Bye, and good work,
bearophile
Aug 08 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
bearophile:
 - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,
I meant: a + b a + 4 instead of: a[] + b[] a[] + 4
Aug 08 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 bearophile:
 - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,
I meant: a + b a + 4 instead of: a[] + b[] a[] + 4
D already distinguishes operations on the array handle, a, from operations on the contents of a, a[]. I think this is a good distinction.
Aug 08 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Probably I am missing something important, I have tried this code
 with the 1.034 (that compiles my large d libs fine), but I have found
 many problems:
 
 import std.stdio: putr = writefln;
 
 void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6];
 
 //putr(a1[] + a2[]); // test.d(6): Error: Array operations not
 implemented
It only works if the top level is an assignment operation.
 
 auto a3 = a1[] + 4; putr(a3); // [1,2,3,0,0,0,0]
 
 int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not
 implemented
Doesn't work for initializers.
 
 int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error:
 Array operations not implemented
Doesn't work for initializers.
 
 int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []
I don't know what putr is.
 
 auto a8 = a1 + a2; // test.d(21): Error: Array operations not
 implemented
Have to use slice [] operator.
 putr(a8); }
 
 
 Few more questions/notes: - I like a syntax as a+b and a[]+4 instead
 of a[]+b[] and a[]+4, I am used to that from PyLab, etc. - How does
 it works (or not works) with jagged/square matrices?
It doesn't. - When possible
 I'm do few benchmarks compared to normal D code, C code compiled
 normally with GCC and C code automatically vectorized by GCC. - Is it
 able to compute a+b+c with a single loop (as all Fortran compilers
 do)?
Yes. I presume the answer is negative. - Hopefully in the future they
 may support the SSE3/SSSE3 too that my CPU supports.
 
 Bye, and good work, bearophile
Aug 08 2008
next sibling parent Moritz Warning <moritzwarning web.de> writes:
On Fri, 08 Aug 2008 22:43:08 -0700, Walter Bright wrote:

 bearophile wrote:
 Probably I am missing something important, I have tried this code with
 the 1.034 (that compiles my large d libs fine), but I have found many
 problems:
 
 import std.stdio: putr = writefln;
 
 void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6];
 
 //putr(a1[] + a2[]); // test.d(6): Error: Array operations not
 implemented
It only works if the top level is an assignment operation.
[..]
 
 int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error:
 Array operations not implemented
Doesn't work for initializers.
[..] Looks like there is room for improvement. It does put work on the programmers nerves when things doesn't work as expected. :) Anyway - Good work!
Aug 09 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
bearophile wrote:
 import std.stdio: putr = writefln;
Walter Bright:
It only works if the top level is an assignment operation.<
Then I have not seen such comment in the docs, if this is absent from the docs, then this deserves to be added. And the error message too given DMD can be improved.
Doesn't work for initializers.<
Both the docs (if not already present) and the error message have to explain this. This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]
 int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []
I don't know what putr is.
It's just a shorter alias of the writefln.
auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented<<
Have to use slice [] operator.<
I'd like a less wrong error message then.
Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<
Yes.<
This is very positive :-)
D already distinguishes operations on the array handle, a, from operations on
the contents of a, a[]. I think this is a good distinction.<
I understand and I agree, but the [] make the code a little less natural to write. ---------------------------- For reference this is the shortened code, it compiles and runs but the results and error messages are bogus: import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = a1[] + 4; writefln(a3); // prints: [1,2,3,0,0,0,0] int[] a7; a7[] = a1[] + a2[]; writefln(a7); // prints: [] // a7 = a1 + a2; // test2.d(14): Error: Array operations not implemented } The last line gives a wrong message error (well, the message errors in the precedent code were all wrong). ------------------- The following code works, yay! :-) import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = new int[2]; a3[] = a1[] + a2[]; writefln(a3); // prints: [3,6] } Later, bearophile
Aug 09 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my
benchmark (note it needs just Phobos!), not tested on Linux yet:


import std.stdio: put = writef, putr = writefln;
import std.conv: toInt;

version (Win32) {
    import std.c.windows.windows: QueryPerformanceCounter,
QueryPerformanceFrequency;

    double clock() {
        long t;
        QueryPerformanceCounter(&t);

        return cast(double)t / queryPerformanceFrequency;
    }

    long queryPerformanceFrequency;

    static this() {
        QueryPerformanceFrequency(&queryPerformanceFrequency);
    }
}

version (linux) {
    import std.c.linux.linux: time;

    double clock() {
        return cast(double)time(null);
    }
}


void main(string[] args) {
    int n = args.length >= 2 ? toInt(args[1]) : 10;
    n *= 8; // to avoid problems with SSE2
    int nloops = args.length >= 3 ? toInt(args[2]) : 1;
    bool use_vec = args.length == 4 ? cast(bool)toInt(args[3]) : true;

    putr("array len= ", n, "  nloops= ", nloops, "  Use vec ops: ", use_vec);

    auto a1 = new int[n]; // void?
    auto a2 = new int[n]; // void?
    auto a3 = new int[n];

    foreach (i, ref el; a1)
        el = i * 7 + 1;
    foreach (i, ref el; a2)
        el = i + 1;

    auto t = clock();
    if (use_vec)
        for (int j = 0; j < nloops; j++)
            a3[] = a1[] / a2[];
    else
        for (int j = 0; j < nloops; j++)
            for (int i; i < a3.length; i++)
                a3[i] = a1[i] / a2[i];
    putr("time= ", clock() - t, " s");

    if (a3.length < 300)
        putr("\nResult:\n", a3);
}

/*
D code with /:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10563 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.222 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654696 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.655401 s


D code with *:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10615 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.21904 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.65515 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.65566 s
    (Note that 0.65566 > 0.65515 isn't due to noise)


D code with +:
    C:\>array_benchmark.exe 10000 10000 0
    array len= 80000  nloops= 10000  Use vec ops: false
    time= 7.10848 s

    C:\>array_benchmark.exe 10000 10000 1
    array len= 80000  nloops= 10000  Use vec ops: true
    time= 7.22527 s


    C:\>array_benchmark.exe 12000000 1 0
    array len= 96000000  nloops= 1  Use vec ops: false
    time= 0.654797 s

    C:\>array_benchmark.exe 12000000 1 1
    array len= 96000000  nloops= 1  Use vec ops: true
    time= 0.654991 s

*/


Bye,
bearophile
Aug 09 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Second version, just a bit cleaner code, less bug-prone, etc:
http://codepad.org/BlwSIBKl

Timings on linux on DMD 2.0 with * as operation seems much better.

Bater,
bearophile
Aug 09 2008
parent bearophile <bearophileHUGS mailas.com> writes:
C version too:

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define MYOP *
typedef int T;
#define TFORM "%d "

void error(char *string) {
    fprintf(stderr, "ERROR: %s\n", string);
    exit(EXIT_FAILURE);
}

double myclock() {
    clock_t t = clock();
    if (t == -1)
        return 0.0;
    else
        return t / (double)CLOCKS_PER_SEC;
}

int main(int argc, char** argv) {
    int n = argc >= 2 ? atoi(argv[1]) : 10;

    n *= 8; // to avoid problems with SSE2
    int nloops = argc >= 3 ? atoi(argv[2]) : 1;

    printf("array len= %d  nloops= %d\n", n, nloops);

    //__attribute__((aligned(16)))
    T* __restrict a1 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a2 = (T*)malloc(sizeof(T) * n + 16);
    T* __restrict a3 = (T*)malloc(sizeof(T) * n + 16);
    if (a1 == NULL || a2 == NULL || a3 == NULL)
        error("memory overflow");

    int i, j;
    for (i = 0; i < n; i++) {
        a1[i] = i * 7 + 1;
        a2[i] = i + 1;
    }

    double t = myclock();
    for (j = 0; j < nloops; j++)
        for (i = 0; i < n; i++) // Alignment of access forced using peeling.
            a3[i] = a1[i] MYOP a2[i];
    printf("time= %f s\n", myclock() - t);

    if (n < 300) {
        printf("\nResult:\n");
        for (i = 0; i < n; i++)
            printf(TFORM, a3[i]);
        putchar('\n');
    }

    return 0;
}

/*

MYOP = *, compiled with:
gcc -Wall -O3 -s benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.656000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.374000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.547000 s


MYOP = *, compiled with:
gcc -Wall -O3 -s -ftree-vectorize -msse3 -ftree-vectorizer-verbose=5
benchmark.c -o benchmark
    C:\>benchmark 100 3000000
    array len= 800  nloops= 3000000
    time= 3.468000 s

    C:\>benchmark 10000 10000
    array len= 80000  nloops= 10000
    time= 1.156000 s

    C:\>benchmark 12000000 1
    array len= 96000000  nloops= 1
    time= 0.531000 s

In the larger array the cache effects may dominate over computing time.

*/
Aug 09 2008
prev sibling next sibling parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops. Three arrays of 80000 ints comes to a total of about 960k. This is not going to fit in any L1 cache for a long time. Heck, my CPU only has 512k L2 cache per core. Here are my results using smaller arrays designed to fit in my 64k L1 data cache, and the same code as Bearophile. + operator: D:\code>array_benchmark.exe 500 1000000 0 array len= 4000 nloops= 1000000 Use vec ops: false time= 4.82841 s D:\code>array_benchmark.exe 500 1000000 1 array len= 4000 nloops= 1000000 Use vec ops: true time= 2.32902 s * operator : D:\code>array_benchmark.exe 500 1000000 0 array len= 4000 nloops= 1000000 Use vec ops: false time= 6.1556 s D:\code>array_benchmark.exe 500 1000000 1 array len= 4000 nloops= 1000000 Use vec ops: true time= 6.16539 s / operator: D:\code>array_benchmark.exe 500 100000 0 array len= 4000 nloops= 100000 Use vec ops: false time= 7.02435 s D:\code>array_benchmark.exe 500 100000 1 array len= 4000 nloops= 100000 Use vec ops: true time= 6.84251 s BTW, for the sake of comparison, here's my CPU specs from CPU-Z. Also note that I'm running in 32-bit mode. Number of processors 1 Number of cores 2 per processor Number of threads 2 (max 2) per processor Name AMD Athlon 64 X2 3600+ Code Name Brisbane Specification AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ Package Socket AM2 (940) Family/Model/Stepping F.B.1 Extended Family/Model F.6B Brand ID 4 Core Stepping BH-G1 Technology 65 nm Core Speed 2698.1 MHz Multiplier x Bus speed 9.5 x 284.0 MHz HT Link speed 852.0 MHz Stock frequency 1900 MHz Instruction sets MMX (+), 3DNow! (+), SSE, SSE2, SSE3, x86-64 L1 Data cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte line size L1 Instruction cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte line size L2 cache (per processor) 2 x 512 KBytes, 16-way set associative, 64-byte line size
Aug 09 2008
next sibling parent bearophile <bearophileHUGS mailas.com> writes:
dsimcha:
 I see at least part of the problem.  When you use such huge arrays, it ends up
 being more a test of your memory bandwidth than of the vector ops.
Right. Finding good benchmarks is not easy, and I have shown the code here for people to spot problems in it. I have added a C version too now. Bye, bearophile
Aug 09 2008
prev sibling parent Don <nospam nospam.com.au> writes:
dsimcha wrote:
 == Quote from bearophile (bearophileHUGS lycos.com)'s article
 First benchmark, just D against itself, not used GCC yet, the results show that
vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops. Three arrays of 80000 ints comes to a total of about 960k. This is not going to fit in any L1 cache for a long time.
Yes. The solution to that is to check for huge array sizes, and use a different routine (using prefetching) in that case. Actually, the most important routine to be doing that is memcpy/ array slice assignment, but I'm not sure it does. I think it just does a movsd. So I think this is still a useful case to benchmark, it's not the most important one, though.
Aug 10 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 First benchmark, just D against itself, not used GCC yet, the results
 show that vector ops are generally slower, but maybe there's some
 bug/problem in my benchmark (note it needs just Phobos!), not tested
 on Linux yet:
[...]
 a3[] = a1[] / a2[];
I wouldn't be a bit surprised at that since / for int[]s does not have a custom asm routine for it. See phobos/internal/arrayint.d If someone wants to write one, I'll put it in!
Aug 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 I wouldn't be a bit surprised at that since / for int[]s does not have a 
 custom asm routine for it.
I didn't know it. We may write a list about such things. But as you can see I have performed benchmarks with + * / not just /. It's very easy to write wrong benchmarks, so I am careful, but from the little I have seen so far the speed improvements are absent or less than 1 (slow down). And I haven't seen yet SS2 asm in my compiled programs :-)
Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<
Yes.<
But later on Reddit the answer by Walter was:
This optimization is called "loop fusion", and is well known. It doesn't always
result in a speedup, though. The dmd compiler doesn't do it, but that is not
the fault of D.<
At a closer look the two questions are different, I think he meant: a += b + c; => single loop a += b; a += c; => two loops I think this is acceptable. Bye, bearophile
Aug 09 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 It's very easy to write wrong benchmarks, so I am careful, but from
 the little I have seen so far the speed improvements are absent or
 less than 1 (slow down).
If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.
 And I haven't seen yet SS2 asm in my compiled programs :-)
The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
Aug 09 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 If this happens, then it's worth verifying that the asm code is actually 
 being run by inserting a printf in it.
I presume I'll have to recompile Phobos for that.
 And I haven't seen yet SS2 asm in my compiled programs :-)
The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent. Bye, bearophile
Aug 10 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.
I presume I'll have to recompile Phobos for that.
Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.
 And I haven't seen yet SS2 asm in my compiled programs :-)
The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.
I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.
Aug 10 2008
parent reply "Dave" <Dave_member pathlink.com> writes:
"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:g7na5s$qg0$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.
I presume I'll have to recompile Phobos for that.
Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.
 And I haven't seen yet SS2 asm in my compiled programs :-)
The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.
I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.
The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available? Thanks, - Dave import std.stdio, std.date, std.conv; void main(string[] args) { if(args.length < 3) { writefln("usage: ",args[0]," <array size> <iterations>"); return; } auto ASIZE = toInt(args[1]); auto ITERS = toInt(args[2]); writefln("Array Size = ",ASIZE,", Iterations = ",ITERS); int[] ia, ib, ic; ia = new int[ASIZE]; ib = new int[ASIZE]; ic = new int[ASIZE]; ib[] = ic[] = 10; double[] da, db, dc; da = new double[ASIZE]; db = new double[ASIZE]; dc = new double[ASIZE]; db[] = dc[] = 10.0; { ia[] = 0; int sum = 0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += aops!(int)(ia,ib,ic); } d_time e = getUTCtime(); writefln("intaops: ",(e - s) / 1000.0," secs, sum = ",sum); } { ia[] = 0; int sum = 0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += loop!(int)(ia,ib,ic); } d_time e = getUTCtime(); writefln("intloop: ",(e - s) / 1000.0," secs, sum = ",sum); } { da[] = 0.0; double sum = 0.0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += aops!(double)(da,db,dc); } d_time e = getUTCtime(); writefln("dfpaops: ",(e - s) / 1000.0," secs, sum = ",sum); } { da[] = 0.0; double sum = 0.0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += loop!(double)(da,db,dc); } d_time e = getUTCtime(); writefln("dfploop: ",(e - s) / 1000.0," secs, sum = ",sum); } } T aops(T)(T[] a, T[] b, T[] c) { a[] = b[] + c[]; return a[$-1]; } T loop(T)(T[] a, T[] b, T[] c) { foreach(i, inout val; a) val = b[i] + c[i]; return a[$-1]; } C:\Zz>dmd -O -inline -release top.d C:\Zz>top 4000 100000 Array Size = 4000, Iterations = 100000 intaops: 0.204 secs, sum = 2000000 intloop: 0.515 secs, sum = 2000000 dfpaops: 0.625 secs, sum = 2e+06 dfploop: 0.563 secs, sum = 2e+06
Aug 11 2008
parent "Dave" <Dave_member pathlink.com> writes:
	format=flowed;
	charset="iso-8859-1";
	reply-type=response
Content-Transfer-Encoding: 7bit

"Dave" <Dave_member pathlink.com> wrote in message 
news:g7qr3h$2l6$1 digitalmars.com...
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:g7na5s$qg0$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 If this happens, then it's worth verifying that the asm code is
 actually being run by inserting a printf in it.
I presume I'll have to recompile Phobos for that.
Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.
 And I haven't seen yet SS2 asm in my compiled programs :-)
The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.
I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.
The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available? Thanks, - Dave
Before:
 C:\Zz>top 4000 100000
 Array Size = 4000, Iterations = 100000
 intaops: 0.204 secs, sum = 2000000
 intloop: 0.515 secs, sum = 2000000
 dfpaops: 0.625 secs, sum = 2e+06
 dfploop: 0.563 secs, sum = 2e+06
After adding aligned case for _arraySliceSliceAddSliceAssign_d C:\Zz>top 4000 100000 Array Size = 4000, Iterations = 100000 intaops: 0.212 secs, sum = 2000000 intloop: 0.525 secs, sum = 2000000 dfpaops: 0.438 secs, sum = 2e+06 dfploop: 0.557 secs, sum = 2e+06 ;--- SiSoftware Sandra Processor Model : Intel(R) Core(TM)2 CPU 6700 2.66GHz Processor Cache(s) Internal Data Cache : 32kB, Synchronous, Write-Thru, 8-way set, 64 byte line size Internal Instruction Cache : 32kB, Synchronous, Write-Back, 8-way set, 64 byte line size L2 On-board Cache : 4MB, ECC, Synchronous, ATC, 16-way set, 64 byte line size, 2 threads sharing L2 Cache Multiplier : 1/1x (2667MHz)
Aug 11 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 D code with +:
I found the results to be heavily dependent on the data set size: C:\mars>test5 1000 10000 array len= 8000 nloops= 10000 vec time= 0.0926506 s non-vec time= 0.626356 s C:\mars>test5 2000 10000 array len= 16000 nloops= 10000 vec time= 0.279727 s non-vec time= 1.70048 s C:\mars>test5 3000 10000 array len= 24000 nloops= 10000 vec time= 0.795482 s non-vec time= 2.47597 s C:\mars>test5 4000 10000 array len= 32000 nloops= 10000 vec time= 2.36905 s non-vec time= 3.90906 s C:\mars>test5 5000 10000 array len= 40000 nloops= 10000 vec time= 3.12636 s non-vec time= 3.70741 s For smaller sets, it's a 2x speedup, for larger ones only a few percent. What we're seeing here is most likely the effects of the data set size exceeding the cache. It would be a fun project for someone to see if somehow the performance for such large data sets could be improved, perhaps by "warming" up the cache?
Aug 10 2008
prev sibling parent Walter Bright <newshound1 digitalmars.com> writes:
bearophile wrote:
 This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]
Please post all bugs to bugzilla! thanks
Aug 09 2008
prev sibling next sibling parent reply Michael P. <baseball.mjp gmail.com> writes:
Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
This will probably make me sound like an idiot, but what are these array operations everyone's so stoked about? I've only been learning D for a week and a half, fill me in! BTW, nice update!
Aug 08 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Michael P.:
 This will probably make me sound like an idiot, but what are these array
operations everyone's so stoked about? I've only been learning D for a week and
a half, fill me in!
If Walter gives you one link, you have to follow it before asking :-) http://www.digitalmars.com/d/1.0/arrays.html#array-operations Bye, bearophile
Aug 08 2008
parent Michael P. <baseball.mjp gmail.com> writes:
bearophile Wrote:

 Michael P.:
 This will probably make me sound like an idiot, but what are these array
operations everyone's so stoked about? I've only been learning D for a week and
a half, fill me in!
If Walter gives you one link, you have to follow it before asking :-) http://www.digitalmars.com/d/1.0/arrays.html#array-operations Bye, bearophile
Who knows where that could have led to... I was just playing it safe. :D
Aug 08 2008
prev sibling next sibling parent reply JAnderson <ask me.com> writes:
Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
Sweet! I love the way you put this forth as a challenge. Maybe D will have the worlds fastest array operations :) -Joel
Aug 08 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
JAnderson wrote:
 Sweet!  I love the way you put this forth as a challenge.  Maybe D will 
 have the worlds fastest array operations :)
I thought a little competition might bring out the best in people!
Aug 09 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Walter Bright:
 Can you make it faster?
Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code. I've taken a look at my code, and so far I don't see many spots where the array operations (once they actually give some speedup) can be useful (there are many other things I can find much more useful than such ops, see my wish lists). But if the array ops are useful for enough people, then it may be useful to burn some programming time to make those array ops use all the 2-4+ cores. Bye, bearophile
Aug 09 2008
parent reply Christopher Wright <dhasenan gmail.com> writes:
bearophile wrote:
 Walter Bright:
 Can you make it faster?
Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.
Aug 09 2008
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Christopher Wright" <dhasenan gmail.com> wrote in message 
news:g7ljal$2i84$1 digitalmars.com...
 bearophile wrote:
 Walter Bright:
 Can you make it faster?
Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.
I think we could see a lot more improvement from using vector ops to perform SIMD operations. They are just begging for it.
Aug 09 2008
prev sibling next sibling parent JAnderson <ask me.com> writes:
Christopher Wright wrote:
 bearophile wrote:
 Walter Bright:
 Can you make it faster?
Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.
I agree. I think a lot of profiling would be in order to see when certain things become an advantage to use. Then use a branch to jump to the best algorithm for the particular case (platform + length of array). Hopefully the compiler could inline the algorithm so that constant sized arrays don't pay for the additional overhead. There would be a small cost for the extra branch for small dynamic arrays. Ideally one could argue that if this becomes a performance bottleneck then the program is doing a lot of operations on lots of small arrays. The user could change the design to group their small arrays into a larger array to get the performance they desire. -Joel
Aug 09 2008
prev sibling parent renoX <renosky free.fr> writes:
Christopher Wright a écrit :
 bearophile wrote:
 Walter Bright:
 Can you make it faster?
Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
The overhead of creating a new thread for this would be significant.
Well for this kind of scheme, you wouldn't start a new set of thread each time! Just start a set of worker threads (one per cpu pinned to each cpu) which are created at startup of the program, and do nothing until they are woken up when there is an operation which can be accelerated through parallelism.
 You'd probably be better off using a regular loop for arrays that are 
 not huge.
Sure, even with pre-created threads, using several cpu induce additional cost at startup and end cost so this would be worthwhile only with loops 'big enough'.. A pitfall also is to ensure that two cpu don't write to the same cache line, otherwise this 'false sharing' will reduce the performance. renoX
Sep 07 2008
prev sibling next sibling parent reply "Craig Black" <craigblack2 cox.net> writes:
Very exciting stuff!  Keep up the good work.

Currently it only optimizes int and float.  I assume you could get it 
working for double pretty easily as well.  Is it extensible to user defined 
types like a Vector3 class?

-Craig 
Aug 09 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Craig Black:
 Currently it only optimizes int and float.
Currently it optimizes very little, I think. I have posted C and D benchmarks: http://codepad.org/BlwSIBKl http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=12718 Bye, bearophile
Aug 09 2008
prev sibling next sibling parent reply Don <nospam nospam.com.au> writes:
Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
I intend to contribute some asm routines, but have been working on bigint operations (both Tango and Phobos) for the past couple of weeks.
Aug 10 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 I intend to contribute some asm routines, but have been working on 
 bigint operations (both Tango and Phobos) for the past couple of weeks.
Cool!
Aug 10 2008
prev sibling next sibling parent reply Pete <example example.com> writes:
Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem. Regards,
Aug 11 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Pete wrote:
 Not sure if someone else has already mentioned this but would it be
 possible for the compiler to align these arrays on 16 byte boundaries
 in order to maximise any possible vector efficiency. AFAIK you can't
 actually specify align anything higher than align 8 at the moment
 which is a bit of a problem.
Anything allocated with new will be aligned on 16 byte boundaries.
Aug 11 2008
prev sibling parent reply Georg Lukas <georg op-co.de> writes:
On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.
From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays. Georg -- || http://op-co.de ++ GCS/CM d? s: a-- C+++ UL+++ !P L+++ E--- W++ ++ || gpg: 0x962FD2DE || N++ o? K- w---() O M V? PS+ PE-- Y+ PGP++ t* || || Ge0rG: euIRCnet || 5 X+ R tv b+(+++) DI+(+++) D+ G e* h! r* !y+ || ++ IRCnet OFTC OPN ||________________________________________________||
Aug 12 2008
next sibling parent reply Don <nospam nospam.com.au> writes:
Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.
From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.
Just begin with a check for minimal size. If less than that size, don't use SSE at all.
 
 Georg
Aug 13 2008
parent "Dave" <Dave_member pathlink.com> writes:
"Don" <nospam nospam.com.au> wrote in message 
news:g7u36h$20j0$1 digitalmars.com...
 Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.
From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.
Good idea. Right now in that code there is (usually) a case for both un/aligned. It typically goes like this: if(cpu_has_sse2 && a.length > min_size) { if(((cast(size_t) aptr | cast(size_t)bptr | cast(size_t)cptr) & 15) != 0) { // Unaligned case asm { ... movdqu XMM0, [EAX] ... } } else { // Aligned case asm { ... movdqa XMM0, [EAX] ... } } } The two blocks of asm code is basically identical except for the un/aligned SSE opcodes. With your idea, one could get rid of the test for alignment, probably some bloat and a whole lot of duplication. I guess the question would be if the overhead of your idea would be less than the current design. - Dave
 This would also work for slices, at least when both slices have the same 
 alignment remainder. I'm just not sure what overhead such a solution 
 would impose for small arrays.
Just begin with a check for minimal size. If less than that size, don't use SSE at all.
 Georg 
Aug 13 2008
prev sibling parent JAnderson <ask me.com> writes:
Georg Lukas wrote:
 On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:
 Walter Bright Wrote:
 This one has (finally) got array operations implemented. For those who
 want to show off their leet assembler skills, the initial assembler
 implementation code is in phobos/internal/array*.d. Burton Radons wrote
 the assembler. Can you make it faster?
Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.
From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.
There would be some overhead for small arrays however as I said in my previous email, if your using a small array then its likely that your not doing much. If it is a performance issue you should switch to a larger array (by grouping all your smaller ones together). Of course there's the edge case where some actually needs to do a g-billion operations on exactly the same small array.
 
 Georg
-Joel
Aug 13 2008
prev sibling next sibling parent Don <nospam nospam.com.au> writes:
Walter Bright wrote:
 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
My tests indicate that array operations also support ^ and ^=, but that's not listed in the spec. Not the first time that D's been better than advertised. <g>
Aug 14 2008
prev sibling parent Pablo Ripolles <in-call gmx.net> writes:
Fantastic!  Thanks!

This is the present already!
http://www.digitalmars.com/d/1.0/future.html
http://www.digitalmars.com/d/2.0/future.html


Walter Bright Wrote:

 This one has (finally) got array operations implemented. For those who 
 want to show off their leet assembler skills, the initial assembler 
 implementation code is in phobos/internal/array*.d. Burton Radons wrote 
 the assembler. Can you make it faster?
 
 http://www.digitalmars.com/d/1.0/changelog.html
 http://ftp.digitalmars.com/dmd.1.034.zip
 
 http://www.digitalmars.com/d/2.0/changelog.html
 http://ftp.digitalmars.com/dmd.2.018.zip
Aug 18 2008