digitalmars.D.announce - DMD 1.034 and 2.018 releases
- Walter Bright (8/8) Aug 08 2008 This one has (finally) got array operations implemented. For those who
- Lars Ivar Igesund (8/15) Aug 08 2008 The array op docs aren't actually on the 1.0 array page. But great! I
- Walter Bright (2/3) Aug 08 2008 Fixed.
- Jarrett Billingsley (3/13) Aug 08 2008 Too bad Tango doesn't support them yet. :C
- Lars Ivar Igesund (8/24) Aug 09 2008 Are you suggesting that Walter should have told us that he was implement...
- Walter Bright (3/22) Aug 09 2008 All Tango needs to do is copy the internal\array*.d files over and add
- Lars Ivar Igesund (7/30) Aug 09 2008 I know :) No malice intended.
- Sean Kelly (6/28) Aug 09 2008 I took care of this when 1.033 was released, since the files were first ...
- dsimcha (4/4) Aug 08 2008 Uhh...Is it just me or did you accidentally repackage 2.017 and call it ...
- Walter Bright (2/6) Aug 08 2008 I don't know what happened there, but I just re-uploaded it.
- Sean Kelly (3/7) Aug 08 2008 Sweet! One of the best updates ever.
- Jonathan Crapuchettes (3/13) Aug 08 2008 This is great! Thanks for adding this Walter.
- Bill Baxter (9/17) Aug 08 2008 That is pretty neat.
- Walter Bright (2/9) Aug 08 2008 Array ops were always supposed to be there.
- Bill Baxter (7/17) Aug 08 2008 Ok, I thought the charter for D1 was no new language features, period.
- Walter Bright (2/2) Aug 08 2008 Now on Reddit!
- bearophile (25/25) Aug 08 2008 Probably I am missing something important, I have tried this code with t...
- bearophile (7/8) Aug 08 2008 I meant:
- Walter Bright (3/13) Aug 08 2008 D already distinguishes operations on the array handle, a, from
- Walter Bright (10/46) Aug 08 2008 Doesn't work for initializers.
- Moritz Warning (7/29) Aug 09 2008 [..]
- bearophile (36/46) Aug 09 2008 Then I have not seen such comment in the docs, if this is absent from th...
- bearophile (90/90) Aug 09 2008 First benchmark, just D against itself, not used GCC yet, the results sh...
- bearophile (5/5) Aug 09 2008 Second version, just a bit cleaner code, less bug-prone, etc:
- bearophile (72/72) Aug 09 2008 C version too:
- dsimcha (53/54) Aug 09 2008 vector ops are generally slower, but maybe there's some bug/problem in m...
- bearophile (4/6) Aug 09 2008 Right. Finding good benchmarks is not easy, and I have shown the code he...
- Don (7/16) Aug 10 2008 Yes. The solution to that is to check for huge array sizes, and use a
- Walter Bright (5/10) Aug 09 2008 I wouldn't be a bit surprised at that since / for int[]s does not have a...
- bearophile (11/16) Aug 09 2008 I didn't know it. We may write a list about such things.
- Walter Bright (5/9) Aug 09 2008 If this happens, then it's worth verifying that the asm code is actually...
- bearophile (5/10) Aug 10 2008 I know. I was talking about the parts of the code that for example adds ...
- Walter Bright (6/18) Aug 10 2008 Not really, it's easier to just copy that particular function out of the
- Walter Bright (27/28) Aug 10 2008 I found the results to be heavily dependent on the data set size:
- Walter Bright (2/3) Aug 09 2008 Please post all bugs to bugzilla! thanks
- Michael P. (3/13) Aug 08 2008 This will probably make me sound like an idiot, but what are these array...
- bearophile (5/6) Aug 08 2008 If Walter gives you one link, you have to follow it before asking :-)
- Michael P. (2/10) Aug 08 2008 Who knows where that could have led to... I was just playing it safe. :D
- JAnderson (4/14) Aug 08 2008 Sweet! I love the way you put this forth as a challenge. Maybe D will
- Walter Bright (2/4) Aug 09 2008 I thought a little competition might bring out the best in people!
- bearophile (5/6) Aug 09 2008 Lot of people today have 2 (or even 4 cores), the order of the computati...
- Christopher Wright (4/8) Aug 09 2008 The overhead of creating a new thread for this would be significant.
- Jarrett Billingsley (4/15) Aug 09 2008 I think we could see a lot more improvement from using vector ops to per...
- JAnderson (12/25) Aug 09 2008 I agree. I think a lot of profiling would be in order to see when
- renoX (12/25) Sep 07 2008 Well for this kind of scheme, you wouldn't start a new set of thread
- Craig Black (5/5) Aug 09 2008 Very exciting stuff! Keep up the good work.
- bearophile (6/7) Aug 09 2008 Currently it optimizes very little, I think. I have posted C and D bench...
- Don (3/13) Aug 10 2008 I intend to contribute some asm routines, but have been working on
- Walter Bright (2/4) Aug 10 2008 Cool!
- Pete (3/13) Aug 11 2008 Not sure if someone else has already mentioned this but would it be poss...
- Walter Bright (2/7) Aug 11 2008 Anything allocated with new will be aligned on 16 byte boundaries.
- Georg Lukas (16/27) Aug 12 2008 From a short look at the array*.d source code, it would be better to
- Don (4/14) Aug 14 2008 My tests indicate that array operations also support ^ and ^=, but
- Pablo Ripolles (5/15) Aug 18 2008 Fantastic! Thanks!
This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zip
Aug 08 2008
Walter Bright wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Aug 08 2008
Lars Ivar Igesund wrote:The array op docs aren't actually on the 1.0 array page.Fixed.
Aug 08 2008
"Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:g7ias2$2kbo$1 digitalmars.com...Walter Bright wrote:Too bad Tango doesn't support them yet. :CThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Aug 08 2008
Jarrett Billingsley wrote:"Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:g7ias2$2kbo$1 digitalmars.com...Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034? -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoWalter Bright wrote:Too bad Tango doesn't support them yet. :CThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Aug 09 2008
Lars Ivar Igesund wrote:Jarrett Billingsley wrote:All Tango needs to do is copy the internal\array*.d files over and add them to the makefile."Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:g7ias2$2kbo$1 digitalmars.com...Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?Walter Bright wrote:Too bad Tango doesn't support them yet. :CThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Aug 09 2008
Walter Bright wrote:Lars Ivar Igesund wrote:I know :) No malice intended. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoJarrett Billingsley wrote:All Tango needs to do is copy the internal\array*.d files over and add them to the makefile."Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:g7ias2$2kbo$1 digitalmars.com...Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?Walter Bright wrote:Too bad Tango doesn't support them yet. :CThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Aug 09 2008
== Quote from Walter Bright (newshound1 digitalmars.com)'s articleLars Ivar Igesund wrote:I took care of this when 1.033 was released, since the files were first included then. There are likely updates in 1.034, but nothing a few minutes with my merge tool can't handle. Sadly, that particular merge tool is a bit broken at the moment, but I'll see about taking care of this anyway. SeanJarrett Billingsley wrote:All Tango needs to do is copy the internal\array*.d files over and add them to the makefile."Lars Ivar Igesund" <larsivar igesund.net> wrote in message news:g7ias2$2kbo$1 digitalmars.com...Are you suggesting that Walter should have told us that he was implementing this feature ahead of releasing 1.034?Walter Bright wrote:Too bad Tango doesn't support them yet. :CThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zipThe array op docs aren't actually on the 1.0 array page. But great! I remember trying to use these 3 years ago :D
Aug 09 2008
Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018? I checked all the obvious stuff, cleared my cache and all, and no matter what I do, the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I run it as 2.017.
Aug 08 2008
dsimcha wrote:Uhh...Is it just me or did you accidentally repackage 2.017 and call it 2.018? I checked all the obvious stuff, cleared my cache and all, and no matter what I do, the date on DMD.exe in the 2.018 zipfile is 7/10/08 and it is identified when I run it as 2.017.I don't know what happened there, but I just re-uploaded it.
Aug 08 2008
== Quote from Walter Bright (newshound1 digitalmars.com)'s articleThis one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?Sweet! One of the best updates ever. Sean
Aug 08 2008
This is great! Thanks for adding this Walter. JC Walter Bright wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zip
Aug 08 2008
On Sat, Aug 9, 2008 at 5:24 AM, Walter Bright <newshound1 digitalmars.com> wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipThat is pretty neat. So does this mean you've reconsidered your position on adding new features to D1.x? Because q{} strings, 1..10 literals, and that enhancement to IFTI used by std.algorithm, all sure would be nice. I'd take those over fancy array ops any day. --bb
Aug 08 2008
Bill Baxter wrote:That is pretty neat. So does this mean you've reconsidered your position on adding new features to D1.x? Because q{} strings, 1..10 literals, and that enhancement to IFTI used by std.algorithm, all sure would be nice. I'd take those over fancy array ops any day.Array ops were always supposed to be there.
Aug 08 2008
On Sat, Aug 9, 2008 at 7:30 AM, Walter Bright <newshound1 digitalmars.com> wrote:Bill Baxter wrote:Ok, I thought the charter for D1 was no new language features, period. FWIW, I always thought D1 IFTI was supposed to be smarter, so to me adding array ops seems to be similar to porting the fix for 493 to D1. (http://d.puremagic.com/issues/show_bug.cgi?id=493) --bbThat is pretty neat. So does this mean you've reconsidered your position on adding new features to D1.x? Because q{} strings, 1..10 literals, and that enhancement to IFTI used by std.algorithm, all sure would be nice. I'd take those over fancy array ops any day.Array ops were always supposed to be there.
Aug 08 2008
Now on Reddit! http://www.reddit.com/comments/6vjcv/d_programming_language_gets_vector_operations/
Aug 08 2008
Probably I am missing something important, I have tried this code with the 1.034 (that compiles my large d libs fine), but I have found many problems: import std.stdio: putr = writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; //putr(a1[] + a2[]); // test.d(6): Error: Array operations not implemented auto a3 = a1[] + 4; putr(a3); // [1,2,3,0,0,0,0] int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not implemented int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error: Array operations not implemented int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: [] auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented putr(a8); } Few more questions/notes: - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4, I am used to that from PyLab, etc. - How does it works (or not works) with jagged/square matrices? - When possible I'm do few benchmarks compared to normal D code, C code compiled normally with GCC and C code automatically vectorized by GCC. - Is it able to compute a+b+c with a single loop (as all Fortran compilers do)? I presume the answer is negative. - Hopefully in the future they may support the SSE3/SSSE3 too that my CPU supports. Bye, and good work, bearophile
Aug 08 2008
bearophile:- I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,I meant: a + b a + 4 instead of: a[] + b[] a[] + 4
Aug 08 2008
bearophile wrote:bearophile:D already distinguishes operations on the array handle, a, from operations on the contents of a, a[]. I think this is a good distinction.- I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4,I meant: a + b a + 4 instead of: a[] + b[] a[] + 4
Aug 08 2008
bearophile wrote:Probably I am missing something important, I have tried this code with the 1.034 (that compiles my large d libs fine), but I have found many problems: import std.stdio: putr = writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; //putr(a1[] + a2[]); // test.d(6): Error: Array operations not implementedIt only works if the top level is an assignment operation.auto a3 = a1[] + 4; putr(a3); // [1,2,3,0,0,0,0] int[] a4 = a1[] + a2[]; // test.d(12): Error: Array operations not implementedDoesn't work for initializers.int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error: Array operations not implementedDoesn't work for initializers.int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []I don't know what putr is.auto a8 = a1 + a2; // test.d(21): Error: Array operations not implementedHave to use slice [] operator.putr(a8); } Few more questions/notes: - I like a syntax as a+b and a[]+4 instead of a[]+b[] and a[]+4, I am used to that from PyLab, etc. - How does it works (or not works) with jagged/square matrices?It doesn't. - When possibleI'm do few benchmarks compared to normal D code, C code compiled normally with GCC and C code automatically vectorized by GCC. - Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?Yes. I presume the answer is negative. - Hopefully in the future theymay support the SSE3/SSSE3 too that my CPU supports. Bye, and good work, bearophile
Aug 08 2008
On Fri, 08 Aug 2008 22:43:08 -0700, Walter Bright wrote:bearophile wrote:[..]Probably I am missing something important, I have tried this code with the 1.034 (that compiles my large d libs fine), but I have found many problems: import std.stdio: putr = writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; //putr(a1[] + a2[]); // test.d(6): Error: Array operations not implementedIt only works if the top level is an assignment operation.[..] Looks like there is room for improvement. It does put work on the programmers nerves when things doesn't work as expected. :) Anyway - Good work!int[] a5 = [3, 5, 7, 9]; int[] a6 = a1 + a5; // test.d(16): Error: Array operations not implementedDoesn't work for initializers.
Aug 09 2008
bearophile wrote:Walter Bright:import std.stdio: putr = writefln;It only works if the top level is an assignment operation.<Then I have not seen such comment in the docs, if this is absent from the docs, then this deserves to be added. And the error message too given DMD can be improved.Doesn't work for initializers.<Both the docs (if not already present) and the error message have to explain this. This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]It's just a shorter alias of the writefln.int[] a7; a7[] = a1[] + a2[]; putr(a7); // prints: []I don't know what putr is.auto a8 = a1 + a2; // test.d(21): Error: Array operations not implemented<<Have to use slice [] operator.<I'd like a less wrong error message then.Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<Yes.<This is very positive :-)D already distinguishes operations on the array handle, a, from operations on the contents of a, a[]. I think this is a good distinction.<I understand and I agree, but the [] make the code a little less natural to write. ---------------------------- For reference this is the shortened code, it compiles and runs but the results and error messages are bogus: import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = a1[] + 4; writefln(a3); // prints: [1,2,3,0,0,0,0] int[] a7; a7[] = a1[] + a2[]; writefln(a7); // prints: [] // a7 = a1 + a2; // test2.d(14): Error: Array operations not implemented } The last line gives a wrong message error (well, the message errors in the precedent code were all wrong). ------------------- The following code works, yay! :-) import std.stdio: writefln; void main() { int[] a1 = [1, 2, 3]; int[] a2 = [2, 4, 6]; auto a3 = new int[2]; a3[] = a1[] + a2[]; writefln(a3); // prints: [3,6] } Later, bearophile
Aug 09 2008
First benchmark, just D against itself, not used GCC yet, the results show that vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: import std.stdio: put = writef, putr = writefln; import std.conv: toInt; version (Win32) { import std.c.windows.windows: QueryPerformanceCounter, QueryPerformanceFrequency; double clock() { long t; QueryPerformanceCounter(&t); return cast(double)t / queryPerformanceFrequency; } long queryPerformanceFrequency; static this() { QueryPerformanceFrequency(&queryPerformanceFrequency); } } version (linux) { import std.c.linux.linux: time; double clock() { return cast(double)time(null); } } void main(string[] args) { int n = args.length >= 2 ? toInt(args[1]) : 10; n *= 8; // to avoid problems with SSE2 int nloops = args.length >= 3 ? toInt(args[2]) : 1; bool use_vec = args.length == 4 ? cast(bool)toInt(args[3]) : true; putr("array len= ", n, " nloops= ", nloops, " Use vec ops: ", use_vec); auto a1 = new int[n]; // void? auto a2 = new int[n]; // void? auto a3 = new int[n]; foreach (i, ref el; a1) el = i * 7 + 1; foreach (i, ref el; a2) el = i + 1; auto t = clock(); if (use_vec) for (int j = 0; j < nloops; j++) a3[] = a1[] / a2[]; else for (int j = 0; j < nloops; j++) for (int i; i < a3.length; i++) a3[i] = a1[i] / a2[i]; putr("time= ", clock() - t, " s"); if (a3.length < 300) putr("\nResult:\n", a3); } /* D code with /: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10563 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.222 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.654696 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.655401 s D code with *: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10615 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.21904 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.65515 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.65566 s (Note that 0.65566 > 0.65515 isn't due to noise) D code with +: C:\>array_benchmark.exe 10000 10000 0 array len= 80000 nloops= 10000 Use vec ops: false time= 7.10848 s C:\>array_benchmark.exe 10000 10000 1 array len= 80000 nloops= 10000 Use vec ops: true time= 7.22527 s C:\>array_benchmark.exe 12000000 1 0 array len= 96000000 nloops= 1 Use vec ops: false time= 0.654797 s C:\>array_benchmark.exe 12000000 1 1 array len= 96000000 nloops= 1 Use vec ops: true time= 0.654991 s */ Bye, bearophile
Aug 09 2008
Second version, just a bit cleaner code, less bug-prone, etc: http://codepad.org/BlwSIBKl Timings on linux on DMD 2.0 with * as operation seems much better. Bater, bearophile
Aug 09 2008
C version too: #include "stdlib.h" #include "stdio.h" #include "time.h" #define MYOP * typedef int T; #define TFORM "%d " void error(char *string) { fprintf(stderr, "ERROR: %s\n", string); exit(EXIT_FAILURE); } double myclock() { clock_t t = clock(); if (t == -1) return 0.0; else return t / (double)CLOCKS_PER_SEC; } int main(int argc, char** argv) { int n = argc >= 2 ? atoi(argv[1]) : 10; n *= 8; // to avoid problems with SSE2 int nloops = argc >= 3 ? atoi(argv[2]) : 1; printf("array len= %d nloops= %d\n", n, nloops); //__attribute__((aligned(16))) T* __restrict a1 = (T*)malloc(sizeof(T) * n + 16); T* __restrict a2 = (T*)malloc(sizeof(T) * n + 16); T* __restrict a3 = (T*)malloc(sizeof(T) * n + 16); if (a1 == NULL || a2 == NULL || a3 == NULL) error("memory overflow"); int i, j; for (i = 0; i < n; i++) { a1[i] = i * 7 + 1; a2[i] = i + 1; } double t = myclock(); for (j = 0; j < nloops; j++) for (i = 0; i < n; i++) // Alignment of access forced using peeling. a3[i] = a1[i] MYOP a2[i]; printf("time= %f s\n", myclock() - t); if (n < 300) { printf("\nResult:\n"); for (i = 0; i < n; i++) printf(TFORM, a3[i]); putchar('\n'); } return 0; } /* MYOP = *, compiled with: gcc -Wall -O3 -s benchmark.c -o benchmark C:\>benchmark 100 3000000 array len= 800 nloops= 3000000 time= 3.656000 s C:\>benchmark 10000 10000 array len= 80000 nloops= 10000 time= 1.374000 s C:\>benchmark 12000000 1 array len= 96000000 nloops= 1 time= 0.547000 s MYOP = *, compiled with: gcc -Wall -O3 -s -ftree-vectorize -msse3 -ftree-vectorizer-verbose=5 benchmark.c -o benchmark C:\>benchmark 100 3000000 array len= 800 nloops= 3000000 time= 3.468000 s C:\>benchmark 10000 10000 array len= 80000 nloops= 10000 time= 1.156000 s C:\>benchmark 12000000 1 array len= 96000000 nloops= 1 time= 0.531000 s In the larger array the cache effects may dominate over computing time. */
Aug 09 2008
== Quote from bearophile (bearophileHUGS lycos.com)'s articleFirst benchmark, just D against itself, not used GCC yet, the results show thatvector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops. Three arrays of 80000 ints comes to a total of about 960k. This is not going to fit in any L1 cache for a long time. Heck, my CPU only has 512k L2 cache per core. Here are my results using smaller arrays designed to fit in my 64k L1 data cache, and the same code as Bearophile. + operator: D:\code>array_benchmark.exe 500 1000000 0 array len= 4000 nloops= 1000000 Use vec ops: false time= 4.82841 s D:\code>array_benchmark.exe 500 1000000 1 array len= 4000 nloops= 1000000 Use vec ops: true time= 2.32902 s * operator : D:\code>array_benchmark.exe 500 1000000 0 array len= 4000 nloops= 1000000 Use vec ops: false time= 6.1556 s D:\code>array_benchmark.exe 500 1000000 1 array len= 4000 nloops= 1000000 Use vec ops: true time= 6.16539 s / operator: D:\code>array_benchmark.exe 500 100000 0 array len= 4000 nloops= 100000 Use vec ops: false time= 7.02435 s D:\code>array_benchmark.exe 500 100000 1 array len= 4000 nloops= 100000 Use vec ops: true time= 6.84251 s BTW, for the sake of comparison, here's my CPU specs from CPU-Z. Also note that I'm running in 32-bit mode. Number of processors 1 Number of cores 2 per processor Number of threads 2 (max 2) per processor Name AMD Athlon 64 X2 3600+ Code Name Brisbane Specification AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ Package Socket AM2 (940) Family/Model/Stepping F.B.1 Extended Family/Model F.6B Brand ID 4 Core Stepping BH-G1 Technology 65 nm Core Speed 2698.1 MHz Multiplier x Bus speed 9.5 x 284.0 MHz HT Link speed 852.0 MHz Stock frequency 1900 MHz Instruction sets MMX (+), 3DNow! (+), SSE, SSE2, SSE3, x86-64 L1 Data cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte line size L1 Instruction cache (per processor) 2 x 64 KBytes, 2-way set associative, 64-byte line size L2 cache (per processor) 2 x 512 KBytes, 16-way set associative, 64-byte line size
Aug 09 2008
dsimcha:I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops.Right. Finding good benchmarks is not easy, and I have shown the code here for people to spot problems in it. I have added a C version too now. Bye, bearophile
Aug 09 2008
dsimcha wrote:== Quote from bearophile (bearophileHUGS lycos.com)'s articleYes. The solution to that is to check for huge array sizes, and use a different routine (using prefetching) in that case. Actually, the most important routine to be doing that is memcpy/ array slice assignment, but I'm not sure it does. I think it just does a movsd. So I think this is still a useful case to benchmark, it's not the most important one, though.First benchmark, just D against itself, not used GCC yet, the results show thatvector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet: I see at least part of the problem. When you use such huge arrays, it ends up being more a test of your memory bandwidth than of the vector ops. Three arrays of 80000 ints comes to a total of about 960k. This is not going to fit in any L1 cache for a long time.
Aug 10 2008
bearophile wrote:First benchmark, just D against itself, not used GCC yet, the results show that vector ops are generally slower, but maybe there's some bug/problem in my benchmark (note it needs just Phobos!), not tested on Linux yet:[...]a3[] = a1[] / a2[];I wouldn't be a bit surprised at that since / for int[]s does not have a custom asm routine for it. See phobos/internal/arrayint.d If someone wants to write one, I'll put it in!
Aug 09 2008
Walter Bright:I wouldn't be a bit surprised at that since / for int[]s does not have a custom asm routine for it.I didn't know it. We may write a list about such things. But as you can see I have performed benchmarks with + * / not just /. It's very easy to write wrong benchmarks, so I am careful, but from the little I have seen so far the speed improvements are absent or less than 1 (slow down). And I haven't seen yet SS2 asm in my compiled programs :-)Is it able to compute a+b+c with a single loop (as all Fortran compilers do)?<<Yes.<But later on Reddit the answer by Walter was:This optimization is called "loop fusion", and is well known. It doesn't always result in a speedup, though. The dmd compiler doesn't do it, but that is not the fault of D.<At a closer look the two questions are different, I think he meant: a += b + c; => single loop a += b; a += c; => two loops I think this is acceptable. Bye, bearophile
Aug 09 2008
bearophile wrote:It's very easy to write wrong benchmarks, so I am careful, but from the little I have seen so far the speed improvements are absent or less than 1 (slow down).If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.And I haven't seen yet SS2 asm in my compiled programs :-)The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
Aug 09 2008
Walter Bright:If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.I presume I'll have to recompile Phobos for that.I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent. Bye, bearophileAnd I haven't seen yet SS2 asm in my compiled programs :-)The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
Aug 10 2008
bearophile wrote:Walter Bright:Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.I presume I'll have to recompile Phobos for that.I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.And I haven't seen yet SS2 asm in my compiled programs :-)The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
Aug 10 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7na5s$qg0$1 digitalmars.com...bearophile wrote:The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available? Thanks, - Dave import std.stdio, std.date, std.conv; void main(string[] args) { if(args.length < 3) { writefln("usage: ",args[0]," <array size> <iterations>"); return; } auto ASIZE = toInt(args[1]); auto ITERS = toInt(args[2]); writefln("Array Size = ",ASIZE,", Iterations = ",ITERS); int[] ia, ib, ic; ia = new int[ASIZE]; ib = new int[ASIZE]; ic = new int[ASIZE]; ib[] = ic[] = 10; double[] da, db, dc; da = new double[ASIZE]; db = new double[ASIZE]; dc = new double[ASIZE]; db[] = dc[] = 10.0; { ia[] = 0; int sum = 0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += aops!(int)(ia,ib,ic); } d_time e = getUTCtime(); writefln("intaops: ",(e - s) / 1000.0," secs, sum = ",sum); } { ia[] = 0; int sum = 0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += loop!(int)(ia,ib,ic); } d_time e = getUTCtime(); writefln("intloop: ",(e - s) / 1000.0," secs, sum = ",sum); } { da[] = 0.0; double sum = 0.0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += aops!(double)(da,db,dc); } d_time e = getUTCtime(); writefln("dfpaops: ",(e - s) / 1000.0," secs, sum = ",sum); } { da[] = 0.0; double sum = 0.0; d_time s = getUTCtime(); for(size_t i = 0; i < ITERS; i++) { sum += loop!(double)(da,db,dc); } d_time e = getUTCtime(); writefln("dfploop: ",(e - s) / 1000.0," secs, sum = ",sum); } } T aops(T)(T[] a, T[] b, T[] c) { a[] = b[] + c[]; return a[$-1]; } T loop(T)(T[] a, T[] b, T[] c) { foreach(i, inout val; a) val = b[i] + c[i]; return a[$-1]; } C:\Zz>dmd -O -inline -release top.d C:\Zz>top 4000 100000 Array Size = 4000, Iterations = 100000 intaops: 0.204 secs, sum = 2000000 intloop: 0.515 secs, sum = 2000000 dfpaops: 0.625 secs, sum = 2e+06 dfploop: 0.563 secs, sum = 2e+06Walter Bright:Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.I presume I'll have to recompile Phobos for that.I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.And I haven't seen yet SS2 asm in my compiled programs :-)The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.
Aug 11 2008
format=flowed; charset="iso-8859-1"; reply-type=response Content-Transfer-Encoding: 7bit "Dave" <Dave_member pathlink.com> wrote in message news:g7qr3h$2l6$1 digitalmars.com..."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7na5s$qg0$1 digitalmars.com...Before:bearophile wrote:The SSE2 is being used, but what would be nice would be the same code that Burton used for his benchmarks. Is that available? Thanks, - DaveWalter Bright:Not really, it's easier to just copy that particular function out of the library and paste it into your test module, that way it's easier to experiment with.If this happens, then it's worth verifying that the asm code is actually being run by inserting a printf in it.I presume I'll have to recompile Phobos for that.I don't know what you mean. The SSE2 instructions are in internal/arrayint.d, and they do get compiled in.I know. I was talking about the parts of the code that for example adds the arrays; according to the phobos source code they use SSE2 but in the final source code produces they are absent.And I haven't seen yet SS2 asm in my compiled programs :-)The dmd compiler doesn't generate SS2 instructions. But the routines in internal\array*.d do.C:\Zz>top 4000 100000 Array Size = 4000, Iterations = 100000 intaops: 0.204 secs, sum = 2000000 intloop: 0.515 secs, sum = 2000000 dfpaops: 0.625 secs, sum = 2e+06 dfploop: 0.563 secs, sum = 2e+06After adding aligned case for _arraySliceSliceAddSliceAssign_d C:\Zz>top 4000 100000 Array Size = 4000, Iterations = 100000 intaops: 0.212 secs, sum = 2000000 intloop: 0.525 secs, sum = 2000000 dfpaops: 0.438 secs, sum = 2e+06 dfploop: 0.557 secs, sum = 2e+06 ;--- SiSoftware Sandra Processor Model : Intel(R) Core(TM)2 CPU 6700 2.66GHz Processor Cache(s) Internal Data Cache : 32kB, Synchronous, Write-Thru, 8-way set, 64 byte line size Internal Instruction Cache : 32kB, Synchronous, Write-Back, 8-way set, 64 byte line size L2 On-board Cache : 4MB, ECC, Synchronous, ATC, 16-way set, 64 byte line size, 2 threads sharing L2 Cache Multiplier : 1/1x (2667MHz)
Aug 11 2008
bearophile wrote:D code with +:I found the results to be heavily dependent on the data set size: C:\mars>test5 1000 10000 array len= 8000 nloops= 10000 vec time= 0.0926506 s non-vec time= 0.626356 s C:\mars>test5 2000 10000 array len= 16000 nloops= 10000 vec time= 0.279727 s non-vec time= 1.70048 s C:\mars>test5 3000 10000 array len= 24000 nloops= 10000 vec time= 0.795482 s non-vec time= 2.47597 s C:\mars>test5 4000 10000 array len= 32000 nloops= 10000 vec time= 2.36905 s non-vec time= 3.90906 s C:\mars>test5 5000 10000 array len= 40000 nloops= 10000 vec time= 3.12636 s non-vec time= 3.70741 s For smaller sets, it's a 2x speedup, for larger ones only a few percent. What we're seeing here is most likely the effects of the data set size exceeding the cache. It would be a fun project for someone to see if somehow the performance for such large data sets could be improved, perhaps by "warming" up the cache?
Aug 10 2008
bearophile wrote:This output looks like a bug of the compiler anyway: [1,2,3,0,0,0,0]Please post all bugs to bugzilla! thanks
Aug 09 2008
Walter Bright Wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipThis will probably make me sound like an idiot, but what are these array operations everyone's so stoked about? I've only been learning D for a week and a half, fill me in! BTW, nice update!
Aug 08 2008
Michael P.:This will probably make me sound like an idiot, but what are these array operations everyone's so stoked about? I've only been learning D for a week and a half, fill me in!If Walter gives you one link, you have to follow it before asking :-) http://www.digitalmars.com/d/1.0/arrays.html#array-operations Bye, bearophile
Aug 08 2008
bearophile Wrote:Michael P.:Who knows where that could have led to... I was just playing it safe. :DThis will probably make me sound like an idiot, but what are these array operations everyone's so stoked about? I've only been learning D for a week and a half, fill me in!If Walter gives you one link, you have to follow it before asking :-) http://www.digitalmars.com/d/1.0/arrays.html#array-operations Bye, bearophile
Aug 08 2008
Walter Bright wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipSweet! I love the way you put this forth as a challenge. Maybe D will have the worlds fastest array operations :) -Joel
Aug 08 2008
JAnderson wrote:Sweet! I love the way you put this forth as a challenge. Maybe D will have the worlds fastest array operations :)I thought a little competition might bring out the best in people!
Aug 09 2008
Walter Bright:Can you make it faster?Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code. I've taken a look at my code, and so far I don't see many spots where the array operations (once they actually give some speedup) can be useful (there are many other things I can find much more useful than such ops, see my wish lists). But if the array ops are useful for enough people, then it may be useful to burn some programming time to make those array ops use all the 2-4+ cores. Bye, bearophile
Aug 09 2008
bearophile wrote:Walter Bright:The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.Can you make it faster?Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
Aug 09 2008
"Christopher Wright" <dhasenan gmail.com> wrote in message news:g7ljal$2i84$1 digitalmars.com...bearophile wrote:I think we could see a lot more improvement from using vector ops to perform SIMD operations. They are just begging for it.Walter Bright:The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.Can you make it faster?Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
Aug 09 2008
Christopher Wright wrote:bearophile wrote:I agree. I think a lot of profiling would be in order to see when certain things become an advantage to use. Then use a branch to jump to the best algorithm for the particular case (platform + length of array). Hopefully the compiler could inline the algorithm so that constant sized arrays don't pay for the additional overhead. There would be a small cost for the extra branch for small dynamic arrays. Ideally one could argue that if this becomes a performance bottleneck then the program is doing a lot of operations on lots of small arrays. The user could change the design to group their small arrays into a larger array to get the performance they desire. -JoelWalter Bright:The overhead of creating a new thread for this would be significant. You'd probably be better off using a regular loop for arrays that are not huge.Can you make it faster?Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.
Aug 09 2008
Christopher Wright a écrit :bearophile wrote:Well for this kind of scheme, you wouldn't start a new set of thread each time! Just start a set of worker threads (one per cpu pinned to each cpu) which are created at startup of the program, and do nothing until they are woken up when there is an operation which can be accelerated through parallelism.Walter Bright:The overhead of creating a new thread for this would be significant.Can you make it faster?Lot of people today have 2 (or even 4 cores), the order of the computation of those ops is arbitrary, so a major (nearly linear, hopefully) speedup will probably come as soon all the cores are used. This job splitting is probably an advantage even then the ops aren't computed by asm code.You'd probably be better off using a regular loop for arrays that are not huge.Sure, even with pre-created threads, using several cpu induce additional cost at startup and end cost so this would be worthwhile only with loops 'big enough'.. A pitfall also is to ensure that two cpu don't write to the same cache line, otherwise this 'false sharing' will reduce the performance. renoX
Sep 07 2008
Very exciting stuff! Keep up the good work. Currently it only optimizes int and float. I assume you could get it working for double pretty easily as well. Is it extensible to user defined types like a Vector3 class? -Craig
Aug 09 2008
Craig Black:Currently it only optimizes int and float.Currently it optimizes very little, I think. I have posted C and D benchmarks: http://codepad.org/BlwSIBKl http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D.announce&article_id=12718 Bye, bearophile
Aug 09 2008
Walter Bright wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipI intend to contribute some asm routines, but have been working on bigint operations (both Tango and Phobos) for the past couple of weeks.
Aug 10 2008
Don wrote:I intend to contribute some asm routines, but have been working on bigint operations (both Tango and Phobos) for the past couple of weeks.Cool!
Aug 10 2008
Walter Bright Wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipNot sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem. Regards,
Aug 11 2008
Pete wrote:Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.Anything allocated with new will be aligned on 16 byte boundaries.
Aug 11 2008
On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:Walter Bright Wrote:From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays. Georg -- || http://op-co.de ++ GCS/CM d? s: a-- C+++ UL+++ !P L+++ E--- W++ ++ || gpg: 0x962FD2DE || N++ o? K- w---() O M V? PS+ PE-- Y+ PGP++ t* || || Ge0rG: euIRCnet || 5 X+ R tv b+(+++) DI+(+++) D+ G e* h! r* !y+ || ++ IRCnet OFTC OPN ||________________________________________________||This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.
Aug 12 2008
Georg Lukas wrote:On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:Just begin with a check for minimal size. If less than that size, don't use SSE at all.Walter Bright Wrote:From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.Georg
Aug 13 2008
"Don" <nospam nospam.com.au> wrote in message news:g7u36h$20j0$1 digitalmars.com...Georg Lukas wrote:Good idea. Right now in that code there is (usually) a case for both un/aligned. It typically goes like this: if(cpu_has_sse2 && a.length > min_size) { if(((cast(size_t) aptr | cast(size_t)bptr | cast(size_t)cptr) & 15) != 0) { // Unaligned case asm { ... movdqu XMM0, [EAX] ... } } else { // Aligned case asm { ... movdqa XMM0, [EAX] ... } } } The two blocks of asm code is basically identical except for the un/aligned SSE opcodes. With your idea, one could get rid of the test for alignment, probably some bloat and a whole lot of duplication. I guess the question would be if the overhead of your idea would be less than the current design. - DaveOn Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:Walter Bright Wrote:From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest.This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.Just begin with a check for minimal size. If less than that size, don't use SSE at all.Georg
Aug 13 2008
Georg Lukas wrote:On Mon, 11 Aug 2008 09:55:26 -0400, Pete wrote:There would be some overhead for small arrays however as I said in my previous email, if your using a small array then its likely that your not doing much. If it is a performance issue you should switch to a larger array (by grouping all your smaller ones together). Of course there's the edge case where some actually needs to do a g-billion operations on exactly the same small array.Walter Bright Wrote:From a short look at the array*.d source code, it would be better to check if source and destination have the same alignment, i.e.: a = 0xf00d0013 (3 mod 16) b = 0xdeaffff3 (3 mod 16) In that case, the first 16-3 = 13 bytes can be handled using regular D code, and the aligned SSE version can be used for the rest. This would also work for slices, at least when both slices have the same alignment remainder. I'm just not sure what overhead such a solution would impose for small arrays.This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster?Not sure if someone else has already mentioned this but would it be possible for the compiler to align these arrays on 16 byte boundaries in order to maximise any possible vector efficiency. AFAIK you can't actually specify align anything higher than align 8 at the moment which is a bit of a problem.Georg-Joel
Aug 13 2008
Walter Bright wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zipMy tests indicate that array operations also support ^ and ^=, but that's not listed in the spec. Not the first time that D's been better than advertised. <g>
Aug 14 2008
Fantastic! Thanks! This is the present already! http://www.digitalmars.com/d/1.0/future.html http://www.digitalmars.com/d/2.0/future.html Walter Bright Wrote:This one has (finally) got array operations implemented. For those who want to show off their leet assembler skills, the initial assembler implementation code is in phobos/internal/array*.d. Burton Radons wrote the assembler. Can you make it faster? http://www.digitalmars.com/d/1.0/changelog.html http://ftp.digitalmars.com/dmd.1.034.zip http://www.digitalmars.com/d/2.0/changelog.html http://ftp.digitalmars.com/dmd.2.018.zip
Aug 18 2008