D - Optimising
- nicO (18/18) Aug 17 2001 I'm from the fcpu project (f-cpu.org), the goal is to create a "free
- Walter (3/21) Aug 17 2001 Since D supports rectangular arrays, much more aggressive array optimiza...
- LuigiG (7/33) Aug 18 2001 BTW, Microsoft's compiler has a
- Walter (4/8) Aug 18 2001 I decided long ago not to support such a switch. It would really only be
- LuigiG (5/13) Aug 19 2001 Yep,
- nicO (3/31) Aug 18 2001 Is that enough to introduice strip mining ? Or to change the order of
- Walter (3/10) Aug 18 2001 I think so.
- nicO (2/15) Aug 19 2001 Good news !
- nicO (2/15) Aug 19 2001 And what about vector computing ? (for the use of MMX and SSE)
- Walter (2/3) Aug 19 2001 I don't know.
- nicO (4/9) Aug 19 2001 To work better with current CPU, any compiler should use SIMD
- Sean L. Palmer (24/33) Oct 29 2001 I think the key concept to take advantage of SSE and MMX and 3DNow etc S...
- Walter (10/45) Jan 09 2002 You're right. -Walter
- Sean L. Palmer (80/131) Jan 10 2002 If D had good native support for SIMD programming, that would seem to be...
- Walter (16/89) Jan 10 2002 Actually, I think D does with its array operation semantics. They aren't
- Pavel Minayev (3/5) Jan 10 2002 But there is no dot and cross product for vectors.
- Walter (3/8) Jan 10 2002 True.
- Pavel Minayev (4/5) Jan 10 2002 Then why not add them? As well as matrix multiply? And optimize
- Walter (3/8) Jan 11 2002 Because I need to finish what I've got first!
- Robert W. Cunningham (6/16) Jan 12 2002 I nominate Walter to be the first human to be cloned...
- Pavel Minayev (4/7) Jan 12 2002 Yes, but who will hold the copyright?
- Pavel Minayev (3/4) Jan 12 2002 I don't mean "now". I mean "in general" =)
I'm from the fcpu project (f-cpu.org), the goal is to create a "free cpu". One of our main problem is the compiler. I have wrote some algorithme in C to improve matrix multiplication (using gcc on a P3). I have used loop unrolling, software pipelining, a have brake a lot of false dependancies, remove calculation from inner loop. I reorder the loop to align data access and do strip mining. Most of this are processor dependant and should be done by the compiler. I have won an average of 4 times compare to usual algorithme, my best gain was made on an 1.2Ghz athlon processors with 512*512 matrix : x25 ! One of the main problem is the alias problem induice by pointer. In my test, i lose 25% of the performance by using pointer instead of array. So what in D will be done to improve the use of new instruction ? All new cpu use conditional move (avoid to empty the pipeline), and vector instructions (like MMX and SSE). We can also add the fact to compile for 2 processors and more. What do you think ? nicO
Aug 17 2001
Since D supports rectangular arrays, much more aggressive array optimization becomes possible. nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...I'm from the fcpu project (f-cpu.org), the goal is to create a "free cpu". One of our main problem is the compiler. I have wrote some algorithme in C to improve matrix multiplication (using gcc on a P3). I have used loop unrolling, software pipelining, a have brake a lot of false dependancies, remove calculation from inner loop. I reorder the loop to align data access and do strip mining. Most of this are processor dependant and should be done by the compiler. I have won an average of 4 times compare to usual algorithme, my best gain was made on an 1.2Ghz athlon processors with 512*512 matrix : x25 ! One of the main problem is the alias problem induice by pointer. In my test, i lose 25% of the performance by using pointer instead of array. So what in D will be done to improve the use of new instruction ? All new cpu use conditional move (avoid to empty the pipeline), and vector instructions (like MMX and SSE). We can also add the fact to compile for 2 processors and more. What do you think ? nicO
Aug 17 2001
BTW, Microsoft's compiler has a "/Oa assume no aliasing" switch. kind of playing with fire I assume ;) "Walter" <walter digitalmars.com> wrote in message news:9lkk0g$2qdq$2 digitaldaemon.com...Since D supports rectangular arrays, much more aggressive arrayoptimizationbecomes possible. nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...I'm from the fcpu project (f-cpu.org), the goal is to create a "free cpu". One of our main problem is the compiler. I have wrote some algorithme in C to improve matrix multiplication (using gcc on a P3). I have used loop unrolling, software pipelining, a have brake a lot of false dependancies, remove calculation from inner loop. I reorder the loop to align data access and do strip mining. Most of this are processor dependant and should be done by the compiler. I have won an average of 4 times compare to usual algorithme, my best gain was made on an 1.2Ghz athlon processors with 512*512 matrix : x25 ! One of the main problem is the alias problem induice by pointer. In my test, i lose 25% of the performance by using pointer instead of array. So what in D will be done to improve the use of new instruction ? All new cpu use conditional move (avoid to empty the pipeline), and vector instructions (like MMX and SSE). We can also add the fact to compile for 2 processors and more. What do you think ? nicO
Aug 18 2001
LuigiG wrote in message <9ll862$5ch$1 digitaldaemon.com>...BTW, Microsoft's compiler has a "/Oa assume no aliasing" switch. kind of playing with fire I assume ;)I decided long ago not to support such a switch. It would really only be useful to someone who understood exactly how the internal compiler optimizations really worked, which is likely nobody <g>.
Aug 18 2001
"Walter" <walter digitalmars.com> wrote in message news:9lmlj0$111a$3 digitaldaemon.com...LuigiG wrote in message <9ll862$5ch$1 digitaldaemon.com>...Yep, basically, if you know so much about the compiler that you can use the noalias switch; you don't the noalias switch anymore.BTW, Microsoft's compiler has a "/Oa assume no aliasing" switch. kind of playing with fire I assume ;)I decided long ago not to support such a switch. It would really only be useful to someone who understood exactly how the internal compiler optimizations really worked, which is likely nobody <g>.
Aug 19 2001
Walter a écrit :Since D supports rectangular arrays, much more aggressive array optimization becomes possible.Is that enough to introduice strip mining ? Or to change the order of the loop ?nicO wrote in message <3B7DFFFC.7FB87E4F ifrance.com>...I'm from the fcpu project (f-cpu.org), the goal is to create a "free cpu". One of our main problem is the compiler. I have wrote some algorithme in C to improve matrix multiplication (using gcc on a P3). I have used loop unrolling, software pipelining, a have brake a lot of false dependancies, remove calculation from inner loop. I reorder the loop to align data access and do strip mining. Most of this are processor dependant and should be done by the compiler. I have won an average of 4 times compare to usual algorithme, my best gain was made on an 1.2Ghz athlon processors with 512*512 matrix : x25 ! One of the main problem is the alias problem induice by pointer. In my test, i lose 25% of the performance by using pointer instead of array. So what in D will be done to improve the use of new instruction ? All new cpu use conditional move (avoid to empty the pipeline), and vector instructions (like MMX and SSE). We can also add the fact to compile for 2 processors and more. What do you think ? nicO
Aug 18 2001
nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...Walter a écrit :optimizationSince D supports rectangular arrays, much more aggressive arrayI think so.becomes possible.Is that enough to introduice strip mining ? Or to change the order of the loop ?
Aug 18 2001
Walter a écrit :nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...Good news !Walter a écrit :optimizationSince D supports rectangular arrays, much more aggressive arrayI think so.becomes possible.Is that enough to introduice strip mining ? Or to change the order of the loop ?
Aug 19 2001
Walter a écrit :nicO wrote in message <3B7EB156.6D6761E6 ifrance.com>...And what about vector computing ? (for the use of MMX and SSE)Walter a écrit :optimizationSince D supports rectangular arrays, much more aggressive arrayI think so.becomes possible.Is that enough to introduice strip mining ? Or to change the order of the loop ?
Aug 19 2001
nicO wrote in message <3B801890.5A7216C7 ifrance.com>...And what about vector computing ? (for the use of MMX and SSE)I don't know.
Aug 19 2001
Walter a écrit :nicO wrote in message <3B801890.5A7216C7 ifrance.com>...To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicOAnd what about vector computing ? (for the use of MMX and SSE)I don't know.
Aug 19 2001
I think the key concept to take advantage of SSE and MMX and 3DNow etc SIMD instruction sets is that the language needs to recognize the data types manipulated by these instruction sets (usually short, power-of-two constant sized vectors of int or float of various sizes) and have the language provide operators to work with these "extra" basic types. Vector add, subtract, vector componentwise multiply, vector-by-scalar multiply, square root of one or all components, reciprocal square root, dot product, and matrix multiply (SSE/MMX don't provide much help for these last two). Conditional testing (for every component in A, if less than corresponding component in B, set result component to 1, else set result component to 0). Bitwise ops for integer SIMD. Component swizzling/shuffling/rearrangement. If you're keen on understanding how SIMD programming works, download the Microsoft Processor Pack for MSVC 6 and install it... it provides new __m64 and __m128 data types and a bunch of new intrinsic C functions that work on these types, as well as __declspec(aligned(16)) so you can have a prayer of getting the compiler to help you meet these types' heavy memory alignment restrictions. Perhaps if a few operators were added, constant-sized arrays of basic int and float types could automatically take advantage of these types of instruction sets. Operators that work on two arrays and take an array as result. This really ties into the operator overloading discussion. Sean "nicO" <nicolas.boulay ifrance.com> wrote in message news:3B807A0C.33576068 ifrance.com...Walter a écrit :nicO wrote in message <3B801890.5A7216C7 ifrance.com>...To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicOAnd what about vector computing ? (for the use of MMX and SSE)I don't know.
Oct 29 2001
You're right. -Walter "Sean L. Palmer" <spalmer iname.com> wrote in message news:9rj7js$2sq8$1 digitaldaemon.com...I think the key concept to take advantage of SSE and MMX and 3DNow etcSIMDinstruction sets is that the language needs to recognize the data types manipulated by these instruction sets (usually short, power-of-twoconstantsized vectors of int or float of various sizes) and have the language provide operators to work with these "extra" basic types. Vector add, subtract, vector componentwise multiply, vector-by-scalar multiply, square root of one or all components, reciprocal square root, dot product, and matrix multiply (SSE/MMX don't provide much help for these last two). Conditional testing (for every component in A, if less than corresponding component in B, set result component to 1, else set result component to0).Bitwise ops for integer SIMD. Componentswizzling/shuffling/rearrangement.If you're keen on understanding how SIMD programming works, download the Microsoft Processor Pack for MSVC 6 and install it... it provides new__m64and __m128 data types and a bunch of new intrinsic C functions that workonthese types, as well as __declspec(aligned(16)) so you can have a prayerofgetting the compiler to help you meet these types' heavy memory alignment restrictions. Perhaps if a few operators were added, constant-sized arrays of basic int and float types could automatically take advantage of these types of instruction sets. Operators that work on two arrays and take an array as result. This really ties into the operator overloading discussion. Sean "nicO" <nicolas.boulay ifrance.com> wrote in message news:3B807A0C.33576068 ifrance.com...Walter a écrit :nicO wrote in message <3B801890.5A7216C7 ifrance.com>...To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicOAnd what about vector computing ? (for the use of MMX and SSE)I don't know.
Jan 09 2002
If D had good native support for SIMD programming, that would seem to be a great big selling point for D. "D: Language of the future of parallel programming". ;) That's what SIMD programming is, parallel programming. You do several operations in parallel usually and mix the results together occasionally. Modern processors can do several operations in parallel usually but we usually have to resort to assembly language to program that way. D could abstract the whole concept of that kind of data structure and emulate it if the actual hardware isn't available (you could choose to have the compiler emit SSE or 3DNow instructions, or emulate using the FPU.) Not just 3D graphics, but a *lot* of higher order math and physics calculations are done with vectors and matrices. This stuff needs to run as fast as possible, and a compiler is much better suited to the job of reducing stalls and such than a human is. (because we humans are lazy, mainly, and it's tedious, repetitive work... work computers are well suited to do) If you don't have language support you have to do it thru the assembly language pathway, which is littered with minefields since usually even inline assembly causes all registers to be saved and restored or at least causes some kind of inefficiency with the optimizer. Non-inline assembly (not every platform supports inline assembly) involves call/return statements and more register save/load bullshit. Not the kind of stuff you want happening all over in your tight, fast inner loops where 99% of the time in the application is spent... the parts that benefit most from vector computing. You could even provide convenience features that may not exist in hardware, such as inner product (which isn't available on Intel SSE instruction set, yet definitely belongs in any hardware vector processor IMO) or cross product (no instruction sets provide this to my knowledge, it takes like 3 or 4 instructions to emulate). I think all the 3D engine programmers would switch to D overnight. ;) But not only that, the code generator could support using these instructions to optimize other parts of the code too. It could unroll this, for instance: float a[4],b[4],c[4]; for (int i = 0; i<4; ++i) a[i] = b[i]*b[i] + c[i]; Into this: a[] = b[] * b[] + c[]; Or perhaps with some Unicode operator (Dot and Cross product symbols both exist in Unicode) Let's change that pesky "evaluates right-to-left" assignment into something that reads more left-to-right (or in this case, b[] * b[] + c[] ==> a[]; Which breaks down to this: b[0] * b[0] + c[0] => a[0]; b[1] * b[1] + c[1] => a[1]; b[2] * b[2] + c[2] => a[2]; b[3] * b[3] + c[3] => a[3]; Let's just turn that line sideways, rotate it 90 degrees to the right, and drop the square brackets: b0 b1 b2 b3 * * * * b0 b1 b2 b3 + + + + c0 c1 c2 c3 => => => => a0 a1 a2 a3 Which would essentially boil down to this kind of assembly language: fmov4 xr0, b fmov4 xr1, c fmul4 xr2,xr0,xr0 fadd4 xr2,xr2,xr1 fmov4 a, xr2 or even this: fmov4 xr0, b fmad4 xr2,xr0,xr0,c ; multiply each component of xr0 by itself and add to corresponding component of xr1, store all 4 into xr2 fmov4 a, xr2 Check out the instruction set for D3D's vertex or pixel shaders sometime. ;) All new video cards have that kind of vector processor. Your kid's PS2 has two of those vector units. Every PC since the Pentium MMX has at least one that can do very fast operations on bytes. Yet I can't name a programming language above assembly that inherently supports such things. Sean "Walter" <walter digitalmars.com> wrote in message news:a1j6gs$2b02$1 digitaldaemon.com...You're right. -Walter "Sean L. Palmer" <spalmer iname.com> wrote in message news:9rj7js$2sq8$1 digitaldaemon.com...squareI think the key concept to take advantage of SSE and MMX and 3DNow etcSIMDinstruction sets is that the language needs to recognize the data types manipulated by these instruction sets (usually short, power-of-twoconstantsized vectors of int or float of various sizes) and have the language provide operators to work with these "extra" basic types. Vector add, subtract, vector componentwise multiply, vector-by-scalar multiply,correspondingroot of one or all components, reciprocal square root, dot product, and matrix multiply (SSE/MMX don't provide much help for these last two). Conditional testing (for every component in A, if less thanalignmentcomponent in B, set result component to 1, else set result component to0).Bitwise ops for integer SIMD. Componentswizzling/shuffling/rearrangement.If you're keen on understanding how SIMD programming works, download the Microsoft Processor Pack for MSVC 6 and install it... it provides new__m64and __m128 data types and a bunch of new intrinsic C functions that workonthese types, as well as __declspec(aligned(16)) so you can have a prayerofgetting the compiler to help you meet these types' heavy memoryintrestrictions. Perhaps if a few operators were added, constant-sized arrays of basicasand float types could automatically take advantage of these types of instruction sets. Operators that work on two arrays and take an arrayresult. This really ties into the operator overloading discussion. Sean "nicO" <nicolas.boulay ifrance.com> wrote in message news:3B807A0C.33576068 ifrance.com...Walter a écrit :nicO wrote in message <3B801890.5A7216C7 ifrance.com>...To work better with current CPU, any compiler should use SIMD instruction. The difference in performance is great. nicOAnd what about vector computing ? (for the use of MMX and SSE)I don't know.
Jan 10 2002
Actually, I think D does with its array operation semantics. They aren't implemented :-(, but they are in the spec. -Walter "Sean L. Palmer" <spalmer iname.com> wrote in message news:a1jj23$2jl8$1 digitaldaemon.com...If D had good native support for SIMD programming, that would seem to be a great big selling point for D. "D: Language of the future of parallel programming". ;) That's what SIMD programming is, parallel programming. You do several operations in parallel usually and mix the results together occasionally. Modern processors can do several operations in parallel usually but we usually have to resort to assembly language to program that way. D could abstract the whole concept of that kind of data structureandemulate it if the actual hardware isn't available (you could choose tohavethe compiler emit SSE or 3DNow instructions, or emulate using the FPU.) Not just 3D graphics, but a *lot* of higher order math and physics calculations are done with vectors and matrices. This stuff needs to runasfast as possible, and a compiler is much better suited to the job of reducing stalls and such than a human is. (because we humans are lazy, mainly, and it's tedious, repetitive work... work computers are wellsuitedto do) If you don't have language support you have to do it thru the assembly language pathway, which is littered with minefields since usually even inline assembly causes all registers to be saved and restored or at least causes some kind of inefficiency with the optimizer. Non-inline assembly (not every platform supports inline assembly) involvescall/returnstatements and more register save/load bullshit. Not the kind of stuffyouwant happening all over in your tight, fast inner loops where 99% of the time in the application is spent... the parts that benefit most fromvectorcomputing. You could even provide convenience features that may not exist inhardware,such as inner product (which isn't available on Intel SSE instruction set, yet definitely belongs in any hardware vector processor IMO) or cross product (no instruction sets provide this to my knowledge, it takes like 3 or 4 instructions to emulate). I think all the 3D engine programmers would switch to D overnight. ;) But not only that, the code generator could support using theseinstructionsto optimize other parts of the code too. It could unroll this, for instance: float a[4],b[4],c[4]; for (int i = 0; i<4; ++i) a[i] = b[i]*b[i] + c[i]; Into this: a[] = b[] * b[] + c[]; Or perhaps with some Unicode operator (Dot and Cross product symbols both exist in Unicode) Let's change that pesky "evaluates right-to-left" assignment intosomethingthat reads more left-to-right (or in this case, b[] * b[] + c[] ==> a[]; Which breaks down to this: b[0] * b[0] + c[0] => a[0]; b[1] * b[1] + c[1] => a[1]; b[2] * b[2] + c[2] => a[2]; b[3] * b[3] + c[3] => a[3]; Let's just turn that line sideways, rotate it 90 degrees to the right, and drop the square brackets: b0 b1 b2 b3 * * * * b0 b1 b2 b3 + + + + c0 c1 c2 c3 => => => => a0 a1 a2 a3 Which would essentially boil down to this kind of assembly language: fmov4 xr0, b fmov4 xr1, c fmul4 xr2,xr0,xr0 fadd4 xr2,xr2,xr1 fmov4 a, xr2 or even this: fmov4 xr0, b fmad4 xr2,xr0,xr0,c ; multiply each component of xr0 by itself and add to corresponding component of xr1, store all 4 into xr2 fmov4 a, xr2 Check out the instruction set for D3D's vertex or pixel shaders sometime. ;) All new video cards have that kind of vector processor. Your kid'sPS2has two of those vector units. Every PC since the Pentium MMX has atleastone that can do very fast operations on bytes. Yet I can't name a programming language above assembly that inherently supports such things. Sean
Jan 10 2002
"Walter" <walter digitalmars.com> wrote in message news:a1jlrk$2lgs$1 digitaldaemon.com...Actually, I think D does with its array operation semantics. They aren't implemented :-(, but they are in the spec. -WalterBut there is no dot and cross product for vectors.
Jan 10 2002
True. "Pavel Minayev" <evilone omen.ru> wrote in message news:a1k1lr$2shq$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a1jlrk$2lgs$1 digitaldaemon.com...Actually, I think D does with its array operation semantics. They aren't implemented :-(, but they are in the spec. -WalterBut there is no dot and cross product for vectors.
Jan 10 2002
"Walter" <walter digitalmars.com> wrote in message news:a1kmio$2hu$2 digitaldaemon.com...True.Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Jan 10 2002
"Pavel Minayev" <evilone omen.ru> wrote in message news:a1knkr$d57$1 digitaldaemon.com..."Walter" <walter digitalmars.com> wrote in message news:a1kmio$2hu$2 digitaldaemon.com...Because I need to finish what I've got first!True.Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Jan 11 2002
Walter wrote:"Pavel Minayev" <evilone omen.ru> wrote in message news:a1knkr$d57$1 digitaldaemon.com...I nominate Walter to be the first human to be cloned... With enough Walters, and if they all work the same hours and all think alike, then we may have an SIMD-based language author and compiler writer! -BobC"Walter" <walter digitalmars.com> wrote in message news:a1kmio$2hu$2 digitaldaemon.com...Because I need to finish what I've got first!True.Then why not add them? As well as matrix multiply? And optimize using MMX/3DNow!/SSE?
Jan 12 2002
"Robert W. Cunningham" <rcunning acm.org> wrote in message news:3C408D59.DE2F85AB acm.org...With enough Walters, and if they all work the same hours and all think alike, then we may have an SIMD-based language author and compiler writer!Yes, but who will hold the copyright? Also, imagine all those Walters arguing here on the newsgroup... =)
Jan 12 2002
"Walter" <walter digitalmars.com> wrote in message news:a1oifr$brs$2 digitaldaemon.com...Because I need to finish what I've got first!I don't mean "now". I mean "in general" =)
Jan 12 2002