digitalmars.D - __restrict, architecture intrinsics vs asm, consoles, and other stuff
- Manu (70/70) Sep 21 2011 Hello D community.
- Trass3r (9/14) Sep 21 2011 Well DMD only supports x86 including inline asm so that's the only thing...
- Walter Bright (31/83) Sep 21 2011 D doesn't have __restrict. I'm going to argue that it is unnecessary. AF...
- a (86/86) Sep 21 2011 How would one do something like this without intrinsics (the code is c++...
- Don (28/114) Sep 21 2011 [snip]
- a (26/27) Sep 22 2011 Doesn't it often require additional needless movaps instructions?
- Walter Bright (2/4) Sep 22 2011 That's correct, it currently does not.
- Andrei Alexandrescu (6/37) Sep 22 2011 I think we should put swizzle in std.numeric once and for all. Is anyone...
- so (14/16) Sep 22 2011 You mean some helper functions to be used in user structures? Because i ...
- so (2/3) Sep 22 2011 accept...
- Andrei Alexandrescu (3/8) Sep 22 2011 I was thinking of a template that takes and return T[n].
- so (3/13) Sep 22 2011 Something like this?
- Andrei Alexandrescu (4/19) Sep 22 2011 Looks promising, though I was hoping to not need an additional struct V....
- Manu Evans (121/165) Sep 23 2011 code is
- bearophile (5/10) Sep 23 2011 What do you want to do when CPU with 256 bit registers appear? When they...
- Manu Evans (33/42) Sep 23 2011 offer most of the things discussed in this thread. But I think that
- Don (13/178) Sep 23 2011 float[4] is not considered to be a hardware vector. It is only passed as...
- Marco Leise (7/38) Sep 22 2011 That's a nice fresh approach to intrinsics. I bet if other languages had...
- Peter Alexander (11/29) Sep 22 2011 How can it compile into a single shufps? x and y would need to already
- Marco Leise (5/38) Sep 22 2011 I thought about this. Either write long functions, so you don't have to ...
- Don (3/44) Sep 23 2011 Yeah, at the moment you have to work at a higher level, you can't just
- bearophile (8/10) Sep 24 2011 Is it possible to solve some of those problems adding something like thi...
- Benjamin Thaut (16/24) Sep 21 2011 I recently tried that, and I couldn't do it because D has no way of
- Walter Bright (2/6) Sep 22 2011 No, but 64 bit DMD aligns the stack on 16 byte boundaries.
- Benjamin Thaut (2/9) Sep 22 2011 Unfortunaltey there is no 64 bit dmd on windows.
- Peter Alexander (12/24) Sep 22 2011 It's used for vector stuff, but I wouldn't say mostly. Just about any
- Manu Evans (33/67) Sep 23 2011 Use of __restrict is certainly not limited to your example, it's applica...
- Iain Buclaw (35/52) Sep 24 2011 type, there's
- Iain Buclaw (10/80) Sep 21 2011 The DMD compiler has some basic intrinsics, other compilers build upon t...
- Kagamin (3/10) Sep 22 2011 http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html
Hello D community. I've been reading a lot about D lately. I have known it existed for ages, but for some reason never even took a moment to look into it. The more I looked into it, the more I realise, this is the language I want. C(/C++) has been ruined, far beyond salvation. D seems to be the reboot that it desperately needs. Anyway, I work in the games industry, 10 years in cross platform console games at major studios. Sadly, I don't think Microsoft, Sony, Nintendo, Apple, Google (...maybe google) will support D any time soon, but I've started some after-hours game projects to test D in a some real gamedev environments. So far I have these (critical) questions. Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance. C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware? Is the D assembler a macro assembler? (ie, assigns registers automatically and manage loads/stores intelligently?) I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere? As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have? Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible. In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware. How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here? I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here. What about Android (or iPhone, but apple's 'x-code policy' prevents that)? I'd REALLY love to write an android project in D... the toolchain is GCC, I see no reason why it shouldn't be possible to write an android app if an appropriate toolchain was available? Sorry it's a bit long, thanks for reading this far! I'm looking forward to a brighter future writing lots of D code :P But I need to know basically all these questions are addressed before I could consider it for serious commercial game dev.
Sep 21 2011
I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere?Well DMD only supports x86 including inline asm so that's the only thing that's tested. You need to try LDC or GDC for most of the things you request. http://dsource.org/projects/ldc/wiki/InlineAsmExpressions https://bitbucket.org/goshawk/gdc/wiki/UserDocumentation#!extended-assembler Some guys already managed to compile cross-compilers for ARM and ran some basic code on e.g. Nintendo DS. For anything serious you would need to make druntime work though. It's just nobody has done the dirty work yet.
Sep 21 2011
On 9/21/2011 3:55 PM, Manu wrote:Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance.D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware?D does have some intrinsics, like sin() and cos(). They tend to get added on a strictly as-needed basis, not a speculative one. D has no current intention to replace the inline assembler with intrinsics. As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration D code a while back that would take a string representing a floating point expression, and would literally compile it (using Compile Time Function Execution) and produce a string literal of inline asm functions, which were then compiled by the inline assembler. So yes, it is entirely possible and practical for end users to write custom intrinsics.Is the D assembler a macro assembler?No. It's what-you-write-is-what-you-get.(ie, assigns registers automatically and manage loads/stores intelligently?)No. It's intended to be a low level assembler for those who want to precisely control things.I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists.I enjoy writing x86 inline assembler :-)Are there PowerPC or ARM examples anywhere?The intention is for other CPU targets to employ the syntax used in their respective CPU manual datasheets.As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible.Vectors (statically dimensioned arrays) are currently passed by value (unlike C or C++).In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware.Yes, I agree.How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here.I don't know much about GDC's capabilities.
Sep 21 2011
How would one do something like this without intrinsics (the code is c++ using gcc vector extensions): template <class V> struct Fft { typedef typename V::T T; typedef typename V::vec vec; static const int VecSize = V::Size; ... template <int Interleaved> static NOINLINE void fft_pass_interleaved( vec * __restrict pr, vec *__restrict pi, vec *__restrict pend, T *__restrict table) { for(; pr < pend; pr += 2, pi += 2, table += 2*Interleaved) { vec tmpr, ti, ur, ui, wr, wi; V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi); V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr); V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti); vec tr = tmpr*wr - ti*wi; ti = tmpr*wi + ti*wr; V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]); V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]); } } ... Here vector elements need to be shuffled around when they are loaded and stored. This is platform dependent and cannot be expressed through vector operations (or gcc vector extensions). Here I abstracted platform dependent functionality in member functions of V, which are implemented using intrinsics. The assembly generated for SSE single precision and Interleaved=4 is: 0000000000000000 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>: 0: 48 39 d7 cmp %rdx,%rdi 3: 0f 83 9c 00 00 00 jae a5 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5> 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 0f 28 19 movaps (%rcx),%xmm3 13: 0f 28 41 10 movaps 0x10(%rcx),%xmm0 17: 48 83 c1 20 add $0x20,%rcx 1b: 0f 28 f3 movaps %xmm3,%xmm6 1e: 0f 28 2f movaps (%rdi),%xmm5 21: 0f c6 d8 dd shufps $0xdd,%xmm0,%xmm3 25: 0f c6 f0 88 shufps $0x88,%xmm0,%xmm6 29: 0f 28 e5 movaps %xmm5,%xmm4 2c: 0f 28 47 10 movaps 0x10(%rdi),%xmm0 30: 0f 28 4e 10 movaps 0x10(%rsi),%xmm1 34: 0f c6 e0 88 shufps $0x88,%xmm0,%xmm4 38: 0f c6 e8 dd shufps $0xdd,%xmm0,%xmm5 3c: 0f 28 06 movaps (%rsi),%xmm0 3f: 0f 28 d0 movaps %xmm0,%xmm2 42: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0 46: 0f c6 d1 88 shufps $0x88,%xmm1,%xmm2 4a: 0f 28 cd movaps %xmm5,%xmm1 4d: 0f 28 f8 movaps %xmm0,%xmm7 50: 0f 59 ce mulps %xmm6,%xmm1 53: 0f 59 fb mulps %xmm3,%xmm7 56: 0f 59 c6 mulps %xmm6,%xmm0 59: 0f 59 dd mulps %xmm5,%xmm3 5c: 0f 5c cf subps %xmm7,%xmm1 5f: 0f 58 c3 addps %xmm3,%xmm0 62: 0f 28 dc movaps %xmm4,%xmm3 65: 0f 5c d9 subps %xmm1,%xmm3 68: 0f 58 cc addps %xmm4,%xmm1 6b: 0f 28 e1 movaps %xmm1,%xmm4 6e: 0f 15 cb unpckhps %xmm3,%xmm1 71: 0f 14 e3 unpcklps %xmm3,%xmm4 74: 0f 29 4f 10 movaps %xmm1,0x10(%rdi) 78: 0f 28 ca movaps %xmm2,%xmm1 7b: 0f 29 27 movaps %xmm4,(%rdi) 7e: 0f 5c c8 subps %xmm0,%xmm1 81: 48 83 c7 20 add $0x20,%rdi 85: 0f 58 c2 addps %xmm2,%xmm0 88: 0f 28 d0 movaps %xmm0,%xmm2 8b: 0f 15 c1 unpckhps %xmm1,%xmm0 8e: 0f 14 d1 unpcklps %xmm1,%xmm2 91: 0f 29 46 10 movaps %xmm0,0x10(%rsi) 95: 0f 29 16 movaps %xmm2,(%rsi) 98: 48 83 c6 20 add $0x20,%rsi 9c: 48 39 fa cmp %rdi,%rdx 9f: 0f 87 6b ff ff ff ja 10 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10> a5: f3 c3 repz retq Would something like that be possible with D inline assembly or would there be additional loads and stores for each call of V::interleave, V::deinterleave and V::expandComplexArrayToRealImagVec?
Sep 21 2011
On 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.template<class V> struct Fft { typedef typename V::T T; typedef typename V::vec vec; static const int VecSize = V::Size; ... template<int Interleaved> static NOINLINE void fft_pass_interleaved( vec * __restrict pr, vec *__restrict pi, vec *__restrict pend, T *__restrict table) { for(; pr< pend; pr += 2, pi += 2, table += 2*Interleaved) { vec tmpr, ti, ur, ui, wr, wi; V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi); V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr); V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti); vec tr = tmpr*wr - ti*wi; ti = tmpr*wi + ti*wr; V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]); V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]); } } ... Here vector elements need to be shuffled around when they are loaded and stored. This is platform dependent and cannot be expressed through vector operations (or gcc vector extensions). Here I abstracted platform dependent functionality in member functions of V, which are implemented using intrinsics. The assembly generated for SSE single precision and Interleaved=4 is: 0000000000000000<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>: 0: 48 39 d7 cmp %rdx,%rdi 3: 0f 83 9c 00 00 00 jae a5<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5> 9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 10: 0f 28 19 movaps (%rcx),%xmm3 13: 0f 28 41 10 movaps 0x10(%rcx),%xmm0 17: 48 83 c1 20 add $0x20,%rcx 1b: 0f 28 f3 movaps %xmm3,%xmm6 1e: 0f 28 2f movaps (%rdi),%xmm5 21: 0f c6 d8 dd shufps $0xdd,%xmm0,%xmm3 25: 0f c6 f0 88 shufps $0x88,%xmm0,%xmm6 29: 0f 28 e5 movaps %xmm5,%xmm4 2c: 0f 28 47 10 movaps 0x10(%rdi),%xmm0 30: 0f 28 4e 10 movaps 0x10(%rsi),%xmm1 34: 0f c6 e0 88 shufps $0x88,%xmm0,%xmm4 38: 0f c6 e8 dd shufps $0xdd,%xmm0,%xmm5 3c: 0f 28 06 movaps (%rsi),%xmm0 3f: 0f 28 d0 movaps %xmm0,%xmm2 42: 0f c6 c1 dd shufps $0xdd,%xmm1,%xmm0 46: 0f c6 d1 88 shufps $0x88,%xmm1,%xmm2 4a: 0f 28 cd movaps %xmm5,%xmm1 4d: 0f 28 f8 movaps %xmm0,%xmm7 50: 0f 59 ce mulps %xmm6,%xmm1 53: 0f 59 fb mulps %xmm3,%xmm7 56: 0f 59 c6 mulps %xmm6,%xmm0 59: 0f 59 dd mulps %xmm5,%xmm3 5c: 0f 5c cf subps %xmm7,%xmm1 5f: 0f 58 c3 addps %xmm3,%xmm0 62: 0f 28 dc movaps %xmm4,%xmm3 65: 0f 5c d9 subps %xmm1,%xmm3 68: 0f 58 cc addps %xmm4,%xmm1 6b: 0f 28 e1 movaps %xmm1,%xmm4 6e: 0f 15 cb unpckhps %xmm3,%xmm1 71: 0f 14 e3 unpcklps %xmm3,%xmm4 74: 0f 29 4f 10 movaps %xmm1,0x10(%rdi) 78: 0f 28 ca movaps %xmm2,%xmm1 7b: 0f 29 27 movaps %xmm4,(%rdi) 7e: 0f 5c c8 subps %xmm0,%xmm1 81: 48 83 c7 20 add $0x20,%rdi 85: 0f 58 c2 addps %xmm2,%xmm0 88: 0f 28 d0 movaps %xmm0,%xmm2 8b: 0f 15 c1 unpckhps %xmm1,%xmm0 8e: 0f 14 d1 unpcklps %xmm1,%xmm2 91: 0f 29 46 10 movaps %xmm0,0x10(%rsi) 95: 0f 29 16 movaps %xmm2,(%rsi) 98: 48 83 c6 20 add $0x20,%rsi 9c: 48 39 fa cmp %rdi,%rdx 9f: 0f 87 6b ff ff ff ja 10<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10> a5: f3 c3 repz retq Would something like that be possible with D inline assembly or would there be additional loads and stores for each call of V::interleave, V::deinterleave and V::expandComplexArrayToRealImagVec?
Sep 21 2011
which compiles to a single shufps instruction.Doesn't it often require additional needless movaps instructions? For example, the following: asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } compiles to movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) Is it possible to avoid needlless loading and storing of values when calling multiple functions that use asm blocks? It also seems that the compiler doesn't inline functions containing asm.
Sep 22 2011
On 9/22/2011 5:11 AM, a wrote:It also seems that the compiler doesn't inline functions containing asm.That's correct, it currently does not.
Sep 22 2011
On 9/22/11 1:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d.A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.A good argument for (a) moving stuff from the compiler into the library, (b) continuing Don's great work on making CTFE a solid proposition. Andrei
Sep 22 2011
On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?You mean some helper functions to be used in user structures? Because i don't know of any structure in std.numerics that could use it. We first need to improve opDispatch. Currently i think no one know how it works or how it was intended to work. It refuses to except a few things which i think it should. For example: A { opDispatch(string)() opDispatch(string)() const } A a, b; a.fun = b.run; // This should be perfectly fine.
Sep 22 2011
On Fri, 23 Sep 2011 02:00:50 +0300, so <so so.so> wrote:It refuses to except a few things which i think it should.accept...
Sep 22 2011
On 9/22/11 6:00 PM, so wrote:On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was thinking of a template that takes and return T[n]. AndreiI think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?You mean some helper functions to be used in user structures?
Sep 22 2011
On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 6:00 PM, so wrote:Something like this?On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was thinking of a template that takes and return T[n]. AndreiI think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?You mean some helper functions to be used in user structures?
Sep 22 2011
On 9/22/11 9:11 PM, so wrote:On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. AndreiOn 9/22/11 6:00 PM, so wrote:Something like this?On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was thinking of a template that takes and return T[n]. AndreiI think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?You mean some helper functions to be used in user structures?
Sep 22 2011
On Fri, 23 Sep 2011 06:44:44 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 9/22/11 9:11 PM, so wrote:It was there to show how it should be used in user code, and testing. Swizzle is not just a rvalue operation, there is also a lvalue part to it which plays a bit differently (hence, swizzleR and swizzleL). We could take care of it with an overload but D doesn't act quite like what i expected (like C++), i don't understand why it won't differentiate "fun()" from "fun() const".On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. AndreiOn 9/22/11 6:00 PM, so wrote:Something like this?On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I was thinking of a template that takes and return T[n]. AndreiI think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?You mean some helper functions to be used in user structures?
Sep 23 2011
On Fri, 23 Sep 2011 16:09:31 +0300, so <so so.so> wrote:It was there to show how it should be used in user code, and testing. Swizzle is not just a rvalue operation, there is also a lvalue part to it which plays a bit differently (hence, swizzleR and swizzleL). We could take care of it with an overload but D doesn't act quite like what i expected (like C++), i don't understand why it won't differentiate "fun()" from "fun() const".Sorry about the nonsense. It is now with opDispatch (attached) To make a generic "swizzle" function we need to introduce a few traits but if all you want is a support for T[N] that is easy.
Sep 24 2011
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleOn 9/22/11 1:39 AM, Don wrote:code isOn 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (theinline asm.c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting toDBut, what we've done is to move SIMD into the machine model: theefficientmachine model assumes that float[4] + float[4] is a moreDMD atoperation than a loop. Currently, only arithmetic operations are implemented, and onit'll beleast, they're still not proper intrinsics. So in the long termCTFE,possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' usingof fourgiving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consistsanyonecharacters, each of which is a, b, c, or d.I think we should put swizzle in std.numeric once and for all. Isinterested in taking up that task?operations. It wasA couple of years ago I made a DSL compiler for BLAS1looked likecapable of doing some pretty wild stuff, even then. (The DSLnownormal D code). But the compiler has improved enormously since that time. It'sneed.perfectly feasible to make a DSL for the SIMD operations youthat youThe really nice thing about this, compared to normal asm, isgenerator inhave access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/codelibrary,a DSL in D, than in a compiler back-end.A good argument for (a) moving stuff from the compiler into the(b) continuing Don's great work on making CTFE a solidproposition.AndreiThis sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations? I think that's a wrong decision in its self, and a poor foundation for this approach. Let me try and convince you that the language should have an explicit hardware vector type, and not attempt to make use of any clever language tricks... If float[4] is considered a hardware vector by the compiler, - How to I define an ACTUAL float[4]? - How can I be confident that it actually WILL be a hardware vector? Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such. They may be float4, u/int4, u/short8, u/byte16, double2... All these types are interchangeable within the one register, do you intend to special case fixed length arrays of all those types to support the hardware functionality for those? Hardware vectors are NOT floats, they can not interact with the floating point unit, dereferencing of this style 'float x = myVector[0]' is NOT supported by the hardware and it should not be exposed to the programmer as a trivial possibility. This seemingly harmless line of code will undermine the entire reason for using hardware vector hardware in the first place. Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least resistance also be the 'correct' optimal choice, whereby a seemingly simple line of code may ruin the entire function. float[4] is not even a particularly conveniently sized vector for most inexperienced programmers, the majority will want float[3]. This is NOT a trivial map to float[4], and programmers should be well aware that there is inherent complexity in using the hardware vector architecture, and forced to think it through. Most inexperienced programmers think of results of operations like dot product and magnitude as being scalar values, but they are not, they are a scalar value repeated across all 4 components of a vector 4, and this should be explicit too. .... I know I'm nobody around here, so I can't expect to be taken too seriously, but I'm very excited about the language, so here's what I would consider instead: Add a hardware vector type, lets call it 'v128' for the exercise. It is a primitive type, aligned by definition, and does not have any members. You may use this to refer to a hardware vector registers explicitly, as vector register function arguments, or as arguments to inline asm blocks. Add some functions to the standard library (ideally implemented as compiler intrinsics) which do very specific stuff to vectors, and ideally expandable by hardware vendors or platform holders. You might want to have some classes in the standard library which wrap said v128, and expose the concept as a float4, int4, byte16, etc. These classes would provide maths operators, comparisons, initialisation and immediate assignment, and casts between various vector types. Different vector units support completely different methods of permutation. I would be very careful about adding intrinsic support into the library for generalised permutation. And if so, at least leave the capability of implementing intrinsic architecture-specific permutation but the hardware vendors/platform holders. At the end of the day, it is imperative that the code generator and optimiser still retain the concept of hardware vectors, and can perform appropriate load/store elimination, apply hardware specific optimisation to operations like permutes/swizzles, component broadcasts, load immediates, etc. .... The reason I am so adamant about this, is in almost all architectures (SSE is the most tolerant by far), using the hardware vector unit is an all-or-nothing choice. If you interact between the vector and float registers, you will almost certainly result in slower code than if you just used the float unit outright. Also, since people usually use hardware vectors in areas of extreme performance optimisation, it's not tolerable for the compiler to be making mistakes. As a minimum the programmer needs to be able to explicitly address the vector registers, pass it to and from functions, and perform explicit (probably IHV supplied) functions on them. The code generator and optimiser needs all the information possible, and as explicit as possible so IHV's can implement the best possible support for their architecture. The API should reflect this, and not allow easy access to functionality that would violate hardware support. Ease of programming should be a SECONDARY goal, at which point something like the typed wrapper classes I described would come in, allowing maths operators, comparisons and all I mentioned above, ie, making them look like a real mathematical type, but still keeping their distance from primitive float/int types, to discourage interaction at all costs. I hope this doesn't sound too much like an overly long rant! :) And hopefully I've managed to sell my point... Don: I'd love to hear counter arguments to justify float[4] as a reasonable solution. Currently no matter how I slice it, I just can't see it. Criticism welcome? Cheers! - Manu
Sep 23 2011
Manu Evans: I appreciate your efforts. I answer to the OP that DMD doesn't yet offer most of the things discussed in this thread. But I think that it's better to add and work on high-performance features when the basics of D are in better shape. Currently there are more basic fishes to implement or debug, like tuples syntax sugar, module system issues, const issues, inout, and so on and on (on the other hand I agree that it's OK to discuss even now D design ideas that will allow that future high performance).Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such.What do you want to do when CPU with 256 bit registers appear? When they grow to 512 bit? To 1024? Do you want to keep adding specific types? How many things do you want to add to D in the next 15 years of CPU evolution? Bye, bearophile
Sep 23 2011
== Quote from bearophile (bearophileHUGS lycos.com)'s articleManu Evans: I appreciate your efforts. I answer to the OP that DMD doesn't yetoffer most of the things discussed in this thread. But I think that it's better to add and work on high-performance features when the basics of D are in better shape. Currently there are more basic fishes to implement or debug, like tuples syntax sugar, module system issues, const issues, inout, and so on and on (on the other hand I agree that it's OK to discuss even now D design ideas that will allow that future high performance). I make the point because, while I agree the topics you mention are of greater immediate performance, the previous posts in this thread suggest there is already experimentation/implementation of these features happening in the language now, and if they are defined now, and defined incorrectly, it's always very difficult to go back on these decisions.may beHardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operationsaccessiblesupported, they are probably aligned, and they are onlyasin 128bit quantities. I think they should be explicitly definedWhen they grow to 512 bit? To 1024? Do you want to keep adding specific types? Yes. I don't think it's likely to progress as you suggest though. I foresee perhaps a 4 component 64bit-word vector (256bit), and a hardware matrix. I can't see it being any less appropriate to implement a v256 in addition to v128 than a long in addition to an int. A matrix is a fundamentally different concept, and surely worthy of its own type.such.What do you want to do when CPU with 256 bit registers appear?How many things do you want to add to D in the next 15 years ofCPU evolution? As many things as are universally accepted by computer hardware as a normal/standard feature. Hardware vectors definitely fit this bill. We've had hardware vector support in virtually every architecture for 10-15 years now, and yet there is still no language that really supports it.
Sep 23 2011
On 24.09.2011 00:47, Manu Evans wrote:== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleNo, it's completely unrelated. It has nothing in common.On 9/22/11 1:39 AM, Don wrote:code isOn 22.09.2011 05:24, a wrote:How would one do something like this without intrinsics (theinline asm.c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting toDBut, what we've done is to move SIMD into the machine model: theefficientmachine model assumes that float[4] + float[4] is a moreDMD atoperation than a loop. Currently, only arithmetic operations are implemented, and onit'll beleast, they're still not proper intrinsics. So in the long termCTFE,possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' usingof fourgiving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consistsanyonecharacters, each of which is a, b, c, or d.I think we should put swizzle in std.numeric once and for all. Isinterested in taking up that task?operations. It wasA couple of years ago I made a DSL compiler for BLAS1looked likecapable of doing some pretty wild stuff, even then. (The DSLnownormal D code). But the compiler has improved enormously since that time. It'sneed.perfectly feasible to make a DSL for the SIMD operations youthat youThe really nice thing about this, compared to normal asm, isgenerator inhave access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/codelibrary,a DSL in D, than in a compiler back-end.A good argument for (a) moving stuff from the compiler into the(b) continuing Don's great work on making CTFE a solidproposition.AndreiThis sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations?I think that's a wrong decision in its self, and a poor foundation for this approach. Let me try and convince you that the language should have an explicit hardware vector type, and not attempt to make use of any clever language tricks... If float[4] is considered a hardware vector by the compiler, - How to I define an ACTUAL float[4]? - How can I be confident that it actually WILL be a hardware vector?float[4] is not considered to be a hardware vector. It is only passed as one. To pass it the C++ way, declare the parameter as float[], or pass by ref. Everything after that is the reponsibility of the compiler/optimizer. A big difference compared to C++, is that generally, it's pretty strange to pass fixed-length arrays as value parameters. At this stage we don't have any way of forcing it to be a hardware vector. We've just introduced the parameter passing and the vector operations to make it easier for the compiler to use hardware registers. Very little else is decided at this stage. You make some excellent points.Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such. They may be float4, u/int4, u/short8, u/byte16, double2... All these types are interchangeable within the one register, do you intend to special case fixed length arrays of all those types to support the hardware functionality for those? Hardware vectors are NOT floats, they can not interact with the floating point unit, dereferencing of this style 'float x = myVector[0]' is NOT supported by the hardware and it should not be exposed to the programmer as a trivial possibility. This seemingly harmless line of code will undermine the entire reason for using hardware vector hardware in the first place. Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least resistance also be the 'correct' optimal choice, whereby a seemingly simple line of code may ruin the entire function. float[4] is not even a particularly conveniently sized vector for most inexperienced programmers, the majority will want float[3]. This is NOT a trivial map to float[4], and programmers should be well aware that there is inherent complexity in using the hardware vector architecture, and forced to think it through. Most inexperienced programmers think of results of operations like dot product and magnitude as being scalar values, but they are not, they are a scalar value repeated across all 4 components of a vector 4, and this should be explicit too. .... I know I'm nobody around here, so I can't expect to be taken too seriously, but I'm very excited about the language, so here's what I would consider instead: Add a hardware vector type, lets call it 'v128' for the exercise. It is a primitive type, aligned by definition, and does not have any members. You may use this to refer to a hardware vector registers explicitly, as vector register function arguments, or as arguments to inline asm blocks. Add some functions to the standard library (ideally implemented as compiler intrinsics) which do very specific stuff to vectors, and ideally expandable by hardware vendors or platform holders. You might want to have some classes in the standard library which wrap said v128, and expose the concept as a float4, int4, byte16, etc. These classes would provide maths operators, comparisons, initialisation and immediate assignment, and casts between various vector types. Different vector units support completely different methods of permutation. I would be very careful about adding intrinsic support into the library for generalised permutation. And if so, at least leave the capability of implementing intrinsic architecture-specific permutation but the hardware vendors/platform holders. At the end of the day, it is imperative that the code generator and optimiser still retain the concept of hardware vectors, and can perform appropriate load/store elimination, apply hardware specific optimisation to operations like permutes/swizzles, component broadcasts, load immediates, etc. .... The reason I am so adamant about this, is in almost all architectures (SSE is the most tolerant by far), using the hardware vector unit is an all-or-nothing choice. If you interact between the vector and float registers, you will almost certainly result in slower code than if you just used the float unit outright. Also, since people usually use hardware vectors in areas of extreme performance optimisation, it's not tolerable for the compiler to be making mistakes. As a minimum the programmer needs to be able to explicitly address the vector registers, pass it to and from functions, and perform explicit (probably IHV supplied) functions on them. The code generator and optimiser needs all the information possible, and as explicit as possible so IHV's can implement the best possible support for their architecture. The API should reflect this, and not allow easy access to functionality that would violate hardware support. Ease of programming should be a SECONDARY goal, at which point something like the typed wrapper classes I described would come in, allowing maths operators, comparisons and all I mentioned above, ie, making them look like a real mathematical type, but still keeping their distance from primitive float/int types, to discourage interaction at all costs. I hope this doesn't sound too much like an overly long rant! :) And hopefully I've managed to sell my point... Don: I'd love to hear counter arguments to justify float[4] as a reasonable solution. Currently no matter how I slice it, I just can't see it. Criticism welcome? Cheers! - Manu
Sep 23 2011
Am 22.09.2011, 08:39 Uhr, schrieb Don <nospam nospam.com>:On 22.09.2011 05:24, a wrote:That's a nice fresh approach to intrinsics. I bet if other languages had the CTFE capabilities, they'd probably do the same. Sure, it is ideal if the compiler works magic here, but it takes longer to implement the right code generation in the compiler, than to write an isolated piece of library code and extensions can be added by anyone, especially since there will already be some examples to look at. Thumbs up!How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.
Sep 22 2011
On 22/09/11 7:39 AM, Don wrote:On 22.09.2011 05:24, a wrote:How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
Sep 22 2011
Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander <peter.alexander.au gmail.com>:On 22/09/11 7:39 AM, Don wrote:I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.On 22.09.2011 05:24, a wrote:How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
Sep 22 2011
On 22.09.2011 20:19, Marco Leise wrote:Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander <peter.alexander.au gmail.com>:Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.On 22/09/11 7:39 AM, Don wrote:I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.On 22.09.2011 05:24, a wrote:How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.How would one do something like this without intrinsics (the code is c++ using gcc vector extensions):[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
Sep 23 2011
Don:Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.Is it possible to solve some of those problems adding something like this to D/DMD: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions And then, what changes/work is needed to allow inlining of some functions that contain asm? I mean something like this allow_inline? http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline (I have asked similar questions four times in the last two years, with no answers or comments.) Bye, bearophile
Sep 24 2011
Am 22.09.2011 02:38, schrieb Walter Bright:I recently tried that, and I couldn't do it because D has no way of aligning structs on the stack. Manually allocating the neccessary aligned memroy is also not always possible because it can not be done for compiler temporary variables: vec4 v1 = func1(); vec4 v2 = func2(); vec4 result = (v1 + v2) * 0.5f; Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support? -- Kind Regards Benjamin Thautnsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
Sep 21 2011
On 9/21/2011 10:56 PM, Benjamin Thaut wrote:Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support?No, but 64 bit DMD aligns the stack on 16 byte boundaries.
Sep 22 2011
== Auszug aus Walter Bright (newshound2 digitalmars.com)'s ArtikelOn 9/21/2011 10:56 PM, Benjamin Thaut wrote:Unfortunaltey there is no 64 bit dmd on windows.Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support?No, but 64 bit DMD aligns the stack on 16 byte boundaries.
Sep 22 2011
On 22/09/11 1:38 AM, Walter Bright wrote:D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.It's used for vector stuff, but I wouldn't say mostly. Just about any performance intensive piece of code involving pointers can benefit from __restrict. I use it in a VM for example.I don't see how this would be possible without intrinsics, or at least some form of language extension. Would DMD just *always* put float[4] in XMM registers (assuming they are available)? That doesn't seem like a good idea if you don't want to use it as a vector. BTW, if you want to get a good idea of how game programmers use vector intrinsics on current hardware, there is a good blog post about it here: http://altdevblogaday.com/2011/01/31/vectiquette/As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).
Sep 22 2011
== Quote from Walter Bright (newshound2 digitalmars.com)'s articleD doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.Use of __restrict is certainly not limited to your example, it's applicable basically anywhere that a pointer is dereferenced on either side of a write through any other pointer, or a function call (since it could potentially do anything), the resident value from the previous dereference is invalidated and must be reloaded needlessly unless the pointer is explicitly marked restrict. http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html For RISC architectures in particular, __restrict is mandatory when optimising certain hot functions without making a mess of your code (declaring stack locals all over the place), and I think I've run into cases where even that's not enough.D does have some intrinsics, like sin() and cos(). They tend to get added on a strictly as-needed basis, not a speculative one. D has no current intention to replace the inline assembler with intrinsics. As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration D code a while back that would take a string representing a floating point expression, and would literally compile it (using Compile Time Function Execution) and produce a string literal of inline asm functions, which were then compiled by the inline assembler. So yes, it is entirely possible and practical for end users to write custom intrinsics.I hadn't thought of that using compile-time functions, that's really nice. I'm not sure if that'll be enough to generate good code in all cases, but I'll do some experiments and see where it goes. The main problem with writing (intelligently generated) inline asm vs using intrinsics, is in the context of the C (or D) source code, you don't have enough context to know about the state of the register assignment, and producing the appropriate loads/stores. Also, the opcodes selected to perform the operation may change with context. (again, specific examples are hard to fabricate, but I've had them consistently pop up over the years) Also, I think someone else said that you couldn't inline functions with inline asm? Is that correct? If so, I assume that's intended to be fixed?Are you referring to the comment about special casing a float[4]? I can see why one might reach for that as a solution, but it sounds like a really bad idea to me...As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have?The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).Do you mean that like a memcpy to the stack, or somehow intuitively using the hardware vector registers to pass arguments to the function properly?Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible.Vectors (statically dimensioned arrays) are currently passed by value (unlike C or C++).But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector type, there's no way to pass vectors to functions in registers, and also, using explicit asm, you tend to end up with endless unnecessary loads and stores, and potentially a lot of redundant shuffling/permutation. This will differ radically between architectures too. I think I read in another post too that functions containing inline asm will not be inlined? How does the D compiler go at optimising code around inline asm blocks? Most compilers have a lot of trouble optimising around inline asm blocks, and many don't even attempt to do so... How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
Sep 23 2011
== Quote from Manu Evans (turkeyman gmail.com)'s articleintrinsics.How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather thanBut sadly, in that case, it wouldn't work. Without an intrinsic hardware vectortype, there'sno way to pass vectors to functions in registers, and also, using explicit asm,you tend toend up with endless unnecessary loads and stores, and potentially a lot of redundant shuffling/permutation. This will differ radically between architectures too. I think I read in another post too that functions containing inline asm will notbe inlined?How does the D compiler go at optimising code around inline asm blocks? Mostcompilers have alot of trouble optimising around inline asm blocks, and many don't even attemptto do so...How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Sep 24 2011
On 24 September 2011 15:37, Iain Buclaw <ibuclaw ubuntu.com> wrote:== Quote from Manu Evans (turkeyman gmail.com)'s articleNice! Is there an IRC channel, or anywhere for realtime D discussion? I'm interested in trying to build some GDC cross compilers, and perhaps contributing to the standard library on a few embedded systems, but I have a lot of little questions and general things that don't suit a mailing list... Perhaps some IM? It seems to me that you are the authority on GDC implementation and support...effort ofHow can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?Your C++ vector code should be amenable to translation to D, so thatintrinsics.yours isn't lost, except that it'd have to be in inline asm rather thanBut sadly, in that case, it wouldn't work. Without an intrinsic hardwarevector type, there'sno way to pass vectors to functions in registers, and also, usingexplicit asm, you tend toend up with endless unnecessary loads and stores, and potentially a lotof redundantshuffling/permutation. This will differ radically between architecturestoo.I think I read in another post too that functions containing inline asmwill not be inlined?How does the D compiler go at optimising code around inline asm blocks?Most compilers have alot of trouble optimising around inline asm blocks, and many don't evenattempt to do so...How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Sep 24 2011
On Sat, 24 Sep 2011 16:50:39 +0300, Manu <turkeyman gmail.com> wrote:Nice! Is there an IRC channel, or anywhere for realtime D discussion? I'm interested in trying to build some GDC cross compilers, and perhaps contributing to the standard library on a few embedded systems, but I have a lot of little questions and general things that don't suit a mailing list... Perhaps some IM? It seems to me that you are the authority on GDC implementation and support...We all know where it leads to, first ask IM then ask phone and finally ...
Sep 24 2011
On 2011-09-24 16:50:39 +0300, Manu said:Is there an IRC channel, or anywhere for realtime D discussion?There is a #d channel for general D discussions and #d.gdc for GDC related themes on irc.freenode.org
Sep 24 2011
== Quote from Manu (turkeyman gmail.com)'s articleHello D community. I've been reading a lot about D lately. I have known it existed for ages, but for some reason never even took a moment to look into it. The more I looked into it, the more I realise, this is the language I want. C(/C++) has been ruined, far beyond salvation. D seems to be the reboot that it desperately needs. Anyway, I work in the games industry, 10 years in cross platform console games at major studios. Sadly, I don't think Microsoft, Sony, Nintendo, Apple, Google (...maybe google) will support D any time soon, but I've started some after-hours game projects to test D in a some real gamedev environments. So far I have these (critical) questions. Pointer aliasing... C implementations uses a non-standard __restrict keyword to state that a given pointer will not be aliased by any other pointer. This is critical in some pieces of code to eliminate redundant loads and stores, particularly important on RISC architectures like PPC. How does D address pointer aliasing? I can't imagine the compiler has any way to detect that pointer aliasing is not possible in certain cases, many cases are just far too complicated. Is there a keyword? Or plans? This is critical for realtime performance. C implementations often use compiler intrinsics to implement architecture provided functionality rather than inline asm, the reason is that the intrinsics allow the compiler to generate better code with knowledge of the context. Inline asm can't really be transformed appropriately to suit the context in some situations, whereas intrinsics operate differently, and run vendor specific logic to produce the code more intelligently. How does D address this? What options/possibilities are available to the language? Hooks for vendors to implement intrinsics for custom hardware?The DMD compiler has some basic intrinsics, other compilers build upon this using their own backends. ie: GCC has hundreds of builtins, including some target builtins where intrinsic types are mappable to D types (__float80 ->.real).Is the D assembler a macro assembler? (ie, assigns registers automatically and manage loads/stores intelligently?) I haven't seen any non-x86 examples of the D assembler, and I think it's fair to say that x86 is the single most unnecessary architecture to write inline assembly that exists. Are there PowerPC or ARM examples anywhere? As an extension from that, why is there no hardware vector support in the language? Surely a primitive vector4 type would be a sensible thing to have? Is it possible in D currently to pass vectors to functions by value in registers? Without an intrinsic vector type, it would seem impossible. In addition to that, writing a custom Vector4 class to make use of VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc, wrapping functions around inline asm blocks is always clumsy and far from optimal. The compiler (code generator and probably the optimiser) needs to understand the concepts of vectors to make good use of the hardware. How can I do this in a nice way in D? I'm long sick of writing unsightly vector classes in C++, but fortunately using vendor specific compiler intrinsics usually leads to decent code generation. I can currently imagine an equally ugly (possibly worse) hardware vector library in D, if it's even possible. But perhaps I've missed something here?I would imagine it should now be possible to use GCC vector builtins with the GDC compiler. Given that I manage to get round to turning these routines on though. :~)I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here. What about Android (or iPhone, but apple's 'x-code policy' prevents that)? I'd REALLY love to write an android project in D... the toolchain is GCC, I see no reason why it shouldn't be possible to write an android app if an appropriate toolchain was available? Sorry it's a bit long, thanks for reading this far! I'm looking forward to a brighter future writing lots of D code :P But I need to know basically all these questions are addressed before I could consider it for serious commercial game dev.Someone has recently confirmed D working just fine on the Alpha platform. For D2, your biggest showstopper is the runtime library. There are many gaps to fill to port druntime to your preferred architecture. Regards
Sep 21 2011
Manu Wrote:I'd love to try out D on some console systems. Fortunately there are some great home-brew scenes available for a bunch of slightly older consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC), Dreamcast (SH4). They all have GCC compilers maintained by the community. How difficult will it be to make GDC work with those toolchains? Sadly I know nothing about configuring GCC, so sadly I can't really help here.http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html Maybe this man can be of some help for you.
Sep 22 2011