www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - __restrict, architecture intrinsics vs asm, consoles, and other stuff

reply Manu <turkeyman gmail.com> writes:
Hello D community.

I've been reading a lot about D lately. I have known it existed for
ages, but for some reason never even took a moment to look into it.
The more I looked into it, the more I realise, this is the language
I want. C(/C++) has been ruined, far beyond salvation. D seems to be
the reboot that it desperately needs.

Anyway, I work in the games industry, 10 years in cross platform
console games at major studios. Sadly, I don't think Microsoft,
Sony, Nintendo, Apple, Google (...maybe google) will support D any
time soon, but I've started some after-hours game projects to test D
in a some real gamedev environments.

So far I have these (critical) questions.

Pointer aliasing... C implementations uses a non-standard __restrict
keyword to state that a given pointer will not be aliased by any
other pointer. This is critical in some pieces of code to eliminate
redundant loads and stores, particularly important on RISC
architectures like PPC.
How does D address pointer aliasing? I can't imagine the compiler
has any way to detect that pointer aliasing is not possible in
certain cases, many cases are just far too complicated. Is there a
keyword? Or plans? This is critical for realtime performance.

C implementations often use compiler intrinsics to implement
architecture provided functionality rather than inline asm, the
reason is that the intrinsics allow the compiler to generate better
code with knowledge of the context. Inline asm can't really be
transformed appropriately to suit the context in some situations,
whereas intrinsics operate differently, and run vendor specific
logic to produce the code more intelligently.
How does D address this? What options/possibilities are available to
the language? Hooks for vendors to implement intrinsics for custom
hardware?
Is the D assembler a macro assembler? (ie, assigns registers
automatically and manage loads/stores intelligently?) I haven't seen
any non-x86 examples of the D assembler, and I think it's fair to
say that x86 is the single most unnecessary architecture to write
inline assembly that exists. Are there PowerPC or ARM examples
anywhere?

As an extension from that, why is there no hardware vector support
in the language? Surely a primitive vector4 type would be a sensible
thing to have?
Is it possible in D currently to pass vectors to functions by value
in registers? Without an intrinsic vector type, it would seem
impossible.
In addition to that, writing a custom Vector4 class to make use of
VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
wrapping functions around inline asm blocks is always clumsy and far
from optimal. The compiler (code generator and probably the
optimiser) needs to understand the concepts of vectors to make good
use of the hardware.
How can I do this in a nice way in D? I'm long sick of writing
unsightly vector classes in C++, but fortunately using vendor
specific compiler intrinsics usually leads to decent code
generation. I can currently imagine an equally ugly (possibly worse)
hardware vector library in D, if it's even possible. But perhaps
I've missed something here?

I'd love to try out D on some console systems. Fortunately there are
some great home-brew scenes available for a bunch of slightly older
consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
Dreamcast (SH4). They all have GCC compilers maintained by the
community. How difficult will it be to make GDC work with those
toolchains? Sadly I know nothing about configuring GCC, so sadly I
can't really help here.

What about Android (or iPhone, but apple's 'x-code policy' prevents
that)? I'd REALLY love to write an android project in D... the
toolchain is GCC, I see no reason why it shouldn't be possible to
write an android app if an appropriate toolchain was available?

Sorry it's a bit long, thanks for reading this far!
I'm looking forward to a brighter future writing lots of D code :P
But I need to know basically all these questions are addressed
before I could consider it for serious commercial game dev.
Sep 21 2011
next sibling parent Trass3r <un known.com> writes:
 I haven't seen any non-x86 examples of the D assembler, and I think it's  
 fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists. Are there PowerPC or ARM examples
 anywhere?
Well DMD only supports x86 including inline asm so that's the only thing that's tested. You need to try LDC or GDC for most of the things you request. http://dsource.org/projects/ldc/wiki/InlineAsmExpressions https://bitbucket.org/goshawk/gdc/wiki/UserDocumentation#!extended-assembler Some guys already managed to compile cross-compilers for ARM and ran some basic code on e.g. Nintendo DS. For anything serious you would need to make druntime work though. It's just nobody has done the dirty work yet.
Sep 21 2011
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/21/2011 3:55 PM, Manu wrote:
 Pointer aliasing... C implementations uses a non-standard __restrict
 keyword to state that a given pointer will not be aliased by any
 other pointer. This is critical in some pieces of code to eliminate
 redundant loads and stores, particularly important on RISC
 architectures like PPC.
 How does D address pointer aliasing? I can't imagine the compiler
 has any way to detect that pointer aliasing is not possible in
 certain cases, many cases are just far too complicated. Is there a
 keyword? Or plans? This is critical for realtime performance.
D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, __restrict is most used in writing vector operations. D, on the other hand, has a dedicated vector operation syntax: a[] += b[] * c; where a[] and b[] are required to not be overlapping, hence enabling parallelization of the operation.
 C implementations often use compiler intrinsics to implement
 architecture provided functionality rather than inline asm, the
 reason is that the intrinsics allow the compiler to generate better
 code with knowledge of the context. Inline asm can't really be
 transformed appropriately to suit the context in some situations,
 whereas intrinsics operate differently, and run vendor specific
 logic to produce the code more intelligently.
 How does D address this? What options/possibilities are available to
 the language? Hooks for vendors to implement intrinsics for custom
 hardware?
D does have some intrinsics, like sin() and cos(). They tend to get added on a strictly as-needed basis, not a speculative one. D has no current intention to replace the inline assembler with intrinsics. As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration D code a while back that would take a string representing a floating point expression, and would literally compile it (using Compile Time Function Execution) and produce a string literal of inline asm functions, which were then compiled by the inline assembler. So yes, it is entirely possible and practical for end users to write custom intrinsics.
 Is the D assembler a macro assembler?
No. It's what-you-write-is-what-you-get.
 (ie, assigns registers
 automatically and manage loads/stores intelligently?)
No. It's intended to be a low level assembler for those who want to precisely control things.
 I haven't seen
 any non-x86 examples of the D assembler, and I think it's fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists.
I enjoy writing x86 inline assembler :-)
 Are there PowerPC or ARM examples anywhere?
The intention is for other CPU targets to employ the syntax used in their respective CPU manual datasheets.
 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?
The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).
 Is it possible in D currently to pass vectors to functions by value
 in registers? Without an intrinsic vector type, it would seem
 impossible.
Vectors (statically dimensioned arrays) are currently passed by value (unlike C or C++).
 In addition to that, writing a custom Vector4 class to make use of
 VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
 wrapping functions around inline asm blocks is always clumsy and far
 from optimal. The compiler (code generator and probably the
 optimiser) needs to understand the concepts of vectors to make good
 use of the hardware.
Yes, I agree.
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.
I don't know much about GDC's capabilities.
Sep 21 2011
next sibling parent reply a <a a.com> writes:
How would one do something like this without intrinsics (the code is c++ using 
gcc vector extensions): 

template <class V>
struct Fft 
{
  typedef typename V::T T;
  typedef typename V::vec vec;
  static const int VecSize = V::Size;

...

  template <int Interleaved>
  static NOINLINE void fft_pass_interleaved(
    vec * __restrict pr, 
    vec *__restrict pi, 
    vec *__restrict pend, 
    T *__restrict table)  
  {
    for(; pr < pend; pr += 2, pi += 2, table += 2*Interleaved)
    {
      vec tmpr, ti, ur, ui, wr, wi;
      V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi);
      V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr);
      V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti);
      vec tr = tmpr*wr - ti*wi;
      ti = tmpr*wi + ti*wr;
      V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]);
      V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]);
    }
  }

...

Here vector elements need to be shuffled around when they are loaded and
stored. 
This is platform dependent and cannot be expressed through vector operations 
(or gcc vector extensions).  Here I abstracted platform dependent functionality 
in member functions of  V, which are implemented using intrinsics.  The
assembly 
generated for SSE single precision and Interleaved=4 is:

 0000000000000000 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>:
   0:	48 39 d7             	cmp    %rdx,%rdi
   3:	0f 83 9c 00 00 00    	jae    a5
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5>
   9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  10:	0f 28 19             	movaps (%rcx),%xmm3
  13:	0f 28 41 10          	movaps 0x10(%rcx),%xmm0
  17:	48 83 c1 20          	add    $0x20,%rcx
  1b:	0f 28 f3             	movaps %xmm3,%xmm6
  1e:	0f 28 2f             	movaps (%rdi),%xmm5
  21:	0f c6 d8 dd          	shufps $0xdd,%xmm0,%xmm3
  25:	0f c6 f0 88          	shufps $0x88,%xmm0,%xmm6
  29:	0f 28 e5             	movaps %xmm5,%xmm4
  2c:	0f 28 47 10          	movaps 0x10(%rdi),%xmm0
  30:	0f 28 4e 10          	movaps 0x10(%rsi),%xmm1
  34:	0f c6 e0 88          	shufps $0x88,%xmm0,%xmm4
  38:	0f c6 e8 dd          	shufps $0xdd,%xmm0,%xmm5
  3c:	0f 28 06             	movaps (%rsi),%xmm0
  3f:	0f 28 d0             	movaps %xmm0,%xmm2
  42:	0f c6 c1 dd          	shufps $0xdd,%xmm1,%xmm0
  46:	0f c6 d1 88          	shufps $0x88,%xmm1,%xmm2
  4a:	0f 28 cd             	movaps %xmm5,%xmm1
  4d:	0f 28 f8             	movaps %xmm0,%xmm7
  50:	0f 59 ce             	mulps  %xmm6,%xmm1
  53:	0f 59 fb             	mulps  %xmm3,%xmm7
  56:	0f 59 c6             	mulps  %xmm6,%xmm0
  59:	0f 59 dd             	mulps  %xmm5,%xmm3
  5c:	0f 5c cf             	subps  %xmm7,%xmm1
  5f:	0f 58 c3             	addps  %xmm3,%xmm0
  62:	0f 28 dc             	movaps %xmm4,%xmm3
  65:	0f 5c d9             	subps  %xmm1,%xmm3
  68:	0f 58 cc             	addps  %xmm4,%xmm1
  6b:	0f 28 e1             	movaps %xmm1,%xmm4
  6e:	0f 15 cb             	unpckhps %xmm3,%xmm1
  71:	0f 14 e3             	unpcklps %xmm3,%xmm4
  74:	0f 29 4f 10          	movaps %xmm1,0x10(%rdi)
  78:	0f 28 ca             	movaps %xmm2,%xmm1
  7b:	0f 29 27             	movaps %xmm4,(%rdi)
  7e:	0f 5c c8             	subps  %xmm0,%xmm1
  81:	48 83 c7 20          	add    $0x20,%rdi
  85:	0f 58 c2             	addps  %xmm2,%xmm0
  88:	0f 28 d0             	movaps %xmm0,%xmm2
  8b:	0f 15 c1             	unpckhps %xmm1,%xmm0
  8e:	0f 14 d1             	unpcklps %xmm1,%xmm2
  91:	0f 29 46 10          	movaps %xmm0,0x10(%rsi)
  95:	0f 29 16             	movaps %xmm2,(%rsi)
  98:	48 83 c6 20          	add    $0x20,%rsi
  9c:	48 39 fa             	cmp    %rdi,%rdx
  9f:	0f 87 6b ff ff ff    	ja     10
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10>
  a5:	f3 c3                	repz retq 

Would something like that be possible with D inline assembly or would there be 
additional loads and stores for each call of V::interleave, V::deinterleave 
and V::expandComplexArrayToRealImagVec?
Sep 21 2011
parent reply Don <nospam nospam.com> writes:
On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.
 template<class V>
 struct Fft
 {
    typedef typename V::T T;
    typedef typename V::vec vec;
    static const int VecSize = V::Size;

 ...

    template<int Interleaved>
    static NOINLINE void fft_pass_interleaved(
      vec * __restrict pr,
      vec *__restrict pi,
      vec *__restrict pend,
      T *__restrict table)
    {
      for(; pr<  pend; pr += 2, pi += 2, table += 2*Interleaved)
      {
        vec tmpr, ti, ur, ui, wr, wi;
        V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi);
        V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr);
        V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti);
        vec tr = tmpr*wr - ti*wi;
        ti = tmpr*wi + ti*wr;
        V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]);
        V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]);
      }
    }

 ...

 Here vector elements need to be shuffled around when they are loaded and
stored.
 This is platform dependent and cannot be expressed through vector operations
 (or gcc vector extensions).  Here I abstracted platform dependent functionality
 in member functions of  V, which are implemented using intrinsics.  The
assembly
 generated for SSE single precision and Interleaved=4 is:

   0000000000000000<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>:
     0:	48 39 d7             	cmp    %rdx,%rdi
     3:	0f 83 9c 00 00 00    	jae   
a5<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5>
     9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    10:	0f 28 19             	movaps (%rcx),%xmm3
    13:	0f 28 41 10          	movaps 0x10(%rcx),%xmm0
    17:	48 83 c1 20          	add    $0x20,%rcx
    1b:	0f 28 f3             	movaps %xmm3,%xmm6
    1e:	0f 28 2f             	movaps (%rdi),%xmm5
    21:	0f c6 d8 dd          	shufps $0xdd,%xmm0,%xmm3
    25:	0f c6 f0 88          	shufps $0x88,%xmm0,%xmm6
    29:	0f 28 e5             	movaps %xmm5,%xmm4
    2c:	0f 28 47 10          	movaps 0x10(%rdi),%xmm0
    30:	0f 28 4e 10          	movaps 0x10(%rsi),%xmm1
    34:	0f c6 e0 88          	shufps $0x88,%xmm0,%xmm4
    38:	0f c6 e8 dd          	shufps $0xdd,%xmm0,%xmm5
    3c:	0f 28 06             	movaps (%rsi),%xmm0
    3f:	0f 28 d0             	movaps %xmm0,%xmm2
    42:	0f c6 c1 dd          	shufps $0xdd,%xmm1,%xmm0
    46:	0f c6 d1 88          	shufps $0x88,%xmm1,%xmm2
    4a:	0f 28 cd             	movaps %xmm5,%xmm1
    4d:	0f 28 f8             	movaps %xmm0,%xmm7
    50:	0f 59 ce             	mulps  %xmm6,%xmm1
    53:	0f 59 fb             	mulps  %xmm3,%xmm7
    56:	0f 59 c6             	mulps  %xmm6,%xmm0
    59:	0f 59 dd             	mulps  %xmm5,%xmm3
    5c:	0f 5c cf             	subps  %xmm7,%xmm1
    5f:	0f 58 c3             	addps  %xmm3,%xmm0
    62:	0f 28 dc             	movaps %xmm4,%xmm3
    65:	0f 5c d9             	subps  %xmm1,%xmm3
    68:	0f 58 cc             	addps  %xmm4,%xmm1
    6b:	0f 28 e1             	movaps %xmm1,%xmm4
    6e:	0f 15 cb             	unpckhps %xmm3,%xmm1
    71:	0f 14 e3             	unpcklps %xmm3,%xmm4
    74:	0f 29 4f 10          	movaps %xmm1,0x10(%rdi)
    78:	0f 28 ca             	movaps %xmm2,%xmm1
    7b:	0f 29 27             	movaps %xmm4,(%rdi)
    7e:	0f 5c c8             	subps  %xmm0,%xmm1
    81:	48 83 c7 20          	add    $0x20,%rdi
    85:	0f 58 c2             	addps  %xmm2,%xmm0
    88:	0f 28 d0             	movaps %xmm0,%xmm2
    8b:	0f 15 c1             	unpckhps %xmm1,%xmm0
    8e:	0f 14 d1             	unpcklps %xmm1,%xmm2
    91:	0f 29 46 10          	movaps %xmm0,0x10(%rsi)
    95:	0f 29 16             	movaps %xmm2,(%rsi)
    98:	48 83 c6 20          	add    $0x20,%rsi
    9c:	48 39 fa             	cmp    %rdi,%rdx
    9f:	0f 87 6b ff ff ff    	ja    
10<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10>
    a5:	f3 c3                	repz retq

 Would something like that be possible with D inline assembly or would there be
 additional loads and stores for each call of V::interleave, V::deinterleave
 and V::expandComplexArrayToRealImagVec?
Sep 21 2011
next sibling parent reply a <a a.com> writes:
 which compiles to a single shufps instruction.
Doesn't it often require additional needless movaps instructions? For example, the following: asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } asm { movaps XMM0, a; movaps XMM1, b; addps XMM0, XMM1; movaps a, XMM0; } compiles to movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) movaps -0x48(%rsp),%xmm0 movaps -0x38(%rsp),%xmm1 addps %xmm1,%xmm0 movaps %xmm0,-0x48(%rsp) Is it possible to avoid needlless loading and storing of values when calling multiple functions that use asm blocks? It also seems that the compiler doesn't inline functions containing asm.
Sep 22 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 9/22/2011 5:11 AM, a wrote:
 It also seems that the compiler
 doesn't inline functions containing asm.
That's correct, it currently does not.
Sep 22 2011
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d.
I think we should put swizzle in std.numeric once and for all. Is anyone interested in taking up that task?
 A couple of years ago I made a DSL compiler for BLAS1 operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL looked like
 normal D code).
 But the compiler has improved enormously since that time. It's now
 perfectly feasible to make a DSL for the SIMD operations you need.

 The really nice thing about this, compared to normal asm, is that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code generator in
 a DSL in D, than in a compiler back-end.
A good argument for (a) moving stuff from the compiler into the library, (b) continuing Don's great work on making CTFE a solid proposition. Andrei
Sep 22 2011
next sibling parent reply so <so so.so> writes:
On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is anyone  
 interested in taking up that task?
You mean some helper functions to be used in user structures? Because i don't know of any structure in std.numerics that could use it. We first need to improve opDispatch. Currently i think no one know how it works or how it was intended to work. It refuses to except a few things which i think it should. For example: A { opDispatch(string)() opDispatch(string)() const } A a, b; a.fun = b.run; // This should be perfectly fine.
Sep 22 2011
next sibling parent so <so so.so> writes:
On Fri, 23 Sep 2011 02:00:50 +0300, so <so so.so> wrote:

 It refuses to except a few things which i think it should.
accept...
Sep 22 2011
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Sep 22 2011
parent reply so <so so.so> writes:
On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this?
Sep 22 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/22/11 9:11 PM, so wrote:
 On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this?
Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. Andrei
Sep 22 2011
parent reply so <so so.so> writes:
On Fri, 23 Sep 2011 06:44:44 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 9:11 PM, so wrote:
 On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?
You mean some helper functions to be used in user structures?
I was thinking of a template that takes and return T[n]. Andrei
Something like this?
Looks promising, though I was hoping to not need an additional struct V. But I'm not an expert. Andrei
It was there to show how it should be used in user code, and testing. Swizzle is not just a rvalue operation, there is also a lvalue part to it which plays a bit differently (hence, swizzleR and swizzleL). We could take care of it with an overload but D doesn't act quite like what i expected (like C++), i don't understand why it won't differentiate "fun()" from "fun() const".
Sep 23 2011
parent so <so so.so> writes:
On Fri, 23 Sep 2011 16:09:31 +0300, so <so so.so> wrote:

 It was there to show how it should be used in user code, and testing.
 Swizzle is not just a rvalue operation, there is also a lvalue part to  
 it which plays a bit differently (hence, swizzleR and swizzleL).
 We could take care of it with an overload but D doesn't act quite like  
 what i expected (like C++), i don't understand why it won't  
 differentiate "fun()" from "fun() const".
Sorry about the nonsense. It is now with opDispatch (attached) To make a generic "swizzle" function we need to introduce a few traits but if all you want is a support for T[N] that is easy.
Sep 24 2011
prev sibling parent reply Manu Evans <turkeyman gmail.com> writes:
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
article
 On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the
code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to
inline asm.
 But, what we've done is to move SIMD into the machine model: the
D
 machine model assumes that float[4] + float[4] is a more
efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on
DMD at
 least, they're still not proper intrinsics. So in the long term
it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using
CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists
of four
 characters, each of which is a, b, c, or d.
I think we should put swizzle in std.numeric once and for all. Is
anyone
 interested in taking up that task?
 A couple of years ago I made a DSL compiler for BLAS1
operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL
looked like
 normal D code).
 But the compiler has improved enormously since that time. It's
now
 perfectly feasible to make a DSL for the SIMD operations you
need.
 The really nice thing about this, compared to normal asm, is
that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code
generator in
 a DSL in D, than in a compiler back-end.
A good argument for (a) moving stuff from the compiler into the
library,
 (b) continuing Don's great work on making CTFE a solid
proposition.
 Andrei
This sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations? I think that's a wrong decision in its self, and a poor foundation for this approach. Let me try and convince you that the language should have an explicit hardware vector type, and not attempt to make use of any clever language tricks... If float[4] is considered a hardware vector by the compiler, - How to I define an ACTUAL float[4]? - How can I be confident that it actually WILL be a hardware vector? Hardware vectors are NOT float[4]'s, they are a reference to an 128bit hardware register upon which various vector operations may be supported, they are probably aligned, and they are only accessible in 128bit quantities. I think they should be explicitly defined as such. They may be float4, u/int4, u/short8, u/byte16, double2... All these types are interchangeable within the one register, do you intend to special case fixed length arrays of all those types to support the hardware functionality for those? Hardware vectors are NOT floats, they can not interact with the floating point unit, dereferencing of this style 'float x = myVector[0]' is NOT supported by the hardware and it should not be exposed to the programmer as a trivial possibility. This seemingly harmless line of code will undermine the entire reason for using hardware vector hardware in the first place. Allowing easy access of individual floats within a hardware vector breaks the languages stated premise that the path of least resistance also be the 'correct' optimal choice, whereby a seemingly simple line of code may ruin the entire function. float[4] is not even a particularly conveniently sized vector for most inexperienced programmers, the majority will want float[3]. This is NOT a trivial map to float[4], and programmers should be well aware that there is inherent complexity in using the hardware vector architecture, and forced to think it through. Most inexperienced programmers think of results of operations like dot product and magnitude as being scalar values, but they are not, they are a scalar value repeated across all 4 components of a vector 4, and this should be explicit too. .... I know I'm nobody around here, so I can't expect to be taken too seriously, but I'm very excited about the language, so here's what I would consider instead: Add a hardware vector type, lets call it 'v128' for the exercise. It is a primitive type, aligned by definition, and does not have any members. You may use this to refer to a hardware vector registers explicitly, as vector register function arguments, or as arguments to inline asm blocks. Add some functions to the standard library (ideally implemented as compiler intrinsics) which do very specific stuff to vectors, and ideally expandable by hardware vendors or platform holders. You might want to have some classes in the standard library which wrap said v128, and expose the concept as a float4, int4, byte16, etc. These classes would provide maths operators, comparisons, initialisation and immediate assignment, and casts between various vector types. Different vector units support completely different methods of permutation. I would be very careful about adding intrinsic support into the library for generalised permutation. And if so, at least leave the capability of implementing intrinsic architecture-specific permutation but the hardware vendors/platform holders. At the end of the day, it is imperative that the code generator and optimiser still retain the concept of hardware vectors, and can perform appropriate load/store elimination, apply hardware specific optimisation to operations like permutes/swizzles, component broadcasts, load immediates, etc. .... The reason I am so adamant about this, is in almost all architectures (SSE is the most tolerant by far), using the hardware vector unit is an all-or-nothing choice. If you interact between the vector and float registers, you will almost certainly result in slower code than if you just used the float unit outright. Also, since people usually use hardware vectors in areas of extreme performance optimisation, it's not tolerable for the compiler to be making mistakes. As a minimum the programmer needs to be able to explicitly address the vector registers, pass it to and from functions, and perform explicit (probably IHV supplied) functions on them. The code generator and optimiser needs all the information possible, and as explicit as possible so IHV's can implement the best possible support for their architecture. The API should reflect this, and not allow easy access to functionality that would violate hardware support. Ease of programming should be a SECONDARY goal, at which point something like the typed wrapper classes I described would come in, allowing maths operators, comparisons and all I mentioned above, ie, making them look like a real mathematical type, but still keeping their distance from primitive float/int types, to discourage interaction at all costs. I hope this doesn't sound too much like an overly long rant! :) And hopefully I've managed to sell my point... Don: I'd love to hear counter arguments to justify float[4] as a reasonable solution. Currently no matter how I slice it, I just can't see it. Criticism welcome? Cheers! - Manu
Sep 23 2011
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Manu Evans:

I appreciate your efforts. I answer to the OP that DMD doesn't yet offer most
of the things discussed in this thread. But I think that it's better to add and
work on high-performance features when the basics of D are in better shape.
Currently there are more basic fishes to implement or debug, like tuples syntax
sugar, module system issues, const issues, inout, and so on and on (on the
other hand I agree that it's OK to discuss even now D design ideas that will
allow that future high performance).


 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations may be
 supported, they are probably aligned, and they are only accessible
 in 128bit quantities. I think they should be explicitly defined as
 such.
What do you want to do when CPU with 256 bit registers appear? When they grow to 512 bit? To 1024? Do you want to keep adding specific types? How many things do you want to add to D in the next 15 years of CPU evolution? Bye, bearophile
Sep 23 2011
parent Manu Evans <turkeyman gmail.com> writes:
== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Manu Evans:
 I appreciate your efforts. I answer to the OP that DMD doesn't yet
offer most of the things discussed in this thread. But I think that it's better to add and work on high-performance features when the basics of D are in better shape. Currently there are more basic fishes to implement or debug, like tuples syntax sugar, module system issues, const issues, inout, and so on and on (on the other hand I agree that it's OK to discuss even now D design ideas that will allow that future high performance). I make the point because, while I agree the topics you mention are of greater immediate performance, the previous posts in this thread suggest there is already experimentation/implementation of these features happening in the language now, and if they are defined now, and defined incorrectly, it's always very difficult to go back on these decisions.
 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations
may be
 supported, they are probably aligned, and they are only
accessible
 in 128bit quantities. I think they should be explicitly defined
as
 such.
What do you want to do when CPU with 256 bit registers appear?
When they grow to 512 bit? To 1024? Do you want to keep adding specific types? Yes. I don't think it's likely to progress as you suggest though. I foresee perhaps a 4 component 64bit-word vector (256bit), and a hardware matrix. I can't see it being any less appropriate to implement a v256 in addition to v128 than a long in addition to an int. A matrix is a fundamentally different concept, and surely worthy of its own type.
 How many things do you want to add to D in the next 15 years of
CPU evolution? As many things as are universally accepted by computer hardware as a normal/standard feature. Hardware vectors definitely fit this bill. We've had hardware vector support in virtually every architecture for 10-15 years now, and yet there is still no language that really supports it.
Sep 23 2011
prev sibling parent Don <nospam nospam.com> writes:
On 24.09.2011 00:47, Manu Evans wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the
code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to
inline asm.
 But, what we've done is to move SIMD into the machine model: the
D
 machine model assumes that float[4] + float[4] is a more
efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on
DMD at
 least, they're still not proper intrinsics. So in the long term
it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using
CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists
of four
 characters, each of which is a, b, c, or d.
I think we should put swizzle in std.numeric once and for all. Is
anyone
 interested in taking up that task?
 A couple of years ago I made a DSL compiler for BLAS1
operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL
looked like
 normal D code).
 But the compiler has improved enormously since that time. It's
now
 perfectly feasible to make a DSL for the SIMD operations you
need.
 The really nice thing about this, compared to normal asm, is
that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code
generator in
 a DSL in D, than in a compiler back-end.
A good argument for (a) moving stuff from the compiler into the
library,
 (b) continuing Don's great work on making CTFE a solid
proposition.
 Andrei
This sounds really dangerous to me. I really like the idea where CTFE can be used to produce some pretty powerful almost-like-intrinsics code, but applying it in this context sounds like a really bad idea. Firstly, so I'm not misunderstanding, is this suggestion building on Don's previous post saying that float[4] is somehow intercepted and special-cased by the compiler, reinterpreting as a candidate for hardware vector operations?
No, it's completely unrelated. It has nothing in common.
 I think that's a wrong decision in its self, and a poor foundation
 for this approach.
 Let me try and convince you that the language should have an
 explicit hardware vector type, and not attempt to make use of any
 clever language tricks...

 If float[4] is considered a hardware vector by the compiler,
    - How to I define an ACTUAL float[4]?
    - How can I be confident that it actually WILL be a hardware
 vector?
float[4] is not considered to be a hardware vector. It is only passed as one. To pass it the C++ way, declare the parameter as float[], or pass by ref. Everything after that is the reponsibility of the compiler/optimizer. A big difference compared to C++, is that generally, it's pretty strange to pass fixed-length arrays as value parameters. At this stage we don't have any way of forcing it to be a hardware vector. We've just introduced the parameter passing and the vector operations to make it easier for the compiler to use hardware registers. Very little else is decided at this stage. You make some excellent points.
 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations may be
 supported, they are probably aligned, and they are only accessible
 in 128bit quantities. I think they should be explicitly defined as
 such.

 They may be float4, u/int4, u/short8, u/byte16, double2... All these
 types are interchangeable within the one register, do you intend to
 special case fixed length arrays of all those types to support the
 hardware functionality for those?

 Hardware vectors are NOT floats, they can not interact with the
 floating point unit, dereferencing of this style 'float x =
 myVector[0]' is NOT supported by the hardware and it should not be
 exposed to the programmer as a trivial possibility. This seemingly
 harmless line of code will undermine the entire reason for using
 hardware vector hardware in the first place.

 Allowing easy access of individual floats within a hardware vector breaks the
languages stated premise that the path of least
 resistance also be the 'correct' optimal choice, whereby a seemingly
 simple line of code may ruin the entire function.

 float[4] is not even a particularly conveniently sized vector for
 most inexperienced programmers, the majority will want float[3].
 This is NOT a trivial map to float[4], and programmers should be
 well aware that there is inherent complexity in using the hardware
 vector architecture, and forced to think it through.

 Most inexperienced programmers think of results of operations like
 dot product and magnitude as being scalar values, but they are not,
 they are a scalar value repeated across all 4 components of a vector
 4, and this should be explicit too.

 ....

 I know I'm nobody around here, so I can't expect to be taken too
 seriously, but I'm very excited about the language, so here's what I
 would consider instead:

 Add a hardware vector type, lets call it 'v128' for the exercise. It
 is a primitive type, aligned by definition, and does not have any
 members.
 You may use this to refer to a hardware vector registers explicitly,
 as vector register function arguments, or as arguments to inline asm
 blocks.

 Add some functions to the standard library (ideally implemented as
 compiler intrinsics) which do very specific stuff to vectors, and
 ideally expandable by hardware vendors or platform holders.

 You might want to have some classes in the standard library which
 wrap said v128, and expose the concept as a float4, int4, byte16,
 etc. These classes would provide maths operators, comparisons,
 initialisation and immediate assignment, and casts between various
 vector types.

 Different vector units support completely different methods of
 permutation. I would be very careful about adding intrinsic support
 into the library for generalised permutation. And if so, at least
 leave the capability of implementing intrinsic architecture-specific
 permutation but the hardware vendors/platform holders.

 At the end of the day, it is imperative that the code generator and
 optimiser still retain the concept of hardware vectors, and can
 perform appropriate load/store elimination, apply hardware specific
 optimisation to operations like permutes/swizzles, component
 broadcasts, load immediates, etc.

 ....

 The reason I am so adamant about this, is in almost all
 architectures (SSE is the most tolerant by far), using the hardware
 vector unit is an all-or-nothing choice. If you interact between the
 vector and float registers, you will almost certainly result in
 slower code than if you just used the float unit outright. Also,
 since people usually use hardware vectors in areas of extreme
 performance optimisation, it's not tolerable for the compiler to be
 making mistakes. As a minimum the programmer needs to be able to
 explicitly address the vector registers, pass it to and from
 functions, and perform explicit (probably IHV supplied) functions on
 them. The code generator and optimiser needs all the information
 possible, and as explicit as possible so IHV's can implement the
 best possible support for their architecture. The API should reflect
 this, and not allow easy access to functionality that would violate
 hardware support.

 Ease of programming should be a SECONDARY goal, at which point
 something like the typed wrapper classes I described would come in,
 allowing maths operators, comparisons and all I mentioned above, ie,
 making them look like a real mathematical type, but still keeping
 their distance from primitive float/int types, to discourage
 interaction at all costs.

 I hope this doesn't sound too much like an overly long rant! :)
 And hopefully I've managed to sell my point...

 Don: I'd love to hear counter arguments to justify float[4] as a
 reasonable solution. Currently no matter how I slice it, I just
 can't see it.

 Criticism welcome?

 Cheers!
 - Manu
Sep 23 2011
prev sibling next sibling parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 22.09.2011, 08:39 Uhr, schrieb Don <nospam nospam.com>:

 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is  
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction. That "cdcd" string is really a tiny DSL: the language consists of four characters, each of which is a, b, c, or d. A couple of years ago I made a DSL compiler for BLAS1 operations. It was capable of doing some pretty wild stuff, even then. (The DSL looked like normal D code). But the compiler has improved enormously since that time. It's now perfectly feasible to make a DSL for the SIMD operations you need. The really nice thing about this, compared to normal asm, is that you have access to the compiler's symbol table. This lets you add compile-time error messages, for example. A funny thing about this, which I found after working on the DMD back-end, is that is MUCH easier to write an optimizer/code generator in a DSL in D, than in a compiler back-end.
That's a nice fresh approach to intrinsics. I bet if other languages had the CTFE capabilities, they'd probably do the same. Sure, it is ideal if the compiler works magic here, but it takes longer to implement the right code generation in the compiler, than to write an isolated piece of library code and extensions can be added by anyone, especially since there will already be some examples to look at. Thumbs up!
Sep 22 2011
prev sibling parent reply Peter Alexander <peter.alexander.au gmail.com> writes:
On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
Sep 22 2011
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander  
<peter.alexander.au gmail.com>:

 On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.
Sep 22 2011
parent reply Don <nospam nospam.com> writes:
On 22.09.2011 20:19, Marco Leise wrote:
 Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander
 <peter.alexander.au gmail.com>:

 On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):
[snip] At present, you can't do it without ultimately resorting to inline asm. But, what we've done is to move SIMD into the machine model: the D machine model assumes that float[4] + float[4] is a more efficient operation than a loop. Currently, only arithmetic operations are implemented, and on DMD at least, they're still not proper intrinsics. So in the long term it'll be possible to do it directly, but not yet. At various times, several of us have implemented 'swizzle' using CTFE, giving you a syntax like: float[4] x, y; x[] = y[].swizzle!"cdcd"(); // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3] which compiles to a single shufps instruction.
How can it compile into a single shufps? x and y would need to already be in vector registers, and unless I've missed something, they won't be. You'll need instructions for loading into registers (using the slow movups because 16-byte alignment isn't guaranteed) then do the shufps, then load back out again. This is too slow for performance critical code. Being stored in XMM registers from creation, passed and returned in XMM registers to/from functions is a key requirement for this sort of code. If you have to keep loading in and out of memory then you lose all performance.
I thought about this. Either write long functions, so you don't have to load and unload often or just make the functions assume that the parameters are in registers without explicit declaration.
Yeah, at the moment you have to work at a higher level, you can't just do a single instruction on its own.
Sep 23 2011
parent bearophile <bearophileHUGS lycos.com> writes:
Don:

 Yeah, at the moment you have to work at a higher level, you can't just 
 do a single instruction on its own.
Is it possible to solve some of those problems adding something like this to D/DMD: http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions And then, what changes/work is needed to allow inlining of some functions that contain asm? I mean something like this allow_inline? http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline (I have asked similar questions four times in the last two years, with no answers or comments.) Bye, bearophile
Sep 24 2011
prev sibling next sibling parent reply Benjamin Thaut <code benjamin-thaut.de> writes:
Am 22.09.2011 02:38, schrieb Walter Bright:
 nsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
I recently tried that, and I couldn't do it because D has no way of aligning structs on the stack. Manually allocating the neccessary aligned memroy is also not always possible because it can not be done for compiler temporary variables: vec4 v1 = func1(); vec4 v2 = func2(); vec4 result = (v1 + v2) * 0.5f; Even if I manually allocate v1,v2 and result, the temporary variable that the compiler uses to compute the expression might be unaligned. That is a total killer for SSE optimizations because you can not hide them away. Does DMC++ have __declspec(align(16)) support? -- Kind Regards Benjamin Thaut
Sep 21 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 9/21/2011 10:56 PM, Benjamin Thaut wrote:
 Even if I manually allocate v1,v2 and result, the temporary variable that the
 compiler uses to compute the expression might be unaligned.
 That is a total killer for SSE optimizations because you can not hide them
away.

 Does DMC++ have __declspec(align(16)) support?
No, but 64 bit DMD aligns the stack on 16 byte boundaries.
Sep 22 2011
parent Benjamin Thaut <code benjamin-thaut.de> writes:
== Auszug aus Walter Bright (newshound2 digitalmars.com)'s Artikel
 On 9/21/2011 10:56 PM, Benjamin Thaut wrote:
 Even if I manually allocate v1,v2 and result, the temporary variable that the
 compiler uses to compute the expression might be unaligned.
 That is a total killer for SSE optimizations because you can not hide them
away.

 Does DMC++ have __declspec(align(16)) support?
No, but 64 bit DMD aligns the stack on 16 byte boundaries.
Unfortunaltey there is no 64 bit dmd on windows.
Sep 22 2011
prev sibling next sibling parent Peter Alexander <peter.alexander.au gmail.com> writes:
On 22/09/11 1:38 AM, Walter Bright wrote:
 D doesn't have __restrict. I'm going to argue that it is unnecessary.
 AFAIK, __restrict is most used in writing vector operations. D, on the
 other hand, has a dedicated vector operation syntax:

 a[] += b[] * c;

 where a[] and b[] are required to not be overlapping, hence enabling
 parallelization of the operation.
It's used for vector stuff, but I wouldn't say mostly. Just about any performance intensive piece of code involving pointers can benefit from __restrict. I use it in a VM for example.
 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?
The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).
I don't see how this would be possible without intrinsics, or at least some form of language extension. Would DMD just *always* put float[4] in XMM registers (assuming they are available)? That doesn't seem like a good idea if you don't want to use it as a vector. BTW, if you want to get a good idea of how game programmers use vector intrinsics on current hardware, there is a good blog post about it here: http://altdevblogaday.com/2011/01/31/vectiquette/
Sep 22 2011
prev sibling parent reply Manu Evans <turkeyman gmail.com> writes:
== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK,
 __restrict is most used in writing vector operations. D, on the other hand, has
 a dedicated vector operation syntax:
    a[] += b[] * c;
 where a[] and b[] are required to not be overlapping, hence enabling
 parallelization of the operation.
Use of __restrict is certainly not limited to your example, it's applicable basically anywhere that a pointer is dereferenced on either side of a write through any other pointer, or a function call (since it could potentially do anything), the resident value from the previous dereference is invalidated and must be reloaded needlessly unless the pointer is explicitly marked restrict. http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html For RISC architectures in particular, __restrict is mandatory when optimising certain hot functions without making a mess of your code (declaring stack locals all over the place), and I think I've run into cases where even that's not enough.
 D does have some intrinsics, like sin() and cos(). They tend to get added on a
 strictly as-needed basis, not a speculative one.
 D has no current intention to replace the inline assembler with intrinsics.
 As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration
D
 code a while back that would take a string representing a floating point
 expression, and would literally compile it (using Compile Time Function
 Execution) and produce a string literal of inline asm functions, which were
then
 compiled by the inline assembler.
 So yes, it is entirely possible and practical for end users to write custom
 intrinsics.
I hadn't thought of that using compile-time functions, that's really nice. I'm not sure if that'll be enough to generate good code in all cases, but I'll do some experiments and see where it goes. The main problem with writing (intelligently generated) inline asm vs using intrinsics, is in the context of the C (or D) source code, you don't have enough context to know about the state of the register assignment, and producing the appropriate loads/stores. Also, the opcodes selected to perform the operation may change with context. (again, specific examples are hard to fabricate, but I've had them consistently pop up over the years) Also, I think someone else said that you couldn't inline functions with inline asm? Is that correct? If so, I assume that's intended to be fixed?
 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?
The language supports it now (see the aforementioned vector syntax), it's just that the vector code gen isn't done (currently it is just implemented using loops).
Are you referring to the comment about special casing a float[4]? I can see why one might reach for that as a solution, but it sounds like a really bad idea to me...
 Is it possible in D currently to pass vectors to functions by value
 in registers? Without an intrinsic vector type, it would seem
 impossible.
Vectors (statically dimensioned arrays) are currently passed by value (unlike C or C++).
Do you mean that like a memcpy to the stack, or somehow intuitively using the hardware vector registers to pass arguments to the function properly?
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than intrinsics.
But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector type, there's no way to pass vectors to functions in registers, and also, using explicit asm, you tend to end up with endless unnecessary loads and stores, and potentially a lot of redundant shuffling/permutation. This will differ radically between architectures too. I think I read in another post too that functions containing inline asm will not be inlined? How does the D compiler go at optimising code around inline asm blocks? Most compilers have a lot of trouble optimising around inline asm blocks, and many don't even attempt to do so... How does GDC compare to DMD? Does it do a good job? I really need to take the weekend and do a lot of experiments I think.
Sep 23 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Manu Evans (turkeyman gmail.com)'s article
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
Your C++ vector code should be amenable to translation to D, so that effort of yours isn't lost, except that it'd have to be in inline asm rather than
intrinsics.
 But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector
type, there's
 no way to pass vectors to functions in registers, and also, using explicit asm,
you tend to
 end up with endless unnecessary loads and stores, and potentially a lot of
redundant
 shuffling/permutation. This will differ radically between architectures too.
 I think I read in another post too that functions containing inline asm will
not
be inlined?
 How does the D compiler go at optimising code around inline asm blocks? Most
compilers have a
 lot of trouble optimising around inline asm blocks, and many don't even attempt
to do so...
 How does GDC compare to DMD? Does it do a good job?
 I really need to take the weekend and do a lot of experiments I think.
GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Sep 24 2011
parent reply Manu <turkeyman gmail.com> writes:
On 24 September 2011 15:37, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 == Quote from Manu Evans (turkeyman gmail.com)'s article
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
Your C++ vector code should be amenable to translation to D, so that
effort of
 yours isn't lost, except that it'd have to be in inline asm rather than
intrinsics.
 But sadly, in that case, it wouldn't work. Without an intrinsic hardware
vector type, there's
 no way to pass vectors to functions in registers, and also, using
explicit asm, you tend to
 end up with endless unnecessary loads and stores, and potentially a lot
of redundant
 shuffling/permutation. This will differ radically between architectures
too.
 I think I read in another post too that functions containing inline asm
will not be inlined?
 How does the D compiler go at optimising code around inline asm blocks?
Most compilers have a
 lot of trouble optimising around inline asm blocks, and many don't even
attempt to do so...
 How does GDC compare to DMD? Does it do a good job?
 I really need to take the weekend and do a lot of experiments I think.
GDC is just the same as DMD (same runtime library implementation for vector array operations). You can define vector types in the language through use of GCC's attribute though (is a pragma in GDC), then use a union to interface between it and the corresponding static array. It's deliberately UGLY and PRONE to you hitting lots of brick walls if you don't handle them in a very specific way though. :~) Stock example: pragma(attribute, vector_size()) typedef float __v4sf_t union __v4sf { float[4] f; __v4sf_t v; } __v4sf a = {[1,2,3,4]} b = {[1,2,3,4]} c; c.v = a.v + b.v; assert(c.f == [2,4,6,8]); The assignment compiles down to ~5 instructions: movaps -0x88(%ebp),%xmm1 movaps -0x78(%ebp),%xmm0 addps %xmm1,%xmm0 movaps %xmm0,-0x68(%ebp) flds -0x68(%ebp) And is far quicker than c[] = a[] + b[] due to it being inlined, and not an external library call. Regards Iain
Nice! Is there an IRC channel, or anywhere for realtime D discussion? I'm interested in trying to build some GDC cross compilers, and perhaps contributing to the standard library on a few embedded systems, but I have a lot of little questions and general things that don't suit a mailing list... Perhaps some IM? It seems to me that you are the authority on GDC implementation and support...
Sep 24 2011
next sibling parent so <so so.so> writes:
On Sat, 24 Sep 2011 16:50:39 +0300, Manu <turkeyman gmail.com> wrote:

 Nice!
 Is there an IRC channel, or anywhere for realtime D discussion?
 I'm interested in trying to build some GDC cross compilers, and perhaps
 contributing to the standard library on a few embedded systems, but I  
 have a
 lot of little questions and general things that don't suit a mailing  
 list...
 Perhaps some IM? It seems to me that you are the authority on GDC
 implementation and support...
We all know where it leads to, first ask IM then ask phone and finally ...
Sep 24 2011
prev sibling parent Max Klyga <max.klyga gmail.com> writes:
On 2011-09-24 16:50:39 +0300, Manu said:
 Is there an IRC channel, or anywhere for realtime D discussion?
There is a #d channel for general D discussions and #d.gdc for GDC related themes on irc.freenode.org
Sep 24 2011
prev sibling next sibling parent Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Manu (turkeyman gmail.com)'s article
 Hello D community.
 I've been reading a lot about D lately. I have known it existed for
 ages, but for some reason never even took a moment to look into it.
 The more I looked into it, the more I realise, this is the language
 I want. C(/C++) has been ruined, far beyond salvation. D seems to be
 the reboot that it desperately needs.
 Anyway, I work in the games industry, 10 years in cross platform
 console games at major studios. Sadly, I don't think Microsoft,
 Sony, Nintendo, Apple, Google (...maybe google) will support D any
 time soon, but I've started some after-hours game projects to test D
 in a some real gamedev environments.
 So far I have these (critical) questions.
 Pointer aliasing... C implementations uses a non-standard __restrict
 keyword to state that a given pointer will not be aliased by any
 other pointer. This is critical in some pieces of code to eliminate
 redundant loads and stores, particularly important on RISC
 architectures like PPC.
 How does D address pointer aliasing? I can't imagine the compiler
 has any way to detect that pointer aliasing is not possible in
 certain cases, many cases are just far too complicated. Is there a
 keyword? Or plans? This is critical for realtime performance.
 C implementations often use compiler intrinsics to implement
 architecture provided functionality rather than inline asm, the
 reason is that the intrinsics allow the compiler to generate better
 code with knowledge of the context. Inline asm can't really be
 transformed appropriately to suit the context in some situations,
 whereas intrinsics operate differently, and run vendor specific
 logic to produce the code more intelligently.
 How does D address this? What options/possibilities are available to
 the language? Hooks for vendors to implement intrinsics for custom
 hardware?
The DMD compiler has some basic intrinsics, other compilers build upon this using their own backends. ie: GCC has hundreds of builtins, including some target builtins where intrinsic types are mappable to D types (__float80 ->.real).
 Is the D assembler a macro assembler? (ie, assigns registers
 automatically and manage loads/stores intelligently?) I haven't seen
 any non-x86 examples of the D assembler, and I think it's fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists. Are there PowerPC or ARM examples
 anywhere?
 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?
 Is it possible in D currently to pass vectors to functions by value
 in registers? Without an intrinsic vector type, it would seem
 impossible.
 In addition to that, writing a custom Vector4 class to make use of
 VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
 wrapping functions around inline asm blocks is always clumsy and far
 from optimal. The compiler (code generator and probably the
 optimiser) needs to understand the concepts of vectors to make good
 use of the hardware.
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?
I would imagine it should now be possible to use GCC vector builtins with the GDC compiler. Given that I manage to get round to turning these routines on though. :~)
 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.
 What about Android (or iPhone, but apple's 'x-code policy' prevents
 that)? I'd REALLY love to write an android project in D... the
 toolchain is GCC, I see no reason why it shouldn't be possible to
 write an android app if an appropriate toolchain was available?
 Sorry it's a bit long, thanks for reading this far!
 I'm looking forward to a brighter future writing lots of D code :P
 But I need to know basically all these questions are addressed
 before I could consider it for serious commercial game dev.
Someone has recently confirmed D working just fine on the Alpha platform. For D2, your biggest showstopper is the runtime library. There are many gaps to fill to port druntime to your preferred architecture. Regards
Sep 21 2011
prev sibling parent Kagamin <spam here.lot> writes:
Manu Wrote:

 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.
http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html Maybe this man can be of some help for you.
Sep 22 2011