digitalmars.D - __restrict, architecture intrinsics vs asm, consoles, and other stuff

Manu (70/70) Sep 21 2011 Hello D community.

Trass3r (9/14) Sep 21 2011 Well DMD only supports x86 including inline asm so that's the only thing...
Walter Bright (31/83) Sep 21 2011 D doesn't have __restrict. I'm going to argue that it is unnecessary. AF...

a (86/86) Sep 21 2011 How would one do something like this without intrinsics (the code is c++...

Don (28/114) Sep 21 2011 [snip]

a (26/27) Sep 22 2011 Doesn't it often require additional needless movaps instructions?

Walter Bright (2/4) Sep 22 2011 That's correct, it currently does not.

Andrei Alexandrescu (6/37) Sep 22 2011 I think we should put swizzle in std.numeric once and for all. Is anyone...

so (14/16) Sep 22 2011 You mean some helper functions to be used in user structures? Because i ...

so (2/3) Sep 22 2011 accept...
Andrei Alexandrescu (3/8) Sep 22 2011 I was thinking of a template that takes and return T[n].

so (3/13) Sep 22 2011 Something like this?

Andrei Alexandrescu (4/19) Sep 22 2011 Looks promising, though I was hoping to not need an additional struct V....

so (8/29) Sep 23 2011 It was there to show how it should be used in user code, and testing.

so (5/11) Sep 24 2011 Sorry about the nonsense.

Manu Evans (121/165) Sep 23 2011 code is

bearophile (5/10) Sep 23 2011 What do you want to do when CPU with 256 bit registers appear? When they...

Manu Evans (33/42) Sep 23 2011 offer most of the things discussed in this thread. But I think that

Don (13/178) Sep 23 2011 float[4] is not considered to be a hardware vector. It is only passed as...

Marco Leise (7/38) Sep 22 2011 That's a nice fresh approach to intrinsics. I bet if other languages had...
Peter Alexander (11/29) Sep 22 2011 How can it compile into a single shufps? x and y would need to already

Marco Leise (5/38) Sep 22 2011 I thought about this. Either write long functions, so you don't have to ...

Don (3/44) Sep 23 2011 Yeah, at the moment you have to work at a higher level, you can't just

bearophile (8/10) Sep 24 2011 Is it possible to solve some of those problems adding something like thi...

Benjamin Thaut (16/24) Sep 21 2011 I recently tried that, and I couldn't do it because D has no way of

Walter Bright (2/6) Sep 22 2011 No, but 64 bit DMD aligns the stack on 16 byte boundaries.

Benjamin Thaut (2/9) Sep 22 2011 Unfortunaltey there is no 64 bit dmd on windows.

Peter Alexander (12/24) Sep 22 2011 It's used for vector stuff, but I wouldn't say mostly. Just about any
Manu Evans (33/67) Sep 23 2011 Use of __restrict is certainly not limited to your example, it's applica...

Iain Buclaw (35/52) Sep 24 2011 type, there's

Manu (8/71) Sep 24 2011 Nice!

so (2/11) Sep 24 2011 We all know where it leads to, first ask IM then ask phone and finally ....
Max Klyga (3/4) Sep 24 2011 There is a #d channel for general D discussions and #d.gdc for GDC

Iain Buclaw (10/80) Sep 21 2011 The DMD compiler has some basic intrinsics, other compilers build upon t...
Kagamin (3/10) Sep 22 2011 http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html

Manu <turkeyman gmail.com> writes:

Hello D community.

I've been reading a lot about D lately. I have known it existed for
ages, but for some reason never even took a moment to look into it.
The more I looked into it, the more I realise, this is the language
I want. C(/C++) has been ruined, far beyond salvation. D seems to be
the reboot that it desperately needs.

Anyway, I work in the games industry, 10 years in cross platform
console games at major studios. Sadly, I don't think Microsoft,
Sony, Nintendo, Apple, Google (...maybe google) will support D any
time soon, but I've started some after-hours game projects to test D
in a some real gamedev environments.

So far I have these (critical) questions.

Pointer aliasing... C implementations uses a non-standard __restrict
keyword to state that a given pointer will not be aliased by any
other pointer. This is critical in some pieces of code to eliminate
redundant loads and stores, particularly important on RISC
architectures like PPC.
How does D address pointer aliasing? I can't imagine the compiler
has any way to detect that pointer aliasing is not possible in
certain cases, many cases are just far too complicated. Is there a
keyword? Or plans? This is critical for realtime performance.

C implementations often use compiler intrinsics to implement
architecture provided functionality rather than inline asm, the
reason is that the intrinsics allow the compiler to generate better
code with knowledge of the context. Inline asm can't really be
transformed appropriately to suit the context in some situations,
whereas intrinsics operate differently, and run vendor specific
logic to produce the code more intelligently.
How does D address this? What options/possibilities are available to
the language? Hooks for vendors to implement intrinsics for custom
hardware?
Is the D assembler a macro assembler? (ie, assigns registers
automatically and manage loads/stores intelligently?) I haven't seen
any non-x86 examples of the D assembler, and I think it's fair to
say that x86 is the single most unnecessary architecture to write
inline assembly that exists. Are there PowerPC or ARM examples
anywhere?

As an extension from that, why is there no hardware vector support
in the language? Surely a primitive vector4 type would be a sensible
thing to have?
Is it possible in D currently to pass vectors to functions by value
in registers? Without an intrinsic vector type, it would seem
impossible.
In addition to that, writing a custom Vector4 class to make use of
VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
wrapping functions around inline asm blocks is always clumsy and far
from optimal. The compiler (code generator and probably the
optimiser) needs to understand the concepts of vectors to make good
use of the hardware.
How can I do this in a nice way in D? I'm long sick of writing
unsightly vector classes in C++, but fortunately using vendor
specific compiler intrinsics usually leads to decent code
generation. I can currently imagine an equally ugly (possibly worse)
hardware vector library in D, if it's even possible. But perhaps
I've missed something here?

I'd love to try out D on some console systems. Fortunately there are
some great home-brew scenes available for a bunch of slightly older
consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
Dreamcast (SH4). They all have GCC compilers maintained by the
community. How difficult will it be to make GDC work with those
toolchains? Sadly I know nothing about configuring GCC, so sadly I
can't really help here.

What about Android (or iPhone, but apple's 'x-code policy' prevents
that)? I'd REALLY love to write an android project in D... the
toolchain is GCC, I see no reason why it shouldn't be possible to
write an android app if an appropriate toolchain was available?

Sorry it's a bit long, thanks for reading this far!
I'm looking forward to a brighter future writing lots of D code :P
But I need to know basically all these questions are addressed
before I could consider it for serious commercial game dev.

Sep 21 2011

Trass3r <un known.com> writes:

 I haven't seen any non-x86 examples of the D assembler, and I think it's  
 fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists. Are there PowerPC or ARM examples
 anywhere?

Well DMD only supports x86 including inline asm so that's the only thing  
that's tested.

You need to try LDC or GDC for most of the things you request.
http://dsource.org/projects/ldc/wiki/InlineAsmExpressions
https://bitbucket.org/goshawk/gdc/wiki/UserDocumentation#!extended-assembler


Some guys already managed to compile cross-compilers for ARM and ran some  
basic code on e.g. Nintendo DS.
For anything serious you would need to make druntime work though.

It's just nobody has done the dirty work yet.

Sep 21 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 9/21/2011 3:55 PM, Manu wrote:
 Pointer aliasing... C implementations uses a non-standard __restrict
 keyword to state that a given pointer will not be aliased by any
 other pointer. This is critical in some pieces of code to eliminate
 redundant loads and stores, particularly important on RISC
 architectures like PPC.
 How does D address pointer aliasing? I can't imagine the compiler
 has any way to detect that pointer aliasing is not possible in
 certain cases, many cases are just far too complicated. Is there a
 keyword? Or plans? This is critical for realtime performance.

D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK, 
__restrict is most used in writing vector operations. D, on the other hand, has 
a dedicated vector operation syntax:

   a[] += b[] * c;

where a[] and b[] are required to not be overlapping, hence enabling 
parallelization of the operation.


 C implementations often use compiler intrinsics to implement
 architecture provided functionality rather than inline asm, the
 reason is that the intrinsics allow the compiler to generate better
 code with knowledge of the context. Inline asm can't really be
 transformed appropriately to suit the context in some situations,
 whereas intrinsics operate differently, and run vendor specific
 logic to produce the code more intelligently.
 How does D address this? What options/possibilities are available to
 the language? Hooks for vendors to implement intrinsics for custom
 hardware?

D does have some intrinsics, like sin() and cos(). They tend to get added on a 
strictly as-needed basis, not a speculative one.

D has no current intention to replace the inline assembler with intrinsics.

As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration
D 
code a while back that would take a string representing a floating point 
expression, and would literally compile it (using Compile Time Function 
Execution) and produce a string literal of inline asm functions, which were
then 
compiled by the inline assembler.

So yes, it is entirely possible and practical for end users to write custom 
intrinsics.


 Is the D assembler a macro assembler?

No. It's what-you-write-is-what-you-get.

 (ie, assigns registers
 automatically and manage loads/stores intelligently?)

No. It's intended to be a low level assembler for those who want to precisely 
control things.

 I haven't seen
 any non-x86 examples of the D assembler, and I think it's fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists.

I enjoy writing x86 inline assembler :-)

 Are there PowerPC or ARM examples anywhere?

The intention is for other CPU targets to employ the syntax used in their 
respective CPU manual datasheets.


 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?

The language supports it now (see the aforementioned vector syntax), it's just 
that the vector code gen isn't done (currently it is just implemented using
loops).

 Is it possible in D currently to pass vectors to functions by value
 in registers? Without an intrinsic vector type, it would seem
 impossible.

Vectors (statically dimensioned arrays) are currently passed by value (unlike C 
or C++).

 In addition to that, writing a custom Vector4 class to make use of
 VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
 wrapping functions around inline asm blocks is always clumsy and far
 from optimal. The compiler (code generator and probably the
 optimiser) needs to understand the concepts of vectors to make good
 use of the hardware.

Yes, I agree.

 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?

Your C++ vector code should be amenable to translation to D, so that effort of 
yours isn't lost, except that it'd have to be in inline asm rather than
intrinsics.


 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.

I don't know much about GDC's capabilities.

Sep 21 2011

a <a a.com> writes:

How would one do something like this without intrinsics (the code is c++ using 
gcc vector extensions): 

template <class V>
struct Fft 
{
  typedef typename V::T T;
  typedef typename V::vec vec;
  static const int VecSize = V::Size;

...

  template <int Interleaved>
  static NOINLINE void fft_pass_interleaved(
    vec * __restrict pr, 
    vec *__restrict pi, 
    vec *__restrict pend, 
    T *__restrict table)  
  {
    for(; pr < pend; pr += 2, pi += 2, table += 2*Interleaved)
    {
      vec tmpr, ti, ur, ui, wr, wi;
      V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi);
      V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr);
      V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti);
      vec tr = tmpr*wr - ti*wi;
      ti = tmpr*wi + ti*wr;
      V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]);
      V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]);
    }
  }

...

Here vector elements need to be shuffled around when they are loaded and
stored. 
This is platform dependent and cannot be expressed through vector operations 
(or gcc vector extensions).  Here I abstracted platform dependent functionality 
in member functions of  V, which are implemented using intrinsics.  The
assembly 
generated for SSE single precision and Interleaved=4 is:

 0000000000000000 <_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>:
   0:	48 39 d7             	cmp    %rdx,%rdi
   3:	0f 83 9c 00 00 00    	jae    a5
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5>
   9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  10:	0f 28 19             	movaps (%rcx),%xmm3
  13:	0f 28 41 10          	movaps 0x10(%rcx),%xmm0
  17:	48 83 c1 20          	add    $0x20,%rcx
  1b:	0f 28 f3             	movaps %xmm3,%xmm6
  1e:	0f 28 2f             	movaps (%rdi),%xmm5
  21:	0f c6 d8 dd          	shufps $0xdd,%xmm0,%xmm3
  25:	0f c6 f0 88          	shufps $0x88,%xmm0,%xmm6
  29:	0f 28 e5             	movaps %xmm5,%xmm4
  2c:	0f 28 47 10          	movaps 0x10(%rdi),%xmm0
  30:	0f 28 4e 10          	movaps 0x10(%rsi),%xmm1
  34:	0f c6 e0 88          	shufps $0x88,%xmm0,%xmm4
  38:	0f c6 e8 dd          	shufps $0xdd,%xmm0,%xmm5
  3c:	0f 28 06             	movaps (%rsi),%xmm0
  3f:	0f 28 d0             	movaps %xmm0,%xmm2
  42:	0f c6 c1 dd          	shufps $0xdd,%xmm1,%xmm0
  46:	0f c6 d1 88          	shufps $0x88,%xmm1,%xmm2
  4a:	0f 28 cd             	movaps %xmm5,%xmm1
  4d:	0f 28 f8             	movaps %xmm0,%xmm7
  50:	0f 59 ce             	mulps  %xmm6,%xmm1
  53:	0f 59 fb             	mulps  %xmm3,%xmm7
  56:	0f 59 c6             	mulps  %xmm6,%xmm0
  59:	0f 59 dd             	mulps  %xmm5,%xmm3
  5c:	0f 5c cf             	subps  %xmm7,%xmm1
  5f:	0f 58 c3             	addps  %xmm3,%xmm0
  62:	0f 28 dc             	movaps %xmm4,%xmm3
  65:	0f 5c d9             	subps  %xmm1,%xmm3
  68:	0f 58 cc             	addps  %xmm4,%xmm1
  6b:	0f 28 e1             	movaps %xmm1,%xmm4
  6e:	0f 15 cb             	unpckhps %xmm3,%xmm1
  71:	0f 14 e3             	unpcklps %xmm3,%xmm4
  74:	0f 29 4f 10          	movaps %xmm1,0x10(%rdi)
  78:	0f 28 ca             	movaps %xmm2,%xmm1
  7b:	0f 29 27             	movaps %xmm4,(%rdi)
  7e:	0f 5c c8             	subps  %xmm0,%xmm1
  81:	48 83 c7 20          	add    $0x20,%rdi
  85:	0f 58 c2             	addps  %xmm2,%xmm0
  88:	0f 28 d0             	movaps %xmm0,%xmm2
  8b:	0f 15 c1             	unpckhps %xmm1,%xmm0
  8e:	0f 14 d1             	unpcklps %xmm1,%xmm2
  91:	0f 29 46 10          	movaps %xmm0,0x10(%rsi)
  95:	0f 29 16             	movaps %xmm2,(%rsi)
  98:	48 83 c6 20          	add    $0x20,%rsi
  9c:	48 39 fa             	cmp    %rdi,%rdx
  9f:	0f 87 6b ff ff ff    	ja     10
<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10>
  a5:	f3 c3                	repz retq 

Would something like that be possible with D inline assembly or would there be 
additional loads and stores for each call of V::interleave, V::deinterleave 
and V::expandComplexArrayToRealImagVec?

Sep 21 2011

Don <nospam nospam.com> writes:

On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is c++ using
 gcc vector extensions):

[snip]
At present, you can't do it without ultimately resorting to inline asm. 
But, what we've done is to move SIMD into the machine model: the D 
machine model assumes that float[4] + float[4] is a more efficient 
operation than a loop.
Currently, only arithmetic operations are implemented, and on DMD at 
least, they're still not proper intrinsics. So in the long term it'll be 
possible to do it directly, but not yet.

At various times, several of us have implemented 'swizzle' using CTFE, 
giving you a syntax like:

float[4] x, y;
x[] = y[].swizzle!"cdcd"();
// x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

which compiles to a single shufps instruction.

That "cdcd" string is really a tiny DSL: the language consists of four 
characters, each of which is a, b, c, or d.

A couple of years ago I made a DSL compiler for BLAS1 operations. It was 
capable of doing some pretty wild stuff, even then. (The DSL looked like 
normal D code).
But the compiler has improved enormously since that time. It's now 
perfectly feasible to make a DSL for the SIMD operations you need.

The really nice thing about this, compared to normal asm, is that you 
have access to the compiler's symbol table. This lets you add 
compile-time error messages, for example.

A funny thing about this, which I found after working on the DMD 
back-end, is that is MUCH easier to write an optimizer/code generator in 
a DSL in D, than in a compiler back-end.

 template<class V>
 struct Fft
 {
    typedef typename V::T T;
    typedef typename V::vec vec;
    static const int VecSize = V::Size;

 ...

    template<int Interleaved>
    static NOINLINE void fft_pass_interleaved(
      vec * __restrict pr,
      vec *__restrict pi,
      vec *__restrict pend,
      T *__restrict table)
    {
      for(; pr<  pend; pr += 2, pi += 2, table += 2*Interleaved)
      {
        vec tmpr, ti, ur, ui, wr, wi;
        V::template expandComplexArrayToRealImagVec<Interleaved>(table, wr, wi);
        V::template deinterleave<Interleaved>(pr[0],pr[1], ur, tmpr);
        V::template deinterleave<Interleaved>(pi[0],pi[1], ui, ti);
        vec tr = tmpr*wr - ti*wi;
        ti = tmpr*wi + ti*wr;
        V::template interleave<Interleaved>(ur + tr, ur - tr, pr[0], pr[1]);
        V::template interleave<Interleaved>(ui + ti, ui - ti, pi[0], pi[1]);
      }
    }

 ...

 Here vector elements need to be shuffled around when they are loaded and
stored.
 This is platform dependent and cannot be expressed through vector operations
 (or gcc vector extensions).  Here I abstracted platform dependent functionality
 in member functions of  V, which are implemented using intrinsics.  The
assembly
 generated for SSE single precision and Interleaved=4 is:

   0000000000000000<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf>:
     0:	48 39 d7             	cmp    %rdx,%rdi
     3:	0f 83 9c 00 00 00    	jae   
a5<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0xa5>
     9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    10:	0f 28 19             	movaps (%rcx),%xmm3
    13:	0f 28 41 10          	movaps 0x10(%rcx),%xmm0
    17:	48 83 c1 20          	add    $0x20,%rcx
    1b:	0f 28 f3             	movaps %xmm3,%xmm6
    1e:	0f 28 2f             	movaps (%rdi),%xmm5
    21:	0f c6 d8 dd          	shufps $0xdd,%xmm0,%xmm3
    25:	0f c6 f0 88          	shufps $0x88,%xmm0,%xmm6
    29:	0f 28 e5             	movaps %xmm5,%xmm4
    2c:	0f 28 47 10          	movaps 0x10(%rdi),%xmm0
    30:	0f 28 4e 10          	movaps 0x10(%rsi),%xmm1
    34:	0f c6 e0 88          	shufps $0x88,%xmm0,%xmm4
    38:	0f c6 e8 dd          	shufps $0xdd,%xmm0,%xmm5
    3c:	0f 28 06             	movaps (%rsi),%xmm0
    3f:	0f 28 d0             	movaps %xmm0,%xmm2
    42:	0f c6 c1 dd          	shufps $0xdd,%xmm1,%xmm0
    46:	0f c6 d1 88          	shufps $0x88,%xmm1,%xmm2
    4a:	0f 28 cd             	movaps %xmm5,%xmm1
    4d:	0f 28 f8             	movaps %xmm0,%xmm7
    50:	0f 59 ce             	mulps  %xmm6,%xmm1
    53:	0f 59 fb             	mulps  %xmm3,%xmm7
    56:	0f 59 c6             	mulps  %xmm6,%xmm0
    59:	0f 59 dd             	mulps  %xmm5,%xmm3
    5c:	0f 5c cf             	subps  %xmm7,%xmm1
    5f:	0f 58 c3             	addps  %xmm3,%xmm0
    62:	0f 28 dc             	movaps %xmm4,%xmm3
    65:	0f 5c d9             	subps  %xmm1,%xmm3
    68:	0f 58 cc             	addps  %xmm4,%xmm1
    6b:	0f 28 e1             	movaps %xmm1,%xmm4
    6e:	0f 15 cb             	unpckhps %xmm3,%xmm1
    71:	0f 14 e3             	unpcklps %xmm3,%xmm4
    74:	0f 29 4f 10          	movaps %xmm1,0x10(%rdi)
    78:	0f 28 ca             	movaps %xmm2,%xmm1
    7b:	0f 29 27             	movaps %xmm4,(%rdi)
    7e:	0f 5c c8             	subps  %xmm0,%xmm1
    81:	48 83 c7 20          	add    $0x20,%rdi
    85:	0f 58 c2             	addps  %xmm2,%xmm0
    88:	0f 28 d0             	movaps %xmm0,%xmm2
    8b:	0f 15 c1             	unpckhps %xmm1,%xmm0
    8e:	0f 14 d1             	unpcklps %xmm1,%xmm2
    91:	0f 29 46 10          	movaps %xmm0,0x10(%rsi)
    95:	0f 29 16             	movaps %xmm2,(%rsi)
    98:	48 83 c6 20          	add    $0x20,%rsi
    9c:	48 39 fa             	cmp    %rdi,%rdx
    9f:	0f 87 6b ff ff ff    	ja    
10<_ZN3FftI6SSEVecIfEE20fft_pass_interleavedILi4EEEvPDv4_fS5_S5_Pf+0x10>
    a5:	f3 c3                	repz retq

 Would something like that be possible with D inline assembly or would there be
 additional loads and stores for each call of V::interleave, V::deinterleave
 and V::expandComplexArrayToRealImagVec?

Sep 21 2011

a <a a.com> writes:

 which compiles to a single shufps instruction.

Doesn't it often require additional needless  movaps instructions?
For example, the following: 

  asm
  {
    movaps XMM0, a;
    movaps XMM1, b;
    addps  XMM0, XMM1;
    movaps a, XMM0;
  }
  asm
  {
    movaps XMM0, a;
    movaps XMM1, b;
    addps  XMM0, XMM1;
    movaps a, XMM0;
  }

compiles to

movaps -0x48(%rsp),%xmm0
movaps -0x38(%rsp),%xmm1
addps    %xmm1,%xmm0
movaps %xmm0,-0x48(%rsp)
movaps -0x48(%rsp),%xmm0
movaps -0x38(%rsp),%xmm1
addps    %xmm1,%xmm0
movaps %xmm0,-0x48(%rsp)

Is it possible to avoid needlless loading and storing of values when calling
multiple functions that use asm blocks? It also seems that the compiler doesn't
inline functions containing asm.

Sep 22 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 9/22/2011 5:11 AM, a wrote:
 It also seems that the compiler
 doesn't inline functions containing asm.

That's correct, it currently does not.

Sep 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to inline asm.
 But, what we've done is to move SIMD into the machine model: the D
 machine model assumes that float[4] + float[4] is a more efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on DMD at
 least, they're still not proper intrinsics. So in the long term it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists of four
 characters, each of which is a, b, c, or d.

I think we should put swizzle in std.numeric once and for all. Is anyone 
interested in taking up that task?

 A couple of years ago I made a DSL compiler for BLAS1 operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL looked like
 normal D code).
 But the compiler has improved enormously since that time. It's now
 perfectly feasible to make a DSL for the SIMD operations you need.

 The really nice thing about this, compared to normal asm, is that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code generator in
 a DSL in D, than in a compiler back-end.

A good argument for (a) moving stuff from the compiler into the library, 
(b) continuing Don's great work on making CTFE a solid proposition.


Andrei

Sep 22 2011

so <so so.so> writes:

On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is anyone  
 interested in taking up that task?

You mean some helper functions to be used in user structures? Because i  
don't know of any structure in std.numerics that could use it.
We first need to improve opDispatch. Currently i think no one know how it  
works or how it was intended to work.
It refuses to except a few things which i think it should.

For example:

A {
opDispatch(string)()
opDispatch(string)() const
}

A a, b;
a.fun = b.run; // This should be perfectly fine.

Sep 22 2011

so <so so.so> writes:

On Fri, 23 Sep 2011 02:00:50 +0300, so <so so.so> wrote:

 It refuses to except a few things which i think it should.

accept...

Sep 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?

 You mean some helper functions to be used in user structures?

I was thinking of a template that takes and return T[n].

Andrei

Sep 22 2011

so <so so.so> writes:

On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?

 You mean some helper functions to be used in user structures?

 I was thinking of a template that takes and return T[n].

 Andrei

Something like this?

Sep 22 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 9/22/11 9:11 PM, so wrote:
 On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?

 You mean some helper functions to be used in user structures?

 I was thinking of a template that takes and return T[n].

 Andrei

 Something like this?

Looks promising, though I was hoping to not need an additional struct V. 
But I'm not an expert.

Andrei

Sep 22 2011

so <so so.so> writes:

On Fri, 23 Sep 2011 06:44:44 +0300, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 9:11 PM, so wrote:
 On Fri, 23 Sep 2011 02:40:11 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 9/22/11 6:00 PM, so wrote:
 On Thu, 22 Sep 2011 17:07:25 +0300, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I think we should put swizzle in std.numeric once and for all. Is
 anyone interested in taking up that task?

 You mean some helper functions to be used in user structures?

 I was thinking of a template that takes and return T[n].

 Andrei

 Something like this?

 Looks promising, though I was hoping to not need an additional struct V.  
 But I'm not an expert.

 Andrei

It was there to show how it should be used in user code, and testing.
Swizzle is not just a rvalue operation, there is also a lvalue part to it  
which plays a bit differently (hence, swizzleR and swizzleL).
We could take care of it with an overload but D doesn't act quite like  
what i expected (like C++), i don't understand why it won't differentiate  
"fun()" from "fun() const".

Sep 23 2011

so <so so.so> writes:

On Fri, 23 Sep 2011 16:09:31 +0300, so <so so.so> wrote:

 It was there to show how it should be used in user code, and testing.
 Swizzle is not just a rvalue operation, there is also a lvalue part to  
 it which plays a bit differently (hence, swizzleR and swizzleL).
 We could take care of it with an overload but D doesn't act quite like  
 what i expected (like C++), i don't understand why it won't  
 differentiate "fun()" from "fun() const".

Sorry about the nonsense.
It is now with opDispatch (attached)
To make a generic "swizzle" function we need to introduce a few traits but
if all you want is a support for T[N] that is easy.

Sep 24 2011

Manu Evans <turkeyman gmail.com> writes:

== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
article
 On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the



code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to


inline asm.
 But, what we've done is to move SIMD into the machine model: the


D
 machine model assumes that float[4] + float[4] is a more


efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on


DMD at
 least, they're still not proper intrinsics. So in the long term


it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using


CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists


of four
 characters, each of which is a, b, c, or d.

 I think we should put swizzle in std.numeric once and for all. Is

anyone
 interested in taking up that task?
 A couple of years ago I made a DSL compiler for BLAS1


operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL


looked like
 normal D code).
 But the compiler has improved enormously since that time. It's


now
 perfectly feasible to make a DSL for the SIMD operations you


need.
 The really nice thing about this, compared to normal asm, is


that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code


generator in
 a DSL in D, than in a compiler back-end.

 A good argument for (a) moving stuff from the compiler into the

library,
 (b) continuing Don's great work on making CTFE a solid

proposition.
 Andrei

This sounds really dangerous to me.
I really like the idea where CTFE can be used to produce some pretty
powerful almost-like-intrinsics code, but applying it in this
context sounds like a really bad idea.

Firstly, so I'm not misunderstanding, is this suggestion building on
Don's previous post saying that float[4] is somehow intercepted and
special-cased by the compiler, reinterpreting as a candidate for
hardware vector operations?

I think that's a wrong decision in its self, and a poor foundation
for this approach.
Let me try and convince you that the language should have an
explicit hardware vector type, and not attempt to make use of any
clever language tricks...

If float[4] is considered a hardware vector by the compiler,
  - How to I define an ACTUAL float[4]?
  - How can I be confident that it actually WILL be a hardware
vector?

Hardware vectors are NOT float[4]'s, they are a reference to an
128bit hardware register upon which various vector operations may be
supported, they are probably aligned, and they are only accessible
in 128bit quantities. I think they should be explicitly defined as
such.

They may be float4, u/int4, u/short8, u/byte16, double2... All these
types are interchangeable within the one register, do you intend to
special case fixed length arrays of all those types to support the
hardware functionality for those?

Hardware vectors are NOT floats, they can not interact with the
floating point unit, dereferencing of this style 'float x =
myVector[0]' is NOT supported by the hardware and it should not be
exposed to the programmer as a trivial possibility. This seemingly
harmless line of code will undermine the entire reason for using
hardware vector hardware in the first place.

Allowing easy access of individual floats within a hardware vector breaks the
languages stated premise that the path of least
resistance also be the 'correct' optimal choice, whereby a seemingly
simple line of code may ruin the entire function.

float[4] is not even a particularly conveniently sized vector for
most inexperienced programmers, the majority will want float[3].
This is NOT a trivial map to float[4], and programmers should be
well aware that there is inherent complexity in using the hardware
vector architecture, and forced to think it through.

Most inexperienced programmers think of results of operations like
dot product and magnitude as being scalar values, but they are not,
they are a scalar value repeated across all 4 components of a vector
4, and this should be explicit too.

....

I know I'm nobody around here, so I can't expect to be taken too
seriously, but I'm very excited about the language, so here's what I
would consider instead:

Add a hardware vector type, lets call it 'v128' for the exercise. It
is a primitive type, aligned by definition, and does not have any
members.
You may use this to refer to a hardware vector registers explicitly,
as vector register function arguments, or as arguments to inline asm
blocks.

Add some functions to the standard library (ideally implemented as
compiler intrinsics) which do very specific stuff to vectors, and
ideally expandable by hardware vendors or platform holders.

You might want to have some classes in the standard library which
wrap said v128, and expose the concept as a float4, int4, byte16,
etc. These classes would provide maths operators, comparisons,
initialisation and immediate assignment, and casts between various
vector types.

Different vector units support completely different methods of
permutation. I would be very careful about adding intrinsic support
into the library for generalised permutation. And if so, at least
leave the capability of implementing intrinsic architecture-specific
permutation but the hardware vendors/platform holders.

At the end of the day, it is imperative that the code generator and
optimiser still retain the concept of hardware vectors, and can
perform appropriate load/store elimination, apply hardware specific
optimisation to operations like permutes/swizzles, component
broadcasts, load immediates, etc.

....

The reason I am so adamant about this, is in almost all
architectures (SSE is the most tolerant by far), using the hardware
vector unit is an all-or-nothing choice. If you interact between the
vector and float registers, you will almost certainly result in
slower code than if you just used the float unit outright. Also,
since people usually use hardware vectors in areas of extreme
performance optimisation, it's not tolerable for the compiler to be
making mistakes. As a minimum the programmer needs to be able to
explicitly address the vector registers, pass it to and from
functions, and perform explicit (probably IHV supplied) functions on
them. The code generator and optimiser needs all the information
possible, and as explicit as possible so IHV's can implement the
best possible support for their architecture. The API should reflect
this, and not allow easy access to functionality that would violate
hardware support.

Ease of programming should be a SECONDARY goal, at which point
something like the typed wrapper classes I described would come in,
allowing maths operators, comparisons and all I mentioned above, ie,
making them look like a real mathematical type, but still keeping
their distance from primitive float/int types, to discourage
interaction at all costs.

I hope this doesn't sound too much like an overly long rant! :)
And hopefully I've managed to sell my point...

Don: I'd love to hear counter arguments to justify float[4] as a
reasonable solution. Currently no matter how I slice it, I just
can't see it.

Criticism welcome?

Cheers!
- Manu

Sep 23 2011

bearophile <bearophileHUGS lycos.com> writes:

Manu Evans:

I appreciate your efforts. I answer to the OP that DMD doesn't yet offer most
of the things discussed in this thread. But I think that it's better to add and
work on high-performance features when the basics of D are in better shape.
Currently there are more basic fishes to implement or debug, like tuples syntax
sugar, module system issues, const issues, inout, and so on and on (on the
other hand I agree that it's OK to discuss even now D design ideas that will
allow that future high performance).


 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations may be
 supported, they are probably aligned, and they are only accessible
 in 128bit quantities. I think they should be explicitly defined as
 such.

What do you want to do when CPU with 256 bit registers appear? When they grow
to 512 bit? To 1024? Do you want to keep adding specific types? How many things
do you want to add to D in the next 15 years of CPU evolution?

Bye,
bearophile

Sep 23 2011

Manu Evans <turkeyman gmail.com> writes:

== Quote from bearophile (bearophileHUGS lycos.com)'s article
 Manu Evans:
 I appreciate your efforts. I answer to the OP that DMD doesn't yet

offer most of the things discussed in this thread. But I think that
it's better to add and work on high-performance features when the
basics of D are in better shape. Currently there are more basic
fishes to implement or debug, like tuples syntax sugar, module
system issues, const issues, inout, and so on and on (on the other
hand I agree that it's OK to discuss even now D design ideas that
will allow that future high performance).

I make the point because, while I agree the topics you mention are
of greater immediate performance, the previous posts in this thread
suggest there is already experimentation/implementation of these
features happening in the language now, and if they are defined now,
and defined incorrectly, it's always very difficult to go back on
these decisions.

 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations


may be
 supported, they are probably aligned, and they are only


accessible
 in 128bit quantities. I think they should be explicitly defined


as
 such.

 What do you want to do when CPU with 256 bit registers appear?

When they grow to 512 bit? To 1024? Do you want to keep adding
specific types?

Yes.
I don't think it's likely to progress as you suggest though.
I foresee perhaps a 4 component 64bit-word vector (256bit), and a
hardware matrix.
I can't see it being any less appropriate to implement a v256 in
addition to v128 than a long in addition to an int.
A matrix is a fundamentally different concept, and surely worthy of
its own type.

 How many things do you want to add to D in the next 15 years of

CPU evolution?

As many things as are universally accepted by computer hardware as a
normal/standard feature. Hardware vectors definitely fit this bill.

We've had hardware vector support in virtually every architecture
for 10-15 years now, and yet there is still no language that really
supports it.

Sep 23 2011

Don <nospam nospam.com> writes:

On 24.09.2011 00:47, Manu Evans wrote:
 == Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s
 article
 On 9/22/11 1:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the



 code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to


 inline asm.
 But, what we've done is to move SIMD into the machine model: the


 D
 machine model assumes that float[4] + float[4] is a more


 efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on


 DMD at
 least, they're still not proper intrinsics. So in the long term


 it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using


 CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists


 of four
 characters, each of which is a, b, c, or d.

 I think we should put swizzle in std.numeric once and for all. Is

 anyone
 interested in taking up that task?
 A couple of years ago I made a DSL compiler for BLAS1


 operations. It was
 capable of doing some pretty wild stuff, even then. (The DSL


 looked like
 normal D code).
 But the compiler has improved enormously since that time. It's


 now
 perfectly feasible to make a DSL for the SIMD operations you


 need.
 The really nice thing about this, compared to normal asm, is


 that you
 have access to the compiler's symbol table. This lets you add
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD
 back-end, is that is MUCH easier to write an optimizer/code


 generator in
 a DSL in D, than in a compiler back-end.

 A good argument for (a) moving stuff from the compiler into the

 library,
 (b) continuing Don's great work on making CTFE a solid

 proposition.
 Andrei

 This sounds really dangerous to me.
 I really like the idea where CTFE can be used to produce some pretty
 powerful almost-like-intrinsics code, but applying it in this
 context sounds like a really bad idea.

 Firstly, so I'm not misunderstanding, is this suggestion building on
 Don's previous post saying that float[4] is somehow intercepted and
 special-cased by the compiler, reinterpreting as a candidate for
 hardware vector operations?

No, it's completely unrelated. It has nothing in common.

 I think that's a wrong decision in its self, and a poor foundation
 for this approach.
 Let me try and convince you that the language should have an
 explicit hardware vector type, and not attempt to make use of any
 clever language tricks...

 If float[4] is considered a hardware vector by the compiler,
    - How to I define an ACTUAL float[4]?
    - How can I be confident that it actually WILL be a hardware
 vector?

float[4] is not considered to be a hardware vector. It is only passed as 
one. To pass it the C++ way, declare the parameter as float[], or pass 
by ref.
Everything after that is the reponsibility of the compiler/optimizer.
A big difference compared to C++, is that generally, it's pretty strange 
to pass fixed-length arrays as value parameters.
At this stage we don't have any way of forcing it to be a hardware 
vector. We've just introduced the parameter passing and the vector 
operations to make it easier for the compiler to use hardware registers.
Very little else is decided at this stage.

You make some excellent points.

 Hardware vectors are NOT float[4]'s, they are a reference to an
 128bit hardware register upon which various vector operations may be
 supported, they are probably aligned, and they are only accessible
 in 128bit quantities. I think they should be explicitly defined as
 such.

 They may be float4, u/int4, u/short8, u/byte16, double2... All these
 types are interchangeable within the one register, do you intend to
 special case fixed length arrays of all those types to support the
 hardware functionality for those?

 Hardware vectors are NOT floats, they can not interact with the
 floating point unit, dereferencing of this style 'float x =
 myVector[0]' is NOT supported by the hardware and it should not be
 exposed to the programmer as a trivial possibility. This seemingly
 harmless line of code will undermine the entire reason for using
 hardware vector hardware in the first place.

 Allowing easy access of individual floats within a hardware vector breaks the
languages stated premise that the path of least
 resistance also be the 'correct' optimal choice, whereby a seemingly
 simple line of code may ruin the entire function.

 float[4] is not even a particularly conveniently sized vector for
 most inexperienced programmers, the majority will want float[3].
 This is NOT a trivial map to float[4], and programmers should be
 well aware that there is inherent complexity in using the hardware
 vector architecture, and forced to think it through.

 Most inexperienced programmers think of results of operations like
 dot product and magnitude as being scalar values, but they are not,
 they are a scalar value repeated across all 4 components of a vector
 4, and this should be explicit too.

 ....

 I know I'm nobody around here, so I can't expect to be taken too
 seriously, but I'm very excited about the language, so here's what I
 would consider instead:

 Add a hardware vector type, lets call it 'v128' for the exercise. It
 is a primitive type, aligned by definition, and does not have any
 members.
 You may use this to refer to a hardware vector registers explicitly,
 as vector register function arguments, or as arguments to inline asm
 blocks.

 Add some functions to the standard library (ideally implemented as
 compiler intrinsics) which do very specific stuff to vectors, and
 ideally expandable by hardware vendors or platform holders.

 You might want to have some classes in the standard library which
 wrap said v128, and expose the concept as a float4, int4, byte16,
 etc. These classes would provide maths operators, comparisons,
 initialisation and immediate assignment, and casts between various
 vector types.

 Different vector units support completely different methods of
 permutation. I would be very careful about adding intrinsic support
 into the library for generalised permutation. And if so, at least
 leave the capability of implementing intrinsic architecture-specific
 permutation but the hardware vendors/platform holders.

 At the end of the day, it is imperative that the code generator and
 optimiser still retain the concept of hardware vectors, and can
 perform appropriate load/store elimination, apply hardware specific
 optimisation to operations like permutes/swizzles, component
 broadcasts, load immediates, etc.

 ....

 The reason I am so adamant about this, is in almost all
 architectures (SSE is the most tolerant by far), using the hardware
 vector unit is an all-or-nothing choice. If you interact between the
 vector and float registers, you will almost certainly result in
 slower code than if you just used the float unit outright. Also,
 since people usually use hardware vectors in areas of extreme
 performance optimisation, it's not tolerable for the compiler to be
 making mistakes. As a minimum the programmer needs to be able to
 explicitly address the vector registers, pass it to and from
 functions, and perform explicit (probably IHV supplied) functions on
 them. The code generator and optimiser needs all the information
 possible, and as explicit as possible so IHV's can implement the
 best possible support for their architecture. The API should reflect
 this, and not allow easy access to functionality that would violate
 hardware support.

 Ease of programming should be a SECONDARY goal, at which point
 something like the typed wrapper classes I described would come in,
 allowing maths operators, comparisons and all I mentioned above, ie,
 making them look like a real mathematical type, but still keeping
 their distance from primitive float/int types, to discourage
 interaction at all costs.

 I hope this doesn't sound too much like an overly long rant! :)
 And hopefully I've managed to sell my point...

 Don: I'd love to hear counter arguments to justify float[4] as a
 reasonable solution. Currently no matter how I slice it, I just
 can't see it.

 Criticism welcome?

 Cheers!
 - Manu

Sep 23 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 22.09.2011, 08:39 Uhr, schrieb Don <nospam nospam.com>:

 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is  
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to inline asm.  
 But, what we've done is to move SIMD into the machine model: the D  
 machine model assumes that float[4] + float[4] is a more efficient  
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on DMD at  
 least, they're still not proper intrinsics. So in the long term it'll be  
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using CTFE,  
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 That "cdcd" string is really a tiny DSL: the language consists of four  
 characters, each of which is a, b, c, or d.

 A couple of years ago I made a DSL compiler for BLAS1 operations. It was  
 capable of doing some pretty wild stuff, even then. (The DSL looked like  
 normal D code).
 But the compiler has improved enormously since that time. It's now  
 perfectly feasible to make a DSL for the SIMD operations you need.

 The really nice thing about this, compared to normal asm, is that you  
 have access to the compiler's symbol table. This lets you add  
 compile-time error messages, for example.

 A funny thing about this, which I found after working on the DMD  
 back-end, is that is MUCH easier to write an optimizer/code generator in  
 a DSL in D, than in a compiler back-end.

That's a nice fresh approach to intrinsics. I bet if other languages had  
the CTFE capabilities, they'd probably do the same.
Sure, it is ideal if the compiler works magic here, but it takes longer to  
implement the right code generation in the compiler, than to write an  
isolated piece of library code and extensions can be added by anyone,  
especially since there will already be some examples to look at. Thumbs up!

Sep 22 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to inline asm.
 But, what we've done is to move SIMD into the machine model: the D
 machine model assumes that float[4] + float[4] is a more efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on DMD at
 least, they're still not proper intrinsics. So in the long term it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

How can it compile into a single shufps? x and y would need to already 
be in vector registers, and unless I've missed something, they won't be. 
You'll need instructions for loading into registers (using the slow 
movups because 16-byte alignment isn't guaranteed) then do the shufps, 
then load back out again.

This is too slow for performance critical code.

Being stored in XMM registers from creation, passed and returned in XMM 
registers to/from functions is a key requirement for this sort of code. 
If you have to keep loading in and out of memory then you lose all 
performance.

Sep 22 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander  
<peter.alexander.au gmail.com>:

 On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to inline asm.
 But, what we've done is to move SIMD into the machine model: the D
 machine model assumes that float[4] + float[4] is a more efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on DMD at
 least, they're still not proper intrinsics. So in the long term it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 How can it compile into a single shufps? x and y would need to already  
 be in vector registers, and unless I've missed something, they won't be.  
 You'll need instructions for loading into registers (using the slow  
 movups because 16-byte alignment isn't guaranteed) then do the shufps,  
 then load back out again.

 This is too slow for performance critical code.

 Being stored in XMM registers from creation, passed and returned in XMM  
 registers to/from functions is a key requirement for this sort of code.  
 If you have to keep loading in and out of memory then you lose all  
 performance.

I thought about this. Either write long functions, so you don't have to  
load and unload often or just make the functions assume that the  
parameters are in registers without explicit declaration.

Sep 22 2011

Don <nospam nospam.com> writes:

On 22.09.2011 20:19, Marco Leise wrote:
 Am 22.09.2011, 19:26 Uhr, schrieb Peter Alexander
 <peter.alexander.au gmail.com>:

 On 22/09/11 7:39 AM, Don wrote:
 On 22.09.2011 05:24, a wrote:
 How would one do something like this without intrinsics (the code is
 c++ using
 gcc vector extensions):

 [snip]
 At present, you can't do it without ultimately resorting to inline asm.
 But, what we've done is to move SIMD into the machine model: the D
 machine model assumes that float[4] + float[4] is a more efficient
 operation than a loop.
 Currently, only arithmetic operations are implemented, and on DMD at
 least, they're still not proper intrinsics. So in the long term it'll be
 possible to do it directly, but not yet.

 At various times, several of us have implemented 'swizzle' using CTFE,
 giving you a syntax like:

 float[4] x, y;
 x[] = y[].swizzle!"cdcd"();
 // x[0]=y[2], x[1]=y[3], x[2]=y[2], x[3]=y[3]

 which compiles to a single shufps instruction.

 How can it compile into a single shufps? x and y would need to already
 be in vector registers, and unless I've missed something, they won't
 be. You'll need instructions for loading into registers (using the
 slow movups because 16-byte alignment isn't guaranteed) then do the
 shufps, then load back out again.

 This is too slow for performance critical code.

 Being stored in XMM registers from creation, passed and returned in
 XMM registers to/from functions is a key requirement for this sort of
 code. If you have to keep loading in and out of memory then you lose
 all performance.

 I thought about this. Either write long functions, so you don't have to
 load and unload often or just make the functions assume that the
 parameters are in registers without explicit declaration.

Yeah, at the moment you have to work at a higher level, you can't just 
do a single instruction on its own.

Sep 23 2011

bearophile <bearophileHUGS lycos.com> writes:

Don:

 Yeah, at the moment you have to work at a higher level, you can't just 
 do a single instruction on its own.

Is it possible to solve some of those problems adding something like this to
D/DMD:
http://www.dsource.org/projects/ldc/wiki/InlineAsmExpressions

And then, what changes/work is needed to allow inlining of some functions that
contain asm? I mean something like this allow_inline?
http://www.dsource.org/projects/ldc/wiki/Docs#allow_inline

(I have asked similar questions four times in the last two years, with no
answers or comments.)

Bye,
bearophile

Sep 24 2011

Benjamin Thaut <code benjamin-thaut.de> writes:

Am 22.09.2011 02:38, schrieb Walter Bright:
 nsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?

 Your C++ vector code should be amenable to translation to D, so that
 effort of yours isn't lost, except that it'd have to be in inline asm
 rather than intrinsics.

I recently tried that, and I couldn't do it because D has no way of 
aligning structs on the stack. Manually allocating the neccessary 
aligned memroy is also not always possible because it can not be done 
for compiler temporary variables:

vec4 v1 = func1();
vec4 v2 = func2();
vec4 result = (v1 + v2) * 0.5f;

Even if I manually allocate v1,v2 and result, the temporary variable 
that the compiler uses to compute the expression might be unaligned.
That is a total killer for SSE optimizations because you can not hide 
them away.

Does DMC++ have __declspec(align(16)) support?

-- 
Kind Regards
Benjamin Thaut

Sep 21 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 9/21/2011 10:56 PM, Benjamin Thaut wrote:
 Even if I manually allocate v1,v2 and result, the temporary variable that the
 compiler uses to compute the expression might be unaligned.
 That is a total killer for SSE optimizations because you can not hide them
away.

 Does DMC++ have __declspec(align(16)) support?

No, but 64 bit DMD aligns the stack on 16 byte boundaries.

Sep 22 2011

Benjamin Thaut <code benjamin-thaut.de> writes:

== Auszug aus Walter Bright (newshound2 digitalmars.com)'s Artikel
 On 9/21/2011 10:56 PM, Benjamin Thaut wrote:
 Even if I manually allocate v1,v2 and result, the temporary variable that the
 compiler uses to compute the expression might be unaligned.
 That is a total killer for SSE optimizations because you can not hide them
away.

 Does DMC++ have __declspec(align(16)) support?

 No, but 64 bit DMD aligns the stack on 16 byte boundaries.

Unfortunaltey there is no 64 bit dmd on windows.

Sep 22 2011

Peter Alexander <peter.alexander.au gmail.com> writes:

On 22/09/11 1:38 AM, Walter Bright wrote:
 D doesn't have __restrict. I'm going to argue that it is unnecessary.
 AFAIK, __restrict is most used in writing vector operations. D, on the
 other hand, has a dedicated vector operation syntax:

 a[] += b[] * c;

 where a[] and b[] are required to not be overlapping, hence enabling
 parallelization of the operation.

It's used for vector stuff, but I wouldn't say mostly. Just about any 
performance intensive piece of code involving pointers can benefit from 
__restrict. I use it in a VM for example.


 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?

 The language supports it now (see the aforementioned vector syntax),
 it's just that the vector code gen isn't done (currently it is just
 implemented using loops).

I don't see how this would be possible without intrinsics, or at least 
some form of language extension.

Would DMD just *always* put float[4] in XMM registers (assuming they are 
available)?

That doesn't seem like a good idea if you don't want to use it as a vector.


BTW, if you want to get a good idea of how game programmers use vector 
intrinsics on current hardware, there is a good blog post about it here: 
http://altdevblogaday.com/2011/01/31/vectiquette/

Sep 22 2011

Manu Evans <turkeyman gmail.com> writes:

== Quote from Walter Bright (newshound2 digitalmars.com)'s article
D doesn't have __restrict. I'm going to argue that it is unnecessary. AFAIK,
__restrict is most used in writing vector operations. D, on the other hand, has
a dedicated vector operation syntax:
a[] += b[] * c;
where a[] and b[] are required to not be overlapping, hence enabling
parallelization of the operation.

Use of __restrict is certainly not limited to your example, it's applicable
basically anywhere
that a pointer is dereferenced on either side of a write through any other
pointer, or a
function call (since it could potentially do anything), the resident value from
the previous
dereference is invalidated and must be reloaded needlessly unless the pointer
is explicitly
marked restrict.

http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html

For RISC architectures in particular, __restrict is mandatory when optimising
certain hot
functions without making a mess of your code (declaring stack locals all over
the place), and
I think I've run into cases where even that's not enough.

D does have some intrinsics, like sin() and cos(). They tend to get added on a
strictly as-needed basis, not a speculative one.
D has no current intention to replace the inline assembler with intrinsics.
As for custom intrinsics, Don Clugston wrote an amazing piece of demonstration
D
code a while back that would take a string representing a floating point
expression, and would literally compile it (using Compile Time Function
Execution) and produce a string literal of inline asm functions, which were
then
compiled by the inline assembler.
So yes, it is entirely possible and practical for end users to write custom
intrinsics.

I hadn't thought of that using compile-time functions, that's really nice.
I'm not sure if that'll be enough to generate good code in all cases, but I'll
do some
experiments and see where it goes.
The main problem with writing (intelligently generated) inline asm vs using
intrinsics, is in
the context of the C (or D) source code, you don't have enough context to know
about the state
of the register assignment, and producing the appropriate loads/stores. Also,
the opcodes
selected to perform the operation may change with context. (again, specific
examples are hard
to fabricate, but I've had them consistently pop up over the years)

Also, I think someone else said that you couldn't inline functions with inline
asm? Is that
correct? If so, I assume that's intended to be fixed?

As an extension from that, why is there no hardware vector support
in the language? Surely a primitive vector4 type would be a sensible
thing to have?

The language supports it now (see the aforementioned vector syntax), it's just
that the vector code gen isn't done (currently it is just implemented using
loops).

Are you referring to the comment about special casing a float[4]? I can see why
one might
reach for that as a solution, but it sounds like a really bad idea to me...

Is it possible in D currently to pass vectors to functions by value
in registers? Without an intrinsic vector type, it would seem
impossible.

Vectors (statically dimensioned arrays) are currently passed by value (unlike C
or C++).

Do you mean that like a memcpy to the stack, or somehow intuitively using the
hardware vector
registers to pass arguments to the function properly?

How can I do this in a nice way in D? I'm long sick of writing
unsightly vector classes in C++, but fortunately using vendor
specific compiler intrinsics usually leads to decent code
generation. I can currently imagine an equally ugly (possibly worse)
hardware vector library in D, if it's even possible. But perhaps
I've missed something here?

Your C++ vector code should be amenable to translation to D, so that effort of
yours isn't lost, except that it'd have to be in inline asm rather than
intrinsics.

But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector
type, there's
no way to pass vectors to functions in registers, and also, using explicit asm,
you tend to
end up with endless unnecessary loads and stores, and potentially a lot of
redundant
shuffling/permutation. This will differ radically between architectures too.
I think I read in another post too that functions containing inline asm will
not be inlined?
How does the D compiler go at optimising code around inline asm blocks? Most
compilers have a
lot of trouble optimising around inline asm blocks, and many don't even attempt
to do so...

How does GDC compare to DMD? Does it do a good job?
I really need to take the weekend and do a lot of experiments I think.

Sep 23 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Manu Evans (turkeyman gmail.com)'s article
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?

 Your C++ vector code should be amenable to translation to D, so that effort of
 yours isn't lost, except that it'd have to be in inline asm rather than


intrinsics.
 But sadly, in that case, it wouldn't work. Without an intrinsic hardware vector

type, there's
 no way to pass vectors to functions in registers, and also, using explicit asm,

you tend to
 end up with endless unnecessary loads and stores, and potentially a lot of
redundant
 shuffling/permutation. This will differ radically between architectures too.
 I think I read in another post too that functions containing inline asm will
not

be inlined?
 How does the D compiler go at optimising code around inline asm blocks? Most

compilers have a
 lot of trouble optimising around inline asm blocks, and many don't even attempt

to do so...
 How does GDC compare to DMD? Does it do a good job?
 I really need to take the weekend and do a lot of experiments I think.

GDC is just the same as DMD (same runtime library implementation for vector
array
operations).


You can define vector types in the language through use of GCC's attribute
though
(is a pragma in GDC), then use a union to interface between it and the
corresponding static array.  It's deliberately UGLY and PRONE to you hitting
lots
of brick walls if you don't handle them in a very specific way though. :~)

Stock example:

pragma(attribute, vector_size())
  typedef float __v4sf_t

union __v4sf {
  float[4] f;
  __v4sf_t v;
}


__v4sf a = {[1,2,3,4]}
       b = {[1,2,3,4]}
       c;

c.v = a.v + b.v;
assert(c.f == [2,4,6,8]);


The assignment compiles down to ~5 instructions:
movaps -0x88(%ebp),%xmm1
movaps -0x78(%ebp),%xmm0
addps  %xmm1,%xmm0
movaps %xmm0,-0x68(%ebp)
flds   -0x68(%ebp)

And is far quicker than c[] = a[] + b[] due to it being inlined, and not an
external library call.

Regards
Iain

Sep 24 2011

Manu <turkeyman gmail.com> writes:

On 24 September 2011 15:37, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 == Quote from Manu Evans (turkeyman gmail.com)'s article
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?

 Your C++ vector code should be amenable to translation to D, so that


 effort of
 yours isn't lost, except that it'd have to be in inline asm rather than


 intrinsics.
 But sadly, in that case, it wouldn't work. Without an intrinsic hardware

 vector
 type, there's
 no way to pass vectors to functions in registers, and also, using

 explicit asm,
 you tend to
 end up with endless unnecessary loads and stores, and potentially a lot

 of redundant
 shuffling/permutation. This will differ radically between architectures

 too.
 I think I read in another post too that functions containing inline asm

 will not
 be inlined?
 How does the D compiler go at optimising code around inline asm blocks?

 Most
 compilers have a
 lot of trouble optimising around inline asm blocks, and many don't even

 attempt
 to do so...
 How does GDC compare to DMD? Does it do a good job?
 I really need to take the weekend and do a lot of experiments I think.

 GDC is just the same as DMD (same runtime library implementation for vector
 array
 operations).


 You can define vector types in the language through use of GCC's attribute
 though
 (is a pragma in GDC), then use a union to interface between it and the
 corresponding static array.  It's deliberately UGLY and PRONE to you
 hitting lots
 of brick walls if you don't handle them in a very specific way though. :~)

 Stock example:

 pragma(attribute, vector_size())
  typedef float __v4sf_t

 union __v4sf {
  float[4] f;
  __v4sf_t v;
 }


 __v4sf a = {[1,2,3,4]}
       b = {[1,2,3,4]}
       c;

 c.v = a.v + b.v;
 assert(c.f == [2,4,6,8]);


 The assignment compiles down to ~5 instructions:
 movaps -0x88(%ebp),%xmm1
 movaps -0x78(%ebp),%xmm0
 addps  %xmm1,%xmm0
 movaps %xmm0,-0x68(%ebp)
 flds   -0x68(%ebp)

 And is far quicker than c[] = a[] + b[] due to it being inlined, and not an
 external library call.

 Regards
 Iain

Nice!
Is there an IRC channel, or anywhere for realtime D discussion?
I'm interested in trying to build some GDC cross compilers, and perhaps
contributing to the standard library on a few embedded systems, but I have a
lot of little questions and general things that don't suit a mailing list...
Perhaps some IM? It seems to me that you are the authority on GDC
implementation and support...

Sep 24 2011

so <so so.so> writes:

On Sat, 24 Sep 2011 16:50:39 +0300, Manu <turkeyman gmail.com> wrote:

 Nice!
 Is there an IRC channel, or anywhere for realtime D discussion?
 I'm interested in trying to build some GDC cross compilers, and perhaps
 contributing to the standard library on a few embedded systems, but I  
 have a
 lot of little questions and general things that don't suit a mailing  
 list...
 Perhaps some IM? It seems to me that you are the authority on GDC
 implementation and support...

We all know where it leads to, first ask IM then ask phone and finally ...

Sep 24 2011

Max Klyga <max.klyga gmail.com> writes:

On 2011-09-24 16:50:39 +0300, Manu said:
 Is there an IRC channel, or anywhere for realtime D discussion?

There is a #d channel for general D discussions and #d.gdc for GDC 
related themes on irc.freenode.org

Sep 24 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Manu (turkeyman gmail.com)'s article
 Hello D community.
 I've been reading a lot about D lately. I have known it existed for
 ages, but for some reason never even took a moment to look into it.
 The more I looked into it, the more I realise, this is the language
 I want. C(/C++) has been ruined, far beyond salvation. D seems to be
 the reboot that it desperately needs.
 Anyway, I work in the games industry, 10 years in cross platform
 console games at major studios. Sadly, I don't think Microsoft,
 Sony, Nintendo, Apple, Google (...maybe google) will support D any
 time soon, but I've started some after-hours game projects to test D
 in a some real gamedev environments.
 So far I have these (critical) questions.
 Pointer aliasing... C implementations uses a non-standard __restrict
 keyword to state that a given pointer will not be aliased by any
 other pointer. This is critical in some pieces of code to eliminate
 redundant loads and stores, particularly important on RISC
 architectures like PPC.
 How does D address pointer aliasing? I can't imagine the compiler
 has any way to detect that pointer aliasing is not possible in
 certain cases, many cases are just far too complicated. Is there a
 keyword? Or plans? This is critical for realtime performance.
 C implementations often use compiler intrinsics to implement
 architecture provided functionality rather than inline asm, the
 reason is that the intrinsics allow the compiler to generate better
 code with knowledge of the context. Inline asm can't really be
 transformed appropriately to suit the context in some situations,
 whereas intrinsics operate differently, and run vendor specific
 logic to produce the code more intelligently.
 How does D address this? What options/possibilities are available to
 the language? Hooks for vendors to implement intrinsics for custom
 hardware?

The DMD compiler has some basic intrinsics, other compilers build upon this
using
their own backends. ie: GCC has hundreds of builtins, including some target
builtins where intrinsic types are mappable to D types (__float80 ->.real).

 Is the D assembler a macro assembler? (ie, assigns registers
 automatically and manage loads/stores intelligently?) I haven't seen
 any non-x86 examples of the D assembler, and I think it's fair to
 say that x86 is the single most unnecessary architecture to write
 inline assembly that exists. Are there PowerPC or ARM examples
 anywhere?
 As an extension from that, why is there no hardware vector support
 in the language? Surely a primitive vector4 type would be a sensible
 thing to have?
 Is it possible in D currently to pass vectors to functions by value
 in registers? Without an intrinsic vector type, it would seem
 impossible.
 In addition to that, writing a custom Vector4 class to make use of
 VMX, SSE, ARM VFP, PSP VFPU, MIPS 'Vector Units', SH4 DR regs, etc,
 wrapping functions around inline asm blocks is always clumsy and far
 from optimal. The compiler (code generator and probably the
 optimiser) needs to understand the concepts of vectors to make good
 use of the hardware.
 How can I do this in a nice way in D? I'm long sick of writing
 unsightly vector classes in C++, but fortunately using vendor
 specific compiler intrinsics usually leads to decent code
 generation. I can currently imagine an equally ugly (possibly worse)
 hardware vector library in D, if it's even possible. But perhaps
 I've missed something here?

I would imagine it should now be possible to use GCC vector builtins with the
GDC
compiler. Given that I manage to get round to turning these routines on though.
:~)

 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.
 What about Android (or iPhone, but apple's 'x-code policy' prevents
 that)? I'd REALLY love to write an android project in D... the
 toolchain is GCC, I see no reason why it shouldn't be possible to
 write an android app if an appropriate toolchain was available?
 Sorry it's a bit long, thanks for reading this far!
 I'm looking forward to a brighter future writing lots of D code :P
 But I need to know basically all these questions are addressed
 before I could consider it for serious commercial game dev.

Someone has recently confirmed D working just fine on the Alpha platform.  For
D2,
your biggest showstopper is the runtime library.  There are many gaps to fill to
port druntime to your preferred architecture.

Regards

Sep 21 2011

Kagamin <spam here.lot> writes:

Manu Wrote:

 I'd love to try out D on some console systems. Fortunately there are
 some great home-brew scenes available for a bunch of slightly older
 consoles; PSP/PS2 (MIPS), XBox1 (embedded x86), GameCube/Wii (PPC),
 Dreamcast (SH4). They all have GCC compilers maintained by the
 community. How difficult will it be to make GDC work with those
 toolchains? Sadly I know nothing about configuring GCC, so sadly I
 can't really help here.

http://pspemu.soywiz.com/2011/07/fourth-release-d-pspemu-r301.html
Maybe this man can be of some help for you.

Sep 22 2011

D Programming

C/C++ Programming

Other

digitalmars.D - __restrict, architecture intrinsics vs asm, consoles, and other stuff