digitalmars.D.ldc - LCD inline assembly expressions

NaN (19/19) Dec 22 2018 Ok, so i'm delving into LCD intrinsics and hit a wall, cant find

NaN (3/22) Dec 22 2018 ignore the comment line , thats just a left over copy and paste
kinke (9/12) Dec 23 2018 This is a working variant (`r` is a GP register, `x` a vector

NaN (5/17) Dec 23 2018 Hi thanks, i was just coming here to post the solution as I

kinke (25/27) Dec 23 2018 Assuming there's really no LLVM intrinsic for your desired

NaN (5/32) Dec 23 2018 that's pretty much what I've got, i've been using compiler
NaN (21/36) Dec 23 2018 so I had this..

NaN (22/26) Dec 23 2018 OK think I've figured it out...

Guillaume Piolat (5/16) Jan 01 2019 Just a note that after you posted here, the intrinsics has been

NaN <divide by.zero> writes:

Ok, so i'm delving into LCD intrinsics and hit a wall, cant find 
the

_mm_cmpgt_epi32

instruction anywhere, looks like it's not included, not in the 
gcc_builtins or anywhere else. I'm using the wrapper lib that 
gives you intel style intrinsics from

https://github.com/AuburnSounds/intel-intrinsics

And from what I can tell if it's not an llvm intrinisic, or not 
in gcc builtins you're out of luck. So i wondered if I can use 
inline assembly expressions but I'm obviously missing somthing. 
Ive got as far as...

// load dst into EAX and return it
int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
}

but get the error...

error: non-trivial scalar-to-vector conversion, possible invalid 
constraint for vector type

Compiler returned: 1

Dec 22 2018

NaN <divide by.zero> writes:

On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 Ok, so i'm delving into LCD intrinsics and hit a wall, cant 
 find the

 _mm_cmpgt_epi32

 instruction anywhere, looks like it's not included, not in the 
 gcc_builtins or anywhere else. I'm using the wrapper lib that 
 gives you intel style intrinsics from

 https://github.com/AuburnSounds/intel-intrinsics

 And from what I can tell if it's not an llvm intrinisic, or not 
 in gcc builtins you're out of luck. So i wondered if I can use 
 inline assembly expressions but I'm obviously missing somthing. 
 Ive got as far as...

 // load dst into EAX and return it
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }

 but get the error...

 error: non-trivial scalar-to-vector conversion, possible 
 invalid constraint for vector type

 Compiler returned: 1

ignore the comment line , thats just a left over copy and paste 
from the wiki example

Dec 22 2018

kinke <noone nowhere.com> writes:

On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }

This is a working variant (`r` is a GP register, `x` a vector 
register for x86, see 
https://llvm.org/docs/LangRef.html#supported-constraint-code-list):

int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   int4 r = void;
   __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r);
   return r;
}

Dec 23 2018

NaN <divide by.zero> writes:

On Sunday, 23 December 2018 at 12:54:01 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }

 This is a working variant (`r` is a GP register, `x` a vector 
 register for x86, see 
 https://llvm.org/docs/LangRef.html#supported-constraint-code-list):

 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   int4 r = void;
   __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r);
   return r;
 }

Hi thanks, i was just coming here to post the solution as I 
figured it out after following the link to the llvm docs.

Is there any difference between using this vs the other method of 
doing intrinsics?

Dec 23 2018

kinke <noone nowhere.com> writes:

On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?

Assuming there's really no LLVM intrinsic for your desired 
instruction, the manual variant is what it is, a regular function 
with an inline asm expression. I guess the LLVM backends lower 
calls to these instruction-intrinsics directly to inline asm 
expressions in the caller. With inlining, it might result in 
equivalent final asm.

My version above with the memory indirection isn't nice, this is 
better:

extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, 
b);
}

and is going to be inlined with `-O`.

Note that if you used equivalent naked DMD-style inline asm 
instead, e.g.,

extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   asm {
     naked;
     pcmpgtd XMM0, XMM1;
     ret;
   }
}

that is lowered to *module*-level inline asm and the function is 
NOT inline-able.

Dec 23 2018

NaN <divide by.zero> writes:

On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?

 Assuming there's really no LLVM intrinsic for your desired 
 instruction, the manual variant is what it is, a regular 
 function with an inline asm expression. I guess the LLVM 
 backends lower calls to these instruction-intrinsics directly 
 to inline asm expressions in the caller. With inlining, it 
 might result in equivalent final asm.

 My version above with the memory indirection isn't nice, this 
 is better:

 extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", 
 a, b);
 }

 and is going to be inlined with `-O`.

that's pretty much what I've got, i've been using compiler 
explorer so I can see what actually gets generated. Been quite an 
eye opener how good the LLVM optimizer is tbh.

 Note that if you used equivalent naked DMD-style inline asm 
 instead, e.g.,

 extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   asm {
     naked;
     pcmpgtd XMM0, XMM1;
     ret;
   }
 }

 that is lowered to *module*-level inline asm and the function 
 is NOT inline-able.

Im ignoring DMD since it kills performance by about 60% anyway.

Dec 23 2018

NaN <divide by.zero> writes:

On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?

 Assuming there's really no LLVM intrinsic for your desired 
 instruction, the manual variant is what it is, a regular 
 function with an inline asm expression. I guess the LLVM 
 backends lower calls to these instruction-intrinsics directly 
 to inline asm expressions in the caller. With inlining, it 
 might result in equivalent final asm.

 My version above with the memory indirection isn't nice, this 
 is better:

 extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", 
 a, b);
 }

so I had this..

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
   return __asm!__m128i("pcmpgtd $2,$1","=x,x,x",a,b);
}

Looked OK at first but it's actually wrong, the cmp instruction 
writes to $1 which is actually 'a', and it doesnt write anything 
to $0 which is the return, so it overwrites one of the inputs, 
and doesnt write the output. So it actualy needs to be this...

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
     return __asm!int4("
         movdqu $1,$0
         pcmpgtd $2,$0",
         "=x,x,x", a,b);
}

basically copy 'a' to the output, then do the compare with 'b' 
and the output

I dont think there's anyway to get around the temporary copy, 
since it depends on knowing if 'a' is ever use after its used in 
the compare. And it doesn't seem like the optimiser can cull it 
away in this case.

Dec 23 2018

NaN <divide by.zero> writes:

On Monday, 24 December 2018 at 01:40:42 UTC, NaN wrote:
 I dont think there's anyway to get around the temporary copy, 
 since it depends on knowing if 'a' is ever use after its used 
 in the compare. And it doesn't seem like the optimiser can cull 
 it away in this case.

OK think I've figured it out...

__m128i _mm_cmpgt_epi32(__m128i a, __m128i b) {
   return __asm!__m128i("pcmpgtd $2,$0","=x,0,x",a,b);
}

Basically....

$0 is the return, the constraint '=x' means its the output and 
uses xmm register
$1 is 'a', the constraint '0', means this param uses same 
register as $0
$2 is 'b', the constrain 'x' means this uses an xmm register

It's also AT&T syntax so the operands are reversed to what Im 
used to, so...

Although $1 is not written in the asm expression it has been tied 
to $0 by the '0'  constraint. So as far as the compiler is 
concerned 'a' comes in on the same register as the output goes 
out in. By knowing this it can create a temporary copy of 'a' if 
it needs to avoid trashing 'a'.

I've done some tests and if you do...

r = _mm_cmpgt_epi32(a,b)

it only creates the temporary if you use 'a' again afterwards.

So its all working i think.

Dec 23 2018

Guillaume Piolat <first.last gmail.com> writes:

On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 Ok, so i'm delving into LCD intrinsics and hit a wall, cant 
 find the

 _mm_cmpgt_epi32

 instruction anywhere, looks like it's not included, not in the 
 gcc_builtins or anywhere else. I'm using the wrapper lib that 
 gives you intel style intrinsics from

 https://github.com/AuburnSounds/intel-intrinsics

 And from what I can tell if it's not an llvm intrinisic, or not 
 in gcc builtins you're out of luck. So i wondered if I can use 
 inline assembly expressions but I'm obviously missing somthing. 
 Ive got as far as...

Just a note that after you posted here, the intrinsics has been 
implemented in "intel-intrinsics" package through ldc.simd:

https://github.com/AuburnSounds/intel-intrinsics/blob/fa3866dc782b0d2c4a567f6547bdc0b321ada8cc/source/inteli/emmintrin.d#L293

It generates pcmpgtd https://d.godbolt.org/z/ronCG_

Jan 01 2019

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - LCD inline assembly expressions