www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - LCD inline assembly expressions

reply NaN <divide by.zero> writes:
Ok, so i'm delving into LCD intrinsics and hit a wall, cant find 
the

_mm_cmpgt_epi32

instruction anywhere, looks like it's not included, not in the 
gcc_builtins or anywhere else. I'm using the wrapper lib that 
gives you intel style intrinsics from

https://github.com/AuburnSounds/intel-intrinsics

And from what I can tell if it's not an llvm intrinisic, or not 
in gcc builtins you're out of luck. So i wondered if I can use 
inline assembly expressions but I'm obviously missing somthing. 
Ive got as far as...

// load dst into EAX and return it
int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
}

but get the error...

error: non-trivial scalar-to-vector conversion, possible invalid 
constraint for vector type

Compiler returned: 1
Dec 22 2018
next sibling parent NaN <divide by.zero> writes:
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 Ok, so i'm delving into LCD intrinsics and hit a wall, cant 
 find the

 _mm_cmpgt_epi32

 instruction anywhere, looks like it's not included, not in the 
 gcc_builtins or anywhere else. I'm using the wrapper lib that 
 gives you intel style intrinsics from

 https://github.com/AuburnSounds/intel-intrinsics

 And from what I can tell if it's not an llvm intrinisic, or not 
 in gcc builtins you're out of luck. So i wondered if I can use 
 inline assembly expressions but I'm obviously missing somthing. 
 Ive got as far as...

 // load dst into EAX and return it
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }

 but get the error...

 error: non-trivial scalar-to-vector conversion, possible 
 invalid constraint for vector type

 Compiler returned: 1
ignore the comment line , thats just a left over copy and paste from the wiki example
Dec 22 2018
prev sibling next sibling parent reply kinke <noone nowhere.com> writes:
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }
This is a working variant (`r` is a GP register, `x` a vector register for x86, see https://llvm.org/docs/LangRef.html#supported-constraint-code-list): int4 _mm_cmpgt_epi32(int4 a, int4 b) { int4 r = void; __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r); return r; }
Dec 23 2018
parent reply NaN <divide by.zero> writes:
On Sunday, 23 December 2018 at 12:54:01 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   return __asm!int4("pcmpgtd $1,$0", "=&r,r,r", a, b);
 }
This is a working variant (`r` is a GP register, `x` a vector register for x86, see https://llvm.org/docs/LangRef.html#supported-constraint-code-list): int4 _mm_cmpgt_epi32(int4 a, int4 b) { int4 r = void; __asm("pcmpgtd $1,$0; movdqa $0,$2", "x,x,*m", a, b, &r); return r; }
Hi thanks, i was just coming here to post the solution as I figured it out after following the link to the llvm docs. Is there any difference between using this vs the other method of doing intrinsics?
Dec 23 2018
parent reply kinke <noone nowhere.com> writes:
On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?
Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm. My version above with the memory indirection isn't nice, this is better: extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) { return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b); } and is going to be inlined with `-O`. Note that if you used equivalent naked DMD-style inline asm instead, e.g., extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) { asm { naked; pcmpgtd XMM0, XMM1; ret; } } that is lowered to *module*-level inline asm and the function is NOT inline-able.
Dec 23 2018
next sibling parent NaN <divide by.zero> writes:
On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?
Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm. My version above with the memory indirection isn't nice, this is better: extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) { return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b); } and is going to be inlined with `-O`.
that's pretty much what I've got, i've been using compiler explorer so I can see what actually gets generated. Been quite an eye opener how good the LLVM optimizer is tbh.
 Note that if you used equivalent naked DMD-style inline asm 
 instead, e.g.,

 extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) {
   asm {
     naked;
     pcmpgtd XMM0, XMM1;
     ret;
   }
 }

 that is lowered to *module*-level inline asm and the function 
 is NOT inline-able.
Im ignoring DMD since it kills performance by about 60% anyway.
Dec 23 2018
prev sibling parent reply NaN <divide by.zero> writes:
On Sunday, 23 December 2018 at 13:33:51 UTC, kinke wrote:
 On Sunday, 23 December 2018 at 13:00:54 UTC, NaN wrote:
 Is there any difference between using this vs the other method 
 of doing intrinsics?
Assuming there's really no LLVM intrinsic for your desired instruction, the manual variant is what it is, a regular function with an inline asm expression. I guess the LLVM backends lower calls to these instruction-intrinsics directly to inline asm expressions in the caller. With inlining, it might result in equivalent final asm. My version above with the memory indirection isn't nice, this is better: extern(C) int4 _mm_cmpgt_epi32(int4 a, int4 b) { return __asm!int4("pcmpgtd $2,$1", "={xmm0},{xmm0},{xmm1}", a, b); }
so I had this.. __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) { return __asm!__m128i("pcmpgtd $2,$1","=x,x,x",a,b); } Looked OK at first but it's actually wrong, the cmp instruction writes to $1 which is actually 'a', and it doesnt write anything to $0 which is the return, so it overwrites one of the inputs, and doesnt write the output. So it actualy needs to be this... __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) { return __asm!int4(" movdqu $1,$0 pcmpgtd $2,$0", "=x,x,x", a,b); } basically copy 'a' to the output, then do the compare with 'b' and the output I dont think there's anyway to get around the temporary copy, since it depends on knowing if 'a' is ever use after its used in the compare. And it doesn't seem like the optimiser can cull it away in this case.
Dec 23 2018
parent NaN <divide by.zero> writes:
On Monday, 24 December 2018 at 01:40:42 UTC, NaN wrote:
 I dont think there's anyway to get around the temporary copy, 
 since it depends on knowing if 'a' is ever use after its used 
 in the compare. And it doesn't seem like the optimiser can cull 
 it away in this case.
OK think I've figured it out... __m128i _mm_cmpgt_epi32(__m128i a, __m128i b) { return __asm!__m128i("pcmpgtd $2,$0","=x,0,x",a,b); } Basically.... $0 is the return, the constraint '=x' means its the output and uses xmm register $1 is 'a', the constraint '0', means this param uses same register as $0 $2 is 'b', the constrain 'x' means this uses an xmm register It's also AT&T syntax so the operands are reversed to what Im used to, so... Although $1 is not written in the asm expression it has been tied to $0 by the '0' constraint. So as far as the compiler is concerned 'a' comes in on the same register as the output goes out in. By knowing this it can create a temporary copy of 'a' if it needs to avoid trashing 'a'. I've done some tests and if you do... r = _mm_cmpgt_epi32(a,b) it only creates the temporary if you use 'a' again afterwards. So its all working i think.
Dec 23 2018
prev sibling parent Guillaume Piolat <first.last gmail.com> writes:
On Sunday, 23 December 2018 at 01:27:40 UTC, NaN wrote:
 Ok, so i'm delving into LCD intrinsics and hit a wall, cant 
 find the

 _mm_cmpgt_epi32

 instruction anywhere, looks like it's not included, not in the 
 gcc_builtins or anywhere else. I'm using the wrapper lib that 
 gives you intel style intrinsics from

 https://github.com/AuburnSounds/intel-intrinsics

 And from what I can tell if it's not an llvm intrinisic, or not 
 in gcc builtins you're out of luck. So i wondered if I can use 
 inline assembly expressions but I'm obviously missing somthing. 
 Ive got as far as...
Just a note that after you posted here, the intrinsics has been implemented in "intel-intrinsics" package through ldc.simd: https://github.com/AuburnSounds/intel-intrinsics/blob/fa3866dc782b0d2c4a567f6547bdc0b321ada8cc/source/inteli/emmintrin.d#L293 It generates pcmpgtd https://d.godbolt.org/z/ronCG_
Jan 01