digitalmars.D.announce - intel-intrinsics v1.0.0

Guillaume Piolat (36/36) Feb 05 2019 "intel-intrinsics" is a DUB package for people interested in x86

Simen =?UTF-8?B?S2rDpnLDpXM=?= (16/19) Feb 05 2019 Neat. Question: On Github it's stated that implicit conversions

Guillaume Piolat (8/17) Feb 06 2019 The problem is that when you emulate core.simd (DMD 32-bit on

NaN (4/9) Feb 08 2019 Big thanks for this, it's been a massive help for me.

Guillaume Piolat (4/14) Feb 08 2019 You're welcome! I'd be interested to know what you are making

NaN (4/19) Feb 08 2019 Im the guy from #graphics who's writing a software rasterizer.

Crayo List (8/12) Feb 12 2019 This is really cool and I appreciate your efforts!

Guillaume Piolat (5/17) Feb 13 2019 ispc is another compiler in your build, and you'd write in

Crayo List (18/38) Feb 13 2019 That's mostly what I said, except that I did not say it's the

Simen =?UTF-8?B?S2rDpnLDpXM=?= (10/27) Feb 14 2019 While you didn't say it was the same thing, you did say it's an
Guillaume Piolat (35/74) Feb 14 2019 I don't disagree but ispc sounds more like a host-only OpenCL to
Ethan (9/12) Feb 14 2019 All power to the people that have code that simple. But

Crayo List (2/15) Feb 14 2019 Please re-read my post carefully!

Guillaume Piolat (7/24) Feb 14 2019 I think ispc is interesting, and a very D-ish thing to have would

H. S. Teoh (10/16) Feb 14 2019 Much as I love the idea of generating D code at compile-time and look

Guillaume Piolat (3/4) Feb 14 2019 Couldn't help but find a similarity between

Ethan (3/4) Feb 14 2019 Or - even better - take the hint that not every use of SIMD can

Guillaume Piolat <first.last gmail.com> writes:

"intel-intrinsics" is a DUB package for people interested in x86 
performance that want neither to write assembly, nor a 
LDC-specific snippet... and still have fastest possible code.

Available through DUB: 
http://code.dlang.org/packages/intel-intrinsics


*** Features of v1.1.0:

- All intrinsics in this list: 
https://software.intel.com/sites/landingpage/IntrinsicsGuide
#techs=MMX,SSE,SSE2 Use existing Intel documentation and syntax

- write the same code for both DMD and LDC, in the last 6 
versions for each. (Note that debug performance might suffer a 
lot when no inlining is activated.)

- Use operators on SIMD vectors as if core.simd were implemented 
on DMD 32-bit

- Introduces int2 and float2 because short SIMD vectors are useful

- about 6000 LOC (for now! more to come)

- Bonus: approximated pow/exp/log. Perform 4 approximated pow at 
once.


<future>
The long-term goal for this library is to be _only about 
semantics_, and not particularly codegen(!). This is because LLVM 
IR is portable, so forcing a particular instruction is undoing 
this portability work. **This can seem odd** for an "intrinsics" 
library but this way exact codegen options can be choosen by the 
library user, and most intrinsics can gracefuly degrade to 
portable IR in theory.

In the future, "magic" LLVM intrinsics will only be used when 
built for x86, but I think all of it can become portable and not 
x86-specific. Besides, there is a trend in LLVM to remove magic 
intrinsics once they are doable with IR only.
</future>


tl;dr you can use "intel-intrinsics" today, and get quite-optimal 
code with LDC, without duplication. You may come across early 
bugs too.
http://code.dlang.org/packages/intel-intrinsics

(note: it's important to bench against vanilla D code or arrays 
ops too, in some case the vanilla code wins)

Feb 05 2019

Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:

On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat 
wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

Neat. Question: On Github it's stated that implicit conversions 
aren't supported, with this example:

__m128i b = _mm_set1_epi32(42);
__m128 a = b;             // NO, only works in LDC

Couldn't this be solved through something like this:

struct __m128 {
     float4 value;
     alias value this;
     void opAssign(__m128i rhs) {
         value = cast(float4)rhs.value;
     }
}

--
   Simen

Feb 05 2019

Guillaume Piolat <first.last gmail.com> writes:

On Wednesday, 6 February 2019 at 07:41:25 UTC, Simen Kjærås wrote:
 struct __m128 {
     float4 value;
     alias value this;
     void opAssign(__m128i rhs) {
         value = cast(float4)rhs.value;
     }
 }

 --
   Simen

The problem is that when you emulate core.simd (DMD 32-bit on 
Windows require that, if you want super fast OPTLINK build 
times), then you have no way to have user-defined implicit 
conversions.
and magic vector types from the compiler float4 / int4 / short8 / 
long2 / byte16 are all implicitely convertible to each other, but 
I don't think we can replicate this.

Feb 06 2019

NaN <divide by.zero> writes:

On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat 
wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 Available through DUB: 
 http://code.dlang.org/packages/intel-intrinsics

Big thanks for this, it's been a massive help for me.
cheers!

Feb 08 2019

Guillaume Piolat <first.last gmail.com> writes:

On Friday, 8 February 2019 at 12:22:14 UTC, NaN wrote:
 On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat 
 wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 Available through DUB: 
 http://code.dlang.org/packages/intel-intrinsics

 Big thanks for this, it's been a massive help for me.
 cheers!

You're welcome! I'd be interested to know what you are making 
with it, to feed the "users" list! 
https://github.com/AuburnSounds/intel-intrinsics/blob/master/README.md

Feb 08 2019

NaN <divide by.zero> writes:

On Friday, 8 February 2019 at 12:39:22 UTC, Guillaume Piolat 
wrote:
 On Friday, 8 February 2019 at 12:22:14 UTC, NaN wrote:
 On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume 
 Piolat wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 Available through DUB: 
 http://code.dlang.org/packages/intel-intrinsics

 Big thanks for this, it's been a massive help for me.
 cheers!

 You're welcome! I'd be interested to know what you are making 
 with it, to feed the "users" list! 
 https://github.com/AuburnSounds/intel-intrinsics/blob/master/README.md

Im the guy from #graphics who's writing a software rasterizer. 
I'll let you know when I put it on github.

Feb 08 2019

Crayo List <crayolist gmail.com> writes:

On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat 
wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 [...]



This is really cool and I appreciate your efforts!

However (for those who are unaware) there is an alternative way 
that is (arguably) better;
https://ispc.github.io/index.html

You can write portable vectorized code that can be trivially 
invoked from D.

Feb 12 2019

Guillaume Piolat <first.last gmail.com> writes:

On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List wrote:
 On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume Piolat 
 wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 This is really cool and I appreciate your efforts!

 However (for those who are unaware) there is an alternative way 
 that is (arguably) better;
 https://ispc.github.io/index.html

 You can write portable vectorized code that can be trivially 
 invoked from D.

ispc is another compiler in your build, and you'd write in 
another language, so it's not really the same thing. I haven't 
used it (nor do I know anyone who do) so don't really know why it 
would be any better

Feb 13 2019

Crayo List <crayolist gmail.com> writes:

On Wednesday, 13 February 2019 at 19:55:05 UTC, Guillaume Piolat 
wrote:
 On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List 
 wrote:
 On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume 
 Piolat wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 This is really cool and I appreciate your efforts!

 However (for those who are unaware) there is an alternative 
 way that is (arguably) better;
 https://ispc.github.io/index.html

 You can write portable vectorized code that can be trivially 
 invoked from D.

 ispc is another compiler in your build, and you'd write in 
 another language, so it's not really the same thing.

That's mostly what I said, except that I did not say it's the 
same thing.
It's an alternative way to produce vectorized code in a 
deterministic and portable way.
This is NOT an auto-vectorizing compiler!

 I haven't used it (nor do I know anyone who do) so don't really 
 know why it would be any better

And that's precisely why I posted here; for those people that 
have interest in vectorizing their code in a portable way to be 
aware that there is another (arguably) better way.
I highly recommend browsing through the walkthrough example;
https://ispc.github.io/example.html

For example, I have code that I can run on my Xeon Phi 7250 
Knights Landing CPU by compiling with --target=avx512knl-i32x16, 
then I can run the exact same code with no change at all on my 
i7-5820k by compiling with --target=avx2-i32x8. Each time I get 
optimal code. This is not something you can easily do with 
intrinsics!

Feb 13 2019

Simen =?UTF-8?B?S2rDpnLDpXM=?= <simen.kjaras gmail.com> writes:

On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote:
 On Wednesday, 13 February 2019 at 19:55:05 UTC, Guillaume 
 Piolat wrote:
 On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List 
 wrote:
 However (for those who are unaware) there is an alternative 
 way that is (arguably) better;
 https://ispc.github.io/index.html

 You can write portable vectorized code that can be trivially 
 invoked from D.

 ispc is another compiler in your build, and you'd write in 
 another language, so it's not really the same thing.

 That's mostly what I said, except that I did not say it's the 
 same thing.
 It's an alternative way to produce vectorized code in a 
 deterministic and portable way.

While you didn't say it was the same thing, you did say it's an 
alternative that 'is arguably better'. Adding another compiler 
using another language is arguably worse, so there are tradeoffs 
here, which Guillaume may have felt were undercommunicated (I 
know I did).

That said, it *is* a good alternative in some cases, and may well 
be worth pointing out in a thread like this.

--
   Simen

Feb 14 2019

Guillaume Piolat <first.last gmail.com> writes:

On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote:
 On Wednesday, 13 February 2019 at 19:55:05 UTC, Guillaume 
 Piolat wrote:
 On Wednesday, 13 February 2019 at 04:57:29 UTC, Crayo List 
 wrote:
 On Wednesday, 6 February 2019 at 01:05:29 UTC, Guillaume 
 Piolat wrote:
 "intel-intrinsics" is a DUB package for people interested in 
 x86 performance that want neither to write assembly, nor a 
 LDC-specific snippet... and still have fastest possible code.

 This is really cool and I appreciate your efforts!

 However (for those who are unaware) there is an alternative 
 way that is (arguably) better;
 https://ispc.github.io/index.html

 You can write portable vectorized code that can be trivially 
 invoked from D.

 ispc is another compiler in your build, and you'd write in 
 another language, so it's not really the same thing.

 That's mostly what I said, except that I did not say it's the 
 same thing.
 It's an alternative way to produce vectorized code in a 
 deterministic and portable way.
 This is NOT an auto-vectorizing compiler!

 I haven't used it (nor do I know anyone who do) so don't 
 really know why it would be any better

 And that's precisely why I posted here; for those people that 
 have interest in vectorizing their code in a portable way to be 
 aware that there is another (arguably) better way.
 I highly recommend browsing through the walkthrough example;
 https://ispc.github.io/example.html

 For example, I have code that I can run on my Xeon Phi 7250 
 Knights Landing CPU by compiling with 
 --target=avx512knl-i32x16, then I can run the exact same code 
 with no change at all on my i7-5820k by compiling with 
 --target=avx2-i32x8. Each time I get optimal code. This is not 
 something you can easily do with intrinsics!


I don't disagree but ispc sounds more like a host-only OpenCL to 
me, rather than a replacement/competition for intel-intrinsics.

Intrinsics are easy: if calling another compiler with another 
source language might be trivial, then importing a DUB package 
and start using it within the same source code is even more 
trivial!

I take issue with the claim that Single Program Multiple Data 
yields much more performance than well written intrinsics code: 
when your compiler auto-vectorize (or you vectorized using SIMD 
semantics) you _also_ have one instruction for multiple data. The 
only gain I can see for SPMD would be use of non-temporal writes, 
since they are so hard to use effectively in practice.

I also take some issue with "portability": SIMD intrinsics 
optimize quite deterministically (some instructions get generated 
since LDC 1.0.0 -O0), also LLVM IR is portable to ARM, whereas 
ispc will likely never as admitted by its author: 
https://pharr.org/matt/blog/2018/04/29/ispc-retrospective.html

My interests on AVX-512 are subnormal: it can _slow down_ things 
on some x86 CPUs: 
https://gist.github.com/rygorous/32bc3ea8301dba09358fd2c64e02d774 
In general the latest instructions sets are increasingly hard to 
apply, and have lower yield.

The newer Intel instruction sets are basically a scam for the 
performance-minded. Sponsored work on x265 yields really 
abnormally low results, rewriting things with AVX-512: 
https://software.intel.com/en-us/articles/accelerating-x265-with-intel-advanced-vector-extensions-512-intel-avx-512

As to compiling precisely for the host target: we are building 
B2C software here so don't control the host machine. Thankfully 
the ancient SIMD instructions sets yield most of the value! Since 
a lot of the time memory throughput is the bottleneck.

I can see ispc being more useful when you know the precise model 
of your target Intel CPU. I would also like to see it compare to 
Intel's own software OpenCL: it seems it started its life as 
internal competition.

Feb 14 2019

Ethan <gooberman gmail.com> writes:

On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List wrote:
 And that's precisely why I posted here; for those people that 
 have interest in vectorizing their code in a portable way to be 
 aware that there is another (arguably) better way.

All power to the people that have code that simple. But 
auto-vectorising in any capacity is the wrong way to do things in 
my field. An intrinsics library is vital to write highly 
specialised code.

The tl;dr here is that we *FINALLY* have a minimum-spec for x64 
CPUs represented with SSE intrinsics. Instead of whatever 
core.simd is. That's really important, and talks about 
auto-vectorisation are really best saved for another thread.

Feb 14 2019

Crayo List <crayolist gmail.com> writes:

On Thursday, 14 February 2019 at 16:13:21 UTC, Ethan wrote:
 On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List 
 wrote:
 And that's precisely why I posted here; for those people that 
 have interest in vectorizing their code in a portable way to 
 be aware that there is another (arguably) better way.

 All power to the people that have code that simple. But 
 auto-vectorising in any capacity is the wrong way to do things 
 in my field. An intrinsics library is vital to write highly 
 specialised code.

 The tl;dr here is that we *FINALLY* have a minimum-spec for x64 
 CPUs represented with SSE intrinsics. Instead of whatever 
 core.simd is. That's really important, and talks about 
 auto-vectorisation are really best saved for another thread.

Please re-read my post carefully!

Feb 14 2019

Guillaume Piolat <first.last gmail.com> writes:

On Thursday, 14 February 2019 at 21:45:57 UTC, Crayo List wrote:
 On Thursday, 14 February 2019 at 16:13:21 UTC, Ethan wrote:
 On Wednesday, 13 February 2019 at 23:26:48 UTC, Crayo List 
 wrote:
 And that's precisely why I posted here; for those people that 
 have interest in vectorizing their code in a portable way to 
 be aware that there is another (arguably) better way.

 All power to the people that have code that simple. But 
 auto-vectorising in any capacity is the wrong way to do things 
 in my field. An intrinsics library is vital to write highly 
 specialised code.

 The tl;dr here is that we *FINALLY* have a minimum-spec for 
 x64 CPUs represented with SSE intrinsics. Instead of whatever 
 core.simd is. That's really important, and talks about 
 auto-vectorisation are really best saved for another thread.

 Please re-read my post carefully!

I think ispc is interesting, and a very D-ish thing to have would 
be an ispc-like compiler at CTFE that outputs LLVM IR (or 
assembly or intel-intrinsics). That would break the language 
boundary and allows inlining. Though probably we need newCTFE for 
this, as everything interesting seems to need newCTFE :) And it's 
a gigantic amount of work.

Feb 14 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Thu, Feb 14, 2019 at 10:15:19PM +0000, Guillaume Piolat via
Digitalmars-d-announce wrote:
[...]
 I think ispc is interesting, and a very D-ish thing to have would be
 an ispc-like compiler at CTFE that outputs LLVM IR (or assembly or
 intel-intrinsics). That would break the language boundary and allows
 inlining. Though probably we need newCTFE for this, as everything
 interesting seems to need newCTFE :) And it's a gigantic amount of
 work.

Much as I love the idea of generating D code at compile-time and look
forward to newCTFE, there comes a point when I'd really rather just run
the DSL through some kind of preprocessing (i.e., compile with ispc) as
part of the build, then link the result to the D code, rather than
trying to shoehorn everything into (new)CTFE.


T

-- 
You have to expect the unexpected. -- RL

Feb 14 2019

Guillaume Piolat <first.last gmail.com> writes:

On Thursday, 14 February 2019 at 22:28:46 UTC, H. S. Teoh wrote:
 trying to shoehorn everything into (new)CTFE.

Couldn't help but find a similarity between 
http://www.dsource.org/projects/mathextra/browser/trunk/blade/BladeDemo.d and
ispc

Feb 14 2019

Ethan <gooberman gmail.com> writes:

On Thursday, 14 February 2019 at 21:45:57 UTC, Crayo List wrote:
 Please re-read my post carefully!

Or - even better - take the hint that not every use of SIMD can 
be expressed in a high level manner.

Feb 14 2019

D Programming

C/C++ Programming

Other

digitalmars.D.announce - intel-intrinsics v1.0.0