digitalmars.D - Playing SIMD
- Iakh (42/42) Oct 25 2015 Here is my implementatation of SIMD find. Function returns index
- Andrei Alexandrescu (3/43) Oct 25 2015 This is compelling but needs a bit of work to integrate. Care to work on...
- Iakh (4/8) Oct 25 2015 First of all I need sort of investigation about PRs and
- Andrei Alexandrescu (2/8) Oct 25 2015 Thanks! -- Andrei
- Matthias Bentrup (3/4) Oct 25 2015 I think it's better to use PMOVMSKB to avoid storing the PCMPEQB
- Iakh (5/10) Oct 25 2015 Yeah but PMOVMSKB not implemented in core.simd.
- Iain Buclaw via Digitalmars-d (5/14) Oct 26 2015 in memory and you need only one BSF instruction.
- Marco Leise (7/12) Oct 28 2015 Am Mon, 26 Oct 2015 14:04:18 +0100
- Andrei Alexandrescu (2/11) Oct 28 2015 Guess an enh report would help remembering the matter. -- Andrei
- anonymous (3/10) Oct 25 2015 runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
- Iakh (10/11) Oct 26 2015 You are right.
- Iakh (8/19) Oct 26 2015 At home with defult dub config "dub run --build=release":
- John Colvin (4/27) Oct 26 2015 It's good to see work like this. Have you tried with gdc/ldc
- Andrei Alexandrescu (4/15) Oct 26 2015 That's a healthy margin. It may get eroded by startup/finish codes that
- rumbu (14/34) Oct 26 2015 Interpolation search is even faster :)
- Kagamin (4/6) Oct 26 2015 For a funny name it's called Newton search: you can use
- Iakh (10/30) Oct 26 2015 (Binary)Searching in large sorted continuous array e. g. 1 MB of
- Iakh (2/11) Oct 26 2015 But yeah... Who needs 1000_000 0..256 values(sorted :D )?
- Don (12/14) Oct 26 2015 [snip]
- Iain Buclaw via Digitalmars-d (6/12) Oct 26 2015 ubyte in static 16 byte array with unique values.
- Andrei Alexandrescu (7/20) Oct 26 2015 One other note: don't compare with binary search, it's not an
- Iakh (2/4) Oct 26 2015 Can you advice some reading about benchmark?
Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values. ------ immutable size_t ArraySize = 16; int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack) { ubyte16 arr; arr.array = haystack[]; ubyte16 niddles; niddles.array[] = niddle; ubyte16 result; result = __simd_sto(XMM.PCMPEQB, arr, niddles); alias Mask = ulong; static assert(2 * Mask.sizeof == result.sizeof); immutable BitsInByte = 8; if (Mask mask = *cast(Mask*)(result.array.ptr)) { return bsf(mask) / BitsInByte; } else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof)) { return bsf(mask) / BitsInByte + cast(int)Mask.sizeof; } else { return -1; } } ------ Is it optimal and how do you implement this stuf? Full exaple with comparation of algorithms (SIMD, naive, binary search): http://dpaste.dzfl.pl/f3b8989841e3 Benchmark result on dpaste.dzfl.pl: SIMD: TickDuration(157000) Binary: TickDuration(472000) Naive: TickDuration(437000) At home with defult dub config "dub run --build=release": SIMD: TickDuration(241566) Binary: TickDuration(450515) Naive: TickDuration(450371)
Oct 25 2015
On 10/25/2015 03:37 PM, Iakh wrote:Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values. ------ immutable size_t ArraySize = 16; int simdIndexOf(ubyte niddle, ref const ubyte[ArraySize] haystack) { ubyte16 arr; arr.array = haystack[]; ubyte16 niddles; niddles.array[] = niddle; ubyte16 result; result = __simd_sto(XMM.PCMPEQB, arr, niddles); alias Mask = ulong; static assert(2 * Mask.sizeof == result.sizeof); immutable BitsInByte = 8; if (Mask mask = *cast(Mask*)(result.array.ptr)) { return bsf(mask) / BitsInByte; } else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof)) { return bsf(mask) / BitsInByte + cast(int)Mask.sizeof; } else { return -1; } } ------ Is it optimal and how do you implement this stuf? Full exaple with comparation of algorithms (SIMD, naive, binary search): http://dpaste.dzfl.pl/f3b8989841e3 Benchmark result on dpaste.dzfl.pl: SIMD: TickDuration(157000) Binary: TickDuration(472000) Naive: TickDuration(437000) At home with defult dub config "dub run --build=release": SIMD: TickDuration(241566) Binary: TickDuration(450515) Naive: TickDuration(450371)This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- Andrei
Oct 25 2015
On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:[...] This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- AndreiFirst of all I need sort of investigation about PRs and std.algorithm. But in general challenge accepted.
Oct 25 2015
On 10/25/15 6:57 PM, Iakh wrote:On Sunday, 25 October 2015 at 21:13:56 UTC, Andrei Alexandrescu wrote:Thanks! -- Andrei[...] This is compelling but needs a bit of work to integrate. Care to work on a PR that makes std.algorithm use it? Thanks! -- AndreiFirst of all I need sort of investigation about PRs and std.algorithm. But in general challenge accepted.
Oct 25 2015
On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:Is it optimal and how do you implement this stuf?I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction.
Oct 25 2015
On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote:On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:Yeah but PMOVMSKB not implemented in core.simd. Bit more comprehensive here: http://forum.dlang.org/post/20150923115833.054fdb09 marco-toshibaIs it optimal and how do you implement this stuf?I think it's better to use PMOVMSKB to avoid storing the PCMPEQB result in memory and you need only one BSF instruction.
Oct 25 2015
On 25 Oct 2015 11:50 pm, "Iakh via Digitalmars-d" < digitalmars-d puremagic.com> wrote:On Sunday, 25 October 2015 at 22:17:58 UTC, Matthias Bentrup wrote:in memory and you need only one BSF instruction.On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:Is it optimal and how do you implement this stuf?I think it's better to use PMOVMSKB to avoid storing the PCMPEQB resultYeah but PMOVMSKB not implemented in core.simd.Don't use core.simd, push for getting std.simd in, then leverage the generics exposed through that module.
Oct 26 2015
Am Mon, 26 Oct 2015 14:04:18 +0100 schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:Yeah, but PMOVMSKB is not implemented in std.simd. -- MarcoYeah but PMOVMSKB not implemented in core.simd.Don't use core.simd, push for getting std.simd in, then leverage the generics exposed through that module.
Oct 28 2015
On 10/28/2015 10:27 AM, Marco Leise wrote:Am Mon, 26 Oct 2015 14:04:18 +0100 schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:Guess an enh report would help remembering the matter. -- AndreiYeah, but PMOVMSKB is not implemented in std.simd.Yeah but PMOVMSKB not implemented in core.simd.Don't use core.simd, push for getting std.simd in, then leverage the generics exposed through that module.
Oct 28 2015
On 25.10.2015 20:37, Iakh wrote:Full exaple with comparation of algorithms (SIMD, naive, binary search): http://dpaste.dzfl.pl/f3b8989841e3From there:void runBinary() { static int i = 0; naiveIndexOf(cast(ubyte)(i++/ArraySize + 1), arr); }runBinary calls naiveIndexOf. You're not testing binaryIndexOf.
Oct 25 2015
On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:runBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:At home with defult dub config "dub run --build=release": ----- SIMD: TickDuration(350644) Binary: TickDuration(434014) Naive: TickDuration(657548) ~1.24 times faster than binary and ~1.87 times faster than naiverunBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On Monday, 26 October 2015 at 11:10:31 UTC, Iakh wrote:On Monday, 26 October 2015 at 09:49:00 UTC, Iakh wrote:It's good to see work like this. Have you tried with gdc/ldc (-march=native / -mcpu=native respectively if you want to use the full available instruction set)?On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:At home with defult dub config "dub run --build=release": ----- SIMD: TickDuration(350644) Binary: TickDuration(434014) Naive: TickDuration(657548) ~1.24 times faster than binary and ~1.87 times faster than naiverunBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On 10/26/2015 05:48 AM, Iakh wrote:On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- AndreirunBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu wrote:On 10/26/2015 05:48 AM, Iakh wrote:Interpolation search is even faster :) http://dpaste.dzfl.pl/4443c5753454 ----- SIMD: TickDuration(144000) Binary: TickDuration(229000) Naive: TickDuration(472000) Interpolation: TickDuration(92000) ----- SIMD: TickDuration(145000) Binary: TickDuration(247000) Naive: TickDuration(463000) Interpolation: TickDuration(91000)On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- AndreirunBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On Monday, 26 October 2015 at 12:10:41 UTC, rumbu wrote:Interpolation search is even faster :) http://dpaste.dzfl.pl/4443c5753454For a funny name it's called Newton search: you can use interpolation of higher order, can be more efficient if the data has a trend.
Oct 26 2015
On Monday, 26 October 2015 at 11:47:56 UTC, Andrei Alexandrescu wrote:On 10/26/2015 05:48 AM, Iakh wrote:(Binary)Searching in large sorted continuous array e. g. 1 MB of bytes 1 MB = 2 ^^ 20 bytes Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. For 20 levels 18 - 19 ticks. So overall improvement is 5-10%. Furthermore think of cache and memory pages misses on firs levels. IMO SIMD needed for unsorted data(strings?) or for trees.On Monday, 26 October 2015 at 00:00:45 UTC, anonymous wrote:That's a healthy margin. It may get eroded by startup/finish codes that need to get to the first aligned chunk and handle the misaligned data at the end, etc. But it's a solid proof of concept. -- AndreirunBinary calls naiveIndexOf. You're not testing binaryIndexOf.You are right. This is fixed example: http://dpaste.dzfl.pl/f7a54b789a21 and results at dpaste.dzfl.pl: ----- SIMD: TickDuration(151000) Binary: TickDuration(255000) Naive: TickDuration(459000) So SIMD version ~1.68 faster than binary
Oct 26 2015
On Monday, 26 October 2015 at 15:03:12 UTC, Iakh wrote:(Binary)Searching in large sorted continuous array e. g. 1 MB of bytes 1 MB = 2 ^^ 20 bytes Imagine 4 binary levels pass in 4 ticks. So 20 levels in 20 ticks. With SIMD last 4 levels would be done in 2 or 3 ticks. For 20 levels 18 - 19 ticks. So overall improvement is 5-10%. Furthermore think of cache and memory pages misses on firs levels. IMO SIMD needed for unsorted data(strings?) or for trees.But yeah... Who needs 1000_000 0..256 values(sorted :D )?
Oct 26 2015
On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:Here is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values.[snip] You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading. Be aware that the speed of bsf() and bsr() is very very strongly processor dependent. On some machines, it is utterly pathetic. eg AMD K7, BSR is 23 micro-operations, on original pentium is was up to 73 (!), even on AMD Bobcat it is 11 micro-ops, but on recent Intel it is one micro-op. This fact of 73 can totally screw up your performance comparisons. Just because it is a single machine instruction does not mean it is fast.
Oct 26 2015
On 26 Oct 2015 1:40 pm, "Don via Digitalmars-d" <digitalmars-d puremagic.com> wrote:On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:ubyte in static 16 byte array with unique values.Here is my implementatation of SIMD find. Function returns index of[snip] You need to be very careful with doing benchmarks on tiny test cases,they can be very misleading.Be aware that the speed of bsf() and bsr() is very very stronglyprocessor dependent. On some machines, it is utterly pathetic. Some compilers are smart about uses of bsf and bsr. :-)
Oct 26 2015
On 10/26/2015 08:35 AM, Don wrote:On Sunday, 25 October 2015 at 19:37:32 UTC, Iakh wrote:One other note: don't compare with binary search, it's not an appropriate baseline. You should use it only if you implemented SIMD-based binary search. Good baselines: std.find, memchr, a naive version with pointers (no bounds checking). AndreiHere is my implementatation of SIMD find. Function returns index of ubyte in static 16 byte array with unique values.[snip] You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading. Be aware that the speed of bsf() and bsr() is very very strongly processor dependent. On some machines, it is utterly pathetic. eg AMD K7, BSR is 23 micro-operations, on original pentium is was up to 73 (!), even on AMD Bobcat it is 11 micro-ops, but on recent Intel it is one micro-op. This fact of 73 can totally screw up your performance comparisons. Just because it is a single machine instruction does not mean it is fast.
Oct 26 2015
On Monday, 26 October 2015 at 12:35:39 UTC, Don wrote:You need to be very careful with doing benchmarks on tiny test cases, they can be very misleading.Can you advice some reading about benchmark?
Oct 26 2015