digitalmars.D - Benchmark memchar (with GCC builtins)
- Iakh (52/52) Oct 30 2015 I continue to play with SIMD. So I was trying to use std.simd
- Andrei Alexandrescu (3/19) Oct 30 2015 Could you please take a look at GCC's generated code and implementation
- Iakh (5/7) Oct 30 2015 glibc uses something like pseudo-SIMD with ordinal x86
- Marco Leise (25/33) Oct 30 2015 Instead of providing a library implementation, these functions
- Iakh (17/19) Oct 31 2015 Copy-and-paste from glibc's memchr(runGLibC) gaves the result
- Iakh (45/47) Nov 02 2015 So i did. I rewrite code to do main work in cacheLineSize chunks.
- Andrei Alexandrescu (3/8) Nov 03 2015 Looks like the current memchr is well optimized. Not much blood left in
- Dmitry Olshansky (5/20) Oct 31 2015 More or less matches my experience with C's memchr vs something else -
- rsw0x (4/5) Oct 31 2015 I got it to 1.5 the running time of C using SSE2 but couldn't get
- Iakh (2/3) Oct 31 2015 Can you share your solution?
I continue to play with SIMD. So I was trying to use std.simd But it has lots of thing to be implemented. And I also gave up with core.simd.__simd due to problems with PMOVMSKB instruction (it is not implemented). Today I was playing with memchr for gdc: memchr: http://www.cplusplus.com/reference/cstring/memchr/ My implementations with benchmark: http://dpaste.dzfl.pl/4c46c0cf340c Benchmark results: ----- Naive: 21.9 TickDuration(136456491) SIMD: 3.04 TickDuration(18920182) SIMDM: 2.44 TickDuration(15232176) SIMDU: 1.8 TickDuration(11210454) C: 1 TickDuration(6233963) Mid colon is duration relative to C implementation (core.stdc.string). memchrSIMD splits an input into three parts: unaligned begin, unaligned end, and aligned mid. memchrSIMDM instead of pmovmskb use this code: ------ if (Mask mask = *cast(Mask*)(result.array.ptr)) { return ptr + bsf(mask) / BitsInByte; } else if (Mask mask = *cast(Mask*)(result.array.ptr + Mask.sizeof)) { return ptr + bsf(mask) / BitsInByte + cast(int)Mask.sizeof; } ------ memchrSIMDU (unaligned) applay SIMD instructions from first array elements SIMD part of function: ------ ubyte16 niddles; niddles.ptr[0..16] = value; ubyte16 result; ubyte16 arr; for (; ptr < alignedEnd; ptr += 16) { arr.ptr[0..16] = ptr[0..16]; result = __builtin_ia32_pcmpeqb128(arr, niddles); int i = __builtin_ia32_pmovmskb128(result); if (i != 0) { return ptr + bsf(i); } } ------
Oct 30 2015
On 10/30/2015 05:29 PM, Iakh wrote:I continue to play with SIMD. So I was trying to use std.simd But it has lots of thing to be implemented. And I also gave up with core.simd.__simd due to problems with PMOVMSKB instruction (it is not implemented). Today I was playing with memchr for gdc: memchr: http://www.cplusplus.com/reference/cstring/memchr/ My implementations with benchmark: http://dpaste.dzfl.pl/4c46c0cf340c Benchmark results: ----- Naive: 21.9 TickDuration(136456491) SIMD: 3.04 TickDuration(18920182) SIMDM: 2.44 TickDuration(15232176) SIMDU: 1.8 TickDuration(11210454) C: 1 TickDuration(6233963) Mid colon is duration relative to C implementation (core.stdc.string).Could you please take a look at GCC's generated code and implementation of memchr? -- Andrei
Oct 30 2015
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu wrote:Could you please take a look at GCC's generated code and implementation of memchr? -- Andreiglibc uses something like pseudo-SIMD with ordinal x86 instructions (XOR magic, etc). Deap comarison I left for next time :)
Oct 30 2015
Am Fri, 30 Oct 2015 22:31:54 +0000 schrieb Iakh <iaktakh gmail.com>:On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu wrote:Instead of providing a library implementation, these functions are often best left to compiler intrinsics which can provide one of several instruction sequences based on the arguments and the target CPU. In particular compilers often know runtime length arguments from const-folding and can chose to use the best matching SIMD instruction set if compiling for a specific CPU. In regular D or C code you don't have access to this information. GlibC might use that particular implementation in its source, but GCC will override these generic functions with builtin intrinsics unless you disable them via command-line switch (-fno-builtins). Same goes for e.g. ctlz (count leading zeroes): It will be emulated somehow if compiled for older CPUs and use fast native instructions on more recent CPUs. Yes, sometimes the intrinsics are lacking, but in general I trust the compiler devs more than myself, especially since they likely tested it on several target architectures while I mostly only do one. Just ask yourself how you would select the optimal memchr function matching the -march= (gdc) or -mcpu (ldc) flags at compile-time. -- MarcoCould you please take a look at GCC's generated code and implementation of memchr? -- Andreiglibc uses something like pseudo-SIMD with ordinal x86 instructions (XOR magic, etc). Deap comarison I left for next time :)
Oct 30 2015
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu wrote:Could you please take a look at GCC's generated code and implementation of memchr? -- AndreiCopy-and-paste from glibc's memchr(runGLibC) gaves the result below. ----- Naive: 21.4 TickDuration(132485705) SIMD: 3.17 TickDuration(19629892) SIMDM: 2.49 TickDuration(15420462) C: 1 TickDuration(6195504) runGLibC: 4.32 TickDuration(26782585) SIMDU: 1.8 TickDuration(11128618) ASM shows memchr is realy called. There is neither compiler magic nor local memchr imlementation. Aligned versions of memchr use aligned load from memory and unaligned one uses unaligned load. So at this point optimisation done well.
Oct 31 2015
On Friday, 30 October 2015 at 21:33:25 UTC, Andrei Alexandrescu wrote:Could you please take a look at GCC's generated code and implementation of memchr? -- AndreiSo i did. I rewrite code to do main work in cacheLineSize chunks. And this is what GLIBC version do. So main loop looks this: ----- do { // ptr16 is aligned 64 ubyte16 r1 = __builtin_ia32_pcmpeqb128(ptr16[0], niddles); ubyte16 r2 = __builtin_ia32_pcmpeqb128(ptr16[1], niddles); ubyte16 r3 = __builtin_ia32_pcmpeqb128(ptr16[2], niddles); ubyte16 r4 = __builtin_ia32_pcmpeqb128(ptr16[3], niddles); r3 = __builtin_ia32_pmaxub128(r1, r3); r4 = __builtin_ia32_pmaxub128(r2, r4); r4 = __builtin_ia32_pmaxub128(r3, r4); mask = __builtin_ia32_pmovmskb128(r4); if (mask != 0) { mask = __builtin_ia32_pmovmskb128(r1); mixin(CheckMask); // Check and return value ++ptr16; num -= 16; mask = __builtin_ia32_pmovmskb128(r2); mixin(CheckMask); ++ptr16; num -= 16; r3 = __builtin_ia32_pcmpeqb128(*ptr16, niddles); mask = __builtin_ia32_pmovmskb128(r3); mixin(CheckMask); ++ptr16; num -= 16; r4 = __builtin_ia32_pcmpeqb128(*ptr16, niddles); mask = __builtin_ia32_pmovmskb128(r4); mixin(CheckMask); } num -= 64; ptr16 += 4; } while (num > 0); ----- and my best result: ----- Naive: 21.46 TickDuration(132842482) SIMD: 1.161 TickDuration(7188211) (was)SIMD: 3.04 TickDuration(18920182) C: 1 TickDuration(6189222)
Nov 02 2015
On 11/02/2015 09:33 PM, Iakh wrote:----- Naive: 21.46 TickDuration(132842482) SIMD: 1.161 TickDuration(7188211) (was)SIMD: 3.04 TickDuration(18920182) C: 1 TickDuration(6189222)Looks like the current memchr is well optimized. Not much blood left in that stone. -- Andrei
Nov 03 2015
On 31-Oct-2015 00:29, Iakh wrote:I continue to play with SIMD. So I was trying to use std.simd But it has lots of thing to be implemented. And I also gave up with core.simd.__simd due to problems with PMOVMSKB instruction (it is not implemented). Today I was playing with memchr for gdc: memchr: http://www.cplusplus.com/reference/cstring/memchr/ My implementations with benchmark: http://dpaste.dzfl.pl/4c46c0cf340c Benchmark results: ----- Naive: 21.9 TickDuration(136456491) SIMD: 3.04 TickDuration(18920182) SIMDM: 2.44 TickDuration(15232176) SIMDU: 1.8 TickDuration(11210454) C: 1 TickDuration(6233963)More or less matches my experience with C's memchr vs something else - it's fastest portable solution to find a byte in a chunk of memory. -- Dmitry Olshansky
Oct 31 2015
On Friday, 30 October 2015 at 21:29:47 UTC, Iakh wrote:...I got it to 1.5 the running time of C using SSE2 but couldn't get GDC to emit the correct aligned loads, if I used __builtin_assume_aligned the optimizer started being really off.
Oct 31 2015
On Saturday, 31 October 2015 at 08:37:23 UTC, rsw0x wrote:I got it to 1.5 the running time of C using SSE2 but couldn'tCan you share your solution?
Oct 31 2015