www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Fast linear search in D array (slice)

reply Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
I'm currently implementing overloads of some Phobos's search 
algorithms where the haystack is a `char[]` and the needle is an 
ASCII-clean `char`. I plan to integrate them to Phobos later down 
the road.

One such overload is at

https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10

which currently implements the linear search as

foreach (immutable offset; 0 .. haystack.length)
{
     const bool hit = haystack[offset] == needles[0];
     if (hit)
     {
         return Result(haystack, offset);
     }
}

I now have a couple questions regarding the performance of the 
above foreach-loop:

- Is this the preferred way of implementing a linear search in D?
- Andrei has mentioned that sentinel-based search can be faster 
if the haystack is allowed to be temporarily mutated at the end. 
But I presume such a solution is only faster above a certain 
length for the haystack. Has anybody benchmarked and found a 
suitable length limit for this?
- As a special-case of the above there's also the possibility of 
mutating the last character to be a null and then reusing 
`strchr`.
- When is it preferred to search in blocks of 2, 4, 8 or 16 
bytes? I presume for a certain alignment and, again, length limit.
- Would it be relevant to make such a function a compiler 
intrinsic that handles all these cases for the generic 
non-decoding case where haystack is a `T[]` and needle a `T`?
Nov 07 2018
next sibling parent 12345swordy <alexanderheistermann gmail.com> writes:
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
 I'm currently implementing overloads of some Phobos's search 
 algorithms where the haystack is a `char[]` and the needle is 
 an ASCII-clean `char`. I plan to integrate them to Phobos later 
 down the road.

 One such overload is at

 https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10

 which currently implements the linear search as

 foreach (immutable offset; 0 .. haystack.length)
 {
     const bool hit = haystack[offset] == needles[0];
     if (hit)
     {
         return Result(haystack, offset);
     }
 }

 I now have a couple questions regarding the performance of the 
 above foreach-loop:

 - Is this the preferred way of implementing a linear search in 
 D?
 - Andrei has mentioned that sentinel-based search can be faster 
 if the haystack is allowed to be temporarily mutated at the 
 end. But I presume such a solution is only faster above a 
 certain length for the haystack. Has anybody benchmarked and 
 found a suitable length limit for this?
 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.
 - When is it preferred to search in blocks of 2, 4, 8 or 16 
 bytes? I presume for a certain alignment and, again, length 
 limit.
 - Would it be relevant to make such a function a compiler 
 intrinsic that handles all these cases for the generic 
 non-decoding case where haystack is a `T[]` and needle a `T`?
Have you thought about using simd for this?
Nov 07 2018
prev sibling next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Nov 07, 2018 at 09:06:14PM +0000, Per Nordlöw via Digitalmars-d wrote:
[...]
 - Is this the preferred way of implementing a linear search in D?
 - Andrei has mentioned that sentinel-based search can be faster if the
 haystack is allowed to be temporarily mutated at the end. But I
 presume such a solution is only faster above a certain length for the
 haystack. Has anybody benchmarked and found a suitable length limit
 for this?
I would think the suitable length limit would be highly CPU-dependent and memory-configuration-dependent. There is no single magic number that's going to work everywhere. We can sample a wide variety of hardware configurations and arbitrarily choose a middle-ground figure, I suppose, but it's not going to be optimal for everybody, and it will probably become quickly outdated with every new release of hardware.
 - As a special-case of the above there's also the possibility of
 mutating the last character to be a null and then reusing `strchr`.
That's only because strchr is in the C library, and the assumption is that whoever implemented strchr for that platform has already done his homework to find the best-performing implementation for that platform. If you really wanted to, you could do the same in D and figure out the "best" implementation that way.
 - When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I
 presume for a certain alignment and, again, length limit.
I don't have the luxury of being able to compare performance on a wide variety of platforms, but I did peruse the Glibc source code for memchr once, and it basically used a clever bit-mask hack to scan for byte patterns 64 bits at a time, with extra code at the start/end to account for non-aligned boundaries of the array. I ran some tests against it, and found that while it was definitely faster when the byte being searched occurred deep into the array, it was actually slower when the byte occurred near the beginning, because of the complexity of setting up the bit-mask hack and the starting/ending boundary code. When the byte being searched for occurs very frequently in a series of repeated calls to memchr, the glibc memchr actually performed worse than a naïve for-loop that compared byte by byte. The bottom line is that even supposedly optimized-to-hell-and-back code like glibc's memchr may have poor performance characteristics for certain use cases. As with all things, it's always a matter of compromise, and the best solution for you will be heavily dependent on exactly what you're going to be using the code for.
 - Would it be relevant to make such a function a compiler intrinsic
 that handles all these cases for the generic non-decoding case where
 haystack is a `T[]` and needle a `T`?
I thought somebody had already put in some specializations in std.algorithm.find to actually call C's memchr when the haystack is an array and the needle is a 1-byte value? I'm not sure a compiler intrinsic is necessary for something like this... sounds like something that the standard library (i.e., Phobos) should be responsible for. Have you tried profiling std.algorithm.find to see how it performs on your use case? And if indeed std.algorithm.find is suboptimal, we could consider whether your use case is general enough to be worth adding another specialization to handle it, and that way it would benefit everybody, instead of everyone rolling their own probably-suboptimal version. (One thing to watch out for with std.algorithm.find is that it may trigger the dreaded autodecoding machinery. You may want to cast to ubyte[] just to eliminate that possibility. Or use .byCodeUnit, but I'm unsure if .find is able to recognize that as an array, so the optimized path may not be chosen.) T -- Let's call it an accidental feature. -- Larry Wall
Nov 07 2018
prev sibling next sibling parent Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.
Correction it's presumably better to, in place of `strchr`, use void *memchr(const void *s, int c, size_t n); and void *rawmemchr(const void *s, int c); for the sentinel-at-the-end case. For the non-CTFE case, that is.
Nov 07 2018
prev sibling parent reply Stanislav Blinov <stanislav.blinov gmail.com> writes:
On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:

 - Andrei has mentioned that sentinel-based search can be faster 
 if the haystack is allowed to be temporarily mutated at the 
 end. But I presume such a solution is only faster above a 
 certain length for the haystack. Has anybody benchmarked and 
 found a suitable length limit for this?
Speaking about x86(64), when your haystack fits in the cache entirely and is already in it, you'd be hard pressed to beat linear search. If not, a naive sentinel search is more likely to *degrade* performance, unless needle tends to appear in the last cache line worth of data. That's because you'll be performing unnecessary memory accesses. A sentinel search by cache line may be a better idea. So there isn't a particular length limit. When performance is key, your program has to decide which approach to take based on your data layout and access patterns. A single generic library function can't do that.
 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.
D libraries should reduce dependencies on C libraries, not breed them.
Nov 07 2018
parent Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Thursday, 8 November 2018 at 01:47:05 UTC, Stanislav Blinov 
wrote:
 So there isn't a particular length limit. When performance is 
 key, your program has to decide which approach to take based on 
 your data layout and access patterns. A single generic library 
 function can't do that.
 D libraries should reduce dependencies on C libraries, not 
 breed them.
Yep, agree. Thanks.
Nov 08 2018