digitalmars.D - Fast linear search in D array (slice)

Per =?UTF-8?B?Tm9yZGzDtnc=?= (31/31) Nov 07 2018 I'm currently implementing overloads of some Phobos's search

12345swordy (2/35) Nov 07 2018 Have you thought about using simd for this?
H. S. Teoh (49/62) Nov 07 2018 I would think the suitable length limit would be highly CPU-dependent
Per =?UTF-8?B?Tm9yZGzDtnc=?= (7/10) Nov 07 2018 Correction it's presumably better to, in place of `strchr`, use
Stanislav Blinov (14/22) Nov 07 2018 Speaking about x86(64), when your haystack fits in the cache

Per =?UTF-8?B?Tm9yZGzDtnc=?= (4/10) Nov 08 2018 Yep, agree.

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

I'm currently implementing overloads of some Phobos's search 
algorithms where the haystack is a `char[]` and the needle is an 
ASCII-clean `char`. I plan to integrate them to Phobos later down 
the road.

One such overload is at

https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10

which currently implements the linear search as

foreach (immutable offset; 0 .. haystack.length)
{
     const bool hit = haystack[offset] == needles[0];
     if (hit)
     {
         return Result(haystack, offset);
     }
}

I now have a couple questions regarding the performance of the 
above foreach-loop:

- Is this the preferred way of implementing a linear search in D?
- Andrei has mentioned that sentinel-based search can be faster 
if the haystack is allowed to be temporarily mutated at the end. 
But I presume such a solution is only faster above a certain 
length for the haystack. Has anybody benchmarked and found a 
suitable length limit for this?
- As a special-case of the above there's also the possibility of 
mutating the last character to be a null and then reusing 
`strchr`.
- When is it preferred to search in blocks of 2, 4, 8 or 16 
bytes? I presume for a certain alignment and, again, length limit.
- Would it be relevant to make such a function a compiler 
intrinsic that handles all these cases for the generic 
non-decoding case where haystack is a `T[]` and needle a `T`?

Nov 07 2018

12345swordy <alexanderheistermann gmail.com> writes:

On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
 I'm currently implementing overloads of some Phobos's search 
 algorithms where the haystack is a `char[]` and the needle is 
 an ASCII-clean `char`. I plan to integrate them to Phobos later 
 down the road.

 One such overload is at

 https://github.com/nordlow/phobos-next/blob/master/src/find_split_ex.d#L10

 which currently implements the linear search as

 foreach (immutable offset; 0 .. haystack.length)
 {
     const bool hit = haystack[offset] == needles[0];
     if (hit)
     {
         return Result(haystack, offset);
     }
 }

 I now have a couple questions regarding the performance of the 
 above foreach-loop:

 - Is this the preferred way of implementing a linear search in 
 D?
 - Andrei has mentioned that sentinel-based search can be faster 
 if the haystack is allowed to be temporarily mutated at the 
 end. But I presume such a solution is only faster above a 
 certain length for the haystack. Has anybody benchmarked and 
 found a suitable length limit for this?
 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.
 - When is it preferred to search in blocks of 2, 4, 8 or 16 
 bytes? I presume for a certain alignment and, again, length 
 limit.
 - Would it be relevant to make such a function a compiler 
 intrinsic that handles all these cases for the generic 
 non-decoding case where haystack is a `T[]` and needle a `T`?

Have you thought about using simd for this?

Nov 07 2018

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Nov 07, 2018 at 09:06:14PM +0000, Per Nordl�w via Digitalmars-d wrote:
[...]
 - Is this the preferred way of implementing a linear search in D?
 - Andrei has mentioned that sentinel-based search can be faster if the
 haystack is allowed to be temporarily mutated at the end. But I
 presume such a solution is only faster above a certain length for the
 haystack. Has anybody benchmarked and found a suitable length limit
 for this?

I would think the suitable length limit would be highly CPU-dependent
and memory-configuration-dependent.  There is no single magic number
that's going to work everywhere.  We can sample a wide variety of
hardware configurations and arbitrarily choose a middle-ground figure, I
suppose, but it's not going to be optimal for everybody, and it will
probably become quickly outdated with every new release of hardware.


 - As a special-case of the above there's also the possibility of
 mutating the last character to be a null and then reusing `strchr`.

That's only because strchr is in the C library, and the assumption is
that whoever implemented strchr for that platform has already done his
homework to find the best-performing implementation for that platform.
If you really wanted to, you could do the same in D and figure out the
"best" implementation that way.


 - When is it preferred to search in blocks of 2, 4, 8 or 16 bytes? I
 presume for a certain alignment and, again, length limit.

I don't have the luxury of being able to compare performance on a wide
variety of platforms, but I did peruse the Glibc source code for memchr
once, and it basically used a clever bit-mask hack to scan for byte
patterns 64 bits at a time, with extra code at the start/end to account
for non-aligned boundaries of the array.  I ran some tests against it,
and found that while it was definitely faster when the byte being
searched occurred deep into the array, it was actually slower when the
byte occurred near the beginning, because of the complexity of setting
up the bit-mask hack and the starting/ending boundary code.  When the
byte being searched for occurs very frequently in a series of repeated
calls to memchr, the glibc memchr actually performed worse than a na�ve
for-loop that compared byte by byte.

The bottom line is that even supposedly optimized-to-hell-and-back code
like glibc's memchr may have poor performance characteristics for
certain use cases.  As with all things, it's always a matter of
compromise, and the best solution for you will be heavily dependent on
exactly what you're going to be using the code for.


 - Would it be relevant to make such a function a compiler intrinsic
 that handles all these cases for the generic non-decoding case where
 haystack is a `T[]` and needle a `T`?

I thought somebody had already put in some specializations in
std.algorithm.find to actually call C's memchr when the haystack is an
array and the needle is a 1-byte value?  I'm not sure a compiler
intrinsic is necessary for something like this... sounds like something
that the standard library (i.e., Phobos) should be responsible for.
Have you tried profiling std.algorithm.find to see how it performs on
your use case?

And if indeed std.algorithm.find is suboptimal, we could consider
whether your use case is general enough to be worth adding another
specialization to handle it, and that way it would benefit everybody,
instead of everyone rolling their own probably-suboptimal version.

(One thing to watch out for with std.algorithm.find is that it may
trigger the dreaded autodecoding machinery. You may want to cast to
ubyte[] just to eliminate that possibility. Or use .byCodeUnit, but I'm
unsure if .find is able to recognize that as an array, so the optimized
path may not be chosen.)


T

-- 
Let's call it an accidental feature. -- Larry Wall

Nov 07 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:
 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.

Correction it's presumably better to, in place of `strchr`, use

        void *memchr(const void *s, int c, size_t n);

and

        void *rawmemchr(const void *s, int c);

for the sentinel-at-the-end case.

For the non-CTFE case, that is.

Nov 07 2018

Stanislav Blinov <stanislav.blinov gmail.com> writes:

On Wednesday, 7 November 2018 at 21:06:14 UTC, Per Nordlöw wrote:

 - Andrei has mentioned that sentinel-based search can be faster 
 if the haystack is allowed to be temporarily mutated at the 
 end. But I presume such a solution is only faster above a 
 certain length for the haystack. Has anybody benchmarked and 
 found a suitable length limit for this?

Speaking about x86(64), when your haystack fits in the cache 
entirely and is already in it, you'd be hard pressed to beat 
linear search. If not, a naive sentinel search is more likely to 
*degrade* performance, unless needle tends to appear in the last 
cache line worth of data. That's because you'll be performing 
unnecessary memory accesses. A sentinel search by cache line may 
be a better idea.
So there isn't a particular length limit. When performance is 
key, your program has to decide which approach to take based on 
your data layout and access patterns. A single generic library 
function can't do that.

 - As a special-case of the above there's also the possibility 
 of mutating the last character to be a null and then reusing 
 `strchr`.

D libraries should reduce dependencies on C libraries, not breed 
them.

Nov 07 2018

Per =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Thursday, 8 November 2018 at 01:47:05 UTC, Stanislav Blinov 
wrote:
 So there isn't a particular length limit. When performance is 
 key, your program has to decide which approach to take based on 
 your data layout and access patterns. A single generic library 
 function can't do that.

 D libraries should reduce dependencies on C libraries, not 
 breed them.

Yep, agree.

Thanks.

Nov 08 2018

D Programming

C/C++ Programming

Other

digitalmars.D - Fast linear search in D array (slice)