digitalmars.D - [up for grabs]Two-way string searching

Dmitry Olshansky (13/13) Jun 16 2013 It have long bugged me that Phobos std.algorithm.find is slow. Far

Andrei Alexandrescu (4/16) Jun 16 2013 Awesome idea! Could you please submit an enhancement request for this? I...

monarch_dodra (29/55) Jun 17 2013 One of the "problems" is that find is array agnostic, so doesn't

Dmitry Olshansky (10/59) Jun 17 2013 Agreed and there are many tricks beyond that. Like making a repeated bit...

monarch_dodra (7/9) Jun 17 2013 Well... putting it that way always looks nice on paper, but for

Dmitry Olshansky (8/15) Jun 17 2013 I'm sure I'm reading it right:

Andrea Fontana (5/18) Jun 17 2013 http://volnitsky.com/project/str_search/

Ivan Kazmenko (19/23) Jun 17 2013 That's a strange page, at a glance:
Dmitry Olshansky (5/27) Jun 17 2013 No idea as I do not see Two way search represented here at all.

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

It have long bugged me that Phobos std.algorithm.find is slow. Far 
slower then strstr (esp on *nix where it competes against GLIBC[1]).

The current state of the art in searching substring
with O(1) space requirement is Two-Way algorithm:

http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

which is linear in time.

I could imagine it would be interesting to implement a generic version 
as well. Any takers? I'd gladly go for it myself but unfortunately way 
too busy on other fronts (I planed to do it for a couple of months already).

[1] See a relevant bug report and discussion e.g on glibc
http://sourceware.org/bugzilla/show_bug.cgi?id=5514


-- 
Dmitry Olshansky

Jun 16 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514

Awesome idea! Could you please submit an enhancement request for this? I 
wonder what kind of ranges that could work on.

Andrei

Jun 16 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu 
wrote:
 On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. 
 Far
 slower then strstr (esp on *nix where it competes against 
 GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic 
 version
 as well. Any takers? I'd gladly go for it myself but 
 unfortunately way
 too busy on other fronts (I planed to do it for a couple of 
 months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514

 Awesome idea! Could you please submit an enhancement request 
 for this? I wonder what kind of ranges that could work on.

 Andrei

One of the "problems" is that find is array agnostic, so doesn't 
know how to squeeze out all the juice out arrays, such as:
* No bounds check on accesses
* No bounds check on slicing
* vectorized comparisons

I took the existing find(RA, BD) code, and modified it to operate 
on find(ARR, ARR).

On average, I'm getting roughly 20%-25% performance improvements 
(-release -O -inline), although the result is of course highly 
dependent on the tested input.

Goes to say that by addapting the existing algorithm to simply 
better exploit arrays, there is already room for good 
improvements.

Given that string-to-same-width-string boils back down integral 
array search, the gains would also be had for strings.

--------
I was also able to squeeze out similar performace boosts for 
find(R, E), with minimal code changes, exploiting better 
iteration semantics based on the type iterated (range, RA array, 
or narrow string).

--------
I can start by improving find(R, E), because it is a small but 
easy and effective change.

For find(R, R), things are a bit more dicey to properly 
integrate, so I don't want to do anything right now.

But the point is that there is still room for substantial 
improvements without modifying the algorithm too much...

Jun 17 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jun-2013 15:36, monarch_dodra пишет:
 On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:
 On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514

 Awesome idea! Could you please submit an enhancement request for this?
 I wonder what kind of ranges that could work on.

 Andrei

 One of the "problems" is that find is array agnostic, so doesn't know
 how to squeeze out all the juice out arrays, such as:
 * No bounds check on accesses
 * No bounds check on slicing
 * vectorized comparisons

 I took the existing find(RA, BD) code, and modified it to operate on
 find(ARR, ARR).

 On average, I'm getting roughly 20%-25% performance improvements
 (-release -O -inline), although the result is of course highly dependent
 on the tested input.

 Goes to say that by addapting the existing algorithm to simply better
 exploit arrays, there is already room for good improvements.

Agreed and there are many tricks beyond that. Like making a repeated bit 
pattern and searching one machine word at a time (if element size < 
size_t.sizeof).

And this is all nice and have its place too but I'm mostly talking about 
O(N*M) vs O(N).

Anyway, filed:
http://d.puremagic.com/issues/show_bug.cgi?id=10392


 Given that string-to-same-width-string boils back down integral array
 search, the gains would also be had for strings.

 --------
 I was also able to squeeze out similar performace boosts for find(R, E),
 with minimal code changes, exploiting better iteration semantics based
 on the type iterated (range, RA array, or narrow string).

 --------
 I can start by improving find(R, E), because it is a small but easy and
 effective change.

 For find(R, R), things are a bit more dicey to properly integrate, so I
 don't want to do anything right now.

 But the point is that there is still room for substantial improvements
 without modifying the algorithm too much...


-- 
Dmitry Olshansky

Jun 17 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:
 And this is all nice and have its place too but I'm mostly 
 talking about O(N*M) vs O(N).

Well... putting it that way always looks nice on paper, but for 
what values of N and M is it actually *faster* ?

Also, are we talking about *actual* O(N), or average case O(N) ? 
Because the link has a worst case scenario of O(N*M).

Actual O(N) comes with a space tradeof, which also comes with a 
very real allocation cost that doesn't appear in the complexity.

Jun 17 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jun-2013 22:02, monarch_dodra пишет:
 On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:
 And this is all nice and have its place too but I'm mostly talking
 about O(N*M) vs O(N).

 Well... putting it that way always looks nice on paper, but for what
 values of N and M is it actually *faster* ?

 Also, are we talking about *actual* O(N), or average case O(N) ? Because
 the link has a worst case scenario of O(N*M).

I'm sure I'm reading it right:
- preprocessing phase in O(m) time and constant space complexity;
- constant space complexity for the preprocessing phase;
- searching phase in O(n) time;
- performs 2n-m text character comparisons in the worst case.



-- 
Dmitry Olshansky

Jun 17 2013

"Andrea Fontana" <nospam example.com> writes:

http://volnitsky.com/project/str_search/

Reading this, two way algorithm doesn't seem the best one... 
(author says it's  used by GLIBC_memmem)

Have I missed the point?

On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. 
 Far slower then strstr (esp on *nix where it competes against 
 GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic 
 version as well. Any takers? I'd gladly go for it myself but 
 unfortunately way too busy on other fronts (I planed to do it 
 for a couple of months already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514

Jun 17 2013

"Ivan Kazmenko" <gassa mail.ru> writes:

On Monday, 17 June 2013 at 08:04:14 UTC, Andrea Fontana wrote:
 http://volnitsky.com/project/str_search/

 Reading this, two way algorithm doesn't seem the best one... 
 (author says it's  used by GLIBC_memmem)

 Have I missed the point?

That's a strange page, at a glance:

1. The algorithms were tested essentially on one real-life test 
case.
2. Author claims to use a structure similar to hash table, but 
does not include Rabin-Karp algorithm in comparison.
3. Reference implementation does not in fact allow online 
searching, but the algorithm is claimed to have the capacity.
4. Reference implementation has hardcoded constants for array 
sizes.
5. The comment about case-insensitive search seems inefficient at 
best: it suggests to hash every case combination (hello 
combinatorial explosion) instead of just converting everything to 
lowercase on the fly.
(and 6. Reference implementation didn't work for me, but it may 
be actually my fault...)

Given the points above, I won't make any decisions based on data 
from that page :) .

Ivan Kazmenko.

Jun 17 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

17-Jun-2013 12:04, Andrea Fontana пишет:
 http://volnitsky.com/project/str_search/

 Reading this, two way algorithm doesn't seem the best one... (author
 says it's  used by GLIBC_memmem)

 Have I missed the point?

No idea as I do not see Two way search represented here at all.
Otherwise all of the comments by Ivan apply.

 On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514



-- 
Dmitry Olshansky

Jun 17 2013

D Programming

C/C++ Programming

Other

digitalmars.D - [up for grabs]Two-way string searching