www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - [up for grabs]Two-way string searching

reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
It have long bugged me that Phobos std.algorithm.find is slow. Far 
slower then strstr (esp on *nix where it competes against GLIBC[1]).

The current state of the art in searching substring
with O(1) space requirement is Two-Way algorithm:

http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

which is linear in time.

I could imagine it would be interesting to implement a generic version 
as well. Any takers? I'd gladly go for it myself but unfortunately way 
too busy on other fronts (I planed to do it for a couple of months already).

[1] See a relevant bug report and discussion e.g on glibc
http://sourceware.org/bugzilla/show_bug.cgi?id=5514


-- 
Dmitry Olshansky
Jun 16 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. Andrei
Jun 16 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu 
wrote:
 On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. 
 Far
 slower then strstr (esp on *nix where it competes against 
 GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic 
 version
 as well. Any takers? I'd gladly go for it myself but 
 unfortunately way
 too busy on other fronts (I planed to do it for a couple of 
 months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. Andrei
One of the "problems" is that find is array agnostic, so doesn't know how to squeeze out all the juice out arrays, such as: * No bounds check on accesses * No bounds check on slicing * vectorized comparisons I took the existing find(RA, BD) code, and modified it to operate on find(ARR, ARR). On average, I'm getting roughly 20%-25% performance improvements (-release -O -inline), although the result is of course highly dependent on the tested input. Goes to say that by addapting the existing algorithm to simply better exploit arrays, there is already room for good improvements. Given that string-to-same-width-string boils back down integral array search, the gains would also be had for strings. -------- I was also able to squeeze out similar performace boosts for find(R, E), with minimal code changes, exploiting better iteration semantics based on the type iterated (range, RA array, or narrow string). -------- I can start by improving find(R, E), because it is a small but easy and effective change. For find(R, R), things are a bit more dicey to properly integrate, so I don't want to do anything right now. But the point is that there is still room for substantial improvements without modifying the algorithm too much...
Jun 17 2013
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jun-2013 15:36, monarch_dodra пишет:
 On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:
 On 6/16/13 6:16 PM, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. Andrei
One of the "problems" is that find is array agnostic, so doesn't know how to squeeze out all the juice out arrays, such as: * No bounds check on accesses * No bounds check on slicing * vectorized comparisons I took the existing find(RA, BD) code, and modified it to operate on find(ARR, ARR). On average, I'm getting roughly 20%-25% performance improvements (-release -O -inline), although the result is of course highly dependent on the tested input. Goes to say that by addapting the existing algorithm to simply better exploit arrays, there is already room for good improvements.
Agreed and there are many tricks beyond that. Like making a repeated bit pattern and searching one machine word at a time (if element size < size_t.sizeof). And this is all nice and have its place too but I'm mostly talking about O(N*M) vs O(N). Anyway, filed: http://d.puremagic.com/issues/show_bug.cgi?id=10392
 Given that string-to-same-width-string boils back down integral array
 search, the gains would also be had for strings.

 --------
 I was also able to squeeze out similar performace boosts for find(R, E),
 with minimal code changes, exploiting better iteration semantics based
 on the type iterated (range, RA array, or narrow string).

 --------
 I can start by improving find(R, E), because it is a small but easy and
 effective change.

 For find(R, R), things are a bit more dicey to properly integrate, so I
 don't want to do anything right now.

 But the point is that there is still room for substantial improvements
 without modifying the algorithm too much...
-- Dmitry Olshansky
Jun 17 2013
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:
 And this is all nice and have its place too but I'm mostly 
 talking about O(N*M) vs O(N).
Well... putting it that way always looks nice on paper, but for what values of N and M is it actually *faster* ? Also, are we talking about *actual* O(N), or average case O(N) ? Because the link has a worst case scenario of O(N*M). Actual O(N) comes with a space tradeof, which also comes with a very real allocation cost that doesn't appear in the complexity.
Jun 17 2013
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jun-2013 22:02, monarch_dodra пишет:
 On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:
 And this is all nice and have its place too but I'm mostly talking
 about O(N*M) vs O(N).
Well... putting it that way always looks nice on paper, but for what values of N and M is it actually *faster* ? Also, are we talking about *actual* O(N), or average case O(N) ? Because the link has a worst case scenario of O(N*M).
I'm sure I'm reading it right: - preprocessing phase in O(m) time and constant space complexity; - constant space complexity for the preprocessing phase; - searching phase in O(n) time; - performs 2n-m text character comparisons in the worst case. -- Dmitry Olshansky
Jun 17 2013
prev sibling parent reply "Andrea Fontana" <nospam example.com> writes:
http://volnitsky.com/project/str_search/

Reading this, two way algorithm doesn't seem the best one... 
(author says it's  used by GLIBC_memmem)

Have I missed the point?

On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. 
 Far slower then strstr (esp on *nix where it competes against 
 GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic 
 version as well. Any takers? I'd gladly go for it myself but 
 unfortunately way too busy on other fronts (I planed to do it 
 for a couple of months already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Jun 17 2013
next sibling parent "Ivan Kazmenko" <gassa mail.ru> writes:
On Monday, 17 June 2013 at 08:04:14 UTC, Andrea Fontana wrote:
 http://volnitsky.com/project/str_search/

 Reading this, two way algorithm doesn't seem the best one... 
 (author says it's  used by GLIBC_memmem)

 Have I missed the point?
That's a strange page, at a glance: 1. The algorithms were tested essentially on one real-life test case. 2. Author claims to use a structure similar to hash table, but does not include Rabin-Karp algorithm in comparison. 3. Reference implementation does not in fact allow online searching, but the algorithm is claimed to have the capacity. 4. Reference implementation has hardcoded constants for array sizes. 5. The comment about case-insensitive search seems inefficient at best: it suggests to hash every case combination (hello combinatorial explosion) instead of just converting everything to lowercase on the fly. (and 6. Reference implementation didn't work for me, but it may be actually my fault...) Given the points above, I won't make any decisions based on data from that page :) . Ivan Kazmenko.
Jun 17 2013
prev sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
17-Jun-2013 12:04, Andrea Fontana пишет:
 http://volnitsky.com/project/str_search/

 Reading this, two way algorithm doesn't seem the best one... (author
 says it's  used by GLIBC_memmem)

 Have I missed the point?
No idea as I do not see Two way search represented here at all. Otherwise all of the comments by Ivan apply.
 On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:
 It have long bugged me that Phobos std.algorithm.find is slow. Far
 slower then strstr (esp on *nix where it competes against GLIBC[1]).

 The current state of the art in searching substring
 with O(1) space requirement is Two-Way algorithm:

 http://www-igm.univ-mlv.fr/~lecroq/string/node26.html

 which is linear in time.

 I could imagine it would be interesting to implement a generic version
 as well. Any takers? I'd gladly go for it myself but unfortunately way
 too busy on other fronts (I planed to do it for a couple of months
 already).

 [1] See a relevant bug report and discussion e.g on glibc
 http://sourceware.org/bugzilla/show_bug.cgi?id=5514
-- Dmitry Olshansky
Jun 17 2013