digitalmars.D - [up for grabs]Two-way string searching
- Dmitry Olshansky (13/13) Jun 16 2013 It have long bugged me that Phobos std.algorithm.find is slow. Far
- Andrei Alexandrescu (4/16) Jun 16 2013 Awesome idea! Could you please submit an enhancement request for this? I...
- monarch_dodra (29/55) Jun 17 2013 One of the "problems" is that find is array agnostic, so doesn't
- Dmitry Olshansky (10/59) Jun 17 2013 Agreed and there are many tricks beyond that. Like making a repeated bit...
- monarch_dodra (7/9) Jun 17 2013 Well... putting it that way always looks nice on paper, but for
- Dmitry Olshansky (8/15) Jun 17 2013 I'm sure I'm reading it right:
- Andrea Fontana (5/18) Jun 17 2013 http://volnitsky.com/project/str_search/
- Ivan Kazmenko (19/23) Jun 17 2013 That's a strange page, at a glance:
- Dmitry Olshansky (5/27) Jun 17 2013 No idea as I do not see Two way search represented here at all.
It have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514 -- Dmitry Olshansky
Jun 16 2013
On 6/16/13 6:16 PM, Dmitry Olshansky wrote:It have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. Andrei
Jun 16 2013
On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:On 6/16/13 6:16 PM, Dmitry Olshansky wrote:One of the "problems" is that find is array agnostic, so doesn't know how to squeeze out all the juice out arrays, such as: * No bounds check on accesses * No bounds check on slicing * vectorized comparisons I took the existing find(RA, BD) code, and modified it to operate on find(ARR, ARR). On average, I'm getting roughly 20%-25% performance improvements (-release -O -inline), although the result is of course highly dependent on the tested input. Goes to say that by addapting the existing algorithm to simply better exploit arrays, there is already room for good improvements. Given that string-to-same-width-string boils back down integral array search, the gains would also be had for strings. -------- I was also able to squeeze out similar performace boosts for find(R, E), with minimal code changes, exploiting better iteration semantics based on the type iterated (range, RA array, or narrow string). -------- I can start by improving find(R, E), because it is a small but easy and effective change. For find(R, R), things are a bit more dicey to properly integrate, so I don't want to do anything right now. But the point is that there is still room for substantial improvements without modifying the algorithm too much...It have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. Andrei
Jun 17 2013
17-Jun-2013 15:36, monarch_dodra пишет:On Sunday, 16 June 2013 at 23:09:31 UTC, Andrei Alexandrescu wrote:Agreed and there are many tricks beyond that. Like making a repeated bit pattern and searching one machine word at a time (if element size < size_t.sizeof). And this is all nice and have its place too but I'm mostly talking about O(N*M) vs O(N). Anyway, filed: http://d.puremagic.com/issues/show_bug.cgi?id=10392On 6/16/13 6:16 PM, Dmitry Olshansky wrote:One of the "problems" is that find is array agnostic, so doesn't know how to squeeze out all the juice out arrays, such as: * No bounds check on accesses * No bounds check on slicing * vectorized comparisons I took the existing find(RA, BD) code, and modified it to operate on find(ARR, ARR). On average, I'm getting roughly 20%-25% performance improvements (-release -O -inline), although the result is of course highly dependent on the tested input. Goes to say that by addapting the existing algorithm to simply better exploit arrays, there is already room for good improvements.It have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514Awesome idea! Could you please submit an enhancement request for this? I wonder what kind of ranges that could work on. AndreiGiven that string-to-same-width-string boils back down integral array search, the gains would also be had for strings. -------- I was also able to squeeze out similar performace boosts for find(R, E), with minimal code changes, exploiting better iteration semantics based on the type iterated (range, RA array, or narrow string). -------- I can start by improving find(R, E), because it is a small but easy and effective change. For find(R, R), things are a bit more dicey to properly integrate, so I don't want to do anything right now. But the point is that there is still room for substantial improvements without modifying the algorithm too much...-- Dmitry Olshansky
Jun 17 2013
On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:And this is all nice and have its place too but I'm mostly talking about O(N*M) vs O(N).Well... putting it that way always looks nice on paper, but for what values of N and M is it actually *faster* ? Also, are we talking about *actual* O(N), or average case O(N) ? Because the link has a worst case scenario of O(N*M). Actual O(N) comes with a space tradeof, which also comes with a very real allocation cost that doesn't appear in the complexity.
Jun 17 2013
17-Jun-2013 22:02, monarch_dodra пишет:On Monday, 17 June 2013 at 17:13:49 UTC, Dmitry Olshansky wrote:I'm sure I'm reading it right: - preprocessing phase in O(m) time and constant space complexity; - constant space complexity for the preprocessing phase; - searching phase in O(n) time; - performs 2n-m text character comparisons in the worst case. -- Dmitry OlshanskyAnd this is all nice and have its place too but I'm mostly talking about O(N*M) vs O(N).Well... putting it that way always looks nice on paper, but for what values of N and M is it actually *faster* ? Also, are we talking about *actual* O(N), or average case O(N) ? Because the link has a worst case scenario of O(N*M).
Jun 17 2013
http://volnitsky.com/project/str_search/ Reading this, two way algorithm doesn't seem the best one... (author says it's used by GLIBC_memmem) Have I missed the point? On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:It have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Jun 17 2013
On Monday, 17 June 2013 at 08:04:14 UTC, Andrea Fontana wrote:http://volnitsky.com/project/str_search/ Reading this, two way algorithm doesn't seem the best one... (author says it's used by GLIBC_memmem) Have I missed the point?That's a strange page, at a glance: 1. The algorithms were tested essentially on one real-life test case. 2. Author claims to use a structure similar to hash table, but does not include Rabin-Karp algorithm in comparison. 3. Reference implementation does not in fact allow online searching, but the algorithm is claimed to have the capacity. 4. Reference implementation has hardcoded constants for array sizes. 5. The comment about case-insensitive search seems inefficient at best: it suggests to hash every case combination (hello combinatorial explosion) instead of just converting everything to lowercase on the fly. (and 6. Reference implementation didn't work for me, but it may be actually my fault...) Given the points above, I won't make any decisions based on data from that page :) . Ivan Kazmenko.
Jun 17 2013
17-Jun-2013 12:04, Andrea Fontana пишет:http://volnitsky.com/project/str_search/ Reading this, two way algorithm doesn't seem the best one... (author says it's used by GLIBC_memmem) Have I missed the point?No idea as I do not see Two way search represented here at all. Otherwise all of the comments by Ivan apply.On Sunday, 16 June 2013 at 22:17:04 UTC, Dmitry Olshansky wrote:-- Dmitry OlshanskyIt have long bugged me that Phobos std.algorithm.find is slow. Far slower then strstr (esp on *nix where it competes against GLIBC[1]). The current state of the art in searching substring with O(1) space requirement is Two-Way algorithm: http://www-igm.univ-mlv.fr/~lecroq/string/node26.html which is linear in time. I could imagine it would be interesting to implement a generic version as well. Any takers? I'd gladly go for it myself but unfortunately way too busy on other fronts (I planed to do it for a couple of months already). [1] See a relevant bug report and discussion e.g on glibc http://sourceware.org/bugzilla/show_bug.cgi?id=5514
Jun 17 2013