digitalmars.D - Improving std.algorithm.find

Andrei Alexandrescu (28/28) Jul 17 2010 I was thinking of improving std.find. We have this bug report:

Jonathan M Davis (64/102) Jul 17 2010 I think that the problem with find() is not so much find() but it's docu...

sybrandy (6/9) Jul 19 2010 I fully agree. I had no idea from the documentation what something like...

Jonathan M Davis (16/27) Jul 19 2010 In most cases, the return types are just ranges, but because they're con...

sybrandy (6/20) Jul 20 2010 Agreed. I'm also a big fan of examples as it's very easy to look at an

bearophile (5/9) Jul 20 2010 I am not sure, but I think find() is not what I need, even if you add lo...

Andrei Alexandrescu (6/16) Jul 19 2010 I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone)

sybrandy (3/20) Jul 20 2010 Muchas gracias from your loyal minions.

bearophile (15/16) Jul 18 2010 As with most software it's better to first ask yourself what your softwa...
Philippe Sigaud (72/72) Jul 18 2010 On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu wrote:

Andrei Alexandrescu (21/66) Jul 18 2010 I agree that a findAll function yielding a range is nice, however it's

Philippe Sigaud (36/94) Jul 19 2010 Yes, I agree that's the common case.

Andrei Alexandrescu (14/42) Jul 19 2010 I think this is a misunderstanding. All of std.algorithm works with

Philippe Sigaud (34/66) Jul 19 2010 Yes, indeed. Unless I define some cousin to range, a recursive range, to

Dmitry Olshansky (11/48) Jul 19 2010 Try this workaround, replacing delegate with nested function, works for ...

Andrei Alexandrescu (4/65) Jul 19 2010 Yes, I'd discussed this with Walter on Friday and he said the semantics
Philippe Sigaud (21/36) Jul 19 2010 Hmm, using a nested function was the first thing I tried, and it didn't ...

Philippe Sigaud (49/49) Jul 18 2010 On Sun, Jul 18, 2010 at 06:19, Jonathan M Davis wrote:

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

I was thinking of improving std.find. We have this bug report:

http://d.puremagic.com/issues/show_bug.cgi?id=3923

which is pretty vague but does have a point.

For starters, it should be told that std.algorithm.find _does_ a lot, 
which at least partially justifies its complexity. One thing that I 
haven't seen other find()s doing is to be able to search in one pass a 
given range for multiple other ranges, like this:

int[] a = [ 1, 4, 2, 3 ];
assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));

When passed more than two arguments, find returns a tuple continuing the 
searched ranged positioned on the element found and the 1-based index of 
the parameter that was found. The trick is that find() makes exactly one 
pass through the searched range, which is often more efficient than 
searching the same range for each element in turn. Also the one-pass 
approach works with input ranges so it doesn't put pressure on the range 
capabilities.

However the simplified find() looks like, I'd like to keep this feature 

complicates find's signature and implementation.

Another aspect I'd like to discuss is use of Boyer-Moore and related 
fast finding techniques. Currently the use of B-M is explicit and a bit 
awkward. I was thinking instead to automatically detect its 
appropriatedness and use it transparently. Bearophile posted at a link 
to such a blend string search routine 
(http://effbot.org/zone/stringlib.htm) that I think can be generalized 
quite easily.

Any ideas - please chime in.


Andrei

Jul 17 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Saturday 17 July 2010 15:55:26 Andrei Alexandrescu wrote:
 I was thinking of improving std.find. We have this bug report:
 
 http://d.puremagic.com/issues/show_bug.cgi?id=3923
 
 which is pretty vague but does have a point.
 
 For starters, it should be told that std.algorithm.find _does_ a lot,
 which at least partially justifies its complexity. One thing that I
 haven't seen other find()s doing is to be able to search in one pass a
 given range for multiple other ranges, like this:
 
 int[] a = [ 1, 4, 2, 3 ];
 assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));
 
 When passed more than two arguments, find returns a tuple continuing the
 searched ranged positioned on the element found and the 1-based index of
 the parameter that was found. The trick is that find() makes exactly one
 pass through the searched range, which is often more efficient than
 searching the same range for each element in turn. Also the one-pass
 approach works with input ranges so it doesn't put pressure on the range
 capabilities.
 
 However the simplified find() looks like, I'd like to keep this feature

 complicates find's signature and implementation.
 
 Another aspect I'd like to discuss is use of Boyer-Moore and related
 fast finding techniques. Currently the use of B-M is explicit and a bit
 awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 
 Any ideas - please chime in.
 
 
 Andrei

I think that the problem with find() is not so much find() but it's
documentation. 
In all honesty, anything with a return type like FindResult!(Range, Ranges) is 
going to scare people off. The find() that takes only a predicate and the range
to 
search in is much better:

Range find(alias pred, Range)(Range haystack);

Now, unfortunately, I'm not sure how to improve the main version of find other 
than the return type.

FindResult!(Range,Ranges) find(alias pred = "a == b", Range, Ranges...)(Range 
haystack, Ranges needles);

becomes

Range find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges 
needles);

However, that (alias pred = "a == b", Range, Ranges...) is still pretty ugly. 
Unfortunately, I'm not sure how you would eliminate that. The only part that
the 
caller would normally care about is alias pred = "a == b", but the rest is part 
of the template definition as well, and I don't think that we have a way to
clean 
that up at the moment.

The other problem is that you're getting two totally different return types 
depending on whether you give it one needle or many (which may make Range not 
work as a return type anyway since what you get in the case of multiple needs 
isn't exactly a Range). So, I'd suggest either making it so that they return
the 
same thing and you have to do a comparison to figure out which needle was found 
(in many cases, all you care about is that one of them was found anyway, but it 
would be less efficient in cases where you do care), or you should split find() 
into two functions - one which takes a single needle and one which takes many. 
That should make things a bit clearer. It doesn't really change the 
functionality, but the docs would be less cluttered. Another option would be to 
make find only return a range for multiple needles but have a separate function 
which returns a tuple if you really care, thereby simplifying find() a bit. In 
such a case, it would probably still be best to separate find() itself into two 
functions, but it wouldn't be as critical.

As for integrating BoyerMoore into find(), if you can do it cleanly and 
seemlessly, I'm all for it. It would be good if there were a note about it in 
the documentation so that those using it wouldn't feel the need to reimplement 
it themselves, but I'm all for having versions of find() be specialized for the 
type that they're searching on if it can be done seemlessly. I'd even suggest 
doing the same for the various container types which could benefit from a 
specialized find - though perhaps those would be better served by having a 
special version of find() as part of their definition which
std.algorithm.find() 
then calls.

If anything, I've been more interested in canFind() and until() being made to 
match up with find(). I'd like to be able to give them both pretty much the
same 
arguments and then get the bool from canFind() and the range that find() would 
have walked over in the case of until(). A function which gave you both the 
range that until() would have given you as well as the one which find() would 
have given you would be nice as well. I previously opened a bug with thoughts 
along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888

In any case, I think that find() itself is more or less fine from a usage 
standpoint. It's the docs that need help. And splitting up the function and 
trying to simplify the signature in the docs (whether the signature itself is 
really simplified or not) is what really needs to happen. The other thing would 
be to add a lot more examples. That would be a big boon for a lot of 
std.algorithm functions. They're not necessarily hard to use, and the examples 
show it much more clearly than the signatures often do. The problem is that the 
signatures are so ugly. And the fact that they have to be (in terms of the 
actual function if not the documentation) is pretty much required given what 
they do, so I'm not sure how much that can really be fixed. Maybe some
extensions 
to DDoc need to be done for it happen, but it would be good if we could find a 
way to give simplified function signatures for a lot of these functions
(possibly 
with a way in the docs to see the full function signature) so that users of the 
docs can see what they need to to use the functions without having to wade 
through the nasty (albeit necessarily so) signatures.

- Jonathan M Davis

Jul 17 2010

sybrandy <sybrandy gmail.com> writes:

 I think that the problem with find() is not so much find() but it's
documentation.
 In all honesty, anything with a return type like FindResult!(Range, Ranges) is
 going to scare people off.

I fully agree.  I had no idea from the documentation what something like 
this was: [1,2,3,4][].  The array documentation doesn't have any 
details, so for the most part I've stayed away from std.algorithm 
because I don't understand the return types very well.  Better 
documentation would help a lot.

Casey

Jul 19 2010

Jonathan M Davis <jmdavisprog gmail.com> writes:

On Monday, July 19, 2010 16:36:57 sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation. In all honesty, anything with a return type like
 FindResult!(Range, Ranges) is going to scare people off.

 
 I fully agree.  I had no idea from the documentation what something like
 this was: [1,2,3,4][].  The array documentation doesn't have any
 details, so for the most part I've stayed away from std.algorithm
 because I don't understand the return types very well.  Better
 documentation would help a lot.
 
 Casey

In most cases, the return types are just ranges, but because they're
constructed 
by the algorithm, the algorithm creates its own range type. So, generally, all 
you need to know is that a range is being returned. But even then, you run into 
some issues where you need to do stuff like run the returned range through 
array() to get something useable - depending on what you're doing with it. 
Regardless of the return type though, std.algorithm is definitely a prime
example 
of how auto is your friend. That way you often don't have to care all that much 
about what the return type really looks like.

However, the documentation on it is obviously ugly. It really should give the 
programmer exactly what they need to use the function and nothing more. As it 
stands, you have to really want to use it to figure it out. Personally, I think 
that it's well worth it - std.algorithm is fantastic - but the documentation is 
nowhere near as straightforward as it should be (even if using the functions is 
actually pretty straightforward in most cases).

- Jonathan M Davis

Jul 19 2010

sybrandy <sybrandy gmail.com> writes:

 In most cases, the return types are just ranges, but because they're
constructed
 by the algorithm, the algorithm creates its own range type. So, generally, all
 you need to know is that a range is being returned. But even then, you run into
 some issues where you need to do stuff like run the returned range through
 array() to get something useable - depending on what you're doing with it.
 Regardless of the return type though, std.algorithm is definitely a prime
example
 of how auto is your friend. That way you often don't have to care all that much
 about what the return type really looks like.

That paragraph would have been immensely useful to me a few months ago.
 However, the documentation on it is obviously ugly. It really should give the
 programmer exactly what they need to use the function and nothing more. As it
 stands, you have to really want to use it to figure it out. Personally, I think
 that it's well worth it - std.algorithm is fantastic - but the documentation is
 nowhere near as straightforward as it should be (even if using the functions is
 actually pretty straightforward in most cases).

Agreed.  I'm also a big fan of examples as it's very easy to look at an 
example and get something working.  Then, once you see it work and get 
somewhat of an understanding, you can dig deeper into the details to 
learn more.

Casey

Jul 20 2010

bearophile <bearophileHUGS lycos.com> writes:

sybrandy:
 Agreed.  I'm also a big fan of examples as it's very easy to look at an 
 example and get something working.  Then, once you see it work and get 
 somewhat of an understanding, you can dig deeper into the details to 
 learn more.

I am not sure, but I think find() is not what I need, even if you add lot of
usage examples to it.
(It seems my list of find-like micropatterns got ignored, I don't know why,
maybe they don't look enough like things done by STL C++.)

Bye,
bearophile

Jul 20 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 07/19/2010 06:36 PM, sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation.
 In all honesty, anything with a return type like FindResult!(Range,
 Ranges) is
 going to scare people off.

 I fully agree. I had no idea from the documentation what something like
 this was: [1,2,3,4][]. The array documentation doesn't have any details,
 so for the most part I've stayed away from std.algorithm because I don't
 understand the return types very well. Better documentation would help a
 lot.

I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone) 
bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by 
default.

I'll work on improving the documentation.


Andrei

Jul 19 2010

sybrandy <sybrandy gmail.com> writes:

On 07/19/2010 10:18 PM, Andrei Alexandrescu wrote:
 On 07/19/2010 06:36 PM, sybrandy wrote:
 I think that the problem with find() is not so much find() but it's
 documentation.
 In all honesty, anything with a return type like FindResult!(Range,
 Ranges) is
 going to scare people off.

 I fully agree. I had no idea from the documentation what something like
 this was: [1,2,3,4][]. The array documentation doesn't have any details,
 so for the most part I've stayed away from std.algorithm because I don't
 understand the return types very well. Better documentation would help a
 lot.

 I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone)
 bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by
 default.

 I'll work on improving the documentation.


 Andrei

Muchas gracias from your loyal minions.

Casey

Jul 20 2010

bearophile <bearophileHUGS lycos.com> writes:

Andrei:

Any ideas - please chime in.<

As with most software it's better to first ask yourself what your software will
mostly be used for, and then to write it. Otherwise you are wasting your time
and the user time.

The most common usages (I call them micropatterns) for find-like functions
(every one of the following seven operations takes an optional transform
function like schwartzSort):


1) boolean predicate, is the item/iterable x inside the iterable foo? ("in"
operator, or if not possible a "isIn" function).
2) int function, tell me how many items/arrays x are inside the iterable foo
("count" function).
3) int function, tell me the index of the first item/iterable x in the iterable
foo ("index" function).
4 extra) find lazily all the indexes of an item/iterable x inside the iterable
foo ("indexes" struct or class).

5) find the first max/min item in a iterable ("max", "min" functions, they have
overloads for 2 and 3 items too).
6) find the index of the max/min item in a iterable ("maxPos", "minPos"
functions).
7) find lazily all the max/min items in a iterable ("maxs", "mins" structs or
classes).


An iterable is a Range (like a string, array, AA, etc) or a struct/class that
defines just opApply.

In my dlibs1 in all those functions if the sequence is an associative arrays,
the search is done on the keys. D2 AAs have .byKey() that reduces this need a
bit.

In the first 4 micropatterns it's good to have a specialization for strings, to
increase performance with some string-specific tricks .

Bye,
bearophile

Jul 18 2010

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
*****
    I was thinking of improving std.find.

    For starters, it should be told that std.algorithm.find _does_ a lot,
which at least partially justifies its complexity. One thing that I haven't
seen other find()s doing is to be able to search in one pass a given range
for multiple other ranges, like this:

    int[] a = [ 1, 4, 2, 3 ];
    assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2));

    When passed more than two arguments, find returns a tuple continuing the
searched ranged positioned on the element found and the 1-based index of the
parameter that was found. The trick is that find() makes exactly one pass
through the searched range, which is often more efficient than searching the
same range for each element in turn. Also the one-pass approach works with
input ranges so it doesn't put pressure on the range capabilities.
*****

You could consider than find() is a multiple-returns function: it may return
0, 1 or many values (the successive finds). Which can, obviously, be easily
represented by a range. I see find() as bit like filter(): its return type
is a range determined by the predicate and whose elements are the rest of
the input range.
so:
find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are [4,2,3,4,0]
and [4,0].

If no needle is present in haystack, then the result range is empty. It's
just it's an empty range of ranges, not an empty version of the original
range.

A complement information can be how many steps were taken (that is, the
index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
another function could take care of that: findIndexed(haystack, needle) or
find!(withIndex)(haystack, needle).

The advantage of returning a range is the possibility to iterate on the
results. I, for one, like the idea of having find() in the same family that
workhorses like map() or filter()... The drawback is the necessity to test
the result with empty to know if you had results, but that's not different
from the current version.

So, you could have a findFirst() that does not return a range, but one
element, like the current find() does. Or call what I suggest findAll().

Now, for many needles... Currently, you pass the predicate "a==b" to find()
and use it on all needles. A possible generalization is to have a isOneOf()
predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current
element being tested and the b's are the needles passed to find(). This
generic calling of pred with (a, needles) could be done for any predicate,
like for example:

find!(isBetween)(haystack, 0, 10)

where isBetween(a, left, right) returns true if (left <= a && a <= right).

The default would be "a==b" for one needle and isOneOf(a, b...) if
needles.length > 1. So you still get the current behavior, with a bit of
possible customization thrown in, like the isBetween example. I don't think
that's possible today?

The limit of this is you cannot easily get the index of the found needle.
btw, could you remind us why you used a 1-based index? My opinion on this is
that if you want to find any needle, then you mostly care for them being
present, and much less for wich one is the first. Also, the found element is
accessible by taking the front of the retuned range. Why provide it to the
caller?

Another problem I see is to get the nice current behavior of having a needle
being a range like haystack:

find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

And that's I particularly like.


*****
    Another aspect I'd like to discuss is use of Boyer-Moore and related
fast finding techniques. Currently the use of B-M is explicit and a bit
awkward. I was thinking instead to automatically detect its appropriatedness
and use it transparently. Bearophile posted at a link to such a blend string
search routine (http://effbot.org/zone/stringlib.htm) that I think can be
generalized quite easily.
*****

Indeed, reading this, it seems like it could be generalized easily. You need
a random-access range with a defined length. So the usage of such an algo
could indeed be done transparently. Hmm...


Philippe

Jul 18 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 07/18/2010 04:30 PM, Philippe Sigaud wrote:
 On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu  wrote:
 You could consider than find() is a multiple-returns function: it may
 return 0, 1 or many values (the successive finds). Which can, obviously,
 be easily represented by a range. I see find() as bit like filter(): its
 return type is a range determined by the predicate and whose elements
 are the rest of the input range.
 so:
 find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are
 [4,2,3,4,0] and [4,0].

 If no needle is present in haystack, then the result range is empty.
 It's just it's an empty range of ranges, not an empty version of the
 original range.

 A complement information can be how many steps were taken (that is, the
 index of the found elements):  ([4,2,3,4,0], 1), ([4,0], 4). A policy or
 another function could take care of that: findIndexed(haystack, needle)
 or find!(withIndex)(haystack, needle).

I agree that a findAll function yielding a range is nice, however it's 
not the simplest interface. I think there should be a function that just 
finds something somewhere.

 Now, for many needles... Currently, you pass the predicate "a==b" to
 find() and use it on all needles. A possible generalization is to have a
 isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
 the current element being tested and the b's are the needles passed to
 find(). This generic calling of pred with (a, needles) could be done for
 any predicate, like for example:

 find!(isBetween)(haystack, 0, 10)

Aren't such scenarios better served by putting the limits inside the 
predicate? Otherwise the list of what can be done could go on forever.

 The limit of this is you cannot easily get the index of the found
 needle. btw, could you remind us why you used a 1-based index? My
 opinion on this is that if you want to find any needle, then you mostly
 care for them being present, and much less for wich one is the first.

I used one-based index to make tests easier - as you said, most often 
you care for the presence so zero/nonzero is easiest to tell. Then I 
thought it wouldn't hurt to provide the index since it's not any extra work.

 Also, the found element is accessible by taking the front of the retuned
 range. Why provide it to the caller?

Because that doesn't generalize well to searching ranges (as oposed to 
individual elements).

 Another problem I see is to get the nice current behavior of having a
 needle being a range like haystack:

 find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

 And that's I particularly like.

Not sure I understand whether that's good or bad, but I also want to 
provide the likes of findSkip() (it's been discussed here under a 
different name, findConsume) that positions the searched range after the 
found range:

findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]

It's very useful in practice.

 *****
      Another aspect I'd like to discuss is use of Boyer-Moore and
 related fast finding techniques. Currently the use of B-M is explicit
 and a bit awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 *****

 Indeed, reading this, it seems like it could be generalized easily. You
 need a random-access range with a defined length. So the usage of such
 an algo could indeed be done transparently. Hmm...

The current B-M implementation already works with any random-access 
range with length, but it's not integrated transparently with the 
regular find. I wonder what the threshold conditions might look like.

Andrei

Jul 18 2010

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 I agree that a findAll function yielding a range is nice, however it's not
 the simplest interface. I think there should be a function that just finds
 something somewhere.

Yes, I agree that's the common case.
What I personally would like is a way to find an element in an ordered
container (binary tree...)


  Now, for many needles... Currently, you pass the predicate "a==b" to
 find() and use it on all needles. A possible generalization is to have a
 isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is
 the current element being tested and the b's are the needles passed to
 find(). This generic calling of pred with (a, needles) could be done for
 any predicate, like for example:

 find!(isBetween)(haystack, 0, 10)

 Aren't such scenarios better served by putting the limits inside the
 predicate? Otherwise the list of what can be done could go on forever.


The problem is that, as the predicate is passed at compile-time, the limits
must be CT-defined too. I'd like the limits to be defined at runtime. I run
into this regularly, while using predicates in std.algorithms.

A not-so-uncommon scenario for me is having one range delivering limits (or
zones to explore, or any other runtime arguments), and finding a value
inside those limits in another range. Ideally, I'd like to use find() on a
spatial structure like an R-tree. ( http://en.wikipedia.org/wiki/R-tree ).

But I admit it's not that common.



  The limit of this is you cannot easily get the index of the found
 needle. btw, could you remind us why you used a 1-based index? My
 opinion on this is that if you want to find any needle, then you mostly
 care for them being present, and much less for wich one is the first.

 I used one-based index to make tests easier - as you said, most often you
 care for the presence so zero/nonzero is easiest to tell. Then I thought it
 wouldn't hurt to provide the index since it's not any extra work.


Ah, I thought the scenario was: 1) test for the returned range (as the
tuple._0 element) being empty or not. 2) if not empty, something was found,
get the index. I forgot getting 0 as index would mean 'no element found'.
If I understood you correctly, then I'd prefer a -1 index indicating
failure, and a 0-based indexing on the needles if on is found. But that's a
cosmetic issue.




  Also, the found element is accessible by taking the front of the retuned
 range. Why provide it to the caller?

 Because that doesn't generalize well to searching ranges (as oposed to
 individual elements).


Good point.



  Another problem I see is to get the nice current behavior of having a
 needle being a range like haystack:

 find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1]

 And that's I particularly like.

 Not sure I understand whether that's good or bad,


That's good. I mean, the current situation is good and that's a
functionality I like.



 but I also want to provide the likes of findSkip() (it's been discussed
 here under a different name, findConsume) that positions the searched range
 after the found range:

 findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1]

 It's very useful in practice.

I see that. And and empty range if it found nothing?
Another useful one is split() to cut the range in two or three parts :
before the match, the match itself (which can be one of the needles, so we
do not know it in advance) and the part after that. Return that triple as a
tuple. Hey, that way you can even implement replace: return chain(firstpart,
replacedmatch, tailpart).

Anything that can mimic regex functions on generic ranges is good, IMO. I
regularly find myself wanting to do some regex-like matching. We do not need
real pattern matching but even with simplified predicates, we can do many
things.



  *****
     Another aspect I'd like to discuss is use of Boyer-Moore and
 related fast finding techniques. Currently the use of B-M is explicit
 and a bit awkward. I was thinking instead to automatically detect its
 appropriatedness and use it transparently. Bearophile posted at a link
 to such a blend string search routine
 (http://effbot.org/zone/stringlib.htm) that I think can be generalized
 quite easily.
 *****

 Indeed, reading this, it seems like it could be generalized easily. You
 need a random-access range with a defined length. So the usage of such
 an algo could indeed be done transparently. Hmm...

 The current B-M implementation already works with any random-access range
 with length, but it's not integrated transparently with the regular find. I
 wonder what the threshold conditions might look like.

I'm sorry I didn't look at your implementation, as I don't know Boyer-Moore
that much. Does it need to know the alphabet size or do you use another
variant?

Philippe

Jul 19 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 07/19/2010 02:12 PM, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:


     I agree that a findAll function yielding a range is nice, however
     it's not the simplest interface. I think there should be a function
     that just finds something somewhere.


 Yes, I agree that's the common case.
 What I personally would like is a way to find an element in an ordered
 container (binary tree...)

Wouldn't that be a member of the binary tree?

     Aren't such scenarios better served by putting the limits inside the
     predicate? Otherwise the list of what can be done could go on forever.


 The problem is that, as the predicate is passed at compile-time, the
 limits must be CT-defined too. I'd like the limits to be defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.

I think this is a misunderstanding. All of std.algorithm works with 
delegates:

     int[] a = [ 1, 4, 2, 3 ];
     int b = 2;
     assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 A not-so-uncommon scenario for me is having one range delivering limits
 (or zones to explore, or any other runtime arguments), and finding a
 value inside those limits in another range. Ideally, I'd like to use
 find() on a spatial structure like an R-tree. (
 http://en.wikipedia.org/wiki/R-tree ).

 But I admit it's not that common.

Unless you manage to define appropriate and generally applicable 
iterators, I think that would overgeneralize find().

 Anything that can mimic regex functions on generic ranges is good, IMO.
 I regularly find myself wanting to do some regex-like matching. We do
 not need real pattern matching but even with simplified predicates, we
 can do many things.

I agree we're in desperate need for a good range-based, 
character-divorced, statically-computed regex engine.

 I'm sorry I didn't look at your implementation, as I don't know
 Boyer-Moore that much. Does it need to know the alphabet size or do you
 use another variant?

My explanation of B-M would do little more than pasting information from 
various places on the Net.


Andrei

Jul 19 2010

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 What I personally would like is a way to find an element in an ordered
 container (binary tree...)

 Wouldn't that be a member of the binary tree?

Yes, indeed. Unless I define some cousin to range, a recursive range, to
iterate on any recursive container, like trees or graphs and then devise
algorithms on them. I played with this idea a few weeks ago but went to do
other things. Anyway, carry on.
Btw, I still plan to update some simple generic graphs and trees structs to
obey TotalContainer interface when it makes sense, and update my graph
algorithms accordingly. If I ever get somewhere, I'll post something.



     Aren't such scenarios better served by putting the limits inside the
    predicate? Otherwise the list of what can be done could go on forever.


 The problem is that, as the predicate is passed at compile-time, the
 limits must be CT-defined too. I'd like the limits to be defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.

 I think this is a misunderstanding. All of std.algorithm works with
 delegates:


    int[] a = [ 1, 4, 2, 3 ];
    int b = 2;
    assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 Ah yes, but I regularly use algorithms structs as return values, like this:

auto nTimes(E, R)(E multiplier, R range)
{
    return map!((ElementType!R e) { return e*multiplier;})(range);
}

Error: delegate std.algorithm.__dgliteral2 cannot access frame of function
__dgliteral2|

I ran into this yesterday while trying to encode power series as (possibly
infinite) ranges, and getting problems for the operators overloading. Too
bad, I got to the point where I could calculate sin/cos/exp values quite
precisely and then could easily define derivation and such. There are
workaround, but they are not as handy.
I had the same problems some time ago, trying to define predicates on the
fly, though I cannot right now remember what it was.

I know it's not a problem in std.algorithm per se, but it's a limit on its
usefulness, as far as I'm concerned.



  Anything that can mimic regex functions on generic ranges is good, IMO.
 I regularly find myself wanting to do some regex-like matching. We do
 not need real pattern matching but even with simplified predicates, we
 can do many things.

 I agree we're in desperate need for a good range-based, character-divorced,
 statically-computed regex engine.


Oh, a full statically computed engine would be the Graal. But I did not go
that far. Being able to simply match, split and replace at runtime on any
range would be good also, without going all regex. Heck, I may even have
already coded something like this, I'll go and check.

What I'm trying to do right now as a back-burner task is a function
transforming a range in a set-like range (each element is present only
once), but testing for "did I already see that?" with a user-provided
predicate.


  I'm sorry I didn't look at your implementation, as I don't know
 Boyer-Moore that much. Does it need to know the alphabet size or do you
 use another variant?

 My explanation of B-M would do little more than pasting information from
 various places on the Net.

OK.


Philippe

Jul 19 2010

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 20.07.2010 0:05, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> 
 wrote:


         What I personally would like is a way to find an element in an
         ordered
         container (binary tree...)


     Wouldn't that be a member of the binary tree?


 Yes, indeed. Unless I define some cousin to range, a recursive range, 
 to iterate on any recursive container, like trees or graphs and then 
 devise algorithms on them. I played with this idea a few weeks ago but 
 went to do other things. Anyway, carry on.
 Btw, I still plan to update some simple generic graphs and trees 
 structs to obey TotalContainer interface when it makes sense, and 
 update my graph algorithms accordingly. If I ever get somewhere, I'll 
 post something.


            Aren't such scenarios better served by putting the limits
         inside the
            predicate? Otherwise the list of what can be done could go
         on forever.


         The problem is that, as the predicate is passed at
         compile-time, the
         limits must be CT-defined too. I'd like the limits to be
         defined at
         runtime. I run into this regularly, while using predicates in
         std.algorithms.


     I think this is a misunderstanding. All of std.algorithm works
     with delegates:


        int[] a = [ 1, 4, 2, 3 ];
        int b = 2;
        assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 Ah yes, but I regularly use algorithms structs as return values, like 
 this:

 auto nTimes(E, R)(E multiplier, R range)
 {
     return map!((ElementType!R e) { return e*multiplier;})(range);
 }

Try this workaround, replacing delegate with nested function, works for me:
auto nTimes(E, R)(E multiplier, R range){
     ElementType!R fun(ElementType!R e) { return e*multiplier; }
     return map!(fun)(range);
}
For some reason, compiler never plays nice with map, and I believe there 
are some issues with current implementation (2.047 release ) that Andrei 
(hopefully) fixed in recent commit.

 Philippe


-- 
Dmitry Olshansky

Jul 19 2010

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 07/19/2010 05:36 PM, Dmitry Olshansky wrote:
 On 20.07.2010 0:05, Philippe Sigaud wrote:
 On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>>
 wrote:


 What I personally would like is a way to find an element in an
 ordered
 container (binary tree...)


 Wouldn't that be a member of the binary tree?


 Yes, indeed. Unless I define some cousin to range, a recursive range,
 to iterate on any recursive container, like trees or graphs and then
 devise algorithms on them. I played with this idea a few weeks ago but
 went to do other things. Anyway, carry on.
 Btw, I still plan to update some simple generic graphs and trees
 structs to obey TotalContainer interface when it makes sense, and
 update my graph algorithms accordingly. If I ever get somewhere, I'll
 post something.


 Aren't such scenarios better served by putting the limits
 inside the
 predicate? Otherwise the list of what can be done could go
 on forever.


 The problem is that, as the predicate is passed at
 compile-time, the
 limits must be CT-defined too. I'd like the limits to be
 defined at
 runtime. I run into this regularly, while using predicates in
 std.algorithms.


 I think this is a misunderstanding. All of std.algorithm works
 with delegates:


 int[] a = [ 1, 4, 2, 3 ];
 int b = 2;
 assert(find!((x) { return x > b; })(a) == [4, 2, 3]);

 Ah yes, but I regularly use algorithms structs as return values, like
 this:

 auto nTimes(E, R)(E multiplier, R range)
 {
 return map!((ElementType!R e) { return e*multiplier;})(range);
 }

 Try this workaround, replacing delegate with nested function, works for me:
 auto nTimes(E, R)(E multiplier, R range){
 ElementType!R fun(ElementType!R e) { return e*multiplier; }
 return map!(fun)(range);
 }
 For some reason, compiler never plays nice with map, and I believe there
 are some issues with current implementation (2.047 release ) that Andrei
 (hopefully) fixed in recent commit.

Yes, I'd discussed this with Walter on Friday and he said the semantics 
should be identical in the two cases. So we're dealing with a compiler bug.


Andrei

Jul 19 2010

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Tue, Jul 20, 2010 at 00:36, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:


 Ah yes, but I regularly use algorithms structs as return values, like this:
 auto nTimes(E, R)(E multiplier, R range)
 {
    return map!((ElementType!R e) { return e*multiplier;})(range);
 }


 Try this workaround, replacing delegate with nested function, works for me:

 auto nTimes(E, R)(E multiplier, R range){
    ElementType!R fun(ElementType!R e) { return e*multiplier; }
    return map!(fun)(range);
 }
 For some reason, compiler never plays nice with map, and I believe there
 are some issues with current implementation (2.047 release ) that Andrei
 (hopefully) fixed in recent commit.

Hmm, using a nested function was the first thing I tried, and it didn't work
at the time. Good to know it's a bug. right now, it does not work, though:


void main()
{
    auto arr = [0,1,2,3,4];
    writeln(array(nTimes(3, arr))); // prints 0, then four big numbers: it's
stomping on another part of memory?
}

Strangely, even using 0 as a multiplier does not work. Using .save() in
nTimes neither.

You see, these kinds of issues all have solutions (in that particular case,
the workaround is a bit complicated but works in practice), but they are
cumbersome and limit my usage of std.algorithms.

FWIW, the workaround I used there is to take the multiplier, call repeat on
it to make it a range (3,3,3,3 ...), zip the two ranges together and call
map with a function like this:  times(A,B)(Tuple!(A,B) t) { return
t._0*t._1;}
so, you get:  (0,3), (1,3), (2,3), ... and then 0,3,6, etc.
Extension of this idea saved me many times.


Philippe

Jul 19 2010

Philippe Sigaud <philippe.sigaud gmail.com> writes:

On Sun, Jul 18, 2010 at 06:19, Jonathan M Davis wrote:
*****
I think that the problem with find() is not so much find() but it's
documentation.
In all honesty, anything with a return type like FindResult!(Range, Ranges)
is
going to scare people off.
*****

What should be done (in a perfect world) is to have a simplified signature
that can become the complete signature on a click

auto find!()(haystack, needles)  =>  FindResult!(etc.  ...
the empty !() being a cue that it's a template function and that there is
more to it that meets the eye.

(edit: crap, you said it afterwards. I should not reply hours after first
reading a message)

*****
If anything, I've been more interested in canFind() and until() being made
to
match up with find(). I'd like to be able to give them both pretty much the
same
arguments and then get the bool from canFind() and the range that find()
would
have walked over in the case of until(). A function which gave you both the
range that until() would have given you as well as the one which find()
would
have given you would be nice as well. I previously opened a bug with
thoughts
along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888
*****

Ah, these are called takeWhile, dropWhile and span, in Haskell and other
sequence-heavy languages.
Or splitBy, cutAt, etc. Quite handy to have, I agree std.algorithm should
have them. But then, it's huge module already.


*****
In any case, I think that find() itself is more or less fine from a usage
standpoint. It's the docs that need help. The other thing would
be to add a lot more examples. That would be a big boon for a lot of
std.algorithm functions. They're not necessarily hard to use, and the
examples
show it much more clearly than the signatures often do.
*****

I sometimes dream of a complement to Phobos that would demonstrate its use,
either by providing lots of examples for each function or by taking a module
and tackling a task with it. Obviously, as both a complement to Phobos and a
way to demo-nstrate D, it should be called Deimos, the second moon of Mars,
Phobos' sibling.

Now, the Deimos Project, that's some cool name, if I may say so.

But I guess the wiki already does this.

Philippe

Jul 18 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Improving std.algorithm.find