digitalmars.D - Improving std.algorithm.find
- Andrei Alexandrescu (28/28) Jul 17 2010 I was thinking of improving std.find. We have this bug report:
- Jonathan M Davis (64/102) Jul 17 2010 I think that the problem with find() is not so much find() but it's docu...
- sybrandy (6/9) Jul 19 2010 I fully agree. I had no idea from the documentation what something like...
- Jonathan M Davis (16/27) Jul 19 2010 In most cases, the return types are just ranges, but because they're con...
- sybrandy (6/20) Jul 20 2010 Agreed. I'm also a big fan of examples as it's very easy to look at an
- bearophile (5/9) Jul 20 2010 I am not sure, but I think find() is not what I need, even if you add lo...
- Andrei Alexandrescu (6/16) Jul 19 2010 I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone)
- sybrandy (3/20) Jul 20 2010 Muchas gracias from your loyal minions.
- bearophile (15/16) Jul 18 2010 As with most software it's better to first ask yourself what your softwa...
- Philippe Sigaud (72/72) Jul 18 2010 On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu wrote:
- Andrei Alexandrescu (21/66) Jul 18 2010 I agree that a findAll function yielding a range is nice, however it's
- Philippe Sigaud (36/94) Jul 19 2010 Yes, I agree that's the common case.
- Andrei Alexandrescu (14/42) Jul 19 2010 I think this is a misunderstanding. All of std.algorithm works with
- Philippe Sigaud (34/66) Jul 19 2010 Yes, indeed. Unless I define some cousin to range, a recursive range, to
- Dmitry Olshansky (11/48) Jul 19 2010 Try this workaround, replacing delegate with nested function, works for ...
- Andrei Alexandrescu (4/65) Jul 19 2010 Yes, I'd discussed this with Walter on Friday and he said the semantics
- Philippe Sigaud (21/36) Jul 19 2010 Hmm, using a nested function was the first thing I tried, and it didn't ...
- Philippe Sigaud (49/49) Jul 18 2010 On Sun, Jul 18, 2010 at 06:19, Jonathan M Davis wrote:
I was thinking of improving std.find. We have this bug report: http://d.puremagic.com/issues/show_bug.cgi?id=3923 which is pretty vague but does have a point. For starters, it should be told that std.algorithm.find _does_ a lot, which at least partially justifies its complexity. One thing that I haven't seen other find()s doing is to be able to search in one pass a given range for multiple other ranges, like this: int[] a = [ 1, 4, 2, 3 ]; assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2)); When passed more than two arguments, find returns a tuple continuing the searched ranged positioned on the element found and the 1-based index of the parameter that was found. The trick is that find() makes exactly one pass through the searched range, which is often more efficient than searching the same range for each element in turn. Also the one-pass approach works with input ranges so it doesn't put pressure on the range capabilities. However the simplified find() looks like, I'd like to keep this feature complicates find's signature and implementation. Another aspect I'd like to discuss is use of Boyer-Moore and related fast finding techniques. Currently the use of B-M is explicit and a bit awkward. I was thinking instead to automatically detect its appropriatedness and use it transparently. Bearophile posted at a link to such a blend string search routine (http://effbot.org/zone/stringlib.htm) that I think can be generalized quite easily. Any ideas - please chime in. Andrei
Jul 17 2010
On Saturday 17 July 2010 15:55:26 Andrei Alexandrescu wrote:I was thinking of improving std.find. We have this bug report: http://d.puremagic.com/issues/show_bug.cgi?id=3923 which is pretty vague but does have a point. For starters, it should be told that std.algorithm.find _does_ a lot, which at least partially justifies its complexity. One thing that I haven't seen other find()s doing is to be able to search in one pass a given range for multiple other ranges, like this: int[] a = [ 1, 4, 2, 3 ]; assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2)); When passed more than two arguments, find returns a tuple continuing the searched ranged positioned on the element found and the 1-based index of the parameter that was found. The trick is that find() makes exactly one pass through the searched range, which is often more efficient than searching the same range for each element in turn. Also the one-pass approach works with input ranges so it doesn't put pressure on the range capabilities. However the simplified find() looks like, I'd like to keep this feature complicates find's signature and implementation. Another aspect I'd like to discuss is use of Boyer-Moore and related fast finding techniques. Currently the use of B-M is explicit and a bit awkward. I was thinking instead to automatically detect its appropriatedness and use it transparently. Bearophile posted at a link to such a blend string search routine (http://effbot.org/zone/stringlib.htm) that I think can be generalized quite easily. Any ideas - please chime in. AndreiI think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off. The find() that takes only a predicate and the range to search in is much better: Range find(alias pred, Range)(Range haystack); Now, unfortunately, I'm not sure how to improve the main version of find other than the return type. FindResult!(Range,Ranges) find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges needles); becomes Range find(alias pred = "a == b", Range, Ranges...)(Range haystack, Ranges needles); However, that (alias pred = "a == b", Range, Ranges...) is still pretty ugly. Unfortunately, I'm not sure how you would eliminate that. The only part that the caller would normally care about is alias pred = "a == b", but the rest is part of the template definition as well, and I don't think that we have a way to clean that up at the moment. The other problem is that you're getting two totally different return types depending on whether you give it one needle or many (which may make Range not work as a return type anyway since what you get in the case of multiple needs isn't exactly a Range). So, I'd suggest either making it so that they return the same thing and you have to do a comparison to figure out which needle was found (in many cases, all you care about is that one of them was found anyway, but it would be less efficient in cases where you do care), or you should split find() into two functions - one which takes a single needle and one which takes many. That should make things a bit clearer. It doesn't really change the functionality, but the docs would be less cluttered. Another option would be to make find only return a range for multiple needles but have a separate function which returns a tuple if you really care, thereby simplifying find() a bit. In such a case, it would probably still be best to separate find() itself into two functions, but it wouldn't be as critical. As for integrating BoyerMoore into find(), if you can do it cleanly and seemlessly, I'm all for it. It would be good if there were a note about it in the documentation so that those using it wouldn't feel the need to reimplement it themselves, but I'm all for having versions of find() be specialized for the type that they're searching on if it can be done seemlessly. I'd even suggest doing the same for the various container types which could benefit from a specialized find - though perhaps those would be better served by having a special version of find() as part of their definition which std.algorithm.find() then calls. If anything, I've been more interested in canFind() and until() being made to match up with find(). I'd like to be able to give them both pretty much the same arguments and then get the bool from canFind() and the range that find() would have walked over in the case of until(). A function which gave you both the range that until() would have given you as well as the one which find() would have given you would be nice as well. I previously opened a bug with thoughts along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888 In any case, I think that find() itself is more or less fine from a usage standpoint. It's the docs that need help. And splitting up the function and trying to simplify the signature in the docs (whether the signature itself is really simplified or not) is what really needs to happen. The other thing would be to add a lot more examples. That would be a big boon for a lot of std.algorithm functions. They're not necessarily hard to use, and the examples show it much more clearly than the signatures often do. The problem is that the signatures are so ugly. And the fact that they have to be (in terms of the actual function if not the documentation) is pretty much required given what they do, so I'm not sure how much that can really be fixed. Maybe some extensions to DDoc need to be done for it happen, but it would be good if we could find a way to give simplified function signatures for a lot of these functions (possibly with a way in the docs to see the full function signature) so that users of the docs can see what they need to to use the functions without having to wade through the nasty (albeit necessarily so) signatures. - Jonathan M Davis
Jul 17 2010
I think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off.I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot. Casey
Jul 19 2010
On Monday, July 19, 2010 16:36:57 sybrandy wrote:In most cases, the return types are just ranges, but because they're constructed by the algorithm, the algorithm creates its own range type. So, generally, all you need to know is that a range is being returned. But even then, you run into some issues where you need to do stuff like run the returned range through array() to get something useable - depending on what you're doing with it. Regardless of the return type though, std.algorithm is definitely a prime example of how auto is your friend. That way you often don't have to care all that much about what the return type really looks like. However, the documentation on it is obviously ugly. It really should give the programmer exactly what they need to use the function and nothing more. As it stands, you have to really want to use it to figure it out. Personally, I think that it's well worth it - std.algorithm is fantastic - but the documentation is nowhere near as straightforward as it should be (even if using the functions is actually pretty straightforward in most cases). - Jonathan M DavisI think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off.I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot. Casey
Jul 19 2010
In most cases, the return types are just ranges, but because they're constructed by the algorithm, the algorithm creates its own range type. So, generally, all you need to know is that a range is being returned. But even then, you run into some issues where you need to do stuff like run the returned range through array() to get something useable - depending on what you're doing with it. Regardless of the return type though, std.algorithm is definitely a prime example of how auto is your friend. That way you often don't have to care all that much about what the return type really looks like.That paragraph would have been immensely useful to me a few months ago.However, the documentation on it is obviously ugly. It really should give the programmer exactly what they need to use the function and nothing more. As it stands, you have to really want to use it to figure it out. Personally, I think that it's well worth it - std.algorithm is fantastic - but the documentation is nowhere near as straightforward as it should be (even if using the functions is actually pretty straightforward in most cases).Agreed. I'm also a big fan of examples as it's very easy to look at an example and get something working. Then, once you see it work and get somewhat of an understanding, you can dig deeper into the details to learn more. Casey
Jul 20 2010
sybrandy:Agreed. I'm also a big fan of examples as it's very easy to look at an example and get something working. Then, once you see it work and get somewhat of an understanding, you can dig deeper into the details to learn more.I am not sure, but I think find() is not what I need, even if you add lot of usage examples to it. (It seems my list of find-like micropatterns got ignored, I don't know why, maybe they don't look enough like things done by STL C++.) Bye, bearophile
Jul 20 2010
On 07/19/2010 06:36 PM, sybrandy wrote:I agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone) bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by default. I'll work on improving the documentation. AndreiI think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off.I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot.
Jul 19 2010
On 07/19/2010 10:18 PM, Andrei Alexandrescu wrote:On 07/19/2010 06:36 PM, sybrandy wrote:Muchas gracias from your loyal minions. CaseyI agree as well. FWIW [1, 2, 3, 4][] is a workaround for a (now gone) bug in the compiler. That bug made [1, 2, 3, 4] a fixed-size array by default. I'll work on improving the documentation. AndreiI think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off.I fully agree. I had no idea from the documentation what something like this was: [1,2,3,4][]. The array documentation doesn't have any details, so for the most part I've stayed away from std.algorithm because I don't understand the return types very well. Better documentation would help a lot.
Jul 20 2010
Andrei:Any ideas - please chime in.<As with most software it's better to first ask yourself what your software will mostly be used for, and then to write it. Otherwise you are wasting your time and the user time. The most common usages (I call them micropatterns) for find-like functions (every one of the following seven operations takes an optional transform function like schwartzSort): 1) boolean predicate, is the item/iterable x inside the iterable foo? ("in" operator, or if not possible a "isIn" function). 2) int function, tell me how many items/arrays x are inside the iterable foo ("count" function). 3) int function, tell me the index of the first item/iterable x in the iterable foo ("index" function). 4 extra) find lazily all the indexes of an item/iterable x inside the iterable foo ("indexes" struct or class). 5) find the first max/min item in a iterable ("max", "min" functions, they have overloads for 2 and 3 items too). 6) find the index of the max/min item in a iterable ("maxPos", "minPos" functions). 7) find lazily all the max/min items in a iterable ("maxs", "mins" structs or classes). An iterable is a Range (like a string, array, AA, etc) or a struct/class that defines just opApply. In my dlibs1 in all those functions if the sequence is an associative arrays, the search is done on the keys. D2 AAs have .byKey() that reduces this need a bit. In the first 4 micropatterns it's good to have a specialization for strings, to increase performance with some string-specific tricks . Bye, bearophile
Jul 18 2010
On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu wrote: ***** I was thinking of improving std.find. For starters, it should be told that std.algorithm.find _does_ a lot, which at least partially justifies its complexity. One thing that I haven't seen other find()s doing is to be able to search in one pass a given range for multiple other ranges, like this: int[] a = [ 1, 4, 2, 3 ]; assert(find(a, 5, 4, 3) == tuple([ 4, 2, 3 ], 2)); When passed more than two arguments, find returns a tuple continuing the searched ranged positioned on the element found and the 1-based index of the parameter that was found. The trick is that find() makes exactly one pass through the searched range, which is often more efficient than searching the same range for each element in turn. Also the one-pass approach works with input ranges so it doesn't put pressure on the range capabilities. ***** You could consider than find() is a multiple-returns function: it may return 0, 1 or many values (the successive finds). Which can, obviously, be easily represented by a range. I see find() as bit like filter(): its return type is a range determined by the predicate and whose elements are the rest of the input range. so: find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are [4,2,3,4,0] and [4,0]. If no needle is present in haystack, then the result range is empty. It's just it's an empty range of ranges, not an empty version of the original range. A complement information can be how many steps were taken (that is, the index of the found elements): ([4,2,3,4,0], 1), ([4,0], 4). A policy or another function could take care of that: findIndexed(haystack, needle) or find!(withIndex)(haystack, needle). The advantage of returning a range is the possibility to iterate on the results. I, for one, like the idea of having find() in the same family that workhorses like map() or filter()... The drawback is the necessity to test the result with empty to know if you had results, but that's not different from the current version. So, you could have a findFirst() that does not return a range, but one element, like the current find() does. Or call what I suggest findAll(). Now, for many needles... Currently, you pass the predicate "a==b" to find() and use it on all needles. A possible generalization is to have a isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current element being tested and the b's are the needles passed to find(). This generic calling of pred with (a, needles) could be done for any predicate, like for example: find!(isBetween)(haystack, 0, 10) where isBetween(a, left, right) returns true if (left <= a && a <= right). The default would be "a==b" for one needle and isOneOf(a, b...) if needles.length > 1. So you still get the current behavior, with a bit of possible customization thrown in, like the isBetween example. I don't think that's possible today? The limit of this is you cannot easily get the index of the found needle. btw, could you remind us why you used a 1-based index? My opinion on this is that if you want to find any needle, then you mostly care for them being present, and much less for wich one is the first. Also, the found element is accessible by taking the front of the retuned range. Why provide it to the caller? Another problem I see is to get the nice current behavior of having a needle being a range like haystack: find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1] And that's I particularly like. ***** Another aspect I'd like to discuss is use of Boyer-Moore and related fast finding techniques. Currently the use of B-M is explicit and a bit awkward. I was thinking instead to automatically detect its appropriatedness and use it transparently. Bearophile posted at a link to such a blend string search routine (http://effbot.org/zone/stringlib.htm) that I think can be generalized quite easily. ***** Indeed, reading this, it seems like it could be generalized easily. You need a random-access range with a defined length. So the usage of such an algo could indeed be done transparently. Hmm... Philippe
Jul 18 2010
On 07/18/2010 04:30 PM, Philippe Sigaud wrote:On Sun, Jul 18, 2010 at 00:55, Andrei Alexandrescu wrote: You could consider than find() is a multiple-returns function: it may return 0, 1 or many values (the successive finds). Which can, obviously, be easily represented by a range. I see find() as bit like filter(): its return type is a range determined by the predicate and whose elements are the rest of the input range. so: find([ 1, 4, 2, 3, 4, 0 ], 4) returns a range whose elements are [4,2,3,4,0] and [4,0]. If no needle is present in haystack, then the result range is empty. It's just it's an empty range of ranges, not an empty version of the original range. A complement information can be how many steps were taken (that is, the index of the found elements): ([4,2,3,4,0], 1), ([4,0], 4). A policy or another function could take care of that: findIndexed(haystack, needle) or find!(withIndex)(haystack, needle).I agree that a findAll function yielding a range is nice, however it's not the simplest interface. I think there should be a function that just finds something somewhere.Now, for many needles... Currently, you pass the predicate "a==b" to find() and use it on all needles. A possible generalization is to have a isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current element being tested and the b's are the needles passed to find(). This generic calling of pred with (a, needles) could be done for any predicate, like for example: find!(isBetween)(haystack, 0, 10)Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever.The limit of this is you cannot easily get the index of the found needle. btw, could you remind us why you used a 1-based index? My opinion on this is that if you want to find any needle, then you mostly care for them being present, and much less for wich one is the first.I used one-based index to make tests easier - as you said, most often you care for the presence so zero/nonzero is easiest to tell. Then I thought it wouldn't hurt to provide the index since it's not any extra work.Also, the found element is accessible by taking the front of the retuned range. Why provide it to the caller?Because that doesn't generalize well to searching ranges (as oposed to individual elements).Another problem I see is to get the nice current behavior of having a needle being a range like haystack: find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1] And that's I particularly like.Not sure I understand whether that's good or bad, but I also want to provide the likes of findSkip() (it's been discussed here under a different name, findConsume) that positions the searched range after the found range: findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1] It's very useful in practice.***** Another aspect I'd like to discuss is use of Boyer-Moore and related fast finding techniques. Currently the use of B-M is explicit and a bit awkward. I was thinking instead to automatically detect its appropriatedness and use it transparently. Bearophile posted at a link to such a blend string search routine (http://effbot.org/zone/stringlib.htm) that I think can be generalized quite easily. ***** Indeed, reading this, it seems like it could be generalized easily. You need a random-access range with a defined length. So the usage of such an algo could indeed be done transparently. Hmm...The current B-M implementation already works with any random-access range with length, but it's not integrated transparently with the regular find. I wonder what the threshold conditions might look like. Andrei
Jul 18 2010
On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:I agree that a findAll function yielding a range is nice, however it's not the simplest interface. I think there should be a function that just finds something somewhere.Yes, I agree that's the common case. What I personally would like is a way to find an element in an ordered container (binary tree...)Now, for many needles... Currently, you pass the predicate "a==b" toThe problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms. A not-so-uncommon scenario for me is having one range delivering limits (or zones to explore, or any other runtime arguments), and finding a value inside those limits in another range. Ideally, I'd like to use find() on a spatial structure like an R-tree. ( http://en.wikipedia.org/wiki/R-tree ). But I admit it's not that common.find() and use it on all needles. A possible generalization is to have a isOneOf() predicate, called with isOneOf(a, b0, b1, ... bn) where a is the current element being tested and the b's are the needles passed to find(). This generic calling of pred with (a, needles) could be done for any predicate, like for example: find!(isBetween)(haystack, 0, 10)Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever.The limit of this is you cannot easily get the index of the foundAh, I thought the scenario was: 1) test for the returned range (as the tuple._0 element) being empty or not. 2) if not empty, something was found, get the index. I forgot getting 0 as index would mean 'no element found'. If I understood you correctly, then I'd prefer a -1 index indicating failure, and a 0-based indexing on the needles if on is found. But that's a cosmetic issue.needle. btw, could you remind us why you used a 1-based index? My opinion on this is that if you want to find any needle, then you mostly care for them being present, and much less for wich one is the first.I used one-based index to make tests easier - as you said, most often you care for the presence so zero/nonzero is easiest to tell. Then I thought it wouldn't hurt to provide the index since it's not any extra work.Also, the found element is accessible by taking the front of the retunedGood point.range. Why provide it to the caller?Because that doesn't generalize well to searching ranges (as oposed to individual elements).Another problem I see is to get the nice current behavior of having aThat's good. I mean, the current situation is good and that's a functionality I like.needle being a range like haystack: find([0,1,2,3,4,3,2,1], [4,3]) => [4,3,2,1] And that's I particularly like.Not sure I understand whether that's good or bad,but I also want to provide the likes of findSkip() (it's been discussed here under a different name, findConsume) that positions the searched range after the found range: findSkip([0,1,2,3,4,3,2,1], [4,3]) => [2,1] It's very useful in practice.I see that. And and empty range if it found nothing? Another useful one is split() to cut the range in two or three parts : before the match, the match itself (which can be one of the needles, so we do not know it in advance) and the part after that. Return that triple as a tuple. Hey, that way you can even implement replace: return chain(firstpart, replacedmatch, tailpart). Anything that can mimic regex functions on generic ranges is good, IMO. I regularly find myself wanting to do some regex-like matching. We do not need real pattern matching but even with simplified predicates, we can do many things.*****I'm sorry I didn't look at your implementation, as I don't know Boyer-Moore that much. Does it need to know the alphabet size or do you use another variant? PhilippeAnother aspect I'd like to discuss is use of Boyer-Moore and related fast finding techniques. Currently the use of B-M is explicit and a bit awkward. I was thinking instead to automatically detect its appropriatedness and use it transparently. Bearophile posted at a link to such a blend string search routine (http://effbot.org/zone/stringlib.htm) that I think can be generalized quite easily. ***** Indeed, reading this, it seems like it could be generalized easily. You need a random-access range with a defined length. So the usage of such an algo could indeed be done transparently. Hmm...The current B-M implementation already works with any random-access range with length, but it's not integrated transparently with the regular find. I wonder what the threshold conditions might look like.
Jul 19 2010
On 07/19/2010 02:12 PM, Philippe Sigaud wrote:On Mon, Jul 19, 2010 at 03:19, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> wrote: I agree that a findAll function yielding a range is nice, however it's not the simplest interface. I think there should be a function that just finds something somewhere. Yes, I agree that's the common case. What I personally would like is a way to find an element in an ordered container (binary tree...)Wouldn't that be a member of the binary tree?Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever. The problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms.I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]);A not-so-uncommon scenario for me is having one range delivering limits (or zones to explore, or any other runtime arguments), and finding a value inside those limits in another range. Ideally, I'd like to use find() on a spatial structure like an R-tree. ( http://en.wikipedia.org/wiki/R-tree ). But I admit it's not that common.Unless you manage to define appropriate and generally applicable iterators, I think that would overgeneralize find().Anything that can mimic regex functions on generic ranges is good, IMO. I regularly find myself wanting to do some regex-like matching. We do not need real pattern matching but even with simplified predicates, we can do many things.I agree we're in desperate need for a good range-based, character-divorced, statically-computed regex engine.I'm sorry I didn't look at your implementation, as I don't know Boyer-Moore that much. Does it need to know the alphabet size or do you use another variant?My explanation of B-M would do little more than pasting information from various places on the Net. Andrei
Jul 19 2010
On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:What I personally would like is a way to find an element in an orderedYes, indeed. Unless I define some cousin to range, a recursive range, to iterate on any recursive container, like trees or graphs and then devise algorithms on them. I played with this idea a few weeks ago but went to do other things. Anyway, carry on. Btw, I still plan to update some simple generic graphs and trees structs to obey TotalContainer interface when it makes sense, and update my graph algorithms accordingly. If I ever get somewhere, I'll post something.container (binary tree...)Wouldn't that be a member of the binary tree?Aren't such scenarios better served by putting the limits inside theauto nTimes(E, R)(E multiplier, R range) { return map!((ElementType!R e) { return e*multiplier;})(range); } Error: delegate std.algorithm.__dgliteral2 cannot access frame of function __dgliteral2| I ran into this yesterday while trying to encode power series as (possibly infinite) ranges, and getting problems for the operators overloading. Too bad, I got to the point where I could calculate sin/cos/exp values quite precisely and then could easily define derivation and such. There are workaround, but they are not as handy. I had the same problems some time ago, trying to define predicates on the fly, though I cannot right now remember what it was. I know it's not a problem in std.algorithm per se, but it's a limit on its usefulness, as far as I'm concerned.predicate? Otherwise the list of what can be done could go on forever. The problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms.I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]); Ah yes, but I regularly use algorithms structs as return values, like this:Anything that can mimic regex functions on generic ranges is good, IMO.Oh, a full statically computed engine would be the Graal. But I did not go that far. Being able to simply match, split and replace at runtime on any range would be good also, without going all regex. Heck, I may even have already coded something like this, I'll go and check. What I'm trying to do right now as a back-burner task is a function transforming a range in a set-like range (each element is present only once), but testing for "did I already see that?" with a user-provided predicate.I regularly find myself wanting to do some regex-like matching. We do not need real pattern matching but even with simplified predicates, we can do many things.I agree we're in desperate need for a good range-based, character-divorced, statically-computed regex engine.I'm sorry I didn't look at your implementation, as I don't knowOK. PhilippeBoyer-Moore that much. Does it need to know the alphabet size or do you use another variant?My explanation of B-M would do little more than pasting information from various places on the Net.
Jul 19 2010
On 20.07.2010 0:05, Philippe Sigaud wrote:On Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> wrote: What I personally would like is a way to find an element in an ordered container (binary tree...) Wouldn't that be a member of the binary tree? Yes, indeed. Unless I define some cousin to range, a recursive range, to iterate on any recursive container, like trees or graphs and then devise algorithms on them. I played with this idea a few weeks ago but went to do other things. Anyway, carry on. Btw, I still plan to update some simple generic graphs and trees structs to obey TotalContainer interface when it makes sense, and update my graph algorithms accordingly. If I ever get somewhere, I'll post something. Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever. The problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms. I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]); Ah yes, but I regularly use algorithms structs as return values, like this: auto nTimes(E, R)(E multiplier, R range) { return map!((ElementType!R e) { return e*multiplier;})(range); }Try this workaround, replacing delegate with nested function, works for me: auto nTimes(E, R)(E multiplier, R range){ ElementType!R fun(ElementType!R e) { return e*multiplier; } return map!(fun)(range); } For some reason, compiler never plays nice with map, and I believe there are some issues with current implementation (2.047 release ) that Andrei (hopefully) fixed in recent commit.Philippe-- Dmitry Olshansky
Jul 19 2010
On 07/19/2010 05:36 PM, Dmitry Olshansky wrote:On 20.07.2010 0:05, Philippe Sigaud wrote:Yes, I'd discussed this with Walter on Friday and he said the semantics should be identical in the two cases. So we're dealing with a compiler bug. AndreiOn Mon, Jul 19, 2010 at 21:23, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org <mailto:SeeWebsiteForEmail erdani.org>> wrote: What I personally would like is a way to find an element in an ordered container (binary tree...) Wouldn't that be a member of the binary tree? Yes, indeed. Unless I define some cousin to range, a recursive range, to iterate on any recursive container, like trees or graphs and then devise algorithms on them. I played with this idea a few weeks ago but went to do other things. Anyway, carry on. Btw, I still plan to update some simple generic graphs and trees structs to obey TotalContainer interface when it makes sense, and update my graph algorithms accordingly. If I ever get somewhere, I'll post something. Aren't such scenarios better served by putting the limits inside the predicate? Otherwise the list of what can be done could go on forever. The problem is that, as the predicate is passed at compile-time, the limits must be CT-defined too. I'd like the limits to be defined at runtime. I run into this regularly, while using predicates in std.algorithms. I think this is a misunderstanding. All of std.algorithm works with delegates: int[] a = [ 1, 4, 2, 3 ]; int b = 2; assert(find!((x) { return x > b; })(a) == [4, 2, 3]); Ah yes, but I regularly use algorithms structs as return values, like this: auto nTimes(E, R)(E multiplier, R range) { return map!((ElementType!R e) { return e*multiplier;})(range); }Try this workaround, replacing delegate with nested function, works for me: auto nTimes(E, R)(E multiplier, R range){ ElementType!R fun(ElementType!R e) { return e*multiplier; } return map!(fun)(range); } For some reason, compiler never plays nice with map, and I believe there are some issues with current implementation (2.047 release ) that Andrei (hopefully) fixed in recent commit.
Jul 19 2010
On Tue, Jul 20, 2010 at 00:36, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:Ah yes, but I regularly use algorithms structs as return values, like this:auto nTimes(E, R)(E multiplier, R range) { return map!((ElementType!R e) { return e*multiplier;})(range); }Try this workaround, replacing delegate with nested function, works for me: auto nTimes(E, R)(E multiplier, R range){ ElementType!R fun(ElementType!R e) { return e*multiplier; } return map!(fun)(range); } For some reason, compiler never plays nice with map, and I believe there are some issues with current implementation (2.047 release ) that Andrei (hopefully) fixed in recent commit.Hmm, using a nested function was the first thing I tried, and it didn't work at the time. Good to know it's a bug. right now, it does not work, though: void main() { auto arr = [0,1,2,3,4]; writeln(array(nTimes(3, arr))); // prints 0, then four big numbers: it's stomping on another part of memory? } Strangely, even using 0 as a multiplier does not work. Using .save() in nTimes neither. You see, these kinds of issues all have solutions (in that particular case, the workaround is a bit complicated but works in practice), but they are cumbersome and limit my usage of std.algorithms. FWIW, the workaround I used there is to take the multiplier, call repeat on it to make it a range (3,3,3,3 ...), zip the two ranges together and call map with a function like this: times(A,B)(Tuple!(A,B) t) { return t._0*t._1;} so, you get: (0,3), (1,3), (2,3), ... and then 0,3,6, etc. Extension of this idea saved me many times. Philippe
Jul 19 2010
On Sun, Jul 18, 2010 at 06:19, Jonathan M Davis wrote: ***** I think that the problem with find() is not so much find() but it's documentation. In all honesty, anything with a return type like FindResult!(Range, Ranges) is going to scare people off. ***** What should be done (in a perfect world) is to have a simplified signature that can become the complete signature on a click auto find!()(haystack, needles) => FindResult!(etc. ... the empty !() being a cue that it's a template function and that there is more to it that meets the eye. (edit: crap, you said it afterwards. I should not reply hours after first reading a message) ***** If anything, I've been more interested in canFind() and until() being made to match up with find(). I'd like to be able to give them both pretty much the same arguments and then get the bool from canFind() and the range that find() would have walked over in the case of until(). A function which gave you both the range that until() would have given you as well as the one which find() would have given you would be nice as well. I previously opened a bug with thoughts along those lines: http://d.puremagic.com/issues/show_bug.cgi?id=3888 ***** Ah, these are called takeWhile, dropWhile and span, in Haskell and other sequence-heavy languages. Or splitBy, cutAt, etc. Quite handy to have, I agree std.algorithm should have them. But then, it's huge module already. ***** In any case, I think that find() itself is more or less fine from a usage standpoint. It's the docs that need help. The other thing would be to add a lot more examples. That would be a big boon for a lot of std.algorithm functions. They're not necessarily hard to use, and the examples show it much more clearly than the signatures often do. ***** I sometimes dream of a complement to Phobos that would demonstrate its use, either by providing lots of examples for each function or by taking a module and tackling a task with it. Obviously, as both a complement to Phobos and a way to demo-nstrate D, it should be called Deimos, the second moon of Mars, Phobos' sibling. Now, the Deimos Project, that's some cool name, if I may say so. But I guess the wiki already does this. Philippe
Jul 18 2010