digitalmars.D - Request for "indexOf" undeprecation.
- monarch_dodra (21/21) Nov 16 2012 I'd like to get a green light to undeprecate "indexOf".
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (10/33) Nov 16 2012 Fully agree about the fix of countUntil - I can imagine that there is a ...
- monarch_dodra (22/39) Nov 16 2012 Yeah... I just noticed that exists. Still, the deprecation
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (8/21) Nov 16 2012 Just that it returns the index of the _code point_ if you pass a non-ASC...
- monarch_dodra (16/50) Nov 16 2012 I did.
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (8/52) Nov 16 2012 Codepoint index would mean that it returns 1 instead of 3 in the example...
- monarch_dodra (11/18) Nov 16 2012 But that's exactly what a narrow string does :D: s.front yields a
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (5/13) Nov 16 2012 Yes but narrow string are also not RA ranges ;). Any _general_ range alg...
- Denis Shelomovskij (17/21) Nov 16 2012 Thanks, I forget to officially fill this bug some time ago. Pleased you
- monarch_dodra (6/30) Nov 16 2012 Yes, but that was when he wrote "indexOf", in which the index is
- Jonathan M Davis (16/17) Nov 16 2012 Absolutely not.
- monarch_dodra (4/33) Nov 16 2012 OK. Your explanation/rationale makes sense. TY for your reply.
I'd like to get a green light to undeprecate "indexOf". The rationale for the original deprecation is that it was superseded by "countUntil", which also works on forward ranges (indexOf implies, by name, that it only operates on RA). There is just one problem: "日本語".indexOf(本) => 3 "日本語".countUntil(本) => 1 As you can see, when operating on UTF strings, the (expected) result of countUntil is not the same as that of indexOf. countUntil is currently bugged, in the sense that it will return the index in the string (as opposed to the count of popFront's). However, this needs fixing. If users migrate countUntil to indexOf without thinking, it could lead to some terrible, terrible bugs. Phobos was migrated. I found the issue in datetime (which is "OK", since date is ASCII), but also in std.path, which is NOT OK. At all. Who knows what's out there in the nature? ------- So yeah, I want to un-deprecate indexOf, and re-document it as working only on RA. Can I get some Yays or Nays?
Nov 16 2012
Am 16.11.2012 14:18, schrieb monarch_dodra:I'd like to get a green light to undeprecate "indexOf". The rationale for the original deprecation is that it was superseded by "countUntil", which also works on forward ranges (indexOf implies, by name, that it only operates on RA). There is just one problem: "日本語".indexOf(本) => 3 "日本語".countUntil(本) => 1 As you can see, when operating on UTF strings, the (expected) result of countUntil is not the same as that of indexOf. countUntil is currently bugged, in the sense that it will return the index in the string (as opposed to the count of popFront's). However, this needs fixing. If users migrate countUntil to indexOf without thinking, it could lead to some terrible, terrible bugs. Phobos was migrated. I found the issue in datetime (which is "OK", since date is ASCII), but also in std.path, which is NOT OK. At all. Who knows what's out there in the nature? ------- So yeah, I want to un-deprecate indexOf, and re-document it as working only on RA. Can I get some Yays or Nays?Fully agree about the fix of countUntil - I can imagine that there is a non-trivial amount of code that blindly uses countUntil instead of indexOf because of the deprecation (I know I also did this before). But there is still std.string.indexOf, which should do the right thing here... wait, no, that one also screws up and returns based on the ASCII state of the search character! However this all is fixed, it will probably cause a good amount of breakage, but there is no way around it. Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.
Nov 16 2012
On Friday, 16 November 2012 at 13:49:57 UTC, Sönke Ludwig wrote:Fully agree about the fix of countUntil - I can imagine that there is a non-trivial amount of code that blindly uses countUntil instead of indexOf because of the deprecation (I know I also did this before). But there is still std.string.indexOf, which should do the right thing here...Yeah... I just noticed that exists. Still, the deprecation comment is not very clear. I think there should be a clearer message of indexOf vs countUntil. As a matter of fact, I found that in path, we are going out of our way to use "std.algorithm.countUntil", probably because of a forced migration, without considering using std.string.indexOf instead. Still, if we have a RA range of chars, I think it needs to be possible to call indexOf on that range. std.utf can decode input ranges of chars, why can't we get an indexOf for RA ranges too then ?wait, no, that one also screws up and returns based on the ASCII state of the search character!It doesn't screw up the result, it is meant for slicing your string.However this all is fixed, it will probably cause a good amount of breakage, but there is no way around it. Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
Nov 16 2012
Am 16.11.2012 15:02, schrieb monarch_dodra:Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.wait, no, that one also screws up and returns based on the ASCII state of the search character!It doesn't screw up the result, it is meant for slicing your string.In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again. But that wouldn't work if indexOf operates on RA ranges. If you do have a RA though, how is the result of countUntil different from indexOf? If you have an actual char-range, countUntil should also return 3...Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
Nov 16 2012
On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:Am 16.11.2012 15:02, schrieb monarch_dodra:I'm not sure what you are trying to say? Aren't those the same?Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.wait, no, that one also screws up and returns based on the ASCII state of the search character!It doesn't screw up the result, it is meant for slicing your string.I did.In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again.Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.But that wouldn't work if indexOf operates on RA ranges. If you do have a RA though, how is the result of countUntil different from indexOf? If you have an actual char-range, countUntil should also return 3...One of the problems is that the semantics of a char range containing a utf payload is not very well formalized. In particular, there is no particular rule that states that popFront will pop an entire utf-sequence, or just a sigle char. In this situation, we can't really say what countUntil would return... However, it is perfectly legal to decode a forward range, so I don't see why you wouldn't be able to explicitly search the index in a RA range, as opposed to the amount of popFronts needed to get there. It is two different operations, and they have been (IMO) erroneously merged together. One of the workarounds is to "find" instead, and then calculate `r.length - r.find("本")`. But: 1. This is a pain to do. 2. Doesn't work on infinite ranges (which it should).
Nov 16 2012
Am 16.11.2012 16:08, schrieb monarch_dodra:On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:Codepoint index would mean that it returns 1 instead of 3 in the example. But forget what I said. I just tested it and was surprised that in 'foreach( i, dchar ch; str )', i contains byte offsets and not code point indices - so the indexOf() implementation is fully correct after all.Am 16.11.2012 15:02, schrieb monarch_dodra:I'm not sure what you are trying to say? Aren't those the same?Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.wait, no, that one also screws up and returns based on the ASCII state of the search character!It doesn't screw up the result, it is meant for slicing your string.As far as I see allowing divergent behavior for index based access and popFront/front would basically mean that no sensible algorithm could be implemented. What should some generic algorithm do with a RA range that returns double[] but yields byte values when using index access? But I guess Andrei has some more specific ideas here.I did.In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again.Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.But that wouldn't work if indexOf operates on RA ranges. If you do have a RA though, how is the result of countUntil different from indexOf? If you have an actual char-range, countUntil should also return 3...One of the problems is that the semantics of a char range containing a utf payload is not very well formalized. In particular, there is no particular rule that states that popFront will pop an entire utf-sequence, or just a sigle char. In this situation, we can't really say what countUntil would return... However, it is perfectly legal to decode a forward range, so I don't see why you wouldn't be able to explicitly search the index in a RA range, as opposed to the amount of popFronts needed to get there. It is two different operations, and they have been (IMO) erroneously merged together. One of the workarounds is to "find" instead, and then calculate `r.length - r.find("本")`. But: 1. This is a pain to do. 2. Doesn't work on infinite ranges (which it should).
Nov 16 2012
On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:As far as I see allowing divergent behavior for index based access and popFront/front would basically mean that no sensible algorithm could be implemented. What should some generic algorithm do with a RA range that returns double[] but yields byte values when using index access? But I guess Andrei has some more specific ideas here.But that's exactly what a narrow string does :D: s.front yields a dchar, but s[0] yields a char. The only difference is that a string can be detected as a "isNarrowString". A RA that contains chars would have limited algorithm capabilities, and be only used with care, but in such a case, I think indexOf would be one of those algorithms that should specifically work. In any case, it shouldn't be that big of a problem. I started the thread without realizing that string.indexOf existed ... sorry.
Nov 16 2012
Am 16.11.2012 17:50, schrieb monarch_dodra:On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:Yes but narrow string are also not RA ranges ;). Any _general_ range algorithm would fail for a 'string' if it would e.g. try to use the random access for look-ahead or something similar. After all, ranges are a kind of storage abstraction, but if RA would not offer that abstraction they would be of questionable value.As far as I see allowing divergent behavior for index based access and popFront/front would basically mean that no sensible algorithm could be implemented. What should some generic algorithm do with a RA range that returns double[] but yields byte values when using index access? But I guess Andrei has some more specific ideas here.But that's exactly what a narrow string does :D: s.front yields a dchar, but s[0] yields a char.
Nov 16 2012
16.11.2012 17:18, monarch_dodra пишет:As you can see, when operating on UTF strings, the (expected) result of countUntil is not the same as that of indexOf. countUntil is currently bugged, in the sense that it will return the index in the string (as opposed to the count of popFront's).Thanks, I forget to officially fill this bug some time ago. Pleased you are also finding Phobos bugs. And I want to mention that `countUntil` is intentionally spoiled. See the sources: --- // Narrow strings are handled a bit differently --- and Andrei will probably argue that current dangerous like hell behaviour is correct and such cool feature must be undocumented and developers will be pleased to discover it because of data corruption. :) But lets not blame Andrei because it was one of his insane big commits: https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96 and he probably was superheated and didn't understand what is going on. -- Денис В. Шеломовский Denis V. Shelomovskij
Nov 16 2012
On Friday, 16 November 2012 at 17:30:57 UTC, Denis Shelomovskij wrote:16.11.2012 17:18, monarch_dodra пишет:Yes, but that was when he wrote "indexOf", in which the index is right. IndexOf was latter simply renamed countUntil, but that changes the semantics, and makes the implementation incorrect...As you can see, when operating on UTF strings, the (expected) result of countUntil is not the same as that of indexOf. countUntil is currently bugged, in the sense that it will return the index in the string (as opposed to the count of popFront's).Thanks, I forget to officially fill this bug some time ago. Pleased you are also finding Phobos bugs. And I want to mention that `countUntil` is intentionally spoiled. See the sources: --- // Narrow strings are handled a bit differently --- and Andrei will probably argue that current dangerous like hell behaviour is correct and such cool feature must be undocumented and developers will be pleased to discover it because of data corruption. :) But lets not blame Andrei because it was one of his insane big commits: https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96 and he probably was superheated and didn't understand what is going on.
Nov 16 2012
On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:Can I get some Yays or Nays?Absolutely not. std.algorithm.indexOf was _always_ supposed to return code points for strings. It's an outright bug if it ever did otherwise. It was renamed to countUntil so that people would not mistakingly use it on strings and think that they were getting an index. That's what std.string.indexOf does. Any code that relies on std.algorithm.indexOf or std.algorithm.countUntil returning the index of for strings rather than the number of code points is wrong and needs to be fixed. Undeprecating std.algorithm.indexOf would just perpetuate the problem, and it makes _no_ sense for anything in std.algorithm to operate on string indexes anyway. std.algorithm treats all strings as ranges of dchar, and having anything in there operating on string indexes would violate that. If the deprecation message needs improvement, then feel free to improve it, but anything which was calling std.algorithm.indexOf to get the index of a string was wrong to begin with. - Jonathan M Davis
Nov 16 2012
On Friday, 16 November 2012 at 21:17:00 UTC, Jonathan M Davis wrote:On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:OK. Your explanation/rationale makes sense. TY for your reply. I'll try to improve the documentation.Can I get some Yays or Nays?Absolutely not. std.algorithm.indexOf was _always_ supposed to return code points for strings. It's an outright bug if it ever did otherwise. It was renamed to countUntil so that people would not mistakingly use it on strings and think that they were getting an index. That's what std.string.indexOf does. Any code that relies on std.algorithm.indexOf or std.algorithm.countUntil returning the index of for strings rather than the number of code points is wrong and needs to be fixed. Undeprecating std.algorithm.indexOf would just perpetuate the problem, and it makes _no_ sense for anything in std.algorithm to operate on string indexes anyway. std.algorithm treats all strings as ranges of dchar, and having anything in there operating on string indexes would violate that. If the deprecation message needs improvement, then feel free to improve it, but anything which was calling std.algorithm.indexOf to get the index of a string was wrong to begin with. - Jonathan M Davis
Nov 16 2012