www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Request for "indexOf" undeprecation.

reply "monarch_dodra" <monarchdodra gmail.com> writes:
I'd like to get a green light to undeprecate "indexOf".

The rationale for the original deprecation is that it was 
superseded by "countUntil", which also works on forward ranges 
(indexOf implies, by name, that it only operates on RA).

There is just one problem:
"日本語".indexOf(本) => 3
"日本語".countUntil(本) => 1

As you can see, when operating on UTF strings, the (expected) 
result of countUntil is not the same as that of indexOf.

countUntil is currently bugged, in the sense that it will return 
the index in the string (as opposed to the count of popFront's). 
However, this needs fixing. If users migrate countUntil to 
indexOf without thinking, it could lead to some terrible, 
terrible bugs.

Phobos was migrated. I found the issue in datetime (which is 
"OK", since date is ASCII), but also in std.path, which is NOT 
OK. At all. Who knows what's out there in the nature?

-------
So yeah, I want to un-deprecate indexOf, and re-document it as 
working only on RA.

Can I get some Yays or Nays?
Nov 16 2012
next sibling parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 16.11.2012 14:18, schrieb monarch_dodra:
 I'd like to get a green light to undeprecate "indexOf".
 
 The rationale for the original deprecation is that it was superseded by
"countUntil", which also
 works on forward ranges (indexOf implies, by name, that it only operates on
RA).
 
 There is just one problem:
 "日本語".indexOf(本) => 3
 "日本語".countUntil(本) => 1
 
 As you can see, when operating on UTF strings, the (expected) result of
countUntil is not the same
 as that of indexOf.
 
 countUntil is currently bugged, in the sense that it will return the index in
the string (as opposed
 to the count of popFront's). However, this needs fixing. If users migrate
countUntil to indexOf
 without thinking, it could lead to some terrible, terrible bugs.
 
 Phobos was migrated. I found the issue in datetime (which is "OK", since date
is ASCII), but also in
 std.path, which is NOT OK. At all. Who knows what's out there in the nature?
 
 -------
 So yeah, I want to un-deprecate indexOf, and re-document it as working only on
RA.
 
 Can I get some Yays or Nays?
Fully agree about the fix of countUntil - I can imagine that there is a non-trivial amount of code that blindly uses countUntil instead of indexOf because of the deprecation (I know I also did this before). But there is still std.string.indexOf, which should do the right thing here... wait, no, that one also screws up and returns based on the ASCII state of the search character! However this all is fixed, it will probably cause a good amount of breakage, but there is no way around it. Btw. is 'string' actually considered a RA range? After all it provides no useful invariants apart from str[0] == str.front - str[1] could be different from str.popFront(); str.front.
Nov 16 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 16 November 2012 at 13:49:57 UTC, Sönke Ludwig wrote:
 Fully agree about the fix of countUntil - I can imagine that 
 there is a non-trivial amount of code
 that blindly uses countUntil instead of indexOf because of the 
 deprecation (I know I also did this
 before).

 But there is still std.string.indexOf, which should do the 
 right thing here...
Yeah... I just noticed that exists. Still, the deprecation comment is not very clear. I think there should be a clearer message of indexOf vs countUntil. As a matter of fact, I found that in path, we are going out of our way to use "std.algorithm.countUntil", probably because of a forced migration, without considering using std.string.indexOf instead. Still, if we have a RA range of chars, I think it needs to be possible to call indexOf on that range. std.utf can decode input ranges of chars, why can't we get an indexOf for RA ranges too then ?
 wait, no, that one
 also screws up and returns based on the ASCII state of the 
 search character!
It doesn't screw up the result, it is meant for slicing your string.
 However this all is fixed, it will probably cause a good amount 
 of breakage, but there is no way
 around it.

 Btw. is 'string' actually considered a RA range? After all it 
 provides no useful invariants apart
 from str[0] == str.front - str[1] could be different from 
 str.popFront(); str.front.
No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
Nov 16 2012
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the search character!
It doesn't screw up the result, it is meant for slicing your string.
Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.
 Btw. is 'string' actually considered a RA range? After all it provides no
useful invariants apart
 from str[0] == str.front - str[1] could be different from str.popFront();
str.front.
No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again. But that wouldn't work if indexOf operates on RA ranges. If you do have a RA though, how is the result of countUntil different from indexOf? If you have an actual char-range, countUntil should also return 3...
Nov 16 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:
 Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the 
 search character!
It doesn't screw up the result, it is meant for slicing your string.
Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.
I'm not sure what you are trying to say? Aren't those the same?
 Btw. is 'string' actually considered a RA range? After all it 
 provides no useful invariants apart
 from str[0] == str.front - str[1] could be different from 
 str.popFront(); str.front.
No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again.
I did.
 But that wouldn't work if indexOf operates on RA ranges. If you 
 do have a RA though,
 how is the result of countUntil different from indexOf? If you 
 have an actual char-range, countUntil
 should also return 3...
One of the problems is that the semantics of a char range containing a utf payload is not very well formalized. In particular, there is no particular rule that states that popFront will pop an entire utf-sequence, or just a sigle char. In this situation, we can't really say what countUntil would return... However, it is perfectly legal to decode a forward range, so I don't see why you wouldn't be able to explicitly search the index in a RA range, as opposed to the amount of popFronts needed to get there. It is two different operations, and they have been (IMO) erroneously merged together. One of the workarounds is to "find" instead, and then calculate `r.length - r.find("本")`. But: 1. This is a pain to do. 2. Doesn't work on infinite ranges (which it should).
Nov 16 2012
parent reply =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 16.11.2012 16:08, schrieb monarch_dodra:
 On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:
 Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the search character!
It doesn't screw up the result, it is meant for slicing your string.
Just that it returns the index of the _code point_ if you pass a non-ASCII character as the search term and a byte index only if you pass in an ASCII char.
I'm not sure what you are trying to say? Aren't those the same?
Codepoint index would mean that it returns 1 instead of 3 in the example. But forget what I said. I just tested it and was surprised that in 'foreach( i, dchar ch; str )', i contains byte offsets and not code point indices - so the indexOf() implementation is fully correct after all.
 
 Btw. is 'string' actually considered a RA range? After all it provides no
useful invariants apart
 from str[0] == str.front - str[1] could be different from str.popFront();
str.front.
No! String is not an RA range. It can be indexed, but isRandomAccessRange!string is false. This is a fundamental aspect of string, to avoid accidently breaking it. dstring, however, is random access. You should always take this into account for considering whether or not it is worth converting to before operating on it. For example: sorting the chars in a dstring: Easy as pie. Doing it on a string: Not sure if even possible.
In that case I got confused by the example. I thought you wanted to make "日本語".indexOf('本') == 3 possible again.
I did.
 But that wouldn't work if indexOf operates on RA ranges. If you do have a RA
though,
 how is the result of countUntil different from indexOf? If you have an actual
char-range, countUntil
 should also return 3...
One of the problems is that the semantics of a char range containing a utf payload is not very well formalized. In particular, there is no particular rule that states that popFront will pop an entire utf-sequence, or just a sigle char. In this situation, we can't really say what countUntil would return... However, it is perfectly legal to decode a forward range, so I don't see why you wouldn't be able to explicitly search the index in a RA range, as opposed to the amount of popFronts needed to get there. It is two different operations, and they have been (IMO) erroneously merged together. One of the workarounds is to "find" instead, and then calculate `r.length - r.find("本")`. But: 1. This is a pain to do. 2. Doesn't work on infinite ranges (which it should).
As far as I see allowing divergent behavior for index based access and popFront/front would basically mean that no sensible algorithm could be implemented. What should some generic algorithm do with a RA range that returns double[] but yields byte values when using index access? But I guess Andrei has some more specific ideas here.
Nov 16 2012
parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:
 As far as I see allowing divergent behavior for index based 
 access and popFront/front would
 basically mean that no sensible algorithm could be implemented. 
 What should some generic algorithm
 do with a RA range that returns double[] but yields byte values 
 when using index access? But I guess
 Andrei has some more specific ideas here.
But that's exactly what a narrow string does :D: s.front yields a dchar, but s[0] yields a char. The only difference is that a string can be detected as a "isNarrowString". A RA that contains chars would have limited algorithm capabilities, and be only used with care, but in such a case, I think indexOf would be one of those algorithms that should specifically work. In any case, it shouldn't be that big of a problem. I started the thread without realizing that string.indexOf existed ... sorry.
Nov 16 2012
parent =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:
Am 16.11.2012 17:50, schrieb monarch_dodra:
 On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:
 As far as I see allowing divergent behavior for index based access and
popFront/front would
 basically mean that no sensible algorithm could be implemented. What should
some generic algorithm
 do with a RA range that returns double[] but yields byte values when using
index access? But I guess
 Andrei has some more specific ideas here.
But that's exactly what a narrow string does :D: s.front yields a dchar, but s[0] yields a char.
Yes but narrow string are also not RA ranges ;). Any _general_ range algorithm would fail for a 'string' if it would e.g. try to use the random access for look-ahead or something similar. After all, ranges are a kind of storage abstraction, but if RA would not offer that abstraction they would be of questionable value.
Nov 16 2012
prev sibling next sibling parent reply Denis Shelomovskij <verylonglogin.reg gmail.com> writes:
16.11.2012 17:18, monarch_dodra пишет:
 As you can see, when operating on UTF strings, the (expected) result of
 countUntil is not the same as that of indexOf.

 countUntil is currently bugged, in the sense that it will return the
 index in the string (as opposed to the count of popFront's).
Thanks, I forget to officially fill this bug some time ago. Pleased you are also finding Phobos bugs. And I want to mention that `countUntil` is intentionally spoiled. See the sources: --- // Narrow strings are handled a bit differently --- and Andrei will probably argue that current dangerous like hell behaviour is correct and such cool feature must be undocumented and developers will be pleased to discover it because of data corruption. :) But lets not blame Andrei because it was one of his insane big commits: https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96 and he probably was superheated and didn't understand what is going on. -- Денис В. Шеломовский Denis V. Shelomovskij
Nov 16 2012
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 16 November 2012 at 17:30:57 UTC, Denis Shelomovskij 
wrote:
 16.11.2012 17:18, monarch_dodra пишет:
 As you can see, when operating on UTF strings, the (expected) 
 result of
 countUntil is not the same as that of indexOf.

 countUntil is currently bugged, in the sense that it will 
 return the
 index in the string (as opposed to the count of popFront's).
Thanks, I forget to officially fill this bug some time ago. Pleased you are also finding Phobos bugs. And I want to mention that `countUntil` is intentionally spoiled. See the sources: --- // Narrow strings are handled a bit differently --- and Andrei will probably argue that current dangerous like hell behaviour is correct and such cool feature must be undocumented and developers will be pleased to discover it because of data corruption. :) But lets not blame Andrei because it was one of his insane big commits: https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96 and he probably was superheated and didn't understand what is going on.
Yes, but that was when he wrote "indexOf", in which the index is right. IndexOf was latter simply renamed countUntil, but that changes the semantics, and makes the implementation incorrect...
Nov 16 2012
prev sibling parent reply "Jonathan M Davis" <jmdavisProg gmx.com> writes:
On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:
 Can I get some Yays or Nays?
Absolutely not. std.algorithm.indexOf was _always_ supposed to return code points for strings. It's an outright bug if it ever did otherwise. It was renamed to countUntil so that people would not mistakingly use it on strings and think that they were getting an index. That's what std.string.indexOf does. Any code that relies on std.algorithm.indexOf or std.algorithm.countUntil returning the index of for strings rather than the number of code points is wrong and needs to be fixed. Undeprecating std.algorithm.indexOf would just perpetuate the problem, and it makes _no_ sense for anything in std.algorithm to operate on string indexes anyway. std.algorithm treats all strings as ranges of dchar, and having anything in there operating on string indexes would violate that. If the deprecation message needs improvement, then feel free to improve it, but anything which was calling std.algorithm.indexOf to get the index of a string was wrong to begin with. - Jonathan M Davis
Nov 16 2012
parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 16 November 2012 at 21:17:00 UTC, Jonathan M Davis
wrote:
 On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:
 Can I get some Yays or Nays?
Absolutely not. std.algorithm.indexOf was _always_ supposed to return code points for strings. It's an outright bug if it ever did otherwise. It was renamed to countUntil so that people would not mistakingly use it on strings and think that they were getting an index. That's what std.string.indexOf does. Any code that relies on std.algorithm.indexOf or std.algorithm.countUntil returning the index of for strings rather than the number of code points is wrong and needs to be fixed. Undeprecating std.algorithm.indexOf would just perpetuate the problem, and it makes _no_ sense for anything in std.algorithm to operate on string indexes anyway. std.algorithm treats all strings as ranges of dchar, and having anything in there operating on string indexes would violate that. If the deprecation message needs improvement, then feel free to improve it, but anything which was calling std.algorithm.indexOf to get the index of a string was wrong to begin with. - Jonathan M Davis
OK. Your explanation/rationale makes sense. TY for your reply. I'll try to improve the documentation.
Nov 16 2012