digitalmars.D - Request for "indexOf" undeprecation.

monarch_dodra (21/21) Nov 16 2012 I'd like to get a green light to undeprecate "indexOf".

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (10/33) Nov 16 2012 Fully agree about the fix of countUntil - I can imagine that there is a ...

monarch_dodra (22/39) Nov 16 2012 Yeah... I just noticed that exists. Still, the deprecation

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (8/21) Nov 16 2012 Just that it returns the index of the _code point_ if you pass a non-ASC...

monarch_dodra (16/50) Nov 16 2012 I did.

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (8/52) Nov 16 2012 Codepoint index would mean that it returns 1 instead of 3 in the example...

monarch_dodra (11/18) Nov 16 2012 But that's exactly what a narrow string does :D: s.front yields a

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (5/13) Nov 16 2012 Yes but narrow string are also not RA ranges ;). Any _general_ range alg...

Denis Shelomovskij (17/21) Nov 16 2012 Thanks, I forget to officially fill this bug some time ago. Pleased you

monarch_dodra (6/30) Nov 16 2012 Yes, but that was when he wrote "indexOf", in which the index is

Jonathan M Davis (16/17) Nov 16 2012 Absolutely not.

monarch_dodra (4/33) Nov 16 2012 OK. Your explanation/rationale makes sense. TY for your reply.

"monarch_dodra" <monarchdodra gmail.com> writes:

I'd like to get a green light to undeprecate "indexOf".

The rationale for the original deprecation is that it was 
superseded by "countUntil", which also works on forward ranges 
(indexOf implies, by name, that it only operates on RA).

There is just one problem:
"日本語".indexOf(本) => 3
"日本語".countUntil(本) => 1

As you can see, when operating on UTF strings, the (expected) 
result of countUntil is not the same as that of indexOf.

countUntil is currently bugged, in the sense that it will return 
the index in the string (as opposed to the count of popFront's). 
However, this needs fixing. If users migrate countUntil to 
indexOf without thinking, it could lead to some terrible, 
terrible bugs.

Phobos was migrated. I found the issue in datetime (which is 
"OK", since date is ASCII), but also in std.path, which is NOT 
OK. At all. Who knows what's out there in the nature?

-------
So yeah, I want to un-deprecate indexOf, and re-document it as 
working only on RA.

Can I get some Yays or Nays?

Nov 16 2012

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 16.11.2012 14:18, schrieb monarch_dodra:
 I'd like to get a green light to undeprecate "indexOf".
 
 The rationale for the original deprecation is that it was superseded by
"countUntil", which also
 works on forward ranges (indexOf implies, by name, that it only operates on
RA).
 
 There is just one problem:
 "日本語".indexOf(本) => 3
 "日本語".countUntil(本) => 1
 
 As you can see, when operating on UTF strings, the (expected) result of
countUntil is not the same
 as that of indexOf.
 
 countUntil is currently bugged, in the sense that it will return the index in
the string (as opposed
 to the count of popFront's). However, this needs fixing. If users migrate
countUntil to indexOf
 without thinking, it could lead to some terrible, terrible bugs.
 
 Phobos was migrated. I found the issue in datetime (which is "OK", since date
is ASCII), but also in
 std.path, which is NOT OK. At all. Who knows what's out there in the nature?
 
 -------
 So yeah, I want to un-deprecate indexOf, and re-document it as working only on
RA.
 
 Can I get some Yays or Nays?

Fully agree about the fix of countUntil - I can imagine that there is a
non-trivial amount of code
that blindly uses countUntil instead of indexOf because of the deprecation (I
know I also did this
before).

But there is still std.string.indexOf, which should do the right thing here...
wait, no, that one
also screws up and returns based on the ASCII state of the search character!

However this all is fixed, it will probably cause a good amount of breakage,
but there is no way
around it.

Btw. is 'string' actually considered a RA range? After all it provides no
useful invariants apart
from str[0] == str.front - str[1] could be different from str.popFront();
str.front.

Nov 16 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 16 November 2012 at 13:49:57 UTC, Sönke Ludwig wrote:
 Fully agree about the fix of countUntil - I can imagine that 
 there is a non-trivial amount of code
 that blindly uses countUntil instead of indexOf because of the 
 deprecation (I know I also did this
 before).

 But there is still std.string.indexOf, which should do the 
 right thing here...

Yeah... I just noticed that exists. Still, the deprecation 
comment is not very clear. I think there should be a clearer 
message of indexOf vs countUntil.

As a matter of fact, I found that in path, we are going out of 
our way to use "std.algorithm.countUntil", probably because of a 
forced migration, without considering using std.string.indexOf 
instead.

Still, if we have a RA range of chars, I think it needs to be 
possible to call indexOf on that range. std.utf can decode input 
ranges of chars, why can't we get an indexOf for RA ranges too 
then ?

 wait, no, that one
 also screws up and returns based on the ASCII state of the 
 search character!

It doesn't screw up the result, it is meant for slicing your 
string.

 However this all is fixed, it will probably cause a good amount 
 of breakage, but there is no way
 around it.

 Btw. is 'string' actually considered a RA range? After all it 
 provides no useful invariants apart
 from str[0] == str.front - str[1] could be different from 
 str.popFront(); str.front.

No! String is not an RA range. It can be indexed, but 
isRandomAccessRange!string is false. This is a fundamental aspect 
of string, to avoid accidently breaking it.

dstring, however, is random access. You should always take this 
into account for considering whether or not it is worth 
converting to before operating on it. For example: sorting the 
chars in a dstring: Easy as pie. Doing it on a string: Not sure 
if even possible.

Nov 16 2012

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the search character!

 
 It doesn't screw up the result, it is meant for slicing your string.

Just that it returns the index of the _code point_ if you pass a non-ASCII
character as the search
term and a byte index only if you pass in an ASCII char.

 Btw. is 'string' actually considered a RA range? After all it provides no
useful invariants apart
 from str[0] == str.front - str[1] could be different from str.popFront();
str.front.

 
 No! String is not an RA range. It can be indexed, but
isRandomAccessRange!string is false. This is a
 fundamental aspect of string, to avoid accidently breaking it.
 
 dstring, however, is random access. You should always take this into account
for considering whether
 or not it is worth converting to before operating on it. For example: sorting
the chars in a
 dstring: Easy as pie. Doing it on a string: Not sure if even possible.

In that case I got confused by the example. I thought you wanted to make
"日本語".indexOf('本') == 3
possible again. But that wouldn't work if indexOf operates on RA ranges. If you
do have a RA though,
how is the result of countUntil different from indexOf? If you have an actual
char-range, countUntil
should also return 3...

Nov 16 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:
 Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the 
 search character!

 
 It doesn't screw up the result, it is meant for slicing your 
 string.

 Just that it returns the index of the _code point_ if you pass 
 a non-ASCII character as the search
 term and a byte index only if you pass in an ASCII char.

I'm not sure what you are trying to say? Aren't those the same?

 Btw. is 'string' actually considered a RA range? After all it 
 provides no useful invariants apart
 from str[0] == str.front - str[1] could be different from 
 str.popFront(); str.front.

 
 No! String is not an RA range. It can be indexed, but 
 isRandomAccessRange!string is false. This is a
 fundamental aspect of string, to avoid accidently breaking it.
 
 dstring, however, is random access. You should always take 
 this into account for considering whether
 or not it is worth converting to before operating on it. For 
 example: sorting the chars in a
 dstring: Easy as pie. Doing it on a string: Not sure if even 
 possible.

 In that case I got confused by the example. I thought you 
 wanted to make
 "日本語".indexOf('本') == 3
 possible again.

I did.

 But that wouldn't work if indexOf operates on RA ranges. If you 
 do have a RA though,
 how is the result of countUntil different from indexOf? If you 
 have an actual char-range, countUntil
 should also return 3...

One of the problems is that the semantics of a char range 
containing a utf payload is not very well formalized. In 
particular, there is no particular rule that states that popFront 
will pop an entire utf-sequence, or just a sigle char. In this 
situation, we can't really say what countUntil would return...

However, it is perfectly legal to decode a forward range, so I 
don't see why you wouldn't be able to explicitly search the index 
in a RA range, as opposed to the amount of popFronts needed to 
get there. It is two different operations, and they have been 
(IMO) erroneously merged together.

One of the workarounds is to "find" instead, and then calculate 
`r.length - r.find("本")`. But: 1. This is a pain to do. 2. 
Doesn't work on infinite ranges (which it should).

Nov 16 2012

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 16.11.2012 16:08, schrieb monarch_dodra:
 On Friday, 16 November 2012 at 14:50:26 UTC, Sönke Ludwig wrote:
 Am 16.11.2012 15:02, schrieb monarch_dodra:
 wait, no, that one
 also screws up and returns based on the ASCII state of the search character!

 It doesn't screw up the result, it is meant for slicing your string.

 Just that it returns the index of the _code point_ if you pass a non-ASCII
character as the search
 term and a byte index only if you pass in an ASCII char.

 
 I'm not sure what you are trying to say? Aren't those the same?

Codepoint index would mean that it returns 1 instead of 3 in the example. But
forget what I said. I
just tested it and was surprised that in 'foreach( i, dchar ch; str )', i
contains byte offsets and
not code point indices - so the indexOf() implementation is fully correct after
all.

 
 Btw. is 'string' actually considered a RA range? After all it provides no
useful invariants apart
 from str[0] == str.front - str[1] could be different from str.popFront();
str.front.

 No! String is not an RA range. It can be indexed, but
isRandomAccessRange!string is false. This is a
 fundamental aspect of string, to avoid accidently breaking it.

 dstring, however, is random access. You should always take this into account
for considering whether
 or not it is worth converting to before operating on it. For example: sorting
the chars in a
 dstring: Easy as pie. Doing it on a string: Not sure if even possible.

 In that case I got confused by the example. I thought you wanted to make
 "日本語".indexOf('本') == 3
 possible again.

 
 I did.
 
 But that wouldn't work if indexOf operates on RA ranges. If you do have a RA
though,
 how is the result of countUntil different from indexOf? If you have an actual
char-range, countUntil
 should also return 3...

 
 One of the problems is that the semantics of a char range containing a utf
payload is not very well
 formalized. In particular, there is no particular rule that states that
popFront will pop an entire
 utf-sequence, or just a sigle char. In this situation, we can't really say
what countUntil would
 return...
 
 However, it is perfectly legal to decode a forward range, so I don't see why
you wouldn't be able to
 explicitly search the index in a RA range, as opposed to the amount of
popFronts needed to get
 there. It is two different operations, and they have been (IMO) erroneously
merged together.
 
 One of the workarounds is to "find" instead, and then calculate `r.length -
r.find("本")`. But: 1.
 This is a pain to do. 2. Doesn't work on infinite ranges (which it should).
 

As far as I see allowing divergent behavior for index based access and
popFront/front would
basically mean that no sensible algorithm could be implemented. What should
some generic algorithm
do with a RA range that returns double[] but yields byte values when using
index access? But I guess
Andrei has some more specific ideas here.

Nov 16 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:
 As far as I see allowing divergent behavior for index based 
 access and popFront/front would
 basically mean that no sensible algorithm could be implemented. 
 What should some generic algorithm
 do with a RA range that returns double[] but yields byte values 
 when using index access? But I guess
 Andrei has some more specific ideas here.

But that's exactly what a narrow string does :D: s.front yields a 
dchar, but s[0] yields a char.

The only difference is that a string can be detected as a 
"isNarrowString".

A RA that contains chars would have limited algorithm 
capabilities, and be only used with care, but in such a case, I 
think indexOf would be one of those algorithms that should 
specifically work.

In any case, it shouldn't be that big of a problem. I started the 
thread without realizing that string.indexOf existed ... sorry.

Nov 16 2012

=?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= <sludwig outerproduct.org> writes:

Am 16.11.2012 17:50, schrieb monarch_dodra:
 On Friday, 16 November 2012 at 16:33:47 UTC, Sönke Ludwig wrote:
 As far as I see allowing divergent behavior for index based access and
popFront/front would
 basically mean that no sensible algorithm could be implemented. What should
some generic algorithm
 do with a RA range that returns double[] but yields byte values when using
index access? But I guess
 Andrei has some more specific ideas here.

 
 But that's exactly what a narrow string does :D: s.front yields a dchar, but
s[0] yields a char.

Yes but narrow string are also not RA ranges ;). Any _general_ range algorithm
would fail for a
'string' if it would e.g. try to use the random access for look-ahead or
something similar. After
all, ranges are a kind of storage abstraction, but if RA would not offer that
abstraction they would
be of questionable value.

Nov 16 2012

Denis Shelomovskij <verylonglogin.reg gmail.com> writes:

16.11.2012 17:18, monarch_dodra пишет:
 As you can see, when operating on UTF strings, the (expected) result of
 countUntil is not the same as that of indexOf.

 countUntil is currently bugged, in the sense that it will return the
 index in the string (as opposed to the count of popFront's).

Thanks, I forget to officially fill this bug some time ago. Pleased you 
are also finding Phobos bugs.

And I want to mention that `countUntil` is intentionally spoiled. See 
the sources:
---
// Narrow strings are handled a bit differently
---
and Andrei will probably argue that current dangerous like hell 
behaviour is correct and such cool feature must be undocumented and 
developers will be pleased to discover it because of data corruption. :)

But lets not blame Andrei because it was one of his insane big commits:
https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96
and he probably was superheated and didn't understand what is going on.


-- 
Денис В. Шеломовский
Denis V. Shelomovskij

Nov 16 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 16 November 2012 at 17:30:57 UTC, Denis Shelomovskij 
wrote:
 16.11.2012 17:18, monarch_dodra пишет:
 As you can see, when operating on UTF strings, the (expected) 
 result of
 countUntil is not the same as that of indexOf.

 countUntil is currently bugged, in the sense that it will 
 return the
 index in the string (as opposed to the count of popFront's).

 Thanks, I forget to officially fill this bug some time ago. 
 Pleased you are also finding Phobos bugs.

 And I want to mention that `countUntil` is intentionally 
 spoiled. See the sources:
 ---
 // Narrow strings are handled a bit differently
 ---
 and Andrei will probably argue that current dangerous like hell 
 behaviour is correct and such cool feature must be undocumented 
 and developers will be pleased to discover it because of data 
 corruption. :)

 But lets not blame Andrei because it was one of his insane big 
 commits:
 https://github.com/D-Programming-Language/phobos/commit/3ea2debb8c4a6a707447c684e94f651924efaa96
 and he probably was superheated and didn't understand what is 
 going on.

Yes, but that was when he wrote "indexOf", in which the index is 
right.

IndexOf was latter simply renamed countUntil, but that changes 
the semantics, and makes the implementation incorrect...

Nov 16 2012

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:
 Can I get some Yays or Nays?

Absolutely not.

std.algorithm.indexOf was _always_ supposed to return code points for strings. 
It's an outright bug if it ever did otherwise. It was renamed to countUntil so 
that people would not mistakingly use it on strings and think that they were 
getting an index. That's what std.string.indexOf does. Any code that relies on 
std.algorithm.indexOf or std.algorithm.countUntil returning the index of for 
strings rather than the number of code points is wrong and needs to be fixed.

Undeprecating std.algorithm.indexOf would just perpetuate the problem, and it 
makes _no_ sense for anything in std.algorithm to operate on string indexes 
anyway. std.algorithm treats all strings as ranges of dchar, and having 
anything in there operating on string indexes would violate that.

If the deprecation message needs improvement, then feel free to improve it, 
but anything which was calling std.algorithm.indexOf to get the index of a 
string was wrong to begin with.

- Jonathan M Davis

Nov 16 2012

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 16 November 2012 at 21:17:00 UTC, Jonathan M Davis
wrote:
 On Friday, November 16, 2012 14:18:02 monarch_dodra wrote:
 Can I get some Yays or Nays?

 Absolutely not.

 std.algorithm.indexOf was _always_ supposed to return code 
 points for strings.
 It's an outright bug if it ever did otherwise. It was renamed 
 to countUntil so
 that people would not mistakingly use it on strings and think 
 that they were
 getting an index. That's what std.string.indexOf does. Any code 
 that relies on
 std.algorithm.indexOf or std.algorithm.countUntil returning the 
 index of for
 strings rather than the number of code points is wrong and 
 needs to be fixed.

 Undeprecating std.algorithm.indexOf would just perpetuate the 
 problem, and it
 makes _no_ sense for anything in std.algorithm to operate on 
 string indexes
 anyway. std.algorithm treats all strings as ranges of dchar, 
 and having
 anything in there operating on string indexes would violate 
 that.

 If the deprecation message needs improvement, then feel free to 
 improve it,
 but anything which was calling std.algorithm.indexOf to get the 
 index of a
 string was wrong to begin with.

 - Jonathan M Davis

OK. Your explanation/rationale makes sense. TY for your reply.
I'll try to improve the documentation.

Nov 16 2012

D Programming

C/C++ Programming

Other

digitalmars.D - Request for "indexOf" undeprecation.