digitalmars.D.bugs - toUTFindex
- Derek Parnell (43/43) May 24 2005 When an UCS index of zero is supplied to the 'toUTFindex' function, and ...
- Uwe Salomon (9/12) May 24 2005 I have already changed that in my changed std.utf module (posted some da...
When an UCS index of zero is supplied to the 'toUTFindex' function, and the supplied string does not have a valid UTF-8 sequence at offset zero, the function fails to throw an exception. Instead it returns zero, implying that the supplied string is valid up to that point. This bug may exist in other similar functions too. The following code illustrates the issue. <code> import std.utf; import std.stdio; void main() { char[] B; B = "\xFF\xFF\xFF"; // Not a valid UTF-8 string writefln("Index 0=%d", std.utf.toUTFindex(B, 0)); // should fail writefln("Index 1=%d", std.utf.toUTFindex(B, 1)); // does fail } </code> Suggested fix : <code> size_t toUTFindex(char[] s, size_t n) { size_t i; size_t r; do { if (i >= s.length) throw new UtfError("3invalid UTC index", i); size_t j = std.utf.UTF8stride[s[i]]; if (j == 0xFF) throw new UtfError("3invalid UTF-8 sequence", i); r = i; i += j; } while(n--); return r; } </code> Also, I note that the UTF8stride table has entries for 5 and 6 byte sequences. I was under the impression that these are no longer valid UTF-8 sequences. -- Derek Melbourne, Australia 25/05/2005 11:57:02 AM
May 24 2005
Also, I note that the UTF8stride table has entries for 5 and 6 byte sequences. I was under the impression that these are no longer valid UTF-8 sequences.I have already changed that in my changed std.utf module (posted some days ago). The toUtfX() functions were also changed to reject any invalid encodings. Regrettably, i have not heard anything about it. I don't know if Walter includes the changed code into Phobos (i don't think so...). As i said in that posting, i would also rework the other functions in std.utf. But i am not sure what to do about toUCSindex/toUTFindex() ─ they are very inefficient if used the wrong way... Ciao uwe
May 24 2005