digitalmars.D.bugs - toUTFindex

Derek Parnell (43/43) May 24 2005 When an UCS index of zero is supplied to the 'toUTFindex' function, and ...

Uwe Salomon (9/12) May 24 2005 I have already changed that in my changed std.utf module (posted some da...

Derek Parnell <derek psych.ward> writes:

When an UCS index of zero is supplied to the 'toUTFindex' function, and the
supplied string does not have a valid UTF-8 sequence at offset zero, the
function fails to throw an exception. Instead it returns zero, implying
that the supplied string is valid up to that point.

This bug may exist in other similar functions too.

The following code illustrates the issue.

<code>
import std.utf;
import std.stdio;

void main()
{
   char[] B;

   B = "\xFF\xFF\xFF"; // Not a valid UTF-8 string

   writefln("Index 0=%d", std.utf.toUTFindex(B, 0)); // should fail
   writefln("Index 1=%d", std.utf.toUTFindex(B, 1)); // does fail
}
</code>

Suggested fix :
<code>
size_t toUTFindex(char[] s, size_t n)
{
    size_t i;
    size_t r;

    do
    {
        if (i >= s.length)
    	    throw new UtfError("3invalid UTC index", i);
    	size_t j = std.utf.UTF8stride[s[i]];
    	if (j == 0xFF)
    	    throw new UtfError("3invalid UTF-8 sequence", i);
    	r = i;
    	i += j;
    } while(n--);

    return r;
}
</code>


Also, I note that the UTF8stride table has entries for 5 and 6 byte
sequences. I was under the impression that these are no longer valid UTF-8
sequences.

-- 
Derek
Melbourne, Australia
25/05/2005 11:57:02 AM

May 24 2005

"Uwe Salomon" <post uwesalomon.de> writes:

 Also, I note that the UTF8stride table has entries for 5 and 6 byte
 sequences. I was under the impression that these are no longer valid  
 UTF-8 sequences.

I have already changed that in my changed std.utf module (posted some days  
ago). The toUtfX() functions were also changed to reject any invalid  
encodings. Regrettably, i have not heard anything about it. I don't know  
if Walter includes the changed code into Phobos (i don't think so...).

As i said in that posting, i would also rework the other functions in  
std.utf. But i am not sure what to do about toUCSindex/toUTFindex() ─ they  
are very inefficient if used the wrong way...

Ciao
uwe

May 24 2005

D Programming

C/C++ Programming

Other

digitalmars.D.bugs - toUTFindex