digitalmars.D.bugs - std.string.maketrans and std.string.translate not unicode aware
- Sam McCall (69/69) Jun 30 2004 The std.string.maketrans and translate functions are meant to create and...
- Arcane Jill (13/16) Jun 30 2004 In agreement with Sam here, but I should point out that the bug is actua...
The std.string.maketrans and translate functions are meant to create and apply a character translation table, respectively. However at the moment they create and apply a byte translation table. This will cause translation errors, and assertions if you try and replace an ASCII character with a non-ASCII character, for example, due to different array lengths. Unfortunately it's not possible to fix this without changing the function signatures, the lookup table would be too big. Something like this should work... Sam /************************************ * Construct translation table for translate(). */ dchar[dchar] maketrans(dchar[] from, dchar[] to) in { assert(from.length == to.length); } body { dchar[dchar] t; for (int i=0; i<from.length; i++) t[from[i]] = to[i]; return t; } /****************************************** * Translate characters in s[] using table created by maketrans(). * Delete chars in delchars[]. */ dchar[] translate(dchar[] s, dchar[dchar] transtab, dchar[] delchars) { dchar[] r; int i; int count; bit[dchar] deltab; for (i = 0; i < delchars.length; i++) deltab[delchars[i]] = true; count = 0; foreach(dchar d; s) if(!(d in deltab)) count++; r = new dchar[count]; count = 0; foreach(dchar d; s) if(!(d in deltab)) r[count++]=transtab[d]; return r; } /****************************************** * Translate characters in s[] using table created by maketrans(). * Delete chars in delchars[]. */ char[] translate(char[] s, dchar[dchar] transtab, dchar[] delchars) { dchar[] r; int i; int count; bit[dchar] deltab; for (i = 0; i < delchars.length; i++) deltab[delchars[i]] = true; count = 0; foreach(dchar d; s) // iterates properly over characters if(!(d in deltab)) count++; r = new dchar[count]; count = 0; foreach(dchar d; s) if(!(d in deltab)) r[count++]=transtab[d]; return toUTF8(r); }
Jun 30 2004
In article <cbttmg$1u68$1 digitaldaemon.com>, Sam McCall says...The std.string.maketrans and translate functions are meant to create and apply a character translation table, respectively. However at the moment they create and apply a byte translation table.In agreement with Sam here, but I should point out that the bug is actually much more serious than Sam suggests. It's not just a matter of missing features - it's a matter of serious UTF-8 corruption. The current implementation allow users to modify char values in the range 0x80 to 0xFF. These bytes have specific meaning in terms of UTF-8. Allowing users to modify such values with a translate() routine is DANGEROUS, and is pretty much guaranteed to result in a string containing invalid UTF-8. Sam suggests a number of ways of making these functions dchar-based instead of char-based. But if you want to keep them char-based, then you absolutely must disallow the modification of any char >0x7F, and document such functions as ASCII-only. Arcane Jill
Jun 30 2004