digitalmars.D - [Fix] std.utf bad conversions from UTF-16
- Stewart Gordon (77/77) Jul 12 2004 Using DMD 0.95, Windows 98SE.
- Arcane Jill (8/13) Jul 12 2004 The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32....
Using DMD 0.95, Windows 98SE. I've just been experimenting with std.utf. Two separate bugs cropped up, both in ununittested functions: 1. toUTF32(wchar[]) runs into an infinite loop when it encounters a non-ASCII single-word character. The problem is in decode - a missing else block means that the counter doesn't get incremented. 2. toUTF8(wchar[]) also tends to fail. The problem is that each wchar is cast to a dchar, one by one, instead of decoding the UTF-16 string. The fixed functions are below. Stewart. ---------- dchar decode(wchar[] s, inout size_t idx) in { assert(idx >= 0 && idx < s.length); } out (result) { assert(isValidDchar(result)); } body { char[] msg; dchar V; size_t i = idx; uint u = s[i]; if (u >= 0xD800 && u <= 0xDBFF) { uint u2; if (i + 1 == s.length) { msg = "surrogate UTF-16 high value past end of string"; goto Lerr; } u2 = s[i + 1]; if (u2 < 0xDC00 || u2 > 0xDFFF) { msg = "surrogate UTF-16 low value out of range"; goto Lerr; } u = ((u - 0xD7C0) << 10) + (u2 - 0xDC00); i += 2; } else if (u >= 0xDC00 && u <= 0xDFFF) { msg = "unpaired surrogate UTF-16 value"; goto Lerr; } else if (u == 0xFFFE || u == 0xFFFF) { msg = "illegal UTF-16 value"; goto Lerr; } // default: single-word charcter (0x0000 to 0xD7FF, 0xE000 to 0xFFFD) // SG fixed bug - previous if (u <= 0x7F) becomes redundant else { i++; } idx = i; return cast(dchar)u; Lerr: throw new UtfError(msg, i); } char[] toUTF8(wchar[] s) { char[] r; for (size_t i = 0; i < s.length; ) { encode(r, decode(s, i)); } return r; } -- My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment. Please keep replies on the 'group where everyone may benefit.
Jul 12 2004
In article <ccto20$18bq$1 digitaldaemon.com>, Stewart Gordon says... Cool. Excellent. Like it. Apart from these lines...else if (u == 0xFFFE || u == 0xFFFF) { msg = "illegal UTF-16 value"; goto Lerr; }The values U+FFFE and U+FFFF are not illegal either in UTF-16 or UTF-32. They are permanently unassigned characters, that's all. There are in total 64 Unicode characters which have this property, of which U+FFFE and U+FFFF are but two examples. Walter has now changed isValidDchar() to return true for U+FFFE and U+FFFF. Arcane Jill
Jul 12 2004