digitalmars.D.learn - utf.d codeLength asserts false on certain input
- Anonymouse (19/53) Mar 27 2018 My IRC bot is suddenly seeing crashes. It reads characters from a
- Jonathan M Davis (10/79) Mar 27 2018 It means that codeLength requires that dchar be a valid code point, thou...
My IRC bot is suddenly seeing crashes. It reads characters from a Socket into an ubyte[] array, then idups parts of that (full lines) into strings for parsing. Parsing involves slicing such strings into meaningful segments; sender, event type, target channel/user, message content, etc. I can assume all of them to be char[]-compliant except for the content field. Running it in a debugger I see I'm tripping an assert in utf.d[1] when calling stripRight on a content slice[2]./++ Returns the number of code units that are required to encode the code point $(D c) when $(D C) is the character type used to encode it. +/ ubyte codeLength(C)(dchar c) safe pure nothrow nogc if (isSomeChar!C) { static if (C.sizeof == 1) { if (c <= 0x7F) return 1; if (c <= 0x7FF) return 2; if (c <= 0xFFFF) return 3; if (c <= 0x10FFFF) return 4; assert(false); // <-- } // ...This trips it:import std.string; void main() { string s = "\355\342\256 \342\245\341⮢\256\245 ᮮ\241饭\250\245".stripRight; // <-- asserts false }The real backtrace:/usr/include/dlang/dmd/std/utf.d:2530 _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi (this=0x7fffffff99c0, __applyArg1= 0x7fffffff9978: 26663461, __applyArg0= 0x7fffffff9970: 17) at /usr/include/dlang/dmd/std/string.d:2918 /usr/lib/libphobos2.so.0.78 _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at /usr/include/dlang/dmd/std/string.d:2915 _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ir defs8IRCEventKAyaZv (slice=..., event=...,parser=...) at source/kameloso/irc.d:1184Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert? I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError. [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522 [2]: https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184
Mar 27 2018
On Tuesday, March 27, 2018 23:29:57 Anonymouse via Digitalmars-d-learn wrote:My IRC bot is suddenly seeing crashes. It reads characters from a Socket into an ubyte[] array, then idups parts of that (full lines) into strings for parsing. Parsing involves slicing such strings into meaningful segments; sender, event type, target channel/user, message content, etc. I can assume all of them to be char[]-compliant except for the content field. Running it in a debugger I see I'm tripping an assert in utf.d[1] when calling stripRight on a content slice[2].It means that codeLength requires that dchar be a valid code point, though the documentation doesn't say that. It probably should. It was probably assumed that no one would try to pass it an invalid code point - especially since it's usually called with well-known values rather than data from some place like a socket. Regardless, the way to work around it would be to call isValidDchar on the dchar before passing it to codeLength so that you can handle the invalid code point rather than calling codeLength on it. - Jonathan M Davis/++ Returns the number of code units that are required to encode the code point $(D c) when $(D C) is the character type used to encode it. +/ ubyte codeLength(C)(dchar c) safe pure nothrow nogc if (isSomeChar!C) { static if (C.sizeof == 1) { if (c <= 0x7F) return 1; if (c <= 0x7FF) return 2; if (c <= 0xFFFF) return 3; if (c <= 0x10FFFF) return 4; assert(false); // <-- } // ...This trips it:import std.string; void main() { string s = "\355\342\256 \342\245\341⮢\256\245 ᮮ\241饭\250\245".stripRight; // <-- asserts false }The real backtrace:/usr/include/dlang/dmd/std/utf.d:2530 _D3std6string__T10stripRightTAyaZQrFQhZ14__foreachbody2MFNaNbNiNfKmKwZi (this=0x7fffffff99c0, __applyArg1= 0x7fffffff9978: 26663461, __applyArg0= 0x7fffffff9970: 17) at _aApplyRcd2 () from /usr/lib/libphobos2.so.0.78 _D3std6string__T10stripRightTAyaZQrFNaNiNfQnZQq (str=...) at /usr/include/dlang/dmd/std/string.d:2915 _D8kameloso3irc17parseSpecialcasesFNaNfKSQBnQBh9IRCParserKSQCf7ircdefs8I RCEventKAyaZv (slice=..., event=...,parser=...) at source/kameloso/irc.d:1184Should that not be an Exception, as it's based on input? I'm not sure where the character 26663461 came from. Even so, should it assert? I don't know what to do right now. I'd like to avoid sanitizing all lines. I could catch an Exception but not so much an AssertError. [1]: https://github.com/dlang/phobos/blob/master/std/utf.d#L2522 [2]: https://github.com/zorael/kameloso/blob/master/source/kameloso/irc.d#L1184
Mar 27 2018