digitalmars.D.learn - Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (71/71) Jun 15 2014 I'm using the following snippet to convert a UTF-8 string to HTML
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (2/3) Jun 15 2014 See also:
- monarch_dodra (10/16) Jun 16 2014 AFAIK, no. You hit an Error, and those shouldn't occur unless you
- monarch_dodra (3/4) Jun 16 2014 Yeah, there's an issue in the implementation. I brought it up in
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (9/18) Jun 16 2014 Excuse me for the kind of dumb question. I was unsure about the
I'm using the following snippet to convert a UTF-8 string to HTML /** Convert character $(D c) to HTML representation. */ string toHTML(C)(C c) safe pure if (isSomeChar!C) { import std.conv: to; if (c == '&') return "&"; // ampersand else if (c == '<') return "<"; // less than else if (c == '>') return ">"; // greater than else if (c == '\"') return """; // double quote else if (0 < c && c < 128) return to!string(cast(char)c); else } static if (__VERSION__ >= 2066L) { /** Convert string $(D s) to HTML representation. */ auto encodeHTML(string s) safe pure { import std.utf: byDchar; import std.algorithm: joiner, map; return s.byDchar.map!toHTML.joiner(""); } } Note that it uses Walter's new std.utf.byDchar. But it triggers core.exception.RangeError std/utf.d(2703): Range violation ---------------- Stack trace: /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (2703) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d line (3232) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (510) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3440) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d line (3540) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/range.d line (1861) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2172) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (2843) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (3167) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d line (526) /home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/stdio.d line (1168) for non-utf-8 input. Is this intentional? utf.d on line 2703 is inside byCodeUnit(). When I use byChar() i doesn't crash but then I get incorrect conversions. Could somebody explain the different between byChar, byWchar and byDchar?
Jun 15 2014
But it triggersSee also: https://github.com/nordlow/justd/blob/master/test/t_err.d
Jun 15 2014
On Sunday, 15 June 2014 at 23:09:24 UTC, Nordlöw wrote:Is this intentional? utf.d on line 2703 is inside byCodeUnit().AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them. I'll look into it.When I use byChar() i doesn't crash but then I get incorrect conversions. Could somebody explain the different between byChar, byWchar and byDchar?What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type. In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings. The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.
Jun 16 2014
On Monday, 16 June 2014 at 10:02:16 UTC, monarch_dodra wrote:I'll look into it.Yeah, there's an issue in the implementation. I brought it up in the pull page. If it doesn't get attention there, I'll file it.
Jun 16 2014
AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them. I'll look into it.Superb!What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type.Excuse me for the kind of dumb question. I was unsure about the details. Is there a bleeding edge (in sync with git master) variant of dlang.org docs I can read instead of the source? If not, I build dmd, druntime amd phobos daily for testing purposes so I might aswell build the docs aswell and get it from there.In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings.This is what I want/need :)The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.Ok. Got it! Thx a lot.
Jun 16 2014