www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML

reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
I'm using the following snippet to convert a UTF-8 string to HTML

/** Convert character $(D c) to HTML representation. */
string toHTML(C)(C c)  safe pure if (isSomeChar!C)
{
     import std.conv: to;
     if      (c == '&')  return "&amp;"; // ampersand
     else if (c == '<')  return "&lt;"; // less than
     else if (c == '>')  return "&gt;"; // greater than
     else if (c == '\"') return "&quot;"; // double quote
     else if (0 < c && c < 128)
         return to!string(cast(char)c);
     else

}

static if (__VERSION__ >= 2066L)
{
     /** Convert string $(D s) to HTML representation. */
     auto encodeHTML(string s)  safe pure
     {
         import std.utf: byDchar;
         import std.algorithm: joiner, map;
         return s.byDchar.map!toHTML.joiner("");
     }
}

Note that it uses Walter's new std.utf.byDchar.

But it triggers

core.exception.RangeError std/utf.d(2703): Range violation
----------------
Stack trace:



/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d 
line (2703)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d 
line (3232)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (510)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (3440)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (3540)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/range.d 
line (1861)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (2172)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (2843)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (3167)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (526)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/stdio.d 
line (1168)

for non-utf-8 input.

Is this intentional?

utf.d on line 2703 is inside byCodeUnit().

When I use byChar() i doesn't crash but then I get incorrect 
conversions.

Could somebody explain the different between byChar, byWchar and 
byDchar?
Jun 15 2014
next sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 But it triggers
See also: https://github.com/nordlow/justd/blob/master/test/t_err.d
Jun 15 2014
prev sibling parent reply "monarch_dodra" <monarchdodra gmail.com> writes:
On Sunday, 15 June 2014 at 23:09:24 UTC, Nordlöw wrote:
 Is this intentional?

 utf.d on line 2703 is inside byCodeUnit().
AFAIK, no. You hit an Error, and those shouldn't occur unless you go out of your way for them. I'll look into it.
 When I use byChar() i doesn't crash but then I get incorrect 
 conversions.

 Could somebody explain the different between byChar, byWchar 
 and byDchar?
What's there to say? They all take a range of characters, and return it as a range of the corresponding requested type. In the case of "byDchar", it decodes the string (while returning a "BadChar") for invalid encodings. The others first decode using "byDchar", and then re-encode the individual dchars into the corresponding requested char-type.
Jun 16 2014
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Monday, 16 June 2014 at 10:02:16 UTC, monarch_dodra wrote:
 I'll look into it.
Yeah, there's an issue in the implementation. I brought it up in the pull page. If it doesn't get attention there, I'll file it.
Jun 16 2014
prev sibling parent =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
 AFAIK, no. You hit an Error, and those shouldn't occur unless 
 you go out of your way for them.

 I'll look into it.
Superb!
 What's there to say? They all take a range of characters, and 
 return it as a range of the corresponding requested type.
Excuse me for the kind of dumb question. I was unsure about the details. Is there a bleeding edge (in sync with git master) variant of dlang.org docs I can read instead of the source? If not, I build dmd, druntime amd phobos daily for testing purposes so I might aswell build the docs aswell and get it from there.
 In the case of "byDchar", it decodes the string (while 
 returning a "BadChar") for invalid encodings.
This is what I want/need :)
 The others first decode using "byDchar", and then re-encode the 
 individual dchars into the corresponding requested char-type.
Ok. Got it! Thx a lot.
Jun 16 2014