digitalmars.D.learn - Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (71/71) Jun 15 2014 I'm using the following snippet to convert a UTF-8 string to HTML

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (2/3) Jun 15 2014 See also:
monarch_dodra (10/16) Jun 16 2014 AFAIK, no. You hit an Error, and those shouldn't occur unless you

monarch_dodra (3/4) Jun 16 2014 Yeah, there's an issue in the implementation. I brought it up in
=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (9/18) Jun 16 2014 Excuse me for the kind of dumb question. I was unsure about the

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

I'm using the following snippet to convert a UTF-8 string to HTML

/** Convert character $(D c) to HTML representation. */
string toHTML(C)(C c)  safe pure if (isSomeChar!C)
{
     import std.conv: to;
     if      (c == '&')  return "&amp;"; // ampersand
     else if (c == '<')  return "&lt;"; // less than
     else if (c == '>')  return "&gt;"; // greater than
     else if (c == '\"') return "&quot;"; // double quote
     else if (0 < c && c < 128)
         return to!string(cast(char)c);
     else

}

static if (__VERSION__ >= 2066L)
{
     /** Convert string $(D s) to HTML representation. */
     auto encodeHTML(string s)  safe pure
     {
         import std.utf: byDchar;
         import std.algorithm: joiner, map;
         return s.byDchar.map!toHTML.joiner("");
     }
}

Note that it uses Walter's new std.utf.byDchar.

But it triggers

core.exception.RangeError std/utf.d(2703): Range violation
----------------
Stack trace:



/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d 
line (2703)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/utf.d 
line (3232)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (510)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (3440)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/algorithm.d 
line (3540)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/range.d 
line (1861)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (2172)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (2843)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (3167)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/format.d 
line (526)

/home/per/opt/x86_64-unknown-linux-gnu/dmd/bin/../import/std/stdio.d 
line (1168)

for non-utf-8 input.

Is this intentional?

utf.d on line 2703 is inside byCodeUnit().

When I use byChar() i doesn't crash but then I get incorrect 
conversions.

Could somebody explain the different between byChar, byWchar and 
byDchar?

Jun 15 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 But it triggers

See also: 
https://github.com/nordlow/justd/blob/master/test/t_err.d

Jun 15 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Sunday, 15 June 2014 at 23:09:24 UTC, Nordlöw wrote:
 Is this intentional?

 utf.d on line 2703 is inside byCodeUnit().

AFAIK, no. You hit an Error, and those shouldn't occur unless you 
go out of your way for them.

I'll look into it.

 When I use byChar() i doesn't crash but then I get incorrect 
 conversions.

 Could somebody explain the different between byChar, byWchar 
 and byDchar?

What's there to say? They all take a range of characters, and 
return it as a range of the corresponding requested type.

In the case of "byDchar", it decodes the string (while returning 
a "BadChar") for invalid encodings.

The others first decode using "byDchar", and then re-encode the 
individual dchars into the corresponding requested char-type.

Jun 16 2014

"monarch_dodra" <monarchdodra gmail.com> writes:

On Monday, 16 June 2014 at 10:02:16 UTC, monarch_dodra wrote:
 I'll look into it.

Yeah, there's an issue in the implementation. I brought it up in 
the pull page. If it doesn't get attention there, I'll file it.

Jun 16 2014

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

 AFAIK, no. You hit an Error, and those shouldn't occur unless 
 you go out of your way for them.

 I'll look into it.

Superb!

 What's there to say? They all take a range of characters, and 
 return it as a range of the corresponding requested type.

Excuse me for the kind of dumb question. I was unsure about the 
details. Is there a bleeding edge (in sync with git master) 
variant of dlang.org docs I can read instead of the source? If 
not, I build dmd, druntime amd phobos daily for testing purposes 
so I might aswell build the docs aswell and get it from there.

 In the case of "byDchar", it decodes the string (while 
 returning a "BadChar") for invalid encodings.

This is what I want/need :)

 The others first decode using "byDchar", and then re-encode the 
 individual dchars into the corresponding requested char-type.

Ok. Got it!

Thx a lot.

Jun 16 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Crash in byCodeUnit() <- byDchar() when converting faulty text to HTML