digitalmars.D.learn - Decoding HTML escape sequences

Hugo Florentino via Digitalmars-d-learn (29/29) May 12 2014 Hi, I have some documents where some strings appears in HTML escape

Adam D. Ruppe (5/5) May 12 2014 You should use decodeComponent instead of decode in your matchAll

Hugo Florentino via Digitalmars-d-learn writes:

Hi, I have some documents where some strings appears in HTML escape 
sequences in one of these forms:

\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e

%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e

And I would like to recode them to readable form:

<SCRIPT LANGUAGE="Javascript">

I tried something like this, using regular expressions and the uri 
module:


import std.stdio, std.file, std.encoding, std.string, std.regex, 
std.uri;

static auto re = regex(`(%[a-fA-F0-9]{2})`);

int main(in string[] args)
{
   if (args.length < 2)
   {
     writeln("Usage: unescape file1.htm > file2.htm");
     return -1;
   }
   auto input = cast(Latin1String) read(args[1]);
   string buffer;
   transcode(input, buffer);

   string output;
   foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);

   writeln(output);

   return 0;
}


Unfortunately it doesn't seem to work 100%.

I would appreciate any suggestion.

Regards, Hugo

May 12 2014

"Adam D. Ruppe" <destructionator gmail.com> writes:

You should use decodeComponent instead of decode in your matchAll 
loop.

IMO encodeComponent and decodeComponent are the only two useful 
uri encode functions (btw same in JS, use decodeURIComponent 
instead of the other functions). The other ones have weird rules.

May 12 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Decoding HTML escape sequences