digitalmars.D.learn - Decoding HTML escape sequences
- Hugo Florentino via Digitalmars-d-learn (29/29) May 12 2014 Hi, I have some documents where some strings appears in HTML escape
- Adam D. Ruppe (5/5) May 12 2014 You should use decodeComponent instead of decode in your matchAll
Hi, I have some documents where some strings appears in HTML escape sequences in one of these forms: \x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e %3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e And I would like to recode them to readable form: <SCRIPT LANGUAGE="Javascript"> I tried something like this, using regular expressions and the uri module: import std.stdio, std.file, std.encoding, std.string, std.regex, std.uri; static auto re = regex(`(%[a-fA-F0-9]{2})`); int main(in string[] args) { if (args.length < 2) { writeln("Usage: unescape file1.htm > file2.htm"); return -1; } auto input = cast(Latin1String) read(args[1]); string buffer; transcode(input, buffer); string output; foreach(m; matchAll(buffer, re)) output ~= decode(m.hit); writeln(output); return 0; } Unfortunately it doesn't seem to work 100%. I would appreciate any suggestion. Regards, Hugo
May 12 2014
You should use decodeComponent instead of decode in your matchAll loop. IMO encodeComponent and decodeComponent are the only two useful uri encode functions (btw same in JS, use decodeURIComponent instead of the other functions). The other ones have weird rules.
May 12 2014