digitalmars.D - htmlget.d example and unicode parsing
- Tyro[a.c.edwards] (23/23) Apr 30 2011 Hello all,
- Nick Sabalausky (27/48) May 01 2011 Depends on what exactly you're doing. There are many cases where indexin...
Hello all, I am trying to learn how to parse, modify, and redisplay a Japanese webpage passed to me in a form and am wondering if anyone has an example of how to do this. I looked at htmlget and found that it has a couple problems: namely, it is not conform to current D2 practices. I am not sure that my hack can be considered a fix but have attached it nonetheless. It now works correctly on ascii based urls but not utf-8. My lack of knowledge on how to properly parsing unicode documents has left me stumped. I am therefore requesting some assistance in updating the code such that it works with any url. I have taken a look at std.utf and there are a few things there that could possibly assist me however without examples I'm somewhat at a loss. I'm assuming that the problem exists here: for (iw = 0; iw != line.length; iw++) { if (!icmp("</html>", line[iw .. line.length])) break print_lines; } From what I understanding, one cannot index a utf sequence the same as you index ASCII. What is the proper what to rewrite this such that it parses the utf characters correctly? And example would do wonders. Thanks
Apr 30 2011
"Tyro[a.c.edwards]" <nospam home.com> wrote in message news:ipinj3$1c77$1 digitalmars.com...Hello all, I am trying to learn how to parse, modify, and redisplay a Japanese webpage passed to me in a form and am wondering if anyone has an example of how to do this. I looked at htmlget and found that it has a couple problems: namely, it is not conform to current D2 practices. I am not sure that my hack can be considered a fix but have attached it nonetheless. It now works correctly on ascii based urls but not utf-8. My lack of knowledge on how to properly parsing unicode documents has left me stumped. I am therefore requesting some assistance in updating the code such that it works with any url. I have taken a look at std.utf and there are a few things there that could possibly assist me however without examples I'm somewhat at a loss. I'm assuming that the problem exists here: for (iw = 0; iw != line.length; iw++) { if (!icmp("</html>", line[iw .. line.length])) break print_lines; } From what I understanding, one cannot index a utf sequence the same as you index ASCII.Depends on what exactly you're doing. There are many cases where indexing utf like ASCII works fine, and your code above looks like one of the cases where it should work (Unless icmp throws or asserts on invalid code-unit sequences. Anyone know offhand if it does?). But you do have a non-utf-related bug in that loop. If there's anything in 'line' after the "</html>" tag, then it won't detect the tag because you're slicing with the length of 'line' instead of the length of "</html>". So it should be: for (iw = 0; iw != line.length; iw++) { immutable endTag = "</html>"; if (line.length >= endTag.length && !icmp(endTag, line[iw .. endTag.length])) break print_lines; } On the topic of unicode, this is a really good introduction to the details of it: http://www.joelonsoftware.com/articles/Unicode.html But once you read that, keep in mind there's a few important details he failed to mention: A code-point is made up of code-units, yes, but a single code-point is *not* always an entire character (aka "grapheme"). Because of combining codes, a character could be made up of multiple code points (just like how a code point can be made up of multiple code units). Also, there are certain characters that can be represented with more than one specific sequence of code points (and that gets into unicode normalization).
May 01 2011