digitalmars.D.learn - html2txt library, anyone?
- jicman (4/4) Jan 19 2006 yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html f...
- James Dunne (4/12) Jan 21 2006 I've just written such a thing for C#... code is mostly
- jicman (8/20) Jan 21 2006 He he he he... c doesn't scare me. ;-) Neither does c#. :-) yes, pleas...
- James Dunne (5/35) Jan 27 2006 So, keep me in suspense...
- Charles (4/8) Jan 23 2006 Here's a PCRE regex that will do it
- Charles (5/18) Jan 23 2006 Oops,
- James Dunne (10/39) Jan 27 2006 What about reflowing whitespace runs? BR tags to newlines, P tags,
- Charles (10/50) Jan 29 2006 Yea I agree . I've been using AJAX lately but its hard for me to get ov...
yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this? Thanks, josé
Jan 19 2006
jicman wrote:yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this? Thanks, joséplatform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
Jan 21 2006
would love to have it. Would you be so kind as to email it to, cabrera at wrc.xerox.com thanks. josé James Dunne says...jicman wrote:yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this? Thanks, joséplatform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
Jan 21 2006
jicman wrote:would love to have it. Would you be so kind as to email it to, cabrera at wrc.xerox.com thanks. josé James Dunne says...So, keep me in suspense... -- Regards, James Dunnejicman wrote:yes, I know I can use cygwin tools or lynx, w3m, etc., to take an html file and change it to text, but has anyone written a d library to do this? Thanks, joséplatform-agnostic so it should move to D easily enough. It looks like C and scares everyone, which is why I like it :) Interested?
Jan 27 2006
Here's a PCRE regex that will do it "jicman" <jicman_member pathlink.com> wrote in message news:dqpvdf$1h3j$1 digitaldaemon.com...yes, I know I can use cygwin tools or lynx, w3m, etc., to take an htmlfile andchange it to text, but has anyone written a d library to do this? Thanks, josé
Jan 23 2006
Oops, char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain text </tag> from html "Charles" <noone nowhere.com> wrote in message news:dr2tmr$2g9a$1 digitaldaemon.com...Here's a PCRE regex that will do it "jicman" <jicman_member pathlink.com> wrote in message news:dqpvdf$1h3j$1 digitaldaemon.com...yes, I know I can use cygwin tools or lynx, w3m, etc., to take an htmlfile andchange it to text, but has anyone written a d library to do this? Thanks, josé
Jan 23 2006
Charles wrote:Oops, char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag> plain text </tag> from html "Charles" <noone nowhere.com> wrote in message news:dr2tmr$2g9a$1 digitaldaemon.com...What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. This is why HTML is such a hacked standard. -- Regards, James DunneHere's a PCRE regex that will do it "jicman" <jicman_member pathlink.com> wrote in message news:dqpvdf$1h3j$1 digitaldaemon.com...yes, I know I can use cygwin tools or lynx, w3m, etc., to take an htmlfile andchange it to text, but has anyone written a d library to do this? Thanks, josé
Jan 27 2006
This is why HTML is such a hacked standard.Yea I agree . I've been using AJAX lately but its hard for me to get over how 'hackish' it is , jumping through tons of hurdles just to overcome the limitations of HTTP/HTML. Have you seen HTML 2.0 ? http://www.w3.org/MarkUp/html-spec/html-spec_toc.html . I'd love to see a new design language for the web , with some better widgets and connection based . Using Mango for the server and the Harmonia code base to display this unnamed new language :D. "James Dunne" <james.jdunne gmail.com> wrote in message news:drf2b5$gb9$1 digitaldaemon.com...Charles wrote:plainOops, char [] htmlContents = `<([^>])+>|&([^;])+;`; // to extract all <tag>text </tag> from html "Charles" <noone nowhere.com> wrote in message news:dr2tmr$2g9a$1 digitaldaemon.com...What about reflowing whitespace runs? BR tags to newlines, P tags, ordered lists, bulleted lists? Incorrect tag close nestings? (i.e. <i><b></i></b>) Not to mention that you have to parse each tag's attributes so you don't accidentally hit a right angle bracket inside a string value... HTML/XML comments... the list never ends. This is why HTML is such a hacked standard. -- Regards, James DunneHere's a PCRE regex that will do it "jicman" <jicman_member pathlink.com> wrote in message news:dqpvdf$1h3j$1 digitaldaemon.com...yes, I know I can use cygwin tools or lynx, w3m, etc., to take an htmlfile andchange it to text, but has anyone written a d library to do this? Thanks, josé
Jan 29 2006