digitalmars.D.learn - parsing HTML for a web robot (crawler) like application

Martin Tschierschke (8/8) Mar 23 2016 Hello!

Rene Zwanenburg (5/13) Mar 23 2016 Adam's dom.d will get you pretty far. I believe it can also

Martin Tschierschke (4/7) Mar 23 2016 On Wednesday, 23 March 2016 at 09:06:37 UTC, Rene Zwanenburg
=?UTF-8?B?Tm9yZGzDtnc=?= (5/8) Mar 23 2016 HTML-docs here:

Adam D. Ruppe (23/25) Mar 23 2016 Indeed, though the docs are still a work in progress (the lib is

Andrea Fontana (3/11) Mar 23 2016 See also: http://code.dlang.org/packages/htmld

Martin Tschierschke <mt smartdolphin.de> writes:

Hello!
I want to set up a web robot to detect changes on certain web 
pages or sites.
Any hint to similar projects or libraries at dub or git to look 
at,
before starting to develop my own RegExp for parsing?

Best regards
mt.

Mar 23 2016

Rene Zwanenburg <renezwanenburg gmail.com> writes:

On Wednesday, 23 March 2016 at 09:02:37 UTC, Martin Tschierschke 
wrote:
 Hello!
 I want to set up a web robot to detect changes on certain web 
 pages or sites.
 Any hint to similar projects or libraries at dub or git to look 
 at,
 before starting to develop my own RegExp for parsing?

 Best regards
 mt.

Adam's dom.d will get you pretty far. I believe it can also 
handle documents that aren't completely well-formed.

https://github.com/adamdruppe/arsd/blob/master/dom.d

Mar 23 2016

Martin Tschierschke <mt smartdolphin.de> writes:

On Wednesday, 23 March 2016 at 09:06:37 UTC, Rene Zwanenburg 
wrote:
[...]
 Adam's dom.d will get you pretty far. I believe it can also 
 handle documents that aren't completely well-formed.

 https://github.com/adamdruppe/arsd/blob/master/dom.d

Thank you! This forum has an incredible fast auto responder ;-)

Mar 23 2016

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Wednesday, 23 March 2016 at 09:06:37 UTC, Rene Zwanenburg 
wrote:
 Adam's dom.d will get you pretty far. I believe it can also 
 handle documents that aren't completely well-formed.

 https://github.com/adamdruppe/arsd/blob/master/dom.d

HTML-docs here:

http://dpldocs.info/experimental-docs/arsd.dom.html

throught Adam's own web-service.

Mar 23 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Wednesday, 23 March 2016 at 10:49:03 UTC, Nordlöw wrote:
 HTML-docs here:

 http://dpldocs.info/experimental-docs/arsd.dom.html

Indeed, though the docs are still a work in progress (the lib is 
now about 6 years old, but until recently, ddoc blocked me from 
using examples in the comments so I didn't bother. I've fixed 
that now though, but haven't finished writing them all up).


Basic idea though for web scraping:

auto document = new Document();
document.parseGarbage(your_html_string);

// supports most the CSS syntax, and you might also know it from 
jQuery
Element[] elements = document.querySelectorAll("css selector");
// or if you just want the first hit or null if none...
Element element = document.querySelector("css selector");


And once you have a reference:

element.innerText
element.innerHTML

to print its contents in some form.



You can do a lot more too (a LOT more), but just these functions 
should get you started.


The parseGarbage function will also need you to compile in the 
characterencodings.d file from my same github. It will handle 
charset detection and translation as well as tag soup parsing. I 
use it for a lot of web scraping myself.

Mar 23 2016

Andrea Fontana <nospam example.com> writes:

On Wednesday, 23 March 2016 at 09:02:37 UTC, Martin Tschierschke 
wrote:
 Hello!
 I want to set up a web robot to detect changes on certain web 
 pages or sites.
 Any hint to similar projects or libraries at dub or git to look 
 at,
 before starting to develop my own RegExp for parsing?

 Best regards
 mt.

See also: http://code.dlang.org/packages/htmld

Mar 23 2016

D Programming

C/C++ Programming

Other

digitalmars.D.learn - parsing HTML for a web robot (crawler) like application