digitalmars.D.learn - html fetcher/parser
- Faux Amis (9/9) Aug 12 2017 I would like to get into D again by making a small program which fetches...
- Adam D. Ruppe (50/53) Aug 12 2017 My dom.d and http2.d combine to make this easy:
- Michael (4/10) Aug 12 2017 Sometimes it feels like there's the standard D library, Phobos,
- Faux Amis (5/29) Aug 13 2017 Just curious, but is there a spec of sorts which defines which errors
- Adam D. Ruppe (14/16) Aug 13 2017 The HTML5 spec describes how you are supposed to parse various
- Faux Amis (4/23) Aug 14 2017 Sounds good!
- Adam D. Ruppe (6/8) Aug 14 2017 Oh, I've actually done some of that before too.
- Soulsbane (3/13) Aug 12 2017 I've the requests module nice to work with:
- Faux Amis (2/17) Aug 13 2017 Thanks, looks nice! I'll try it if Adam's modules fail me :)
I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements. fetching: should I go for std curl, vibe.d or something else? parsing: I could only find these dub packages: htmld & libdominator. And they don't seem overly active, any recommendations? As I haven't been using D for some time I just don't want to get off with a bad start :) thx
Aug 12 2017
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements.My dom.d and http2.d combine to make this easy: https://github.com/adamdruppe/arsd/blob/master/dom.d https://github.com/adamdruppe/arsd/blob/master/http2.d and support file for random encodings: https://github.com/adamdruppe/arsd/blob/master/characterencodings.d Or via dub: http://code.dlang.org/packages/arsd-official the dom and http subpackages are the ones you want. Docs: http://dpldocs.info/arsd.dom Sample program: --- // compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings} import std.stdio; import arsd.dom; void main() { auto document = Document.fromUrl("https://dlang.org/"); writeln(document.optionSelector("p").innerText); } --- Output: D is a general-purpose programming language with static typing, systems-level access, and C-like syntax. It combines efficiency, control and modeling power with safety and programmer productivity. Note that the https support requires OpenSSL available on your system. Works best on Linux with it installed as a devel lib (so like openssl-devel or whatever, just like you would if using it from C). How it works: Document.fromUrl uses the http lib to fetch it, then automatically parse the contents as a dom document. It will correct for common errors in webpage markup, character sets, etc. Document and Element both have various methods for navigating, modifying, and accessing the DOM tree. Here, I used `optionSelector`, which works like `querySelector` in Javascript (and the same syntax is used for CSS), returning the first matching element. querySelector, however, returns null if there is nothing found. optionSelector returns a dummy object instead, so you don't have to explicitly test it for null and instead just access its methods. `innerText` returns the text inside, stripped of markup. You might also want `innerHTML`, or `toString` to get the whole thing, markup and all. there's a lot more you can do too but just these few functions I think will be enough for your task. Bonus fact: http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDi tanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....
Aug 12 2017
On Saturday, 12 August 2017 at 20:22:44 UTC, Adam D. Ruppe wrote:On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:Sometimes it feels like there's the standard D library, Phobos, and then for everything else you have already developed a suitable library to supplement it haha![...]My dom.d and http2.d combine to make this easy: https://github.com/adamdruppe/arsd/blob/master/dom.d https://github.com/adamdruppe/arsd/blob/master/http2.d [...]
Aug 12 2017
On 2017-08-12 22:22, Adam D. Ruppe wrote:On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:Nice![...][...] --- // compile: $ dmd thisfile.d ~/arsd/{dom,http2,characterencodings} import std.stdio; import arsd.dom; void main() { auto document = Document.fromUrl("https://dlang.org/"); writeln(document.optionSelector("p").innerText); } ---[...] Document.fromUrl uses the http lib to fetch it, then automatically parse the contents as a dom document. It will correct for common errors in webpage markup, character sets, etc.Just curious, but is there a spec of sorts which defines which errors should be fixed and such?[...] Bonus fact: http://dpldocs.info/experimental-docs/std.algorithm.comparison.levenshteinDi tanceAndPath.1.html that function from the standard library makes doing a diff display of before and after pretty simple....Thanks for the pointer!
Aug 13 2017
On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:Just curious, but is there a spec of sorts which defines which errors should be fixed and such?The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup. My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!) So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.
Aug 13 2017
On 2017-08-13 19:51, Adam D. Ruppe wrote:On Sunday, 13 August 2017 at 15:54:45 UTC, Faux Amis wrote:Sounds good! (Althought following the spec would be the first step to a D html layout engine :D )Just curious, but is there a spec of sorts which defines which errors should be fixed and such?The HTML5 spec describes how you are supposed to parse various things, including the recovery paths for broken markup. My module, however, isn't so formal. I just used it for a web scraping thing at work that hit a few hundred sites and fixed bugs as they came up to give good enough results for me.... (one thing I found is a lot of sites claiming to be UTF-8 are actually latin-1, so it validates and falls back to handle that. My http thing, while buggier, is similar - I hit a server once that ignored the accept gzip header and always sent it anyway, so I had to handle that... and I noticed curl actually didn't!) So on the one hand, there's surely still bugs and weird cases, but on the other hand, it did get a fair chunk of real-world use so I am fairly confident it will be ok for most things.
Aug 14 2017
On Monday, 14 August 2017 at 23:15:13 UTC, Faux Amis wrote:(Althought following the spec would be the first step to a D html layout engine :D )Oh, I've actually done some of that before too. https://github.com/adamdruppe/arsd/blob/master/htmlwidget.d It is pretty horrible... but managed to render my old homepage which used css float, boxes, and basic tables. I don't know if it still compiles, I haven't even tried it for years.
Aug 14 2017
On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements. fetching: should I go for std curl, vibe.d or something else? parsing: I could only find these dub packages: htmld & libdominator. And they don't seem overly active, any recommendations? As I haven't been using D for some time I just don't want to get off with a bad start :) thxI've the requests module nice to work with: http://code.dlang.org/packages/requests
Aug 12 2017
On 2017-08-13 01:49, Soulsbane wrote:On Saturday, 12 August 2017 at 19:53:22 UTC, Faux Amis wrote:Thanks, looks nice! I'll try it if Adam's modules fail me :)I would like to get into D again by making a small program which fetches a website every X-time and keeps track of all changes within specified dom elements. fetching: should I go for std curl, vibe.d or something else? parsing: I could only find these dub packages: htmld & libdominator. And they don't seem overly active, any recommendations? As I haven't been using D for some time I just don't want to get off with a bad start :) thxI've the requests module nice to work with: http://code.dlang.org/packages/requests
Aug 13 2017