www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Extracting Structure from HTML using Adam's dom.d

reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
I'm trying to figure out how to most easily extract structured 
information using Adam D Ruppe's dom.d.

Typically I want the following HTML example

...
<h2> <span class="mw-headline" id="H2_A">More important</span> 
</h2>
<p>This is <i>important</i>.</p>
<h2> <span class="mw-headline" id="H2_B">Less important</span> 
</h2>
<p>This is not important.</p>
...

to be reduced to

This is <i>important</i>.

This means that I need some kind of interface to extract all the 
contents of each <p> paragraph that is preceeded by a <h2> 
heading with a specific id (say "H2_A") or content (say "More 
important"). How do I accomplish that?

Further, is there a way to extract the "contents" only of an 
Element instance, that is  "Stuff" from "<p>Stuff</p>" for each 
Element in the return of for example getElementsByTagName(`p`)?
Jan 21 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
 This means that I need some kind of interface to extract all 
 the contents of each <p> paragraph that is preceeded by a <h2> 
 heading with a specific id (say "H2_A") or content (say "More 
 important"). How do I accomplish that?
You can do that with a CSS selector like: document.querySelector("#H2_A + p"); or even document.querySelectorAll("h2 + p") to get every P immediately following a h2. My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector
 Further, is there a way to extract the "contents" only of an 
 Element instance, that is  "Stuff" from "<p>Stuff</p>" for each 
 Element in the return of for example getElementsByTagName(`p`)?
Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE) Element.innerHTML returns all the content inside, including tags (same as the function in all browsers) Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension) You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.
Jan 21 2015
next sibling parent reply "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:
On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
 On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
 This means that I need some kind of interface to extract all 
 the contents of each <p> paragraph that is preceeded by a <h2> 
 heading with a specific id (say "H2_A") or content (say "More 
 important"). How do I accomplish that?
You can do that with a CSS selector like: document.querySelector("#H2_A + p"); or even document.querySelectorAll("h2 + p") to get every P immediately following a h2. My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector
 Further, is there a way to extract the "contents" only of an 
 Element instance, that is  "Stuff" from "<p>Stuff</p>" for 
 each Element in the return of for example 
 getElementsByTagName(`p`)?
Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE) Element.innerHTML returns all the content inside, including tags (same as the function in all browsers) Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension) You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.
Brilliant! Thanks! BTW: Would you be interested in receiving a PR for dom.d where I replace array allocations with calls to lazy ranges?
Jan 22 2015
next sibling parent reply "Suliman" <evermind live.ru> writes:
Adam, please add more simple docs about your parser on site.

Also it would be perfect to create dub, for easier including 
parser to project.
Jan 22 2015
parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 22 January 2015 at 10:14:58 UTC, Suliman wrote:
 Adam, please add more simple docs about your parser on site.
I'll post some ddoc in the next dmd release, now that dmd finally supports some way to automatically escape xml examples.
 Also it would be perfect to create dub, for easier including 
 parser to project.
I don't use dub and don't like its requirements, so I won't do that. Someone else is free to package it though. But like I just said in the other email, it is just a single file for core functionality so you can just download that and add it to your project as a source file.
Jan 22 2015
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 22 January 2015 at 09:27:17 UTC, Per Nordlöw wrote:
 BTW: Would you be interested in receiving a PR for dom.d where 
 I replace array allocations with calls to lazy ranges?
Maybe. It was on my todo list to do that for getElementsByTagName at least, which is supposed to be a live list rather than a copy of references. querySelectorAll, however, is supposed to be a copy, so don't want that to be a range. (this is to match the W3C standard and what javascript does) There are lazy range functions in there btw: element.tree is a lazy range. If you combine it with stuff like std.algorithm.filter and map, etc., it'd be easy to do a bunch of them. getElementsByTagName for example is filter!((e) => e.tagName == want)(element.tree). So the lazy implementations could just be in those terms. (actually though, that's not hard to write on the spot, so maybe it should just be explained instead of adding/changing methods. It is nice that they are plain methods instead of templates now because they can be so easily wrapped in things like script code)
Jan 22 2015
prev sibling parent reply =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:
On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
 You can do that with a CSS selector like:

 document.querySelector("#H2_A + p");
What is the meaning of selectors such as `a[href]` used in doc.querySelectorAll(`a[href]`) ?
Jan 22 2015
parent reply "Gary Willoughby" <dev nomad.so> writes:
On Thursday, 22 January 2015 at 11:23:49 UTC, Nordlöw wrote:
 What is the meaning of selectors such as

     `a[href]`

 used in

     doc.querySelectorAll(`a[href]`)

 ?
Select all `a` tags that have a `href` attribute. You can also select using the attribute value too. For example get all the text inputs in a form: doc.querySelectorAll(`form[name="myform"] input[type="text"]`) dom.d is awesome!
Jan 22 2015
next sibling parent reply ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Thu, 22 Jan 2015 11:40:52 +0000
Gary Willoughby via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 On Thursday, 22 January 2015 at 11:23:49 UTC, Nordl=C3=B6w wrote:
 What is the meaning of selectors such as

     `a[href]`

 used in

     doc.querySelectorAll(`a[href]`)

 ?
=20 Select all `a` tags that have a `href` attribute. =20 You can also select using the attribute value too. For example=20 get all the text inputs in a form: =20 doc.querySelectorAll(`form[name=3D"myform"] input[type=3D"text"]`) =20 dom.d is awesome!
i miss it in Phobos.
Jan 22 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via 
Digitalmars-d-learn wrote:
 i miss it in Phobos.
I'm sure it'd fail the phobos review process though. But since it is an independent file (or it + characterencodings.d for full functionality), it is easy to just download and add to your project anyway.
Jan 22 2015
next sibling parent ketmar <ketmar ketmar.no-ip.org> writes:
On Thu, 22 Jan 2015 18:39:25 +0000, Adam D. Ruppe wrote:

 On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via
 Digitalmars-d-learn wrote:
 i miss it in Phobos.
=20 I'm sure it'd fail the phobos review process though. But since it is an independent file (or it + characterencodings.d for full functionality), it is easy to just download and add to your project anyway.
yes, but that's one more file to download. if it's in Phobos, i can just=20 install dmd and go on writing my k00l skriptz right away. that's why i=20 want it there. i know that it will hardly happen, but can i dream? ;-)=
Jan 22 2015
prev sibling parent reply "Suliman" <evermind live.ru> writes:
Adam, I understood how to select URLs, but how extract values of
attributes from such selection?

   <a href="/" class="post-tag" title="show questions tagged
'javascript'" rel="somedata">

I need extract data inside rel ("somedata")
Jan 25 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:
 I need extract data inside rel ("somedata")
string s = element.rel;
Jan 25 2015
parent reply "Suliman" <evermind live.ru> writes:
On Sunday, 25 January 2015 at 22:14:09 UTC, Adam D. Ruppe wrote:
 On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:
 I need extract data inside rel ("somedata")
string s = element.rel;
Do you mean something like this? string s = element.rel; foreach(row; document.querySelectorAll("a[href]")) { auto data = document.querySelectorAll(s); writeln(data); }
Jan 26 2015
parent reply "Adam D. Ruppe" <destructionator gmail.com> writes:
On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:
 Do you mean something like this?
I just mean to get the value of the rel="something" attribute, use element.rel; assert(element.rel == "something"); The element is the thing you see in the loop and stuff. querySelectorAll returns an array of elements.
Jan 26 2015
parent "Suliman" <evermind live.ru> writes:
On Monday, 26 January 2015 at 17:24:23 UTC, Adam D. Ruppe wrote:
 On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:
 Do you mean something like this?
I just mean to get the value of the rel="something" attribute, use element.rel; assert(element.rel == "something"); The element is the thing you see in the loop and stuff. querySelectorAll returns an array of elements.
But I need to query all data from this field. I do not know what will be in quotes. I need select this value.
Jan 26 2015
prev sibling parent "Adam D. Ruppe" <destructionator gmail.com> writes:
On Thursday, 22 January 2015 at 11:40:53 UTC, Gary Willoughby 
wrote:
 doc.querySelectorAll(`form[name="myform"] input[type="text"]`)

 dom.d is awesome!
Something to remember btw is this also works in browser JavaScript AND css itself, since IE8 and Firefox 3.5. (no need for slow, bloated jquery) My implementation is different in some ways but mostly compatible, including some of the more advanced features like [attr^=starts_with_this] and $= and *= and so on. Also the sibling selectors ~ and +, and so on. (search for CSS selector info to learn more) dom.d also does thigns like :first-child, but it does NOT support like :nth-of-type and a few more of those newer CSS3 things. I might add them some day but I fhind this is pretty good as is.
Jan 22 2015