digitalmars.D.learn - Extracting Structure from HTML using Adam's dom.d

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (20/20) Jan 21 2015 I'm trying to figure out how to most easily extract structured

Adam D. Ruppe (16/23) Jan 21 2015 You can do that with a CSS selector like:

"Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> (4/29) Jan 22 2015 Brilliant! Thanks!

Suliman (3/3) Jan 22 2015 Adam, please add more simple docs about your parser on site.

Adam D. Ruppe (8/11) Jan 22 2015 I'll post some ddoc in the next dmd release, now that dmd finally

Adam D. Ruppe (18/20) Jan 22 2015 Maybe. It was on my todo list to do that for getElementsByTagName

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (6/8) Jan 22 2015 What is the meaning of selectors such as

Gary Willoughby (6/11) Jan 22 2015 Select all `a` tags that have a `href` attribute.

ketmar via Digitalmars-d-learn (4/23) Jan 22 2015 On Thu, 22 Jan 2015 11:40:52 +0000

Adam D. Ruppe (6/7) Jan 22 2015 I'm sure it'd fail the phobos review process though. But since it

ketmar (5/12) Jan 22 2015 yes, but that's one more file to download. if it's in Phobos, i can just...
Suliman (5/5) Jan 25 2015 Adam, I understood how to select URLs, but how extract values of

Adam D. Ruppe (2/3) Jan 25 2015 string s = element.rel;

Suliman (8/11) Jan 26 2015 Do you mean something like this?

Adam D. Ruppe (6/7) Jan 26 2015 I just mean to get the value of the rel="something" attribute, use

Suliman (3/11) Jan 26 2015 But I need to query all data from this field. I do not know what

Adam D. Ruppe (13/15) Jan 22 2015 Something to remember btw is this also works in browser

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

I'm trying to figure out how to most easily extract structured 
information using Adam D Ruppe's dom.d.

Typically I want the following HTML example

...
<h2> <span class="mw-headline" id="H2_A">More important</span> 
</h2>
<p>This is <i>important</i>.</p>
<h2> <span class="mw-headline" id="H2_B">Less important</span> 
</h2>
<p>This is not important.</p>
...

to be reduced to

This is <i>important</i>.

This means that I need some kind of interface to extract all the 
contents of each <p> paragraph that is preceeded by a <h2> 
heading with a specific id (say "H2_A") or content (say "More 
important"). How do I accomplish that?

Further, is there a way to extract the "contents" only of an 
Element instance, that is  "Stuff" from "<p>Stuff</p>" for each 
Element in the return of for example getElementsByTagName(`p`)?

Jan 21 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
 This means that I need some kind of interface to extract all 
 the contents of each <p> paragraph that is preceeded by a <h2> 
 heading with a specific id (say "H2_A") or content (say "More 
 important"). How do I accomplish that?

You can do that with a CSS selector like:

document.querySelector("#H2_A + p");

or even document.querySelectorAll("h2 + p") to get every P 
immediately following a h2.


My implementation works mostly the same as in javascript so you 
can read more about css selectors anywhere on the net like 
https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector

 Further, is there a way to extract the "contents" only of an 
 Element instance, that is  "Stuff" from "<p>Stuff</p>" for each 
 Element in the return of for example getElementsByTagName(`p`)?

Element.innerText returns all the plain text inside with all tags 
stripped out (same as the function in IE)

Element.innerHTML returns all the content inside, including tags 
(same as the function in all browsers)

Element.firstInnerText returns all the text up to the first tag, 
but then stops there. (this is a custom extension)


You can call those in a regular foreach loop or with something 
like std.algorithm.map to get the info from an array of elements.

Jan 21 2015

"Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> writes:

On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
 On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:
 This means that I need some kind of interface to extract all 
 the contents of each <p> paragraph that is preceeded by a <h2> 
 heading with a specific id (say "H2_A") or content (say "More 
 important"). How do I accomplish that?

 You can do that with a CSS selector like:

 document.querySelector("#H2_A + p");

 or even document.querySelectorAll("h2 + p") to get every P 
 immediately following a h2.


 My implementation works mostly the same as in javascript so you 
 can read more about css selectors anywhere on the net like 
 https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelector

 Further, is there a way to extract the "contents" only of an 
 Element instance, that is  "Stuff" from "<p>Stuff</p>" for 
 each Element in the return of for example 
 getElementsByTagName(`p`)?

 Element.innerText returns all the plain text inside with all 
 tags stripped out (same as the function in IE)

 Element.innerHTML returns all the content inside, including 
 tags (same as the function in all browsers)

 Element.firstInnerText returns all the text up to the first 
 tag, but then stops there. (this is a custom extension)


 You can call those in a regular foreach loop or with something 
 like std.algorithm.map to get the info from an array of 
 elements.

Brilliant! Thanks!

BTW: Would you be interested in receiving a PR for dom.d where I 
replace array allocations with calls to lazy ranges?

Jan 22 2015

"Suliman" <evermind live.ru> writes:

Adam, please add more simple docs about your parser on site.

Also it would be perfect to create dub, for easier including 
parser to project.

Jan 22 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 22 January 2015 at 10:14:58 UTC, Suliman wrote:
 Adam, please add more simple docs about your parser on site.

I'll post some ddoc in the next dmd release, now that dmd finally 
supports some way to automatically escape xml examples.

 Also it would be perfect to create dub, for easier including 
 parser to project.

I don't use dub and don't like its requirements, so I won't do 
that. Someone else is free to package it though.

But like I just said in the other email, it is just a single file 
for core functionality so you can just download that and add it 
to your project as a source file.

Jan 22 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 22 January 2015 at 09:27:17 UTC, Per Nordlöw wrote:
 BTW: Would you be interested in receiving a PR for dom.d where 
 I replace array allocations with calls to lazy ranges?

Maybe. It was on my todo list to do that for getElementsByTagName 
at least, which is supposed to be a live list rather than a copy 
of references.

querySelectorAll, however, is supposed to be a copy, so don't 
want that to be a range. (this is to match the W3C standard and 
what javascript does)


There are lazy range functions in there btw: element.tree is a 
lazy range. If you combine it with stuff like 
std.algorithm.filter and map, etc., it'd be easy to do a bunch of 
them.

getElementsByTagName for example is filter!((e) => e.tagName == 
want)(element.tree). So the lazy implementations could just be in 
those terms.

(actually though, that's not hard to write on the spot, so maybe 
it should just be explained instead of adding/changing methods. 
It is nice that they are plain methods instead of templates now 
because they can be so easily wrapped in things like script code)

Jan 22 2015

=?UTF-8?B?Ik5vcmRsw7Z3Ig==?= <per.nordlow gmail.com> writes:

On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:
 You can do that with a CSS selector like:

 document.querySelector("#H2_A + p");

What is the meaning of selectors such as

     `a[href]`

used in

     doc.querySelectorAll(`a[href]`)

?

Jan 22 2015

"Gary Willoughby" <dev nomad.so> writes:

On Thursday, 22 January 2015 at 11:23:49 UTC, Nordlöw wrote:
 What is the meaning of selectors such as

     `a[href]`

 used in

     doc.querySelectorAll(`a[href]`)

 ?

Select all `a` tags that have a `href` attribute.

You can also select using the attribute value too. For example 
get all the text inputs in a form:

doc.querySelectorAll(`form[name="myform"] input[type="text"]`)

dom.d is awesome!

Jan 22 2015

ketmar via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Thu, 22 Jan 2015 11:40:52 +0000
Gary Willoughby via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 On Thursday, 22 January 2015 at 11:23:49 UTC, Nordl=C3=B6w wrote:
 What is the meaning of selectors such as

     `a[href]`

 used in

     doc.querySelectorAll(`a[href]`)

 ?

=20
 Select all `a` tags that have a `href` attribute.
=20
 You can also select using the attribute value too. For example=20
 get all the text inputs in a form:
=20
 doc.querySelectorAll(`form[name=3D"myform"] input[type=3D"text"]`)
=20
 dom.d is awesome!

i miss it in Phobos.

Jan 22 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via 
Digitalmars-d-learn wrote:
 i miss it in Phobos.

I'm sure it'd fail the phobos review process though. But since it 
is an independent file (or it + characterencodings.d for full 
functionality), it is easy to just download and add to your 
project anyway.

Jan 22 2015

ketmar <ketmar ketmar.no-ip.org> writes:

On Thu, 22 Jan 2015 18:39:25 +0000, Adam D. Ruppe wrote:

 On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via
 Digitalmars-d-learn wrote:
 i miss it in Phobos.

=20
 I'm sure it'd fail the phobos review process though. But since it is an
 independent file (or it + characterencodings.d for full functionality),
 it is easy to just download and add to your project anyway.

yes, but that's one more file to download. if it's in Phobos, i can just=20
install dmd and go on writing my k00l skriptz right away. that's why i=20
want it there.

i know that it will hardly happen, but can i dream? ;-)=

Jan 22 2015

"Suliman" <evermind live.ru> writes:

Adam, I understood how to select URLs, but how extract values of
attributes from such selection?

   <a href="/" class="post-tag" title="show questions tagged
'javascript'" rel="somedata">

I need extract data inside rel ("somedata")

Jan 25 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:
 I need extract data inside rel ("somedata")

string s = element.rel;

Jan 25 2015

"Suliman" <evermind live.ru> writes:

On Sunday, 25 January 2015 at 22:14:09 UTC, Adam D. Ruppe wrote:
 On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:
 I need extract data inside rel ("somedata")

 string s = element.rel;

Do you mean something like this?

string s = element.rel;
foreach(row; document.querySelectorAll("a[href]"))
{
	auto data = document.querySelectorAll(s);
	writeln(data);
}

Jan 26 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:
 Do you mean something like this?

I just mean to get the value of the rel="something" attribute, use

element.rel;

assert(element.rel == "something");


The element is the thing you see in the loop and stuff. 
querySelectorAll returns an array of elements.

Jan 26 2015

"Suliman" <evermind live.ru> writes:

On Monday, 26 January 2015 at 17:24:23 UTC, Adam D. Ruppe wrote:
 On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:
 Do you mean something like this?

 I just mean to get the value of the rel="something" attribute, 
 use

 element.rel;

 assert(element.rel == "something");


 The element is the thing you see in the loop and stuff. 
 querySelectorAll returns an array of elements.

But I need to query all data from this field. I do not know what 
will be in quotes. I need select this value.

Jan 26 2015

"Adam D. Ruppe" <destructionator gmail.com> writes:

On Thursday, 22 January 2015 at 11:40:53 UTC, Gary Willoughby 
wrote:
 doc.querySelectorAll(`form[name="myform"] input[type="text"]`)

 dom.d is awesome!

Something to remember btw is this also works in browser 
JavaScript AND css itself, since IE8 and Firefox 3.5. (no need 
for slow, bloated jquery)

My implementation is different in some ways but mostly 
compatible, including some of the more advanced features like 
[attr^=starts_with_this] and $= and *= and so on. Also the 
sibling selectors ~ and +, and so on. (search for CSS selector 
info to learn more)

dom.d also does thigns like :first-child, but it does NOT support 
like :nth-of-type and a few more of those newer CSS3 things. I 
might add them some day but I fhind this is pretty good as is.

Jan 22 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Extracting Structure from HTML using Adam's dom.d