digitalmars.D.learn - Extracting Structure from HTML using Adam's dom.d
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (20/20) Jan 21 2015 I'm trying to figure out how to most easily extract structured
- Adam D. Ruppe (16/23) Jan 21 2015 You can do that with a CSS selector like:
- "Per =?UTF-8?B?Tm9yZGzDtnci?= <per.nordlow gmail.com> (4/29) Jan 22 2015 Brilliant! Thanks!
- Suliman (3/3) Jan 22 2015 Adam, please add more simple docs about your parser on site.
- Adam D. Ruppe (8/11) Jan 22 2015 I'll post some ddoc in the next dmd release, now that dmd finally
- Adam D. Ruppe (18/20) Jan 22 2015 Maybe. It was on my todo list to do that for getElementsByTagName
- =?UTF-8?B?Ik5vcmRsw7Z3Ig==?= (6/8) Jan 22 2015 What is the meaning of selectors such as
- Gary Willoughby (6/11) Jan 22 2015 Select all `a` tags that have a `href` attribute.
- ketmar via Digitalmars-d-learn (4/23) Jan 22 2015 On Thu, 22 Jan 2015 11:40:52 +0000
- Adam D. Ruppe (6/7) Jan 22 2015 I'm sure it'd fail the phobos review process though. But since it
- ketmar (5/12) Jan 22 2015 yes, but that's one more file to download. if it's in Phobos, i can just...
- Suliman (5/5) Jan 25 2015 Adam, I understood how to select URLs, but how extract values of
- Adam D. Ruppe (2/3) Jan 25 2015 string s = element.rel;
- Suliman (8/11) Jan 26 2015 Do you mean something like this?
- Adam D. Ruppe (6/7) Jan 26 2015 I just mean to get the value of the rel="something" attribute, use
- Suliman (3/11) Jan 26 2015 But I need to query all data from this field. I do not know what
- Adam D. Ruppe (13/15) Jan 22 2015 Something to remember btw is this also works in browser
I'm trying to figure out how to most easily extract structured information using Adam D Ruppe's dom.d. Typically I want the following HTML example ... <h2> <span class="mw-headline" id="H2_A">More important</span> </h2> <p>This is <i>important</i>.</p> <h2> <span class="mw-headline" id="H2_B">Less important</span> </h2> <p>This is not important.</p> ... to be reduced to This is <i>important</i>. This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that? Further, is there a way to extract the "contents" only of an Element instance, that is "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)?
Jan 21 2015
On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that?You can do that with a CSS selector like: document.querySelector("#H2_A + p"); or even document.querySelectorAll("h2 + p") to get every P immediately following a h2. My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelectorFurther, is there a way to extract the "contents" only of an Element instance, that is "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)?Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE) Element.innerHTML returns all the content inside, including tags (same as the function in all browsers) Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension) You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.
Jan 21 2015
On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:On Wednesday, 21 January 2015 at 23:31:26 UTC, Nordlöw wrote:Brilliant! Thanks! BTW: Would you be interested in receiving a PR for dom.d where I replace array allocations with calls to lazy ranges?This means that I need some kind of interface to extract all the contents of each <p> paragraph that is preceeded by a <h2> heading with a specific id (say "H2_A") or content (say "More important"). How do I accomplish that?You can do that with a CSS selector like: document.querySelector("#H2_A + p"); or even document.querySelectorAll("h2 + p") to get every P immediately following a h2. My implementation works mostly the same as in javascript so you can read more about css selectors anywhere on the net like https://developer.mozilla.org/en-US/docs/Web/API/Document.querySelectorFurther, is there a way to extract the "contents" only of an Element instance, that is "Stuff" from "<p>Stuff</p>" for each Element in the return of for example getElementsByTagName(`p`)?Element.innerText returns all the plain text inside with all tags stripped out (same as the function in IE) Element.innerHTML returns all the content inside, including tags (same as the function in all browsers) Element.firstInnerText returns all the text up to the first tag, but then stops there. (this is a custom extension) You can call those in a regular foreach loop or with something like std.algorithm.map to get the info from an array of elements.
Jan 22 2015
Adam, please add more simple docs about your parser on site. Also it would be perfect to create dub, for easier including parser to project.
Jan 22 2015
On Thursday, 22 January 2015 at 10:14:58 UTC, Suliman wrote:Adam, please add more simple docs about your parser on site.I'll post some ddoc in the next dmd release, now that dmd finally supports some way to automatically escape xml examples.Also it would be perfect to create dub, for easier including parser to project.I don't use dub and don't like its requirements, so I won't do that. Someone else is free to package it though. But like I just said in the other email, it is just a single file for core functionality so you can just download that and add it to your project as a source file.
Jan 22 2015
On Thursday, 22 January 2015 at 09:27:17 UTC, Per Nordlöw wrote:BTW: Would you be interested in receiving a PR for dom.d where I replace array allocations with calls to lazy ranges?Maybe. It was on my todo list to do that for getElementsByTagName at least, which is supposed to be a live list rather than a copy of references. querySelectorAll, however, is supposed to be a copy, so don't want that to be a range. (this is to match the W3C standard and what javascript does) There are lazy range functions in there btw: element.tree is a lazy range. If you combine it with stuff like std.algorithm.filter and map, etc., it'd be easy to do a bunch of them. getElementsByTagName for example is filter!((e) => e.tagName == want)(element.tree). So the lazy implementations could just be in those terms. (actually though, that's not hard to write on the spot, so maybe it should just be explained instead of adding/changing methods. It is nice that they are plain methods instead of templates now because they can be so easily wrapped in things like script code)
Jan 22 2015
On Thursday, 22 January 2015 at 02:06:16 UTC, Adam D. Ruppe wrote:You can do that with a CSS selector like: document.querySelector("#H2_A + p");What is the meaning of selectors such as `a[href]` used in doc.querySelectorAll(`a[href]`) ?
Jan 22 2015
On Thursday, 22 January 2015 at 11:23:49 UTC, Nordlöw wrote:What is the meaning of selectors such as `a[href]` used in doc.querySelectorAll(`a[href]`) ?Select all `a` tags that have a `href` attribute. You can also select using the attribute value too. For example get all the text inputs in a form: doc.querySelectorAll(`form[name="myform"] input[type="text"]`) dom.d is awesome!
Jan 22 2015
On Thu, 22 Jan 2015 11:40:52 +0000 Gary Willoughby via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> wrote:On Thursday, 22 January 2015 at 11:23:49 UTC, Nordl=C3=B6w wrote:i miss it in Phobos.What is the meaning of selectors such as `a[href]` used in doc.querySelectorAll(`a[href]`) ?=20 Select all `a` tags that have a `href` attribute. =20 You can also select using the attribute value too. For example=20 get all the text inputs in a form: =20 doc.querySelectorAll(`form[name=3D"myform"] input[type=3D"text"]`) =20 dom.d is awesome!
Jan 22 2015
On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via Digitalmars-d-learn wrote:i miss it in Phobos.I'm sure it'd fail the phobos review process though. But since it is an independent file (or it + characterencodings.d for full functionality), it is easy to just download and add to your project anyway.
Jan 22 2015
On Thu, 22 Jan 2015 18:39:25 +0000, Adam D. Ruppe wrote:On Thursday, 22 January 2015 at 16:22:14 UTC, ketmar via Digitalmars-d-learn wrote:yes, but that's one more file to download. if it's in Phobos, i can just=20 install dmd and go on writing my k00l skriptz right away. that's why i=20 want it there. i know that it will hardly happen, but can i dream? ;-)=i miss it in Phobos.=20 I'm sure it'd fail the phobos review process though. But since it is an independent file (or it + characterencodings.d for full functionality), it is easy to just download and add to your project anyway.
Jan 22 2015
Adam, I understood how to select URLs, but how extract values of attributes from such selection? <a href="/" class="post-tag" title="show questions tagged 'javascript'" rel="somedata"> I need extract data inside rel ("somedata")
Jan 25 2015
On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:I need extract data inside rel ("somedata")string s = element.rel;
Jan 25 2015
On Sunday, 25 January 2015 at 22:14:09 UTC, Adam D. Ruppe wrote:On Sunday, 25 January 2015 at 21:24:06 UTC, Suliman wrote:Do you mean something like this? string s = element.rel; foreach(row; document.querySelectorAll("a[href]")) { auto data = document.querySelectorAll(s); writeln(data); }I need extract data inside rel ("somedata")string s = element.rel;
Jan 26 2015
On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:Do you mean something like this?I just mean to get the value of the rel="something" attribute, use element.rel; assert(element.rel == "something"); The element is the thing you see in the loop and stuff. querySelectorAll returns an array of elements.
Jan 26 2015
On Monday, 26 January 2015 at 17:24:23 UTC, Adam D. Ruppe wrote:On Monday, 26 January 2015 at 17:17:55 UTC, Suliman wrote:But I need to query all data from this field. I do not know what will be in quotes. I need select this value.Do you mean something like this?I just mean to get the value of the rel="something" attribute, use element.rel; assert(element.rel == "something"); The element is the thing you see in the loop and stuff. querySelectorAll returns an array of elements.
Jan 26 2015
On Thursday, 22 January 2015 at 11:40:53 UTC, Gary Willoughby wrote:doc.querySelectorAll(`form[name="myform"] input[type="text"]`) dom.d is awesome!Something to remember btw is this also works in browser JavaScript AND css itself, since IE8 and Firefox 3.5. (no need for slow, bloated jquery) My implementation is different in some ways but mostly compatible, including some of the more advanced features like [attr^=starts_with_this] and $= and *= and so on. Also the sibling selectors ~ and +, and so on. (search for CSS selector info to learn more) dom.d also does thigns like :first-child, but it does NOT support like :nth-of-type and a few more of those newer CSS3 things. I might add them some day but I fhind this is pretty good as is.
Jan 22 2015