www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Can I parse this kind of HTML with arsd.dom module?

reply Dr.No <jckj33 gmail.com> writes:
This is the module I'm speaking about: 
https://arsd-official.dpldocs.info/arsd.dom.html

So I have this HTML that not even parseGarbae() can del with:

<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>

There is this spaces between  "href" and "=" and "https..." which 
makes below code fails:


	string html = get(page, client).text;
	auto document = new Document();
	document.parseGarbage(html);
Element attEle = document.querySelector("span[id=link2]");
	Element aEle = attEle.querySelector("a");
string link = aEle.href; // <-- if the href contains space, it 
return "href" rather the link



let's say the page HTML look like this:


<font color="yellow">
<h2>
	Hello, dear world!
	<span id="link2">
<a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
	</span>
</h2>
</font>

I know the library author post on this forum often, I hope he see 
this help somehow

to make it work. But if anyone else know how to fix this, will be 
very welcome too!
Jun 23 2018
next sibling parent reply Timoses <timosesu gmail.com> writes:
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
 	string html = get(page, client).text;
 	auto document = new Document();
 	document.parseGarbage(html);
 Element attEle = document.querySelector("span[id=link2]");
 	Element aEle = attEle.querySelector("a");
 string link = aEle.href; // <-- if the href contains space, it 
 return "href" rather the link

 [...]


 <font color="yellow">
 <h2>
 	Hello, dear world!
 	<span id="link2">
 <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
 	</span>
 </h2>
 </font>
missing </body> Seems to be buggy, the parsed document part refering to "a" looks like this: <a "https:="&quot;https:" href="href" />G!
Jun 24 2018
parent Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 24 June 2018 at 10:49:51 UTC, Timoses wrote:
 <a href = "https://hostname.com/?file=foo.png&foo=baa">G!</a>
 	</span>
 </h2>
 </font>
missing </body> Seems to be buggy, the parsed document part refering to "a" looks like this: <a "https:="&quot;https:" href="href" />G!
It reads href as a no content attribute (like `checked` which becomes `checked="checked"` in xhtml style), then ignored the = as malplaced trash, then did the same with the https. so the fix is to collapse whitespace around the =.....
Jun 24 2018
prev sibling next sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
 I know the library author post on this forum often, I hope he 
 see this help somehow
Yeah, I'm out this week but it shouldn't be too hard to add, the garbage attribute parser can special-case = surrounded by spaces to just skip the spaces. I won't get to it today, but I might be able to tomorrow. Shoot me a reminder email if I don't by tomorrow night. The parser code is unbelievably bad, but the code to change is somewhere around line 450 if you wanna take a stab at it yourself.
Jun 24 2018
prev sibling parent Adam D. Ruppe <destructionator gmail.com> writes:
On Sunday, 24 June 2018 at 03:46:09 UTC, Dr.No wrote:
 to make it work. But if anyone else know how to fix this, will 
 be very welcome too!
try it now. thanks to Sandman83 on github.
Jun 25 2018