digitalmars.D.announce - dxml 0.2.0 released
- Jonathan M Davis (23/23) Feb 11 2018 dxml 0.2.0 has now been released.
- Aravinda VK (50/76) Feb 11 2018 Awesome. Just tried it now as below and it works. Thanks for this
- Chris (3/9) Feb 12 2018 Will this replace `std.xml` one day?
- rikki cattermole (3/14) Feb 12 2018 As long as DTD support is essentially non-existent, my vote will always
- Chris (5/20) Feb 12 2018 How hard would it be to add DTD support? One could take dxml and
- rikki cattermole (10/30) Feb 12 2018 From what I read in the other thread, it would require a complete
- Jonathan M Davis (42/53) Feb 12 2018 Maybe. That depends on community feedback and ultimately on the Phobos
- Chris (21/42) Feb 12 2018 I thought the same when I glanced over std.xml. There's no DTD
- rikki cattermole (10/54) Feb 12 2018 https://github.com/dlang-community/experimental.xml
- Adam D. Ruppe (11/15) Feb 12 2018 About 5 years ago (I think, I actually have the link on my other
- rikki cattermole (23/37) Feb 12 2018 It depends.
- Jonathan M Davis (38/43) Feb 12 2018 That literally cannot be done. dxml returns slices (or takeExactly's) of...
- rikki cattermole (10/18) Feb 12 2018 We are definitely not better off with just std.xml currently.
- H. S. Teoh (29/35) Feb 12 2018 And thus Phobos continues to let the perfect be the enemy of the good,
- rikki cattermole (14/57) Feb 12 2018 In other places it was said that it wasn't possible to build it on top
- Nick Sabalausky (Abscissa) (11/25) Feb 12 2018 +Several billion.
- Russel Winder (12/17) Feb 13 2018 The problem is that std.xml needs removing to make it clear there is
- Adam D. Ruppe (4/6) Feb 12 2018 I wrote one 8 years ago... though mine is more focused on HTML
- bachmeier (6/11) Feb 12 2018 Can't you simply give it a name other than std.xml that indicates
- bachmeier (2/13) Feb 12 2018 Hit send too fast. std.xml.base would be reasonable.
- Jonathan M Davis (14/29) Feb 12 2018 I have no interest in bikeshedding the name right now or even really arg...
- H. S. Teoh (32/43) Feb 12 2018 Actually, thinking about this, I'm wondering if a combination of
- rikki cattermole (4/12) Feb 12 2018 dxml 7.5k LOC
- Chris (9/12) Feb 12 2018 How could it possibly make the situation any worse than it is
- Jacob Carlborg (6/8) Feb 12 2018 I'm using std.xml in a new project right now. It's a really small
- Chris (9/18) Feb 12 2018 A few lines of code that could be replaced easily once something
- Jacob Carlborg (5/7) Feb 12 2018 Fairly easy because it's so small. I'm actually using the SAX interface
- Nick Sabalausky (Abscissa) (3/8) Feb 12 2018 4.5k LOC == "a lot worse"?
- Jonathan M Davis (22/29) Feb 12 2018 There is sometimes a tendency for folks to think that something having a...
- Nick Sabalausky (Abscissa) (10/19) Feb 12 2018 Yea, totally. Another example: mysql-native used to be one (!!) source
- Kagamin (3/11) Feb 13 2018 And it's like 2k LOC of code and 5.5k LOC of tests and docs.
- Jonathan M Davis (65/88) Feb 12 2018 The core problem is that entity references get replaced with more XML th...
- Kagamin (4/16) Feb 13 2018 Standard entities like & have the same problem, so the same
- Jonathan M Davis (27/43) Feb 13 2018 That depends on what exactly an entity reference can contain. If it can ...
- Patrick Schluter (31/82) Feb 13 2018 There's also the issue that entity references open a whole can of
- Jonathan M Davis (48/52) Feb 13 2018 Well, if dxml just passes the entity references along unparsed beyond
- Patrick Schluter (7/29) Feb 14 2018 Yikes! In any case, even if I had to implement a parser I would
- Jonathan M Davis (18/50) Feb 14 2018 Well, since folks other than me are going to use this parser, and it's e...
- H. S. Teoh (93/135) Feb 13 2018 This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure
- Chris (8/32) Feb 14 2018 Thanks for the analysis. I'd say you're right. It makes no sense
- H. S. Teoh (36/61) Feb 13 2018 AFAICT, section 4.3.2 in the spec (probably the one you're referring to)
- Kagamin (9/12) Feb 14 2018 The parser now returns raw text, entity replacement can be done
- Jonathan M Davis (13/15) Feb 14 2018 It's very difficult in general to write a parser that isn't at least a
- rikki cattermole (34/52) Feb 14 2018 See lines:
- Adrian Matoga (5/18) Feb 14 2018 `temp = input.save` is exactly what you want here, which means
- rikki cattermole (2/22) Feb 14 2018 Ah I must be thinking of ranges that support indexing.
- Jonathan M Davis (5/27) Feb 14 2018 Random access ranges are also forward ranges and would require a call to
- rikki cattermole (2/33) Feb 14 2018 Luckily in my code I can forget that ;)
- Jonathan M Davis (20/60) Feb 14 2018 wrote:
- jmh530 (3/6) Feb 15 2018 That sounds like an interesting topic for a blog post.
- Jonathan M Davis (38/41) Feb 13 2018 Well, there are plenty of folks who talk like XML is a pile of steaming ...
- nkm1 (13/20) Aug 30 2018 Bump!
- H. S. Teoh (15/31) Sep 13 2018 +1. I vote for adding dxml to Phobos.
- H. S. Teoh (49/60) Feb 12 2018 [...]
- Chris (14/20) Feb 13 2018 In this vein, if a new version of std.xml didn't offer pure and
- Jonathan M Davis (31/58) Feb 12 2018 Which was my point. The API as-is doesn't work with DTD support for thos...
- Jonathan M Davis (22/38) Feb 13 2018 XML 1.0 does not require the section - which is the main reas...
- Johannes Loher (12/14) Feb 12 2018 Thank you very much for your efforts, I really appreciate it, as
- Jonathan M Davis (9/23) Feb 12 2018 Thanks. When you do use it, please give feedback - particularly if you f...
- Jesse Phillips (9/14) Feb 23 2018 This is absolutely awesome. It is a little low level (compared to
dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. As part of that, it became clear that dxml.parser.stax should be renamed to dxml.parser, since it's really the only parser (DOM support involves just providing a way to hold the results of the parser, not any actual parsing, and that's clear from the API rather than being an implementation detail), and it makes for a shorter import path. So, I figured that I should do a release sooner rather than later to reduce how many folks the rename ends up affecting. For this release, dxml.parser.stax is now an empty, deprecated, module that publicly imports dxml.parser, but it will be removed in 0.3.0, whenever that is released. So, the few folks who grabbed the initial release won't end up with immediate code breakage if they upgrade. One nice side effect of how I implemented DOM support is that it's trivial to get the DOM for a portion of an XML document rather than the entire thing, since it will produce a DOMEntity from any point in an EntityRange. Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/ Github: https://github.com/jmdavis/dxml/tree/v0.2.0 Dub: http://code.dlang.org/packages/dxml - Jonathan M Davis
Feb 11 2018
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. As part of that, it became clear that dxml.parser.stax should be renamed to dxml.parser, since it's really the only parser (DOM support involves just providing a way to hold the results of the parser, not any actual parsing, and that's clear from the API rather than being an implementation detail), and it makes for a shorter import path. So, I figured that I should do a release sooner rather than later to reduce how many folks the rename ends up affecting. For this release, dxml.parser.stax is now an empty, deprecated, module that publicly imports dxml.parser, but it will be removed in 0.3.0, whenever that is released. So, the few folks who grabbed the initial release won't end up with immediate code breakage if they upgrade. One nice side effect of how I implemented DOM support is that it's trivial to get the DOM for a portion of an XML document rather than the entire thing, since it will produce a DOMEntity from any point in an EntityRange. Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/ Github: https://github.com/jmdavis/dxml/tree/v0.2.0 Dub: http://code.dlang.org/packages/dxml - Jonathan M DavisAwesome. Just tried it now as below and it works. Thanks for this library import std.stdio; import dxml.dom; struct Record { string name; string email; } Record[] parseRecords(string xml) { Record[] records; auto d = parseDOM!simpleXML(xml); auto root = d.children[0]; foreach(record; root.children) { auto rec = Record(); foreach(ele; record.children) { if (ele.name == "name") rec.name = ele.children[0].text; if (ele.name == "email") rec.email = ele.children[0].text; } records ~= rec; } return records; } void main() { auto xml = "<root>\n" ~ " <record>\n" ~ " <name>N1</name>\n" ~ " <email>E1</email>\n" ~ " </record>\n" ~ " <record>\n" ~ " <name>N2</name>\n" ~ " <email>E2</email>\n" ~ " </record>\n" ~ " <record>\n" ~ " <email>E3</email>\n" ~ " <name>N3</name>\n" ~ " </record>\n" ~ "<!--no comment -->\n" ~ "</root>"; auto records = parseRecords(xml); writeln(records); }
Feb 11 2018
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. [...]Will this replace `std.xml` one day?
Feb 12 2018
On 12/02/2018 12:38 PM, Chris wrote:On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:As long as DTD support is essentially non-existent, my vote will always be no.dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. [...]Will this replace `std.xml` one day?
Feb 12 2018
On Monday, 12 February 2018 at 12:49:30 UTC, rikki cattermole wrote:On 12/02/2018 12:38 PM, Chris wrote:How hard would it be to add DTD support? One could take dxml and extend it in order to include it in Phobos. I haven't used `std.xml` for years now. It is essentially dead and unusable atm.On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:As long as DTD support is essentially non-existent, my vote will always be no.dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. [...]Will this replace `std.xml` one day?
Feb 12 2018
On 12/02/2018 1:51 PM, Chris wrote:On Monday, 12 February 2018 at 12:49:30 UTC, rikki cattermole wrote:From what I read in the other thread, it would require a complete redesign and a major performance hit. I don't care what J.M.D. puts in his own library. We just can't advertise to having an 'XML' library when we out right ignore a large portion of (and fairly important to real world adoption IMO) the specification for no other reason than personal opinions of the author. Now if you want a subset as the 'default' but have full support including DTD as an opt-in with the only difference is how you initialize the parser, I'd be happy and so will our end users in the future.On 12/02/2018 12:38 PM, Chris wrote:How hard would it be to add DTD support? One could take dxml and extend it in order to include it in Phobos. I haven't used `std.xml` for years now. It is essentially dead and unusable atm.On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:As long as DTD support is essentially non-existent, my vote will always be no.dxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. [...]Will this replace `std.xml` one day?
Feb 12 2018
On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:Maybe. That depends on community feedback and ultimately on the Phobos review process. Assuming that there's support for putting it through the Phobos review process, then once I feel that it's complete enough and had enough use to make it clear that I didn't miss something critical, then I'll submit it for review. What little feedback there has been thus far has been positive, but it would be nice to get it battle-tested a bit, and there is still functionality that I need to add. Given that std.xml needs to be replaced, I think that it would be good if dxml were able to do that, but that depends heavily on what others think of what I've done and what they think Phobos' xml solution should look like. But the way things are going though, if dxml doesn't replace std.xml, I don't know that anything ever will. XML parsers are one of those things that everyone seems to want and no one seems to want to work on. However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design. Someone would basically have to write an entirely new parser to be able to handle it (some of dxml's internals could be reused, but they'd also have to be refactored a fair bit, and a ton of extra stuff would have to be added). Such a parser could theoretically coexist with dxml's parser, since each would provide its own advantages, but I have no plans to implement an XML parser to handle the DTD section. It's simply not worth my time or effort, and this project has already taken way more time and effort than I anticipated. However, std.xml does not support the DTD section, and glancing over it, it doesn't look like it even handles skipping the DTD section properly (it doesn't handle the fact that '>' can appear within quoted sections within the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't lose any functionality by having dxml replace std.xml. It just wouldn't necessarily do as much as some folks might like. My guess is that DTD support won't be a deal breaker given that std.xml doesn't support it, that std.xml has needed to be replaced for years now, and that no one else is working on replacing it, but I don't know. Disagreements over what should be done with std.json's replacement has meant that it has never been replaced even though significant work was done towards replacing it, so unfortunately, there's already precedence for a module not being replaced with something better due to disagreements over what the replacement would ideally be. So, I don't know. - Jonathan M Davisdxml 0.2.0 has now been released. I really wasn't planning on releasing anything this quickly after announcing dxml, but when I went to start working on DOM support, it turned out to be surprisingly quick and easy to implement. So, dxml now has basic DOM support. [...]Will this replace `std.xml` one day?
Feb 12 2018
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M DavisHowever, std.xml does not support the DTD section, and glancing over it, it doesn't look like it even handles skipping the DTD section properly (it doesn't handle the fact that '>' can appear within quoted sections within the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't lose any functionality by having dxml replace std.xml. It just wouldn't necessarily do as much as some folks might like.I thought the same when I glanced over std.xml. There's no DTD support there either and I don't think it would be a deal breaker for most users.My guess is that DTD support won't be a deal breaker given that std.xml doesn't support it, that std.xml has needed to be replaced for years now, and that no one else is working on replacing it, but I don't know. Disagreements over what should be done with std.json's replacement has meant that it has never been replaced even though significant work was done towards replacing it, so unfortunately, there's already precedence for a module not being replaced with something better due to disagreements over what the replacement would ideally be. So, I don't know. - Jonathan M DavisWasn't there a replacement module that never got past the initial review steps? Some GSoC thing or so. But I wonder if that module would be up to the latest D standards. While one may argue that DTD support is important, I would rather have something fast and simple like dxml that covers, say, 90% of the cases than nothing. It doesn't make sense to me that we should accept the current situation, only because of some bikeshedding that concerns 10% of the use cases. After all, it's only a module not a fundamental decision that concerns the direction D will take in the future. I think stuff like that can seriously turn off potential users. A lot of useful things begin with one person deciding to give it a go. vibe.d, dub, DScanner and DlangUI, for example. If the creators had started bikeshedding before writing the first line of code, there would still be a flamewar about the best way to go about it - and nothing would have happened.
Feb 12 2018
On 12/02/2018 2:45 PM, Chris wrote:On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:https://github.com/dlang-community/experimental.xml Code isn't great, and not complete yet. Author has just disappeared sadly.On Monday, February 12, 2018 12:38:51 Chris via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M DavisHowever, std.xml does not support the DTD section, and glancing over it, it doesn't look like it even handles skipping the DTD section properly (it doesn't handle the fact that '>' can appear within quoted sections within the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't lose any functionality by having dxml replace std.xml. It just wouldn't necessarily do as much as some folks might like.I thought the same when I glanced over std.xml. There's no DTD support there either and I don't think it would be a deal breaker for most users.My guess is that DTD support won't be a deal breaker given that std.xml doesn't support it, that std.xml has needed to be replaced for years now, and that no one else is working on replacing it, but I don't know. Disagreements over what should be done with std.json's replacement has meant that it has never been replaced even though significant work was done towards replacing it, so unfortunately, there's already precedence for a module not being replaced with something better due to disagreements over what the replacement would ideally be. So, I don't know. - Jonathan M DavisWasn't there a replacement module that never got past the initial review steps? Some GSoC thing or so. But I wonder if that module would be up to the latest D standards.While one may argue that DTD support is important, I would rather have something fast and simple like dxml that covers, say, 90% of the cases than nothing. It doesn't make sense to me that we should accept the current situation, only because of some bikeshedding that concerns 10% of the use cases. After all, it's only a module not a fundamental decision that concerns the direction D will take in the future. I think stuff like that can seriously turn off potential users. A lot of useful things begin with one person deciding to give it a go. vibe.d, dub, DScanner and DlangUI, for example. If the creators had started bikeshedding before writing the first line of code, there would still be a flamewar about the best way to go about it - and nothing would have happened.Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do. Personally I find J.M.D. arguments quite reasonable for a third-party library, since yes it does cover 90% of the use cases.
Feb 12 2018
On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole wrote:Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.About 5 years ago (I think, I actually have the link on my other computer but it is 2,000 miles away right now), Andrei said something along the lines of "without the review process, we get junk like std.json". Ironically, that same review process may be why we still have such "junk". (actually personally, I don't hate std.json). If std.xml is really so bad and has been for so long, surely we ought to take an opportunity to change that, even if the change isn't perfect.
Feb 12 2018
On 12/02/2018 3:08 PM, Adam D. Ruppe wrote:On Monday, 12 February 2018 at 14:54:48 UTC, rikki cattermole wrote:It depends. The implementation does not need to be perfect or full fledged to go into experimental. But if at the start of the review process it is already well known that the public API would require a complete change to accommodate the intended goal it is unacceptable. Take std.experimental.allocators as an example. It currently is going through a massive API change, but when it first got PR'd, did we know that we should be RC'ing allocators? No of course not, otherwise we'd have done it. At this point in time I cannot say that dxml in good faith serves to represent the XML specification for the D community in full. This is unfortunately not about bike shedding. It is one thing to bike shed features, but when scope does not match the intended goal, we have got to be careful about what goes into Phobos. All J.M.D. has to do to change this, is make the API match the spec (as close as possible, without writing another parser) and separate out the implementation into a different and very clear module (probably a sub package) which states clearly that it is a subset with the full grammar listed that it supports. That way everybody is clear and we can later on get a full implementation as part of taking it out of experimental :)Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.About 5 years ago (I think, I actually have the link on my other computer but it is 2,000 miles away right now), Andrei said something along the lines of "without the review process, we get junk like std.json". Ironically, that same review process may be why we still have such "junk". (actually personally, I don't hate std.json). If std.xml is really so bad and has been for so long, surely we ought to take an opportunity to change that, even if the change isn't perfect.
Feb 12 2018
On Monday, February 12, 2018 15:26:24 rikki cattermole via Digitalmars-d- announce wrote:All J.M.D. has to do to change this, is make the API match the spec (as close as possible, without writing another parser) and separate out the implementation into a different and very clear module (probably a sub package) which states clearly that it is a subset with the full grammar listed that it supports.That literally cannot be done. dxml returns slices (or takeExactly's) of the original input. For it to do otherwise would harm performance and usability, but in order to implement full DTD support, it's impossible to return slices of the original input in the general case, because you have to be able to mutate the data whenever entity references get involved. If the API were entirely string-based, then whether the implementation returned slices or newly allocated strings could be an implementation detail, but as soon as you're dealing with arbitrary ranges of characters, that doesn't work. At that point, you're forced to either return strings for everything (which means allocating for any ranges that aren't strings) or to return a lazy range of characters and thus can't return the original type. And that means that if you pass it a string, you're stuck with a lazy range out the other end instead of a string, and to get a string again, you have to allocate, whereas with what I have now, the parser does almost no allocations, and as long as the input type supports slicing, you get exactly the same type out the other end, which is a huge usabality improvement IMHO. So, you can't have DTD support with the kind of API that dxml has, and changing the API to something that could work with DTD support would harm the parser for all of the cases where DTD support is unnecessary. Even if I were going to implement full DTD support, I would do it with another parser, not change the parser that dxml already has. And if dxml ends up in Phobos with the parser that it has, that doesn't prevent another parser from being added for the DTD case later if someone actually decides to put in the time and effort to do it. Either way, for any XML document that doesn't need DTD support, the way that dxml does things is more efficient and user-friendly than one that had DTD support would be, much as that obviously doesn't cut it for those documents that do need DTD support. In any case, I'm going to finish implementing dxml without any kind of DTD support and then see how things go as far as the Phobos review process goes. If dxml gets rejected, because the majority of folks think that we're better off with std.xml (or no xml parser at all in Phobos) than one that doesn't have DTD support, then oh well. That sucks, but anyone who wants dxml can then use it as a 3rd party library. I think that the D community would be worse off because of that, but it's not ultimately my decision to make, and either way, I have the parser that I need. - Jonathan M Davis
Feb 12 2018
On 12/02/2018 3:50 PM, Jonathan M Davis wrote:In any case, I'm going to finish implementing dxml without any kind of DTD support and then see how things go as far as the Phobos review process goes. If dxml gets rejected, because the majority of folks think that we're better off with std.xml (or no xml parser at all in Phobos) than one that doesn't have DTD support, then oh well. That sucks, but anyone who wants dxml can then use it as a 3rd party library. I think that the D community would be worse off because of that, but it's not ultimately my decision to make, and either way, I have the parser that I need.We are definitely not better off with just std.xml currently. The problem comes from the word currently. By going into Phobos even if experimental, its going to be around for a while in some form or another. So we need to invest a decent amount of time into not creating more problems for new users expecting the world and not getting it. If somebody (say a student?) were to write up a proper API and use dxml as a basis for a simpler parser, now that could be a worth while project and definitely could go into Phobos. I may even consider doing it at some point in the future.
Feb 12 2018
On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote: [...]Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.Personally I find J.M.D. arguments quite reasonable for a third-party library, since yes it does cover 90% of the use cases.As I have just said in another post, dxml itself does not need to be changed to implement DTD support. It's perfectly possible to write a wrapper on top of it that *does* implement DTD support. In fact, I dare say it might be possible to lazily switch from a thin wrapper over dxml to full DTD mode, so that end users don't even need to care about the difference if they don't care to. As far as API is concerned, it could be as simple as something like: auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...) { static if (DtdSupport) return dtdWrapper(dxmlParse(input)); else return dxmlParse(input); } Then just note in the documentation that turning off DTD support would provide extra features X, Y, and Z (speed, slices, whatever). Then let the user choose. Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums. I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple. T -- Many open minds should be closed for repairs. -- K5 user
Feb 12 2018
On 12/02/2018 10:02 PM, H. S. Teoh wrote:On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote: [...]In other places it was said that it wasn't possible to build it on top of it. But yes, I would be expecting an entry point like you described and is something that I mentioned :) std.experimental.xml: - interfaces.d: interface Element {...} - entry.d: auto parseXML(...)(...) {...} - impl_subset: - dom.d ext. - impl_full: - entry.d ext.Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.Personally I find J.M.D. arguments quite reasonable for a third-party library, since yes it does cover 90% of the use cases.As I have just said in another post, dxml itself does not need to be changed to implement DTD support. It's perfectly possible to write a wrapper on top of it that *does* implement DTD support. In fact, I dare say it might be possible to lazily switch from a thin wrapper over dxml to full DTD mode, so that end users don't even need to care about the difference if they don't care to. As far as API is concerned, it could be as simple as something like: auto parseXml(R, DtdSupport = dtdSupport.true)(R input) if (...) { static if (DtdSupport) return dtdWrapper(dxmlParse(input)); else return dxmlParse(input); } Then just note in the documentation that turning off DTD support would provide extra features X, Y, and Z (speed, slices, whatever). Then let the user choose. Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums. I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple. T
Feb 12 2018
On 02/12/2018 05:02 PM, H. S. Teoh wrote:On Mon, Feb 12, 2018 at 02:54:48PM +0000, rikki cattermole via Digitalmars-d-announce wrote: [...]+Several billion. Like the improved assert messages we would've had since many years ago and was implemented, done and ready to go, but it was instead thrown away because...(and here's the real kicker, considering current D climate)...because it was a fully in-library solution instead of a new compiler feature. Go figure ::eyeroll::Everything you have mentioned is not in Phobos. Just because something is 'good enough' does not make it 'good enough' for Phobos. In the words of Andrei "Good enough is not good enough", we need to aim higher to show what we actually can do.And thus Phobos continues to let the perfect be the enemy of the good, and 10 years later std.xml will still be around, and we will still be arguing over how to replace it.Seriously, I would have thought something like this would be obvious to programmers of the calibre found on these forums. I'm a little astonished that this would even be such a point of contention in the first place, since the solution is so simple.I would've expected so too, if it weren't that one of the top favorite activities 'round these parts is nitpicking reasonable ideas to death for stupid reasons. And, generally letting the perfect be the enemy of the good.
Feb 12 2018
On Mon, 2018-02-12 at 14:54 +0000, rikki cattermole via Digitalmars-d- announce wrote:[=E2=80=A6] =20 Personally I find J.M.D. arguments quite reasonable for a third- party=20 library, since yes it does cover 90% of the use cases.The problem is that std.xml needs removing to make it clear there is no good XML package in Phobos. The people will go looking in the Dub repository. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Feb 13 2018
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:XML parsers are one of those things that everyone seems to want and no one seems to want to work on.I wrote one 8 years ago... though mine is more focused on HTML parsing, and the XML aspect is just a side effect!
Feb 12 2018
On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Feb 12 2018
On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:On Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:Hit send too fast. std.xml.base would be reasonable.However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Feb 12 2018
On Monday, February 12, 2018 15:45:50 bachmeier via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 15:43:59 UTC, bachmeier wrote:I have no interest in bikeshedding the name right now or even really arguing about Phobos inclusion (I've already said more in this thread about that than I probably should have). That can be left up to the review process, which already tends to be nasty enough that it wouldn't surprise me at all if dxml doesn't get accepted. The only reason that I have any plans to try for Phobos inclusion with dxml is because std.xml needs to be replaced. If Phobos didn't have an XML parser already, I don't expect that I'd bother, since I don't think that it's all that important that a standard library have an XML parser. I just think that it's important that it not have have a bad one. In general, I think that XML is the sort of thing that's perfectly fine as a 3rd party solution. - Jonathan M DavisOn Monday, 12 February 2018 at 14:04:38 UTC, Jonathan M Davis wrote:Hit send too fast. std.xml.base would be reasonable.However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.Can't you simply give it a name other than std.xml that indicates it doesn't do everything related to xml? It doesn't make sense to not put it into Phobos because of the name, and that should be an easy problem to solve.
Feb 12 2018
On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]However, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.Actually, thinking about this, I'm wondering if a combination of preprocessing and/or postprocessing might make it possible to implement DTD support without needing to rewrite the guts of dxml. AIUI, dxml does parse the DTD section correctly, i.e., as an XML directive, but only doesn't look into its internal details. So one way to implement DTD support might be: - Write an auxiliary parser that's basically a wrapper around dxml, forwarding XML events to the caller, except: - If a DTD event is encountered, eagerly parse it, store DTD declarations internally for future reference. - If there's a DTD that has been seen, perform on-the-fly validation as XML events are forwarded. - In PCDATA sections, if there are entity references to the DTD, expand them, possibly inserting more XML events into the stream based on what's defined in the DTD. (This may need to reuse some dxml internals to parse XML snippets that might be contained in an entity definition, for example.) [...]However, std.xml does not support the DTD section, and glancing over it, it doesn't look like it even handles skipping the DTD section properly (it doesn't handle the fact that '>' can appear within quoted sections within the DTD). So, dxml is not worse than std.xml in that regard, and we wouldn't lose any functionality by having dxml replace std.xml. It just wouldn't necessarily do as much as some folks might like.[...] If std.xml currently does not support DTDs, then I say dxml is definitely a Phobos candidate. At the very least, it does not make the current situation worse. Rejecting dxml because it doesn't support DTDs is basically letting the perfect be the enemy of the good, which is something this community has been plagued with for far too long. What's worse: a std.dxml that doesn't support DTDs, or a std.xml with fundamental problems that continue to plague us for the next decade while nobody else steps up to implement a suitable replacement? T -- Ph.D. = Permanent head Damage
Feb 12 2018
On 12/02/2018 3:59 PM, H. S. Teoh wrote:If std.xml currently does not support DTDs, then I say dxml is definitely a Phobos candidate. At the very least, it does not make the current situation worse. Rejecting dxml because it doesn't support DTDs is basically letting the perfect be the enemy of the good, which is something this community has been plagued with for far too long. What's worse: a std.dxml that doesn't support DTDs, or a std.xml with fundamental problems that continue to plague us for the next decade while nobody else steps up to implement a suitable replacement?dxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.
Feb 12 2018
On Monday, 12 February 2018 at 16:15:54 UTC, rikki cattermole wrote:dxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.How could it possibly make the situation any worse than it is now? Atm, nobody will ever use std.xml, because it is sub-standard and has no future. As others have already mentioned: a DTD parser can still be added at a later point. It's like not moving into newly built house, because the winter garden is not yet finished (and you live in Florida :)
Feb 12 2018
On 2018-02-12 17:49, Chris wrote:How could it possibly make the situation any worse than it is now? Atm, nobody will ever use std.xml, because it is sub-standard and has no future.I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced. -- /Jacob Carlborg
Feb 12 2018
On Monday, 12 February 2018 at 19:47:09 UTC, Jacob Carlborg wrote:On 2018-02-12 17:49, Chris wrote:A few lines of code that could be replaced easily once something better is available? But who will start an important commercial project with std.xml when it says in red letters: "Warning: This module is considered out-dated and not up to Phobos' current standards. It will remain until we have a suitable replacement, but be aware that it will not remain long term." I for my part wouldn't and I'm glad there's dxml now.How could it possibly make the situation any worse than it is now? Atm, nobody will ever use std.xml, because it is sub-standard and has no future.I'm using std.xml in a new project right now. It's a really small private project that just need to extracts some data from an XML document. I started it a couple of days before dxml was announced.
Feb 12 2018
On 2018-02-12 21:19, Chris wrote:A few lines of code that could be replaced easily once something better is available?Fairly easy because it's so small. I'm actually using the SAX interface from std.xml and it quite nicely fits my needs. -- /Jacob Carlborg
Feb 12 2018
On 02/12/2018 11:15 AM, rikki cattermole wrote:dxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
Feb 12 2018
On Monday, February 12, 2018 21:53:21 Nick Sabalausky via Digitalmars-d- announce wrote:On 02/12/2018 11:15 AM, rikki cattermole wrote:There is sometimes a tendency for folks to think that something having a lot of lines of code is bad, and there can be some truth to that. If something can be done in a simpler way, it tends to be shorter and easier to maintain, but shorter isn't always better, and simpler isn't always better - especially if that complexity is needed to get the job done. So, LOC tells you something, but what it really tells you is up for debate. And actually, well-written D code is going to have a much higher line count in general because of stuff like documentation and unit tests being in the source file. In this case, while std.xml does seem to have a fair bit of documentation, it has very little in the way of unit tests, whereas dxml has fairly thorough unit tests - maybe not quite as extreme as std.datetime, but I do tend to be thorough with unit tests. Andrei used to complain periodically about how large std.datetime was, thinking that it was way too much code, and then someone actually went to the effort of stripping out all of the comments and unit tests and whatnot to count the actual lines of code in the implementation, and it was a _way_ smaller number than the lines in the file (IIRC, it might have even been something like only 10% of the file, if that). That's what happens when you write documentation and unit tests that are thorough. - Jonathan M Davisdxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
Feb 12 2018
On 02/12/2018 10:49 PM, Jonathan M Davis wrote:Andrei used to complain periodically about how large std.datetime was, thinking that it was way too much code, and then someone actually went to the effort of stripping out all of the comments and unit tests and whatnot to count the actual lines of code in the implementation, and it was a _way_ smaller number than the lines in the file (IIRC, it might have even been something like only 10% of the file, if that). That's what happens when you write documentation and unit tests that are thorough.Yea, totally. Another example: mysql-native used to be one (!!) source file. It was maybe a bit on the large size for a single module, but it was still workable. In the last several years, that library has grown many times its old size. But now, I'd say that easily the majority of lines are either comments or tests. The *actual* implementation and API isn't really all that much more LOC than it used to be. The original one-module version, by contrast, was less documented and had...I don't think it even had a single test (IIRC, the now-old-and-probably-bitrotted "app.d" wasn't even there.)
Feb 12 2018
On Tuesday, 13 February 2018 at 02:53:21 UTC, Nick Sabalausky (Abscissa) wrote:On 02/12/2018 11:15 AM, rikki cattermole wrote:And it's like 2k LOC of code and 5.5k LOC of tests and docs.dxml 7.5k LOC std.xml 3k LOC dxml would make the situation a lot worse.4.5k LOC == "a lot worse"? Uuuuhhh...WAT?
Feb 13 2018
On Monday, February 12, 2018 07:59:24 H. S. Teoh via Digitalmars-d-announce wrote:On Mon, Feb 12, 2018 at 07:04:38AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference. If we were going to stick to strings and only strings, it would be quite possible to define the API in a way that it may or may not do DTD processing, but that doesn't work with arbitrary ranges of characters, not unless you give up on returning slices of the original input, and that means harming the performance and usability for the common case in order to support DTDs. Also, anything that has the concept of "events" would be drastically different from what dxml does. dxml is completely range-based. It has no callbacks or anything of the sort, and having anything like that would complicate it considerably. There are lots of interesting things that could be done to try and deal with the DTD section, but they fundamentally don't work with returning slices of the original input unless you're only using strings. In any case, I refuse to change dxml so that it has DTD support, and I refuse to change it so that it doesn't return slices of the original input. If I were to do so, it would make the parser worse for any use case I care about and require a lot of time and effort on my part that I'm not willing to spend. So, if that makes it so that dxml is never included in Phobos, then so be it. Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens. I'd be just as fine with a decision to remove std.xml and not include dxml. I'm less fine with std.xml being left in Phobos and dxml being rejected, because std.xml has been recognized as bad, and it sure doesn't look like anyone else is going to write a replacement any time soon. I also think that dxml's approach is better for the common case than anything that supported DTDs would be, so I think that having dxml's solution in Phobos would be better for the community even if Phobos also had a solution that supported DTDs, but at this point, it looks like the options are going to be 1. std.xml stays and continues to suck. 2. std.xml gets ripped out and dxml replaces it. 3. std.xml gets ripped out and we have no xml solution in Phobos. But as it stands, it doesn't seem likely that any XML solution that supports DTDs being in Phobos is likely to happen any time soon, if ever, because AFAIK, only three people have put in any real effort towards replacing std.xml since 2010 or whenever it was that we decided it needed to be replaced. The first two people both disappeared into oblivion without ever finishing, and here I am with a working StAX parser (now with DOM support) and an XML writer in the works - and given how involved I am with D, I think that it's pretty unlikely that I'm disappearing anywhere short of getting hit by a bus or whatnot. So, at least I've actually put in the time and effort towards a solution and made it available, and it will almost certainly be an essentially complete solution by the time that dconf rolls around if not well before. So, I do expect that the question of Phobos inclusion will ultimately be a question of whether std.xml _ever_ gets replaced, but regardless, at least there is a solution, and it will continue to be available as a 3rd party library even if it never makes it into Phobos. - Jonathan M DavisHowever, if folks as a whole think that Phobos' xml parser needs to support the DTD section to be acceptable, then dxml won't replace std.xml, because dxml is not going to implement DTD support. DTD support fundamentally does not fit in with dxml's design.Actually, thinking about this, I'm wondering if a combination of preprocessing and/or postprocessing might make it possible to implement DTD support without needing to rewrite the guts of dxml. AIUI, dxml does parse the DTD section correctly, i.e., as an XML directive, but only doesn't look into its internal details. So one way to implement DTD support might be: - Write an auxiliary parser that's basically a wrapper around dxml, forwarding XML events to the caller, except: - If a DTD event is encountered, eagerly parse it, store DTD declarations internally for future reference. - If there's a DTD that has been seen, perform on-the-fly validation as XML events are forwarded. - In PCDATA sections, if there are entity references to the DTD, expand them, possibly inserting more XML events into the stream based on what's defined in the DTD. (This may need to reuse some dxml internals to parse XML snippets that might be contained in an entity definition, for example.)
Feb 12 2018
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.Standard entities like & have the same problem, so the same solution should work too.
Feb 13 2018
On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like & can be, since & is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out. If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference. - Jonathan M DavisThe core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.Standard entities like & have the same problem, so the same solution should work too.
Feb 13 2018
On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:On Tuesday, February 13, 2018 15:22:32 Kagamin via Digitalmars-d-announce wrote:There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser. <!DOCTYPE root [ <!ELEMENT root ANY> <!ENTITY LOL "LOL"> <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;"> <!ENTITY LOL2 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;"> <!ENTITY LOL3 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;"> <!ENTITY LOL4 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;"> <!ENTITY LOL5 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;"> <!ENTITY LOL6 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;"> <!ENTITY LOL7 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;"> <!ENTITY LOL8 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;"> <!ENTITY LOL9 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;"> ]> <root>&LOL9;</root> Hope you have enough memory (this expands to a 3 000 000 000 LOL's)On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:That depends on what exactly an entity reference can contain. If it can do something like put a start tag in there, and then it has to be terminated by the document putting an end tag in there or another entity reference containing an end tag, then it can't be handled after the fact like & can be, since & is just replaced by text. If an entity reference can't contain a start tag without a matching end tag, then sure. But I find the XML spec to be surprisingly hard to understand with regards to entity references. It's not clear to me where it's even legal to put them or not, let alone what you're allowed to put in them exactly. And I can't even really trust the XML gramamr as long as entity references are involved, because the gramamr in the spec is the grammar _after_ entity references have all been replaced, which I was quite dismayed to figure out. If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.Standard entities like & have the same problem, so the same solution should work too.
Feb 13 2018
On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d- announce wrote:There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser.Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section. If entity references are only legal in the text between start and end tags and between the quotes of attribute values, and whatever they're replaced with cannot actually affect anything else in the XML document (i.e. it can't just be a start or end tag or anything like that - it has to be fulling parseable on its own and not affect the parsing of the document itself), then passing them along should be fine. Basically, if I can change dxml so that in the places where it currently allows one of the standard entity references to be, it then also allows other entity references but passes them along without replacing them instead of throwing an XMLParsingException, and that works without having documents be screwed up due to missing start tags or something, then passing them along should be fine. But if entity references allow arbitrary enough chunks of XML, that doesn't work. It also doesn't work if entity references are allowed in places other than the text between start and end tags or within attribute values. And it's not clear to me at all what is legal in an entity reference or where exactly they're legal. The spec talks about the grammar being the grammar _after_ all of the references have been replaced, which makes the grammar rather untrustworthy, and I find the spec very hard to understand in general. Regardless, there's no risk of dxml's parser ever being changed to actually replace entity references. That doesn't work with returning slices of the original input, and it really doesn't work with a parser that's just supposed to take a range of characters and parse it. To fully handle all of the DTD stuff means actually reading files from disk or from the internet - which of course is where the security problems come in, but it also means that you're not just dealing with a parser anymore. In principle, dxml's parser should be pure (though some implementation make it so that it isn't right now), whereas an XML parser that fully handles the DTD section could never be pure. - Jonathan M Davis
Feb 13 2018
On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis wrote:On Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d- announce wrote:Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.[...]Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
Feb 14 2018
On Wednesday, February 14, 2018 10:03:45 Patrick Schluter via Digitalmars-d- announce wrote:On Tuesday, 13 February 2018 at 22:00:59 UTC, Jonathan M Davis wrote:Well, since folks other than me are going to use this parser, and it's even potentially going to end up in D's standard library, it needs to at least be good enough to not let through invalid XML or incorrectly interpret any XML. It can potentially not support portions of the spec as long as it does so in a clear and clean manner, but it's going to have to correctly handle anything that it does handle. For better or worse, I'm the sort of person who prefers to completely implement a spec when I'm implementing one, but in this case, it wasn't really reasonable. Fortunately however, from the perspective of implementing something that's useful for me personally, the DTD section is completely unnecessary. From that perspective, processing instructions and CDATA sections are also unnecessary, since I'd never do anythnig with them, but I don't think that it would be reasonable to skip those, so they're implemented. And it's not like they're hard to implement support for, unlike the DTD section. - Jonathan M DavisOn Tuesday, February 13, 2018 21:18:12 Patrick Schluter via Digitalmars-d- announce wrote:Yikes! In any case, even if I had to implement a parser I would tend to not implement this "feature" as it sounds quite unreasonable. Only if a real need (i.e. one in the real world, not one that could be contrived out of the specs) arises would I then potentially implement the real deal.[...]Well, if dxml just passes the entity references along unparsed beyond validating that the entity reference itself contains valid characters (e.g. it's not something like &.; or & by itself), then dxml would still not be replacing the entity references with anything. Any security or performance problems associated with entity references would be left up to whatever parser parsed the DTD section and then used dxml to parse the rest of the XML and replaced the entity references in dxml's parsing results with whatever they were. The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.
Feb 14 2018
On Tue, Feb 13, 2018 at 09:18:12PM +0000, Patrick Schluter via Digitalmars-d-announce wrote:On Tuesday, 13 February 2018 at 20:10:59 UTC, Jonathan M Davis wrote:[...]This made me go to the W3C spec (https://www.w3.org/TR/xml/) to figure out what exactly is/isn't defined. I discovered to my chagrin that XML entities are a huge rabbit hole with extremely pathological behaviour that makes it almost impossible to implement in any way that's even remotely efficient. Here's a page with examples of how nasty it can get: http://www.floriankaeferboeck.at/XML/Comparison.html Here's an example given in the W3C spec itself: <?xml version='1.0'?> <!DOCTYPE test [ <!ELEMENT test (#PCDATA) > %xx; ]> <test>This sample shows a &tricky; method.</test> A correct XML parser is supposed to produce the following text as the body of the <test>...</test> tag (the grammatical error is intentional): This sample shows a error-prone method. Fortunately, there's a glimmer of hope on the horizon: in section 4.3.2 of the spec (https://www.w3.org/TR/xml/#wf-entities), it is explicitly stated: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another. Meaning, if I understand it correctly, that you can't have a start tag in &entity1; and its corresponding end tag in &entity2;, and then have your document contain "&entity1; &entity2;". This is because the body of the entity can only contain text or entire tags (the production "content" in the spec); an entity that contains an open tag without an end tag (or vice versa) does not match this rule and is thus illegal. So this means that we *can* use dxml as a backend to drive a DTD-supporting XML parser implementation. The wrapper / higher-level parser would scan the slices returned by dxml for entity references, and substitute them accordingly, which may involve handing the body of the entity to another instance of dxml to parse any tags that may be nested in there. The nastiness involving partially-formed entity references (as seen in the above examples) apparently only applies inside the DOCTYPE declaration, so AIUI this can be handled by the higher-level parser as part of replacing inline entities with their replacement text. (The higher-level parser has a pretty tall order to fill, though, because entities can refer to remote resources via URI, meaning that an innocuous-looking 5-line XML file can potentially expand to terabytes of XML tags downloaded from who knows how many external resources recursively. Not to mention a bunch of security issues like described below.)If it's 100% sure that entity references can be treated as just text and that you can't end up with stuff like start tags or end tags being inserted and messing with the parsing such that they all have to be replaced for the XML to be correctly parsed, then I have no problem passing entity references along, and a higher level parser could try to do something with them, but it's not clear to me at all that an XML document with entity references is correct enough to be parsed while not replacing the entity references with whatever XML markup they contain. I had originally passed them along with the idea that a higher level parser could do something with them, but I decided that I couldn't do that if you could do something like drop a start tag in there and change the meaning of the stuff that needs to be parsed that isn't directly in the entity reference.There's also the issue that entity references open a whole can of worms concerning security. It quite possible to have an exponential growing entity replacement that can take down any parser. <!DOCTYPE root [ <!ELEMENT root ANY> <!ENTITY LOL "LOL"> <!ENTITY LOL1 "&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;&LOL;"> <!ENTITY LOL2 "&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;&LOL1;"> <!ENTITY LOL3 "&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;&LOL2;"> <!ENTITY LOL4 "&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;&LOL3;"> <!ENTITY LOL5 "&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;&LOL4;"> <!ENTITY LOL6 "&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;&LOL5;"> <!ENTITY LOL7 "&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;&LOL6;"> <!ENTITY LOL8 "&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;&LOL7;"> <!ENTITY LOL9 "&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;&LOL8;"> ]> <root>&LOL9;</root> Hope you have enough memory (this expands to a 3 000 000 000 LOL's)[...] Yeah, after reading through relevant portions of the spec, I have to say that full DTD support is a HUGE can of worms. I tip my hats off in advance to the brave soul (or poor fool :-P) who would attempt to implement the spec in full. :-D There are ways to deal with exponential entity growth, e.g., if the expansion was carried out lazily. But it's still a DOS vulnerability if the software then spins practically forever trying to traverse the huge range of stuff being churned out. Not to mention that having embedded external references is itself a security issue, particular since the partial entity formation thing can be used to obfuscate the real URI of a referenced entity, so you could potentially trick a remote XML parser to download stuff from questionable sources. It could be used as a covert surveillance method, for example, or a malware delivery vector, if combined with an exploitable bug in the parser code. Or it could be used to read sensitive files (e.g., if an entity references file:///etc/passwd or some such system file). Ick. Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc.. There also aren't many XML parsers out there that fully support all the features called for in the spec. IOW, this basically amounts to "just use dxml and forget about everything else". :-D Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears. My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever. (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities? And all of this, just to handle rare corner cases? That's completely ridiculous. It's an obvious design smell to me. The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven. Oh wait, on further googling, I see that XSLT *is* Turing complete. Great, just great. Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.) T -- English is useful because it is a mess. Since English is a mess, it maps well onto the problem space, which is also a mess, which we call reality. Similarly, Perl was designed to be a mess, though in the nicest of all possible ways. -- Larry Wall
Feb 13 2018
On Tuesday, 13 February 2018 at 22:13:36 UTC, H. S. Teoh wrote:Ironically, the general advice I found online w.r.t XML vulnerabilities is "don't allow DTDs", "don't expand entities", "don't resolve externals", etc.. There also aren't many XML parsers out there that fully support all the features called for in the spec. IOW, this basically amounts to "just use dxml and forget about everything else". :-D Now of course, there *are* valid use cases for DTDs... but a naïve implementation of the spec is only going to end in tears. My current inclination is, just merge dxml into Phobos, then whoever dares implement DTD support can do so on top of dxml, and shoulder their own responsibility for vulnerabilities or whatever. (I mean, seriously, just for the sake of being able to say "my XML is validated" we have to implement network access, local filesystem access, a security framework, and what amounts to a sandbox to control pathological behaviour like exponentially recursive entities? And all of this, just to handle rare corner cases? That's completely ridiculous. It's an obvious design smell to me. The only thing missing from this poisonous mix is Turing completeness, which would have made XML hackers' heaven. Oh wait, on further googling, I see that XSLT *is* Turing complete. Great, just great. Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.) TThanks for the analysis. I'd say you're right. It makes no sense to keep dxml from becoming std.xml's successor only because it doesn't support DTDs. Also, as I said before, if we had DTD support in std.xml, people would complain about the lack of efficiency, and the discussion about interpreting the specs correctly, implementing them 100%, complaints about the lack of security would just never end.
Feb 14 2018
On Tue, Feb 13, 2018 at 03:00:59PM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]The big problem is how the entity references affect the parsing. If start tags can be dropped in and affect the parsing (and it's still not clear to me from the spec whether that's legal - there is a section talking about being nested properly which might indicate that that's not legal, but it's not very specific or clear), and if it's legal to do something like use an entity reference for a tag name - e.g. <&foo;>, then that's a serious problem. And problems like that are the main reason why I completely dropped any attempt to do anything with the DTD section.AFAICT, section 4.3.2 in the spec (probably the one you're referring to) seems to be saying that you can't do that: A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.If entity references are only legal in the text between start and end tags and between the quotes of attribute values, and whatever they're replaced with cannot actually affect anything else in the XML document (i.e. it can't just be a start or end tag or anything like that - it has to be fulling parseable on its own and not affect the parsing of the document itself), then passing them along should be fine.That's the approach I'm thinking of. [...]Regardless, there's no risk of dxml's parser ever being changed to actually replace entity references. That doesn't work with returning slices of the original input, and it really doesn't work with a parser that's just supposed to take a range of characters and parse it. To fully handle all of the DTD stuff means actually reading files from disk or from the internet - which of course is where the security problems come in, but it also means that you're not just dealing with a parser anymore. In principle, dxml's parser should be pure (though some implementation make it so that it isn't right now), whereas an XML parser that fully handles the DTD section could never be pure.[...] Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top. The only potential changes that might be needed is: - provide a way to parse XML snippets that don't have a <?xml ...> declaration, so that a DTD implementation could, for example, hand an entity body over to dxml to extract any tags that may be nested in there (and if my reading of section 4.3.2 is correct, all such tags must always be closed inside the entity body, so there should be no errors produced). - provide some way of hooking into non-default entities so that DTD-defined entities can be expanded by the DTD implementation. This could be as simple as leaving such entities untouched in the returned range, or invent a special EntityType representing such entities (with a slice of the input containing the entity name) so that the DTD implementation can insert the replacement text. Everything else should be handled by the DTD layer, e.g., parsing the DOCTYPE section (which is itself pretty pathological, given the actual examples in the W3C spec to this effect), expanding entities, looking up external entities, limiting recursive entity expansion, implementing a security model, etc.. T -- Why do conspiracy theories always come from the same people??
Feb 13 2018
On Tuesday, 13 February 2018 at 22:29:27 UTC, H. S. Teoh wrote:- provide some way of hooking into non-default entities so that DTD-defined entities can be expanded by the DTD implementation.The parser now returns raw text, entity replacement can be done by DTD processor without any modification of API. So it's good for experimental if there's incentive to maintain it, but it's purely PR problem: there's nothing wrong in having xml support in dub registry and std.xml in phobos, if phobos is ok with it, it can stay as is. It looks like EntityRange requires forward range, is it ok for a parser?
Feb 14 2018
On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce wrote:It looks like EntityRange requires forward range, is it ok for a parser?It's very difficult in general to write a parser that isn't at least a forward range, because without that, you're stuck at only one character of look ahead unless you play a lot of games with putting data from the input range in a buffer so that you can keep it around to look at it again after you've looked farther ahead. Honestly, pure input ranges are borderline useless for a _lot_ of cases. It's generally only the cases where you only care about operating on each element individually irrespective of what's going on with other elements in the range that pure input ranges are really useable, and parsing definitely doesn't fall into that camp. - Jonathan M Davis
Feb 14 2018
On 14/02/2018 10:32 AM, Jonathan M Davis wrote:On Wednesday, February 14, 2018 10:14:44 Kagamin via Digitalmars-d-announce wrote:See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; if (!temp.empty && temp.front.c == '/') { temp.popFront; if (!temp.empty && temp.front.c == '/') temp.popFront; else return false; } else return false; if (!temp.empty) { size_t endOffset = temp.front.location.fileOffset; while(temp.front.location.lineOffset != 0) { endOffset = temp.front.location.fileOffset; temp.popFront; if (temp.empty) { endOffset++; break; } } current.type = Token.Type.Comment_Line; current.location = input.front.location; current.location.length = endOffset - input.front.location.fileOffset; input = temp; return true; } else return false; }It looks like EntityRange requires forward range, is it ok for a parser?It's very difficult in general to write a parser that isn't at least a forward range, because without that, you're stuck at only one character of look ahead unless you play a lot of games with putting data from the input range in a buffer so that you can keep it around to look at it again after you've looked farther ahead. Honestly, pure input ranges are borderline useless for a _lot_ of cases. It's generally only the cases where you only care about operating on each element individually irrespective of what's going on with other elements in the range that pure input ranges are really useable, and parsing definitely doesn't fall into that camp. - Jonathan M Davis
Feb 14 2018
On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; (...) if (!temp.empty) { (...) input = temp; return true; } else return false; }`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
On 14/02/2018 2:02 PM, Adrian Matoga wrote:On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:Ah I must be thinking of ranges that support indexing.See lines: - Input!IR temp = input; - input = temp;           bool commentLine() {        Input!IR temp = input; (...)        if (!temp.empty) { (...)            input = temp;            return true;        } else            return false;     }`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d- announce wrote:On 14/02/2018 2:02 PM, Adrian Matoga wrote:Random access ranges are also forward ranges and would require a call to save here. - Jonathan M DavisOn Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:Ah I must be thinking of ranges that support indexing.See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; (...) if (!temp.empty) { (...) input = temp; return true; } else return false; }`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
On 14/02/2018 5:13 PM, Jonathan M Davis wrote:On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d- announce wrote:Luckily in my code I can forget that ;)On 14/02/2018 2:02 PM, Adrian Matoga wrote:Random access ranges are also forward ranges and would require a call to save here. - Jonathan M DavisOn Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermole wrote:Ah I must be thinking of ranges that support indexing.See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; (...) if (!temp.empty) { (...) input = temp; return true; } else return false; }`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
On Thursday, February 15, 2018 01:55:28 rikki cattermole via Digitalmars-d- announce wrote:On 14/02/2018 5:13 PM, Jonathan M Davis wrote:wrote:On Wednesday, February 14, 2018 14:09:21 rikki cattermole via Digitalmars-d-> announce wrote:On 14/02/2018 2:02 PM, Adrian Matoga wrote:On Wednesday, 14 February 2018 at 10:57:26 UTC, rikki cattermoleLOL. That's actually part of what makes writing range-based libraries so much harder to get right than simply using ranges in your program. When a piece of code is used with only a few types of ranges (or even only one type of range, as is often the case), then it's generally not very hard to write code that works just fine, but as soon as you have to worry about arbitrary ranges, you get all kinds of nonsense that you have to worry about in order to make sure that the code works correctly for any range that's passed to it. save is the classic example of something that a lot of range-based code gets wrong, because for most ranges, it really doesn't matter, but for those ranges where it does, a single missed call to save results in code that doesn't work properly. To get it right, you basically have to call save every time you pass a range to a range-based function that is not supposed to consume the range, and folks rarely get that right. Certainly, pretty much any range-based code that doesn't have unit tests which include reference-type ranges is going to be wrong for reference-type ranges. Even Phobos has had quite a few issues with that historically. - Jonathan M DavisLuckily in my code I can forget that ;)Random access ranges are also forward ranges and would require a call to save here. - Jonathan M DavisAh I must be thinking of ranges that support indexing.See lines: - Input!IR temp = input; - input = temp; bool commentLine() { Input!IR temp = input; (...) if (!temp.empty) { (...) input = temp; return true; } else return false; }`temp = input.save` is exactly what you want here, which means forward range is required. Your example won't work for range objects with reference semantics.
Feb 14 2018
On Thursday, 15 February 2018 at 02:40:03 UTC, Jonathan M Davis wrote:LOL. That's actually part of what makes writing range-based libraries so much harder to get right than simply using ranges in your program. [snip]That sounds like an interesting topic for a blog post.
Feb 15 2018
On Tuesday, February 13, 2018 14:13:36 H. S. Teoh via Digitalmars-d-announce wrote:Great, just great. Now I know why I've always had this gut feeling that *something* is off about the whole XML mania.)Well, there are plenty of folks who talk like XML is a pile of steaming muck that should never be used (and then usually talk about how great JSON is). I think that basic XML is actually pretty okay - basically the subset that dxml supports, though if I were designing XML I'd take it a bit further. Personally, I'd make XML documents completely recursive - meaning that the top level is the same as any deeper level, so you could have as many element tags at the top level as you want and as much text as you want, whereas XML requires a root element and only allows stuff like processing instructions, comments, and the DOCTYPE stuff outside of the root element. I'd get rid of the <?xml...?> and <!DOCTYPE...> declarations as well as processing instructions, and I'd probably get rid of the CDATA section in favor of escaping characters with backslashes like you typically do in strings (or in JSON), and related to that, I'd get rid of the predefined entity references, making stuff like & legal. I also might get rid of empty element tags becase they're annoying to deal with when parsing, but they do reduce the verbosity of the document such that they might be worth keeping. It's also tempting to get rid of the tag name on end tags, which would actually make parsing much easier, but having them helps the legibility of XML documents, and it's a bit like semicolons in D in the sense that they can help ensure that error messages refer to the right thing rather than something later in the document, so I don't know. I'd also allow all Unicode characters instead of disallowing a number of them, since it won't really matter for most documents, and then the parser doesn't need to care about them when validating. So, basically, you end up with start tags, end tags, and comments, with start tags optionally having attributes. backslashes would then be used for escaping stuff, and you end up with something pretty dead simple. However, as you're finding out when reading through the XML spec, the folks who created XML didn't think like that at all, and were clearly coming from a _very_ different point of view as to what an XML document was for and should contain. But as you might imagine, given my take on what XML should have been, finding out in detail what XML actually _is_ was pretty horrifying. I started dxml with the intention of fully implementing all aspects of the spec but ultimately decided that it simply wasn't worth it. - Jonathan M Davis
Feb 13 2018
On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens.Bump! I'm using dxml now, and it's a very good library. So I thought "it should be in Phobos instead of std.xml" and searched the newsgroup. Sorry for necroposting. Anyway, what I wanted to say is just take an example from Perl and call it std.xml.simple. Then people would know what to expect from it and would use it (because everyone likes simple). That would also leave a way to include std.xml.full (or some such) at some indefinite point in the future. Which is, in practice, probably never - and that's fine, because who needs DTD? screw it... Anyway, thanks for the library, Jonathan.
Aug 30 2018
On Thu, Aug 30, 2018 at 07:26:28PM +0000, nkm1 via Digitalmars-d-announce wrote:On Monday, 12 February 2018 at 16:50:16 UTC, Jonathan M Davis wrote:+1. I vote for adding dxml to Phobos. [...]Folks are free to decide to support dxml for inclusion when the time comes and free to vote it as unacceptable. Personally, I think that dxml's approach is ideal for XML that doesn't use entity references, and I'd much rather use that kind of parser regardless of whether it's in the standard library or not. I think that the D community would be far better off with std.xml being replaced by dxml, but whatever happens happens.I'm using dxml now, and it's a very good library. So I thought "it should be in Phobos instead of std.xml" and searched the newsgroup. Sorry for necroposting. Anyway, what I wanted to say is just take an example from Perl and call it std.xml.simple. Then people would know what to expect from it and would use it (because everyone likes simple). That would also leave a way to include std.xml.full (or some such) at some indefinite point in the future. Which is, in practice, probably never - and that's fine, because who needs DTD? screw it...[...] That's a good idea, actually. That will stop people who expect full DTD support from complaining that it's not supported by the standard library. I vote for adding dxml to Phobos as std.xml.simple. We can either leave std.xml as-is, or deprecate it and work on std.xml.full (or std.xml.complex, or whatever). The current state of std.xml gives a poor impression to anyone coming to D the first time and wanting to work with XML, and having std.xml.simple would be a big plus. T -- This is not a sentence.
Sep 13 2018
On Mon, Feb 12, 2018 at 09:50:16AM -0700, Jonathan M Davis via Digitalmars-d-announce wrote: [...]The core problem is that entity references get replaced with more XML that needs to be parsed. So, they can't simply be passed on for post-processing. As I understand it, they have to be replaced while the parsing is going on. And that means that you can't do something like return slices of the original input that don't bother with the entity references and then have a separate parser take that and process it further to deal with the entity references. The first parser has to deal with them, and that means not returning slices of the original input unless you're dealing purely with strings and are willing to allocate new strings in the cases where the data needs to be mutated because of an entity reference.[...] I think you missed my point. What I'm trying to say is, given the current functionality of dxml, one *can* build an XML interface that implements DTD support. Of course, some concessions obviously have to be made, such as needing to allocate memory (I don't see how else one could keep a dictionary of DTD rules / entity declarations otherwise, for example), or not being able to return only slices of the input anymore. For example, entity support pretty much means plain slices are no longer an option, because you have to perform substitution of entity definitions, so you'll have to either wrap it in some kind of lazy range that chains the entity definition to the surrounding text, or you'l have to use strings or something else. Which means you'll need to have memory allocation / slower parsing / whatever, but that's the price of DTD support. But again, the point is, basic XML parsing (without DTD support) doesn't *need* to pay this price. What's currently in dxml doesn't need to change. DTD support can be implemented in a submodule / separate module that wraps around dxml and builds DTD support on top of it. Put another way, we can implement DTD support *on top of* dxml this way: - Parse the XML using dxml as an initial step (this can be done lazily, or semi-lazily, as needed). - As an intermediate step, parse the DTD section, construct whatever internal state is needed to handle DTD rules, a dictionary of entity references, etc.. - Filter the output of dxml to insert whatever extra behaviour is needed to implement DTD support before handing it to the calling code, e.g., expand entity references, or implement validation and throw an exception if validation fails, etc.. *We don't need to change dxml's current API at all.* At the most, I anticipate that the only potential change needed is to expose an interface to parse XML fragments (i.e., not a complete XML document that contains an outer <xml> tag, but just some PCDATA that may contain entities or tags) so that the DTD support wrapper can use it to expand entities and insert any tags that may appear inside the entity definition. The DTD wrapper doesn't guarantee (and doesn't need to!) to return slices of the input like dxml does. I don't see that as a problem, since I can't see how anyone would be able to implement full DTD support with only slices, even independently from the way dxml is implemented right now. We can even design the DTD support wrapper to start with being just a thin wrapper around dxml, and lazily switch to full DTD mode only if a DTD section is encountered. Then user code that doesn't care to use dxml's raw API won't even need to care about the difference. T -- Curiosity kills the cat. Moral: don't be the cat.
Feb 12 2018
On Monday, 12 February 2018 at 21:51:56 UTC, H. S. Teoh wrote: [...]We can even design the DTD support wrapper to start with being just a thin wrapper around dxml, and lazily switch to full DTD mode only if a DTD section is encountered. Then user code that doesn't care to use dxml's raw API won't even need to care about the difference. TIn this vein, if a new version of std.xml didn't offer pure and fast parsing like dxml, but included DTD by default, people would complain that that was the real deal breaker (too slow, man!). Remember `autodecode`? Right. DTD inclusion should only be available on demand. Imagine you want to implement a library project where ebooks (say classics) are catalogued and presented in an ebook reader on the web (or in an app on your smart phone). It is likely that the whole DTD thing would probably be done at the cataloguing stage, but once the books are in the library most users will probably just want to go through them page by page or search for quotes etc. - and for that you'd need a fast tool like dxml with no overhead.
Feb 13 2018
On Monday, February 12, 2018 13:51:56 H. S. Teoh via Digitalmars-d-announce wrote:For example, entity support pretty much means plain slices are no longer an option, because you have to perform substitution of entity definitions, so you'll have to either wrap it in some kind of lazy range that chains the entity definition to the surrounding text, or you'l have to use strings or something else. Which means you'll need to have memory allocation / slower parsing / whatever, but that's the price of DTD support.Which was my point. The API as-is doesn't work with DTD support for those very reasons.But again, the point is, basic XML parsing (without DTD support) doesn't *need* to pay this price. What's currently in dxml doesn't need to change. DTD support can be implemented in a submodule / separate module that wraps around dxml and builds DTD support on top of it. Put another way, we can implement DTD support *on top of* dxml this way: - Parse the XML using dxml as an initial step (this can be done lazily, or semi-lazily, as needed). - As an intermediate step, parse the DTD section, construct whatever internal state is needed to handle DTD rules, a dictionary of entity references, etc.. - Filter the output of dxml to insert whatever extra behaviour is needed to implement DTD support before handing it to the calling code, e.g., expand entity references, or implement validation and throw an exception if validation fails, etc.. *We don't need to change dxml's current API at all.*I don't think that this works, because the entity references insert new XML and thus affect the parsing. And as such, you can't simply pass through the entity references to be processed by another parser. They need to be handled by the core parser, otherwise it's going to give incorrect results, not just results that need further parsing. I'm sure that dxml's internals could be refactored so that they could be shared with another parser that did that, but unless I'm misunderstanding how entity references work, you can't use what's there now as-is and build another parser on top of it. The entity reference replacement needs to happen in the core parser.The DTD wrapper doesn't guarantee (and doesn't need to!) to return slices of the input like dxml does. I don't see that as a problem, since I can't see how anyone would be able to implement full DTD support with only slices, even independently from the way dxml is implemented right now.Yeah, if I were writing a parser that handled the DTD section, I wouldn't make it deal with slices of the input like DTD does unless I decided to make it always return string, in which case, you could get slices of the original input for strings but no other range types - it's either that or using a lazy range, which would be worse if you passed strings but better for other range types. And that's the main reason that I gave up on having dxml handle the DTD section. I consider that approach unacceptable. One of the key goals for dxml was that it would be providing slices of the input and not lazy ranges or allocating new strings. In any case, unless I misunderstand how entity references work, that would have to be its own parser and not simply a wrapper around dxml because of how the entity references affect the parsing. If I'm wrong, then great, someone else can come along later and add some sort of DTD parser on top of dxml, and if I'm right, well, then anyone who wants to do anything like that is going to need to write a new parser, but that can then coexist alongside dxml's parser just fine. Either way, I like dxml's approach and don't want to compromise what it's doing in an attempt to fully deal with DTDs. - Jonathan M Davis
Feb 12 2018
On Tuesday, February 13, 2018 14:29:27 H. S. Teoh via Digitalmars-d-announce wrote:Given the insane complexities of DTD that I'm only slowly beginning to grasp from actually reading the spec, I'm quickly adopting the opinion that dxml should remain as-is, and any DTD implementation should be layered on top. The only potential changes that might be needed is: - provide a way to parse XML snippets that don't have a <?xml ...> declaration, so that a DTD implementation could, for example, hand an entity body over to dxml to extract any tags that may be nested in there (and if my reading of section 4.3.2 is correct, all such tags must always be closed inside the entity body, so there should be no errors produced).XML 1.0 does not require the <?xml...?> section - which is the main reason why dxml implements XML 1.0 and not 1.1. When working on one of my projects with std_experimental_xml, I had to keep adding the <?xml...?> declaration to the start of XML snippets in all of my tests which had to deal with sections of an XML document, and it was _really_ annoying. dxml does require that what it's given be a valid XML 1.0 document, which means that you have to have exactly one root element in what it's passed, which does limit which kind of XML snippets you pass it, but it will work for a lot of XML snippets as-is.- provide some way of hooking into non-default entities so that DTD-defined entities can be expanded by the DTD implementation. This could be as simple as leaving such entities untouched in the returned range, or invent a special EntityType representing such entities (with a slice of the input containing the entity name) so that the DTD implementation can insert the replacement text.After having actually implemented full parsing for the entire DTD section before figuring out that references could be inserted in it just about anywhere and that the grammar in the spec is only the grammar _after_ all of the replacements were made (when I figured that out was when I gave up on DTD support), I would strongly argue in favor of simply passing along entity references as-is and leaving any and all such processing to a DTD-enabled parser. Originally, the Config had options like SkipDTD and SkipProlog, and I even provided a way to get at the information in the <?xml...?> declaration if you wanted it, all that just wasn't worth the extra complexity. - Jonathan M Davis
Feb 13 2018
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:dxml 0.2.0 has now been released. [...]Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time. Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that. So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)
Feb 12 2018
On Monday, February 12, 2018 21:26:45 Johannes Loher via Digitalmars-d- announce wrote:On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:Thanks. When you do use it, please give feedback - particularly if you find any problems or pain points. I definitely think that the API is solid overall, but that doesn't mean that I got it completely right, and even with all of the tests that I have, I could have missed something and ended up with a bug in the parser. I'm reasonably confident in the code quality, but that doesn't mean that I didn't miss anything. - Jonathan M Davisdxml 0.2.0 has now been released. [...]Thank you very much for your efforts, I really appreciate it, as I have been looking for a decent xml library for quite some time. Whethr or not this is a candidate for inclusion into phobos is certainly up for debate, but as you already mentioned several times, this thread is hardly the right place for that. So instead I'd like to emphasize how much I appreciate you working on this and I am sure I am not the only one. This absence of a usable high quality xml library is/was a big problem for d in my opinion and it is great to see that this is finally being worked on :)
Feb 12 2018
On Monday, 12 February 2018 at 05:36:51 UTC, Jonathan M Davis wrote:dxml 0.2.0 has now been released. Documentation: http://jmdavisprog.com/docs/dxml/0.2.0/ Github: https://github.com/jmdavis/dxml/tree/v0.2.0 Dub: http://code.dlang.org/packages/dxml - Jonathan M DavisThis is absolutely awesome. It is a little low level (compared to SAX) so there is more to deal with, but having this provide a range (and flat) makes it so much clearer the ordering of elements. If I need to handle nesting then I can build that out, but if I don't I can just fly by the seat of my pants and grab the elements I want. This will definitely be my goto for XML parsing.
Feb 23 2018