digitalmars.D - On the subject of an XML parser
- solidstate1991 (8/8) Aug 22 2022 Since the XML parsing library was removed from Phobos, I'm
- Adam D Ruppe (3/4) Aug 22 2022 You might want to look into my dom.d too, it isn't suitable for
- H. S. Teoh (5/14) Aug 22 2022 Why not use Jonathan's dxml?
- solidstate1991 (3/5) Aug 22 2022 I might have a lot of free time in the near future, so I could
- Chris Piker (10/11) Aug 22 2022 I build a project off of dxml and found it to be quite nice, but
- H. S. Teoh (13/27) Aug 22 2022 Do you need to parse xml on-the-fly, or would it work to just slurp the
- Chris Piker (7/9) Sep 13 2022 Thanks for the suggestion, though the purpose of the lib is to
- =?UTF-8?Q?Ali_=c3=87ehreli?= (14/16) Aug 24 2022 The 'cached' range adaptor I mentioned on these forums a couple of times...
- =?UTF-8?Q?Ali_=c3=87ehreli?= (9/24) Sep 12 2022 It is now available:
- Chris Piker (9/21) Sep 13 2022 Wow pretty slick, thanks! I know everyone wants the D community
- =?UTF-8?Q?Ali_=c3=87ehreli?= (4/6) Oct 02 2022 Please don't give up but give feedback if it doesn't fit your use case.
- tsbockman (8/20) Oct 03 2022 I believe it's actually **O(log(n))** amortized because
- tsbockman (5/16) Oct 03 2022 Nevermind - I forgot that the **n** when moving the old contents
- =?UTF-8?Q?Ali_=c3=87ehreli?= (7/9) Oct 03 2022 CircularBlocks does not move elements. It just allocates and adds a new
- tsbockman (12/14) Oct 03 2022 Repeatedly appending `~=` to a dynamic array, as `CircularBlocks`
- =?UTF-8?Q?Ali_=c3=87ehreli?= (24/37) Oct 03 2022 [here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/ci...
- tsbockman (9/20) Oct 03 2022 I meant "elements" in the general sense of "items in an array".
- JN (11/21) Aug 22 2022 I never really understood why we have to make a new library
- Dejan Lekic (10/15) Aug 24 2022 You made my day as I need a good XML library that is as good as
- solidstate1991 (8/18) Aug 25 2022 I took a look at experimental.xml. According to its tests, it's
- solidstate1991 (23/30) Sep 01 2022 So work have begun here:
- James Blachly (3/12) Oct 02 2022 Would be nice to have XSD support; many (most?) XML libraries I've
Since the XML parsing library was removed from Phobos, I'm thinking about either getting dlang-community/experimental.xml into a usable state, or write a completely new parser. First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.
Aug 22 2022
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:First I'd want some community inputYou might want to look into my dom.d too, it isn't suitable for all xml but it does a decent job on much of it.
Aug 22 2022
On Mon, Aug 22, 2022 at 11:48:47AM +0000, solidstate1991 via Digitalmars-d wrote:Since the XML parsing library was removed from Phobos, I'm thinking about either getting dlang-community/experimental.xml into a usable state, or write a completely new parser. First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.Why not use Jonathan's dxml? T -- "You know, maybe we don't *need* enemies." "Yeah, best friends are about all I can take." -- Calvin & Hobbes
Aug 22 2022
On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:Why not use Jonathan's dxml? TI might have a lot of free time in the near future, so I could write something for Phobos.
Aug 22 2022
On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:Why not use Jonathan's dxml?I build a project off of dxml and found it to be quite nice, but I forgot to read the fine print (aka the template specialization). After a week of development when everything was working well, I tried to use it for parsing stdin and that's when the compiler let me know than an InputRange wasn't sufficient. Totally my fault, no knock against the author. So depending on the use case, dxml works quite well. For my own purposes I'll need to find/create a ForwardRange adapter for stdin or refactor my code to use another library.
Aug 22 2022
On Mon, Aug 22, 2022 at 10:51:44PM +0000, Chris Piker via Digitalmars-d wrote:On Monday, 22 August 2022 at 14:51:34 UTC, H. S. Teoh wrote:Do you need to parse xml on-the-fly, or would it work to just slurp the entire stdin into a buffer and then parse that? In the former case, you could probably just accumulate incoming stdin lines into a buffer and parse that (though you'll probably need a wrapper, otherwise dxml may terminate prematurely at the end of the current line). In the latter case it should be a simple matter of using std.algorithm.copy to read stdin into a buffer, which can then be parsed with dxml. T -- There's light at the end of the tunnel. It's the oncoming train.Why not use Jonathan's dxml?I build a project off of dxml and found it to be quite nice, but I forgot to read the fine print (aka the template specialization). After a week of development when everything was working well, I tried to use it for parsing stdin and that's when the compiler let me know than an InputRange wasn't sufficient. Totally my fault, no knock against the author. So depending on the use case, dxml works quite well. For my own purposes I'll need to find/create a ForwardRange adapter for stdin or refactor my code to use another library.
Aug 22 2022
On Monday, 22 August 2022 at 23:30:58 UTC, H. S. Teoh wrote:Do you need to parse xml on-the-fly, or would it work to just slurp the entire stdin into a buffer and then parse that?Thanks for the suggestion, though the purpose of the lib is to support stream based processing of very long time series datasets (> 2 TB occurs occasionally). Due to data volume, we typically work with binary formats, but there is a supported XML representation and I'd prefer to apply the same mentality when processing it so as not to break user expectations.
Sep 13 2022
On 8/22/22 15:51, Chris Piker wrote:So depending on the use case, dxml works quite well. For my own purposes I'll need to find/create a ForwardRange adapter for stdinThe 'cached' range adaptor I mentioned on these forums a couple of times and in my DConf 2022 lightning talk converts any InputRange to a ForwardRange. (It does this by evaluating the elements once; so it would be valuable with generators as well; and in fact, a generator use case was why I wrote it.) (Aside: It actually makes a RandomAccessRange because it supports opIndex as well but it does not honor O(1): It will grab 'n' elements if you say myRange[n] and if those elements are not in the cache yet.) Currently it has an assumed performance issue because it uses a regular D slice, and the way it uses the slice incurs an allocation cost per element. There are different ways of dealing with that issue but I haven't finished that yet. Ali
Aug 24 2022
On 8/24/22 08:16, Ali Çehreli wrote:On 8/22/22 15:51, Chris Piker wrote: > So depending on the use case, dxml works quite well. For my own > purposes I'll need to find/create a ForwardRange adapter for stdin The 'cached' range adaptor I mentioned on these forums a couple of times and in my DConf 2022 lightning talk converts any InputRange to a ForwardRange. (It does this by evaluating the elements once; so it would be valuable with generators as well; and in fact, a generator use case was why I wrote it.)It is now available: https://code.dlang.org/packages/alid(Aside: It actually makes a RandomAccessRange because it supports opIndex as well but it does not honor O(1): It will grab 'n' elements if you say myRange[n] and if those elements are not in the cache yet.)I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.Currently it has an assumed performance issue because it uses a regular D slice, and the way it uses the slice incurs an allocation cost per element. There are different ways of dealing with that issue but I haven't finished that yet.I solved that by writing the `alid.circularblocks` module. Ali
Sep 12 2022
On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:It is now available: https://code.dlang.org/packages/alidWow pretty slick, thanks! I know everyone wants the D community to be larger, but there are some advantages to a tight group. Heck, I just got help on a ground support project from my favorite computer textbook author. Outstanding! As soon as I get back around to working on that project again I'll try out alid. Neck deep in another sprint right now which depends on dpq2. Best,(Aside: It actually makes a RandomAccessRange because itsupportsopIndex as well but it does not honor O(1): It will grab 'n'elements ifyou say myRange[n] and if those elements are not in the cacheyet.) I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.
Sep 13 2022
On 9/13/22 19:00, Chris Piker wrote:As soon as I get back around to working on that project again I'll try out alid.Please don't give up but give feedback if it doesn't fit your use case. It desperately needs to be tested in the wild. :) Ali
Oct 02 2022
On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:On 8/24/22 08:16, Ali Çehreli wrote: https://code.dlang.org/packages/alidI believe it's actually **O(log(n))** amortized because `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the contents over to the new address, an **O(n)** operation, for every **O(log(n))** accesses to `ElementCache`. (This extra **O(log(n))** factor is typical for [Appender](https://dlang.org/phobos/std_array.html#Appender)-like systems.)(Aside: It actually makes a RandomAccessRange because itsupportsopIndex as well but it does not honor O(1): It will grab 'n'elements ifyou say myRange[n] and if those elements are not in the cacheyet.) I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.
Oct 03 2022
On Monday, 3 October 2022 at 07:28:46 UTC, tsbockman wrote:On Monday, 12 September 2022 at 09:29:11 UTC, Ali Çehreli wrote:Nevermind - I forgot that the **n** when moving the old contents over is actually a different variable each time, whose average value is **O(n / log(n))**, which makes the whole thing reduce to amortized **O(1)** time per element, as you claimed.I realized that it is still O(1) because the seemingly unnecessarily grabbed elements would still count as "amortized" because they are readily available at O(1) for consumption of both this range and all its .save'd ranges.I believe it's actually **O(log(n))** amortized because `CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the contents over to the new address, an **O(n)** operation, for every **O(log(n))** accesses to `ElementCache`. (This extra **O(log(n))** factor is typical for [Appender](https://dlang.org/phobos/std_array.html#Appender)-like systems.)
Oct 03 2022
On 10/3/22 00:28, tsbockman wrote:`CircularBlocks.addExistingBlock_` will reallocate `blocks` and move the contents over to the new address,CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue. Algorithmic complexity does get complicated :) if the block size is too small compared to live elements. Then there would be too many small blocks to shuffle around e.g. to the back of the queue to be reused. Ali
Oct 03 2022
On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue.Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` does [here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir ularblocks.d#L422), will reallocate and move the elements over to the new memory whenever the new `.length` would exceed `.capacity`: ```D void addExistingBlock_(ubyte[] buffer) { import std.array : back; blocks ~= ReusableBlock!T(buffer); capacity_ += blocks.back.capacity; } ```
Oct 03 2022
On 10/3/22 09:12, tsbockman wrote:On Monday, 3 October 2022 at 14:46:13 UTC, Ali Çehreli wrote:[here](https://github.com/acehreli/alid/blob/main/circularblocks/alid/cir ularblocks.d#L422), will reallocate and move the elements over to the new memory whenever the new `.length` would exceed `.capacity`:CircularBlocks does not move elements. It just allocates and adds a new block to its "array of slices" queue.Repeatedly appending `~=` to a dynamic array, as `CircularBlocks` does```D void addExistingBlock_(ubyte[] buffer) { import std.array : back; blocks ~= ReusableBlock!T(buffer); capacity_ += blocks.back.capacity; } ```Yes but blocks_ does not carry elements but blocks to place elements on. CircularBlocks uses a circular buffer of blocks of elements. As elements are popped from the head, blocks that become empty are moved to the end of blocks_ array to be reused. (Here, we have O(M); if M (number of blocks) is small compared to N (number of live elements as in a sliding window), then we have just a few blocks to shuffle with std.algorithm.bringToFront.) addExistingBlock_() is called only when existing blocks are all in use. The typical use case is a sliding window of cached elements. Empty blocks at the frot are moved to the end and they get reused for new elements later. If the window width is constant, there will either be no memory allocation (if the user provided initial buffers), or some during the preamble as the number of buffers stabilize. There may be occasional new block allocations if the window enlarges depending on the application, but this high water mark helps reduce block allocations in the future. I appreciate your looking into this and I would be very happy to know about all performance issues. If my analysis is wrong above, please give me data for me to work with. :) Ali
Oct 03 2022
On Monday, 3 October 2022 at 17:10:29 UTC, Ali Çehreli wrote:Yes but blocks_ does not carry elements but blocks to place elements on.I meant "elements" in the general sense of "items in an array". The blocks are stored in an array, and hence are "elements".Here, we have O(M); if M (number of blocks) is small compared to N (number of live elements as in a sliding window), then we have just a few blocks to shuffle with std.algorithm.bringToFront.) addExistingBlock_() is called only when existing blocks are all in use. The typical use case is a sliding window of cached elements. Empty blocks at the frot are moved to the end and they get reused for new elements later.I don't think this changes the big O run time analysis. It sounds like, at least in the worst case, **M** is proportional to **N**, meaning that for big O purposes **M** is **N**. Regardless, it would still be **O(1)** amortized time like you said before. To change that **M** would have to be greater than **N**, which never happens.
Oct 03 2022
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:Since the XML parsing library was removed from Phobos, I'm thinking about either getting dlang-community/experimental.xml into a usable state, or write a completely new parser. First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.I never really understood why we have to make a new library instead of just fixing std.xml. I found std.xml to be the easiest to use out of all D libraries, but my understanding was it's not up to par performance wise.I might have a lot of free time in the near future, so I could write something for Phobos.Honestly, that would probably end up as std.experimental.xml2. I think it'd be very hard to get a new library into Phobos. As soon as you open yourself to comments, everyone will have their own idea of what a perfect XML library would look like. And after that, there will be a struggle whether it should use exceptions or not, GC or no GC, or maybe even betterC.
Aug 22 2022
On Monday, 22 August 2022 at 11:48:47 UTC, solidstate1991 wrote:First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.You made my day as I need a good XML library that is as good as Python's xml (https://docs.python.org/3/library/xml.html) package. You are absolutely right, https://github.com/dlang-community/experimental.xml is a good starting point. See what is missing, as well as what can be improved. It is a pity that package did not get finished... I gave up using D for any XML processing long ago but perhaps your library will be good enough for some of my future XML processing tasks.
Aug 24 2022
On Wednesday, 24 August 2022 at 17:15:42 UTC, Dejan Lekic wrote:You made my day as I need a good XML library that is as good as Python's xml (https://docs.python.org/3/library/xml.html) package. You are absolutely right, https://github.com/dlang-community/experimental.xml is a good starting point. See what is missing, as well as what can be improved. It is a pity that package did not get finished... I gave up using D for any XML processing long ago but perhaps your library will be good enough for some of my future XML processing tasks.I took a look at experimental.xml. According to its tests, it's biggest issue is that it accepts malformed documents. I'll attempt to reverse-engineer the code, then add the necessary checks to reject the malformed documents. Since it has multiple options for allocators (stdx-allocator), it'll be a bit of a challenge, but at worst I can strip that function and replace it with GC only.
Aug 25 2022
On Thursday, 25 August 2022 at 19:41:19 UTC, solidstate1991 wrote:I took a look at experimental.xml. According to its tests, it's biggest issue is that it accepts malformed documents. I'll attempt to reverse-engineer the code, then add the necessary checks to reject the malformed documents. Since it has multiple options for allocators (stdx-allocator), it'll be a bit of a challenge, but at worst I can strip that function and replace it with GC only.So work have begun here: https://github.com/ZILtoid1991/experimental.xml Things I've done so far: * Stripped the allocators and the custom error handling functions. Not much people are using allocators anyways, it just complicates the project, and GC is otherwise the best option for anything that builds a complex tree structure. With that gone, I can just use exceptions for error handling, which can be toggled with a flag: turning it off will enable parsing badly formed XML documents, and even SGML in theory. * Simplifying a lot of things in general, with array slicing and appending. * Enabled character escaping, which led me into the DTD hellhole. * Enabled checking for bad characters in names and texts. * Started working on the processing of XML declarations (important for setting version and checking for correct encoding), and the DTD. I know that the removal of the allocators might doom my project from the inclusion in the Phobos library, but even then I can just release it as a regular dub library. Soon I'll be renaming it to newXML or something similar, while keeping the credits to its previous authors.
Sep 01 2022
On 8/22/22 7:48 AM, solidstate1991 wrote:Since the XML parsing library was removed from Phobos, I'm thinking about either getting dlang-community/experimental.xml into a usable state, or write a completely new parser. First I'd want some community input, and would like to hear from the users of lodo1995's library. Depending on some circumstances, I'll be losing my job next month, so I'll have some extra time on my hands (no money will be a tough thing), and even without that I'll try to pull it off somehow.Would be nice to have XSD support; many (most?) XML libraries I've looked at _across all languages_ only support DTD but not XSD schema def.
Oct 02 2022