digitalmars.D - High performance XML parser
- Tomek =?UTF-8?B?U293acWEc2tp?= (7/7) Feb 04 2011 I am now intensely accumulating information on how to go about creating ...
- Steven Schveighoffer (14/33) Feb 04 2011 Here is how I would approach it (without doing any research).
- Simen kjaeraas (11/23) Feb 04 2011 Question:
- Steven Schveighoffer (15/42) Feb 07 2011 The goal is to avoid double-buffering data. So you are using the buffer...
- Tomek =?UTF-8?B?U293acWEc2tp?= (26/42) Feb 04 2011 =20
- Tomek =?UTF-8?B?U293acWEc2tp?= (27/51) Feb 04 2011 =20
- Steven Schveighoffer (12/42) Feb 07 2011 That might not scale well. For instance, if you are accessing the 1500t...
- Robert Jacques (5/52) Feb 07 2011 I would consider a tokenizer which can be used for SAX style parsing to ...
- Michael Rynn (46/74) Feb 11 2011 XML parser
- Tomek =?ISO-8859-2?B?U293afFza2k=?= (46/66) Feb 08 2011 t =20
- spir (20/53) Feb 08 2011 That's very similar to what I was thinking at. What I wonder is whether,...
- Steven Schveighoffer (6/40) Feb 09 2011 OK, so you mean a buffer other than the I/O buffer. This means double
- Tomek =?ISO-8859-2?Q?Sowi=F1ski?= (11/15) Feb 09 2011 =20
- Michel Fortin (12/33) Feb 04 2011 I agree it's important, especially when receiving XML over the network,
- Tomek =?UTF-8?B?U293acWEc2tp?= (7/15) Feb 04 2011 These are valid concerns. Yet, in overwhelming majority XML documents co...
- Michel Fortin (6/21) Feb 04 2011 A memory-mapped file comes from the hard drive too.
- Roman Ivanov (18/31) Feb 07 2011 This reminds me of some things I was thinking about when I worked with
- Denis Koroskin (26/45) Feb 04 2011 g =
- Jacob Carlborg (6/11) Feb 06 2011 I don't think it's up to the parser to decide where the content comes
- Kagamin (2/5) Feb 07 2011 Did you measure how much memory is wasted by markup?
- Piotr Szturmaj (13/18) Feb 08 2011 Few years ago I needed to write my own parser in Delphi for my XMPP
I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal. Please comment on this. -- Tomek
Feb 04 2011
On Fri, 04 Feb 2011 16:02:39 -0500, Tomek Sowiński <just ask.me> wrote:I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal. Please comment on this.Here is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?). -Steve
Feb 04 2011
Steven Schveighoffer <schveiguy yahoo.com> wrote:Here is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?).Question: For the lazily computed attributes and subnodes, will accessing one element cause all elements to be computed? Same goes for getting the number of elements. Also, can this be efficiently combined with mmapping? What I sorta imagine is a kind of lazy slice: It determines whether it ends within this page, and if not, does not progress past that page until asked to do so. -- Simen
Feb 04 2011
On Fri, 04 Feb 2011 17:03:08 -0500, Simen kjaeraas <simen.kjaras gmail.com> wrote:Steven Schveighoffer <schveiguy yahoo.com> wrote:The goal is to avoid double-buffering data. So you are using the buffer of the input stream to contain all data. So, advancing to the 'next' element/node/attribute makes the previous element/node/attribute invalid (i.e. the buffer is reused). The trick is to make it seem like the node is fully there without actually reading the stream until you need it (hence the lazy part), because reading the entire node means reading the entire file (in the case of the root element).Here is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?).Question: For the lazily computed attributes and subnodes, will accessing one element cause all elements to be computed? Same goes for getting the number of elements.Also, can this be efficiently combined with mmapping? What I sorta imagine is a kind of lazy slice: It determines whether it ends within this page, and if not, does not progress past that page until asked to do so.mmaping would make things more accessible, but the common denominator is not mmap. If it's supported as a special case, then maybe it can offer some interesting features, but something like mmap can't be done for say a network stream. -Steve
Feb 07 2011
Steven Schveighoffer napisa=C5=82:Here is how I would approach it (without doing any research). =20 First, we need a buffered I/O system where you can easily access and =20 manipulate the buffer. I have proposed one a few months ago in this NG. =20 Second, I'd implement the XML lib as a range where "front()" gives you an==20XMLNode. If the XMLNode is an element, it will have eager access to the ==20element tag, and lazy access to the attributes and the sub-nodes. Each ==20XMLNode will provide a forward range for the child nodes. =20 Thus you can "skip" whole elements in the stream by popFront'ing a range,==20and dive deeper via accessing the nodes of the range. =20 I'm unsure how well this will work, or if you can accomplish all of it =20 without reallocation (in particular, you may need to store the element =20 information, maybe via a specialized member function?).Heh, yesterday when I couldn't sleep I was sketching the design. I converge= d to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One co= nsequence is that the fields of the current node will point to the buffer a= kin to foreach(line; File.byLine), so in order to lift the input the user w= ill have to dup (or process the node in-place). As new nodes will be overwr= itten on the same piece of memory, an important trait of the design emerges= : cache intensity. Because of XML namespaces I think it is necessary for th= e buffer to contain the current node plus all its parents. Namespaces are t= he technical reason but having access to the path all the way to the root n= ode is of value, regardless. This suggests mark-release memory management. = The buffer will have to be long enough to fit the deepest tag sequence: the= oretically infinite, not that much in practice. Like I said, the buffer wil= l be owned by the iterator so probably deterministic deallocation is possib= le when the processing is done. One drawback is that you won't know you're dealing with a well-formed DOM u= ntil the closing tag comes. If it doesn't, it'll of course throw, but the m= alformed DOM may already have been digested. So providing some rollback pos= sibility is up to the user. --=20 Tomek
Feb 04 2011
Steven Schveighoffer napisa=C5=82: =20an =20Here is how I would approach it (without doing any research). =20 First, we need a buffered I/O system where you can easily access and =20 manipulate the buffer. I have proposed one a few months ago in this NG. =20 Second, I'd implement the XML lib as a range where "front()" gives you =e =20XMLNode. If the XMLNode is an element, it will have eager access to th==20element tag, and lazy access to the attributes and the sub-nodes. Each=e, =20XMLNode will provide a forward range for the child nodes. =20 Thus you can "skip" whole elements in the stream by popFront'ing a rang==20and dive deeper via accessing the nodes of the range. =20 I'm unsure how well this will work, or if you can accomplish all of it ==20without reallocation (in particular, you may need to store the element =ged to a pretty much same concept, so your comment is reassuring :).information, maybe via a specialized member function?).=20 Heh, yesterday when I couldn't sleep I was sketching the design. I conver==20 The design I'm thinking is that the node iterator will own a buffer. One =consequence is that the fields of the current node will point to the buffer= akin to foreach(line; File.byLine), so in order to lift the input the user= will have to dup (or process the node in-place). As new nodes will be over= written on the same piece of memory, an important trait of the design emerg= es: cache intensity. Because of XML namespaces I think it is necessary for = the buffer to contain the current node plus all its parents. Namespaces are= the technical reason but having access to the path all the way to the root= node is of value, regardless. This suggests mark-release memory management= . The buffer will have to be long enough to fit the deepest tag sequence: t= heoretically infinite, not that much in practice. Like I said, the buffer w= ill be owned by the iterator so probably deterministic deallocation is poss= ible when the processing is done.=20 One drawback is that you won't know you're dealing with a well-formed DOM=until the closing tag comes. If it doesn't, it'll of course throw, but the= malformed DOM may already have been digested. So providing some rollback p= ossibility is up to the user. =20 Oh, and the direction of iteration (deeper/farther) will of course be contr= ollable in fashion you presented. --=20 Tomek
Feb 04 2011
On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:Steven Schveighoffer napisał:That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean. I would start out with a non-compliant parser, but one that allocates nothing beyond the I/O buffer, one that simply parses lazily and can be used as well as a SAX parser. Then see how much extra allocations we need to get it to be compliant. Then, one can choose the compliancy level based on what performance penalties one is willing to incur. -SteveHere is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?).Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.
Feb 07 2011
On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:I would consider a tokenizer which can be used for SAX style parsing to be a key feature of std.xml. I know it was considered very important when I was gathering requirements for my std.JSON re-write.Steven Schveighoffer napisał:That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean. I would start out with a non-compliant parser, but one that allocates nothing beyond the I/O buffer, one that simply parses lazily and can be used as well as a SAX parser. Then see how much extra allocations we need to get it to be compliant. Then, one can choose the compliancy level based on what performance penalties one is willing to incur. -SteveHere is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range. I'm unsure how well this will work, or if you can accomplish all of it without reallocation (in particular, you may need to store the element information, maybe via a specialized member function?).Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :). The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.
Feb 07 2011
On Mon, 07 Feb 2011 10:37:46 -0500, Robert Jacques wrote:On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <just ask.me> wrote:Steven Schveighoffer napisał:Here is how I would approach it (without doing any research). First, we need a buffered I/O system where you can easily access and manipulate the buffer. I have proposed one a few months ago in this NG. Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the element tag, and lazy access to the attributes and the sub-nodes. Each XMLNode will provide a forward range for the child nodes. Thus you can "skip" whole elements in the stream by popFront'ing a range, and dive deeper via accessing the nodes of the range.I would consider a tokenizer which can be used for SAX style parsing to be a key feature of std.xml. I know it was considered very important when I was gathering requirements for my std.JSON re-write.XML parser /dsource/xmlp/trunk/std I have experimented with various means to balance efficiency and flexibility with XML parsing. The core parsing uses an ItemReturn struct. This returns transient pointers to reused buffers, so that there is reduced memory buffer reallocation for just churning through the XML document. struct ItemReturn { ItemType type = ItemType.RET_NULL; char[] scratch; char[][] names; char[][] values; } The central parse method fills the ItemReturn with transient tag names, and pointers to names and values, somewhat like a SAX parser. To measuring performance components, take the throughput of making a XML document using a linked DOM tree structure as 100%, with validation, attribute normalisation. This implementation, buffers for file or string sources, I get this breakdown of processing. This is done on books.xml. Other examples of documents, with different structure, entities, DTD, schema, namespaces, will differ. Input overhead. Throughput of examining each single Unicode character in the document, as a dchar value. About 12-15% of time. So there is not a relatively great cost in the input buffering. Tag, attribute and content throughput. Parsing and filling the ItemReturn struct for each parse method call, called for every identifyable XML element token in the document sequence, about 60% of time, without doing anything with the result. This also includes Input Overhead. No DOM structure is assumed, and their is no recursion. The general sort of work done is keeping track of state, and assembling and returning the various types of tokens. To actually build a full DOM, without doing much in the way of validation, and attribute normalisation, is up to about 86% of total time. This includes converting the returned transient buffered values of tags, attributes names, values, and content into string, and the creation and linking of DOM nodes. This involves some recursive method calls that mirror the XML document structure. Some time and memory seems to be saved by aliasing tag and attribute names using an AA. This takes about 85% of the full job. Additional validation and attribute normalisation takes more time.
Feb 11 2011
Steven Schveighoffer napisa=B3:e =20The design I'm thinking is that the node iterator will own a buffer. On=t =20consequence is that the fields of the current node will point to the =20 buffer akin to foreach(line; File.byLine), so in order to lift the inpu==20the user will have to dup (or process the node in-place). As new nodes ==20will be overwritten on the same piece of memory, an important trait of ==20the design emerges: cache intensity. Because of XML namespaces I think ==20it is necessary for the buffer to contain the current node plus all its==20parents. =20=20 That might not scale well. For instance, if you are accessing the 1500th=child element of a parent, doesn't that mean that the buffer must contain==20the full text for the previous 1499 elements in order to also contain the==20parent? =20 Maybe I'm misunderstanding what you mean.Let's talk on an example: <a name=3D"value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2] Node c2 and all its parents are available for inspection. Node a's attribut= e is stored in the buffer, but not b's "Some text 1" as it is c2's sibling;= "Some text 1" was available in the previous iteration, now it's overwritte= n by c2. To get to "Some text 2" let's advance the iterator in depth to get: [Node a | Node b | Node c2 | Text node "Some text 2"] Advancing it once more we get to: [Node a | Node b | Text node "Some text 3"] So "Some text 3" is written where c2 and the text node 2 used to be. The element type of the range would always be the child, parents available = through pointers: foreach (node; xmlRange) { doStuff(node); if (Node* parent =3D node.parent) doOtherStuff(parent); } Having no access to siblings is quite limiting but the iterator can form an= efficient (zero-allocation) basis on which more convenient schemes are bui= lt upon. It's still just brain-storming, though. I fear there's something t= hat'll make the whole idea crash & burn.I would start out with a non-compliant parser, but one that allocates =20 nothing beyond the I/O buffer, one that simply parses lazily and can be ==20used as well as a SAX parser. Then see how much extra allocations we nee=d =20to get it to be compliant. Then, one can choose the compliancy level =20 based on what performance penalties one is willing to incur.Yeah, 100% compliance is a long way. --=20 Tomek
Feb 08 2011
On 02/09/2011 01:16 AM, Tomek Sowiński wrote:Steven Schveighoffer napisał:That's very similar to what I was thinking at. What I wonder is whether, in your buffer representations above, 'Node x' represents an instanciated node, or collected data necessary to later instanciate --once the current part of the source is validated (proved correct). What i mean is, the whole <a> node will be validated only when </a> is matched, so that if your parsing process fails on the way (and/or it was just following a wrong parsing path), then all nodes instanciated along tha way are just to throw away, aren't they (unless some memoisation may be useful). Possibly all what say here is just stupid, depending on the parsing algo and nature of the grammar, and also how costly Node creation is. In my case, all of this seems relevant. Thus, I'm thinking at just collecting data along the way (rather easy & far cheaper than node construction), and (recursively) instanciate only once a section is validated. (needs to be tried concretely --maybe there are issues I'm not yet aware of) Denis -- _________________ vita es estrany spir.wikidot.comLet's talk on an example: <a name="value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2] Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth to get: [Node a | Node b | Node c2 | Text node "Some text 2"] Advancing it once more we get to: [Node a | Node b | Text node "Some text 3"] So "Some text 3" is written where c2 and the text node 2 used to be.The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean.
Feb 08 2011
On Tue, 08 Feb 2011 19:16:37 -0500, Tomek Sowiński <just ask.me> wrote:Steven Schveighoffer napisał:OK, so you mean a buffer other than the I/O buffer. This means double buffering data. I was thinking of a solution that allows simply using the I/O buffer for parsing. I think this is one of the keys to Tango's xml performance. -SteveLet's talk on an example: <a name="value"> <b> Some Text 1 <c2> <!-- HERE --> Some text 2 </c2> Some Text 3 </b> </a> The buffer of the iterator positioned HERE would be: [Node a | Node b | Node c2]The design I'm thinking is that the node iterator will own a buffer.Oneconsequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift theinputthe user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus allitsparents.That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent? Maybe I'm misunderstanding what you mean.
Feb 09 2011
Steven Schveighoffer napisa=B3:OK, so you mean a buffer other than the I/O buffer. This means double =20 buffering data. I was thinking of a solution that allows simply using th=e =20I/O buffer for parsing. I think this is one of the keys to Tango's xml ==20performance.I'd be glad to hear what's your idea. I think they are convergent. In mine,= the I/O could be asked to dump data to the iterator's buffer at a given po= sition (right to previous nodes), then the iterator forms a node out of raw= data. Some moving would be done but all within the cached buffer so should= be quick. I guess it's as far as I can predict performance in a newsgroup = post. ;-) Gotta write some code and whip out the stopwatch, then we'll see. --=20 Tomek
Feb 09 2011
On 2011-02-04 16:02:39 -0500, Tomek Sowiński <just ask.me> said:I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal.I agree it's important, especially when receiving XML over the network, but I also think it's important to be able to be able to support slicing. Imagine all the memory you could save by just making slices of a memory-mapped file. The difficulty is to support both models: the input range model which requires copying the strings and the slicing model where you're just taking slices of a string. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 04 2011
Michel Fortin napisa=C5=82:I agree it's important, especially when receiving XML over the network,=20 but I also think it's important to be able to be able to support=20 slicing. Imagine all the memory you could save by just making slices of=20 a memory-mapped file. =20 The difficulty is to support both models: the input range model which=20 requires copying the strings and the slicing model where you're just=20 taking slices of a string.These are valid concerns. Yet, in overwhelming majority XML documents come = from hard-drive and network -- these are the places we need to drill. I fea= r that trying to cover every remote use case will render the library incomp= rehensible. --=20 Tomek
Feb 04 2011
On 2011-02-04 16:47:26 -0500, Tomek Sowiński <just ask.me> said:Michel Fortin napisał:A memory-mapped file comes from the hard drive too. -- Michel Fortin michel.fortin michelf.com http://michelf.com/I agree it's important, especially when receiving XML over the network, but I also think it's important to be able to be able to support slicing. Imagine all the memory you could save by just making slices of a memory-mapped file. The difficulty is to support both models: the input range model which requires copying the strings and the slicing model where you're just taking slices of a string.These are valid concerns. Yet, in overwhelming majority XML documents come from hard-drive and network -- these are the places we need to drill. I fea r that trying to cover every remote use case will render the library incomp rehensible.
Feb 04 2011
On 2/4/2011 4:47 PM, Tomek Sowiński wrote:Michel Fortin napisał:This reminds me of some things I was thinking about when I worked with some XML-heavy apps in Java and experimented with writing parsers for my own simple markup languages. If you have the entire XML string loaded in memory, the most time-consuming part of parsing it is probably going to be allocation of node objects. So it makes sense to do a quick scan of the char array, and generate just the root node, which would lazily allocate sub-nodes upon access. I can see several different implementation of a high-performance parser, depending on the typical use case. Do you want to work efficiently with lots of small files or one huge file? Deeply nested or mostly flat? Coming from memory or a stream of characters? Problem is, with lazy parsing XML nodes would need to be able to call upon the parser that created them. Perhaps it would be possible to specify some kind of generic XML node interface and allow people to use/generate different implementations depending on what they need?I agree it's important, especially when receiving XML over the network, but I also think it's important to be able to be able to support slicing. Imagine all the memory you could save by just making slices of a memory-mapped file. The difficulty is to support both models: the input range model which requires copying the strings and the slicing model where you're just taking slices of a string.These are valid concerns. Yet, in overwhelming majority XML documents come from hard-drive and network -- these are the places we need to drill. I fear that trying to cover every remote use case will render the library incomprehensible.
Feb 07 2011
On Sat, 05 Feb 2011 00:02:39 +0300, Tomek Sowi=C5=84ski <just ask.me> wr= ote:I am now intensely accumulating information on how to go about creatin=g =a high-performance parser as it quickly became clear that my old one =won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by ==RapidXML. I already contacted Marcin (the author) to ensure that using==solutions inspired by his lib is OK with him; it is. But I don't think==I'll go this way. One reason is, surprisingly, performance. RapidXML =cannot start parsing until the entire document is loaded and ready as =a =random-access string. Then it's blazingly fast but the time for I/O ha=s =already elapsed. Besides, as Marcin himself said, we need a 100% =W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range,==perhaps assuming bufferized input. That way I can start parsing as soo=n =as the first buffer gets filled. Not to mention that the end result wi=ll =use much less memory. Plenty of the XML data stream is indents, spaces=, =and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing ==effort are pivotal. Please comment on this.I don't have much experience with XML, but as far as I can tell DOM pars= er = pretty much needs to store entire file in memory. You can also load and = = parse files asynchronously. By the contrary, SAX parsers don't require having entire file in memory,= = but that's completely different approach to XML parsing. I'd also recommend you to take a look at pugixml, which is being develop= ed = and supported by my co-worker since 2006. It is free (MIT license), smal= l, = lightweight, fast, clean and has very good documentation.
Feb 04 2011
On 2011-02-04 22:02, Tomek Sowiński wrote:I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal. Please comment on this.I don't think it's up to the parser to decide where the content comes from. It should be able to handle the whole content of an XML file in memory. -- /Jacob Carlborg
Feb 06 2011
Tomek Sowiński Wrote:One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory.Did you measure how much memory is wasted by markup?
Feb 07 2011
Tomek Sowiński wrote:I am now intensely accumulating information on how to go about creating a high-performance parser as it quickly became clear that my old one won't deliver. And if anything is clear is that memory is the key. One way is the slicing approach mentioned on this NG, notably used by RapidXML. I already contacted Marcin (the author) to ensure that using solutions inspired by his lib is OK with him; it is. But I don't think I'll go this way. One reason is, surprisingly, performance. RapidXML cannot start parsing until the entire document is loaded and ready as a random-access string. Then it's blazingly fast but the time for I/O has already elapsed. Besides, as Marcin himself said, we need a 100% W3C-compliant implementation and RapidXML isn't one. I think a much more fertile approach is to operate on a forward range, perhaps assuming bufferized input. That way I can start parsing as soon as the first buffer gets filled. Not to mention that the end result will use much less memory. Plenty of the XML data stream is indents, spaces, and markup -- there's no reason to copy all this into memory. To sum up, I belive memory and overlapping I/O latencies with parsing effort are pivotal. Please comment on this.Few years ago I needed to write my own parser in Delphi for my XMPP (Jabber) client and server. In XMPP you get socket-streamed XML document with xml elements as protocol messages ("XMPP stanzas"). The problem I had was no Delphi parser had hybrid support of SAX/DOM, i.e. I wanted to parse xml nodes like SAX, but when I received whole message then I wanted to return it as XML Element (like in DOM). This way I could easily process incoming messages - now it's accomplishable with pull parsers. I think std needs both SAX/pull and DOM parsers. For DOM, if whole document is in memory, maybe this approach could be advantageous: http://en.wikipedia.org/wiki/VTD-XML http://vtd-xml.sourceforge.net/
Feb 08 2011