digitalmars.D.learn - dxml behavior after exception: continue parsing

Jesse Phillips (18/18) May 07 2018 So I have an XML like document which fails to adhere completely

Jesse Phillips (12/23) May 07 2018 Ok so this worked when inside a quoted attribute value but not a

Jonathan M Davis (16/45) May 07 2018 I don't think that such an approach would work with how dxml does its

Jonathan M Davis (27/38) May 07 2018 The documentation on EntityRange / parseXML specifically states:

Jesse Phillips (3/14) May 08 2018 I'm not going to ask for that (configuration). I may look into

Jonathan M Davis (30/45) May 09 2018 Well, for the general case at least, being able to configure the parser ...

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

So I have an XML like document which fails to adhere completely 
to XML. One of these such events is that & is used without 
escaping.

My observation is that after the exception it is possible to move 
to the next element without issue. Is this something expected and 
will be maintained?


     try {
         range.popFront();
     } catch (Exception e) {
         range.popFront;
     }


As an aside, here is a snippet for skipping over the BOM since 
dxml doesn't expect the BOM to be there:

     import std.encoding;
     import std.algorithm;
     auto fileContent = cast(ubyte[])read(file);
     if(getBOM(fileContent).schema != BOM.none)
         fileContent.skipOver(getBOM(fileContent).sequence);

May 07 2018

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Monday, 7 May 2018 at 19:46:00 UTC, Jesse Phillips wrote:
 So I have an XML like document which fails to adhere completely 
 to XML. One of these such events is that & is used without 
 escaping.

 My observation is that after the exception it is possible to 
 move to the next element without issue. Is this something 
 expected and will be maintained?


     try {
         range.popFront();
     } catch (Exception e) {
         range.popFront;
     }

Ok so this worked when inside a quoted attribute value but not a 
normal tag body. Clearly I'm not parsing valid XML so I'm going 
outside the bounds of valid parameters. But rather than writing a 
custom parser to handle this, it would be nice to have:

      try {
          range.popFront();
      } catch (Exception e) {
          range.moveToNextTag();
      }

Which would make front a MalformedParse containing the content up 
to the next <.

May 07 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, May 07, 2018 22:16:58 Jesse Phillips via Digitalmars-d-learn 
wrote:
 On Monday, 7 May 2018 at 19:46:00 UTC, Jesse Phillips wrote:
 So I have an XML like document which fails to adhere completely
 to XML. One of these such events is that & is used without
 escaping.

 My observation is that after the exception it is possible to
 move to the next element without issue. Is this something
 expected and will be maintained?

     try {

         range.popFront();

     } catch (Exception e) {

         range.popFront;

     }

 Ok so this worked when inside a quoted attribute value but not a
 normal tag body. Clearly I'm not parsing valid XML so I'm going
 outside the bounds of valid parameters. But rather than writing a
 custom parser to handle this, it would be nice to have:

       try {
           range.popFront();
       } catch (Exception e) {
           range.moveToNextTag();
       }

 Which would make front a MalformedParse containing the content up
 to the next <.

I don't think that such an approach would work with how dxml does its
validation, because it's designed with the idea that only the range farthest
along does the validation (which was critical in avoiding having to allocate
memory in functions like save). Some validation is currently done by every
range, but it's been my plan too look at making it so that as little
validation as possible is done by the other ranges. Either way, the fact
that any validation is skipped by ranges that are farther behind would cause
a definite problem with trying to then continue parsing passed invalid XML.
As it stands, any range that is farther behind should throw the same
exception when it reaches the one that first hit the invalid XML, whereas if
that range could somehow continue, then the range that's farther behind
would then not do the same validation and would not do the right thing when
it hit the point where moveToNextTag had been called on the first range.

- Jonathan M Davis

May 07 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Monday, May 07, 2018 19:46:00 Jesse Phillips via Digitalmars-d-learn 
wrote:
 So I have an XML like document which fails to adhere completely
 to XML. One of these such events is that & is used without
 escaping.

 My observation is that after the exception it is possible to move
 to the next element without issue. Is this something expected and
 will be maintained?


      try {
          range.popFront();
      } catch (Exception e) {
          range.popFront;
      }

The documentation on EntityRange / parseXML specifically states:

"If invalid XML is encountered at any point during the parsing process, an
XMLParsingException will be thrown. If an exception has been thrown, then
the parser is in an invalid state, and it is an error to call any functions
on it."

What happens if you continue parsing after an exception is effectively
undefined behavior and could vary wildly depending on what was invalid in
the XML and which part of the parser threw. It may very well be that in some
circumstances, you would be able to continue parsing without any real
negative side effects, but the parser could also end up asserting or doing
who-knows-what, because it's not in a valid state. I could add a member to
the parser which says whether it's in a valid state or not an then have the
parser throw if you try to call anything on it after an exception has been
thrown, but that would add overhead that I'd rather avoid. At most, such a
check would be done with assertions like the checks for whether you're
allowed to call name, text, or attributes are assertions.

I've been considering adding more configuration options where you say
something like you don't care if any invalid characters are encountered, in
which case, you could cleanly parse past something like an unescaped &, but
you'd then potentially be operating on invalid XML without knowing it and
could get undesirable results depending on what exactly is wrong with the
XML. I haven't decided for sure whether I'm going to add any such
configuration options or how fine-grained they'd be, but either way, the
current behavior will continue to be the default behavior.

- Jonathan M Davis

May 07 2018

Jesse Phillips <Jesse.K.Phillips+D gmail.com> writes:

On Monday, 7 May 2018 at 22:24:25 UTC, Jonathan M Davis wrote:

 I've been considering adding more configuration options where 
 you say something like you don't care if any invalid characters 
 are encountered, in which case, you could cleanly parse past 
 something like an unescaped &, but you'd then potentially be 
 operating on invalid XML without knowing it and could get 
 undesirable results depending on what exactly is wrong with the 
 XML. I haven't decided for sure whether I'm going to add any 
 such configuration options or how fine-grained they'd be, but 
 either way, the current behavior will continue to be the 
 default behavior.

 - Jonathan M Davis

I'm not going to ask for that (configuration). I may look into 
cloning dxml and changing it to parse the badly formed XML.

May 08 2018

Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:

On Tuesday, May 08, 2018 16:18:40 Jesse Phillips via Digitalmars-d-learn 
wrote:
 On Monday, 7 May 2018 at 22:24:25 UTC, Jonathan M Davis wrote:
 I've been considering adding more configuration options where
 you say something like you don't care if any invalid characters
 are encountered, in which case, you could cleanly parse past
 something like an unescaped &, but you'd then potentially be
 operating on invalid XML without knowing it and could get
 undesirable results depending on what exactly is wrong with the
 XML. I haven't decided for sure whether I'm going to add any
 such configuration options or how fine-grained they'd be, but
 either way, the current behavior will continue to be the
 default behavior.

 - Jonathan M Davis

 I'm not going to ask for that (configuration). I may look into
 cloning dxml and changing it to parse the badly formed XML.

Well, for the general case at least, being able to configure the parser to
not care about certain types of validation is the best that I can think of
at the moment for dealing with invalid XML (especially with the issues
caused by the fact that only one range actually does the validation, making
selective skipping of invalid stuff while parsing a very iffy proposition).
dxml was designed with the idea that it would be operating on valid XML, and
designing a parser to operate on invalid XML can get very tricky - to the
point that it may simply be best for the programmer to design their own
solution tailored to their particular use case if they're going to be
encountering a lot of invalid XML.

If all that's needed is to tell the parser to allow stuff like lone
ampersands, then that's quite straightforward, but if you're dealing with
anything more wrong than that, then things get hairy fast. It's those sorts
of problems that have made html parsers so wildly inconsistent in what they
do.

Personally, I think that we'd have all been better off if the various
protocols (particularly those related to the web) had always called for
strict validation and rejected anything that didn't follow the spec.
Instead, we've got this whole idea of "be strict in what you emit but relax
in what you accept," and the result is that we've got a lot of incorrect
implementations and a lot of invalid data floating around. And of course, if
you don't accept something and someone else does, then your code is
considered buggy even if it follows the protocol perfectly and the data is
clearly invalid. So, in general, we're all kind of permanently screwed. :(

If I can do reasonable things to make dxml better handle bad data, then I'm
open to it, but given dxml's design, the options are somewhat limited, and
it's just plain a hard problem in general.

- Jonathan M Davis

May 09 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - dxml behavior after exception: continue parsing