digitalmars.D - std.xml2 (collecting features)

Robert burner Schadek (13/13) May 03 2015 std.xml has been considered not up to specs nearly 3 years now.

Joakim (8/21) May 03 2015 My request: just skip it. XML is a horrible waste of space for a

Meta (3/27) May 03 2015 That's not really an option considering the huge amount of XML
w0rp (4/28) May 03 2015 I agree that JSON is superior through-and-through, but legacy

Marco Leise (47/56) May 04 2015 Am Sun, 03 May 2015 18:44:11 +0000

Joakim (20/76) May 09 2015 You seem to have missed the point of my post, which was to

Craig Dillabaugh (31/77) May 09 2015 I have to agree with Joakim on this. Having spent much of this

Marco Leise (31/87) May 10 2015 Well, I was mostly answering to w0rp here. JSON is both

Joakim (27/61) May 10 2015 It's worse than shabby, it's a horrible, horrible choice. Not

Laeeth Isharc (17/27) May 10 2015 I feel the same way about XML, and I also think that having
Kagamin (3/5) May 12 2015 Hypothetically, yes, though formats better than XML don't exist.

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (15/26) May 10 2015 JSON is just javascript literals with some silly constraints. As

Alex Parrill (4/4) May 11 2015 Can we please not turn this thread into an XML vs JSON flamewar?

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (12/13) May 11 2015 This is not a flamewar, JSON is ad hoc and I use it a lot, but it

Chris (8/32) Feb 19 2016 Glad to hear that someone is working on XML support. We cannot

Joakim (10/44) Feb 23 2016 Then write a good XML extraction-only library and dub it. I see

Dmitry (9/14) Feb 23 2016 You won't be able to sleep if it will be in Phobos?

Craig Dillabaugh (4/13) Feb 24 2016 So are you trying to say C/C++ are not serious languages :o)

Robert burner Schadek (1/1) May 03 2015 - CTS to disable parsing location (line,column)
Walter Bright (3/4) May 03 2015 Pipeline range interface, for example:
wobbles (4/17) May 03 2015 Could possibly use pegged to do it?
Walter Bright (3/4) May 03 2015 Encoding schemes should be handled by adapter algorithms, not in the XML...

Marco Leise (6/11) May 04 2015 Unlike JSON, XML actually declares the encoding in the prolog,

Walter Bright (3/4) May 03 2015 Try to design the interface to it so it does not inherently require the
Ilya Yaroshenko (1/1) May 03 2015 Can it lazily reads huge files (files greater than memory)?

Walter Bright (3/4) May 03 2015 If a range interface is used, it doesn't need to be aware of where the d...

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (6/11) May 04 2015 Wouldn't D-ranges make it impossible to use SIMD optimizations

Jonathan M Davis (11/14) May 04 2015 Given how D's arrays work, we have the opportunity to have an

Andrei Alexandrescu (6/18) May 04 2015 To be frank what's more embarrassing is that we managed to do nothing

Jonathan M Davis (7/17) May 04 2015 Also true. Many of us just don't find enough time to work on D,

Walter Bright (3/10) May 04 2015 Tango's XML package was well regarded and the fastest in the business. I...
"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (8/10) May 05 2015 Yes, that would be great. XML is a flexible go-to archive,

Walter Bright (5/6) May 04 2015 Not at all. Algorithms can be specialized for various forms of input ran...

Jonathan M Davis (12/17) May 04 2015 Indeed. It should operate on ranges without caring where they

Michel Fortin (10/24) May 03 2015 This isn't a feature request (sorry?), but I just want to point out

Robert burner Schadek (2/6) May 04 2015 nice, thank you

Rikki Cattermole (6/18) May 03 2015 Preferably the interfaces are made first 1:1 as the spec requires.
Jonathan M Davis (34/47) May 04 2015 If I were doing it, I'd do three types of parsers:

Jacob Carlborg (16/38) May 04 2015 This way the XML parser is structured in Tango. A pull parser at the

Jacob Carlborg (4/6) May 04 2015 I recommend benchmarking against the Tango pull parser.

Walter Bright (3/7) May 04 2015 I agree. The Tango XML parser has set the performance bar. If any new so...
"Mario =?UTF-8?B?S3LDtnBsaW4i?= <linkrope github.com> (8/13) May 05 2015 Recently, I compared DOM parsers for an XML files of 100 MByte:

John Colvin (2/17) May 05 2015 As usual: system, compiler, compiler version, compilation flags?
Richard Webb (11/17) May 05 2015 fwiw I did some tests a couple of years back with

Walter Bright (3/5) May 05 2015 I haven't read the Tango source code, but the performance of it's xml wa...

Jacob Carlborg (7/9) May 05 2015 That's only true for the pull parser (not sure about the SAX parser).

Richard Webb (6/13) May 06 2015 The direct comparisons were with the DOM parsers (I was playing with a D...

Jacob Carlborg (13/19) May 05 2015 Yes, of course it's slower. The DOM parser creates a DOM as well, which

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (11/16) May 05 2015 I agree. Most applications will use a DOM parser for convenience,

Jacob Carlborg (8/14) May 05 2015 Agree.

Brad Roberts via Digitalmars-d (10/22) May 05 2015 An old friend of mine who was intimate with the microsoft xml parsers

Jacob Carlborg (18/20) May 04 2015 There are a couple of interesting comments about the Tango pull parser
Liam McSherry (7/20) May 04 2015 Not a feature, but if `std.data.json` [1] gets accepted in to

Rikki Cattermole (3/25) May 04 2015 It really should be std.data.xml. To keep with the new structuring. Plus...

weaselcat (6/19) May 04 2015 maybe off-topic, but it would be nice if the standard json,xml,

Marco Leise (14/18) May 04 2015 I don't think this needs discussion. It is plain impossible to

Alex Vincent (5/18) Feb 17 2016 I'm looking for a status update. DUB doesn't seem to have many

Robert burner Schadek (10/13) Feb 18 2016 I'm working on it, but recently I had to do some major

Robert burner Schadek (5/8) Feb 18 2016 also I would like to see this

Andrei Alexandrescu (2/8) Feb 18 2016 Would the measuring be possible with 2995 as a dub package? -- Andrei

Robert burner Schadek (3/9) Feb 18 2016 yes, after have synced the dub package to the PR

Robert burner Schadek (3/9) Feb 23 2016 brought the dub package up to date with the PR (v0.0.6)

Alex Vincent (6/9) Feb 18 2016 Oh, I absolutely agree, independent implementation is a bad
Craig Dillabaugh (4/18) Feb 18 2016 Would you be interested in mentoring a student for the Google

Robert burner Schadek (3/5) Feb 19 2016 Yes, why not!

Robert burner Schadek (22/22) Feb 18 2016 While working on a new xml implementation I came cross "control

Adam D. Ruppe (10/12) Feb 18 2016 That means the user didn't encode them properly...

Robert burner Schadek (2/2) Feb 18 2016 for instance, quick often I find <80> in tests that are supposed

Adam D. Ruppe (3/5) Feb 18 2016 What char encoding does the document declare itself as?

Robert burner Schadek (4/9) Feb 18 2016 It does not, it has no prolog and therefore no EncodingInfo.

Robert burner Schadek (3/5) Feb 18 2016 the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

Adam D. Ruppe (7/10) Feb 18 2016 Gah, I should have read this before replying... well, that does

Alex Vincent (7/19) Feb 18 2016 Regarding control characters: If you give me a complete sample

Robert burner Schadek (3/8) Feb 18 2016 thanks you making the effort

Alex Vincent (12/21) Feb 19 2016 In this case, Firefox just passes the control characters through

Kagamin (3/4) Feb 19 2016 http://dpaste.dzfl.pl/80888ed31958 like this?

Robert burner Schadek via Digitalmars-d (8/12) Feb 19 2016 No, The program just takes the hex dump as string.

Kagamin (3/9) Feb 19 2016 http://dpaste.dzfl.pl/2f8a8ff10bde like this?

Robert burner Schadek (2/3) Feb 19 2016 yes

Adam D. Ruppe (16/17) Feb 18 2016 In that case, it needs to be valid UTF-8 or valid UTF-16 and it

crimaniak (10/11) Feb 20 2016 - the ability to read documents with missing or incorrectly

Adam D. Ruppe (4/8) Feb 20 2016 fyi, my dom.d can do those, I use it for web scraping where

crimaniak (5/13) Feb 21 2016 It works, thanks! I will use it in my experiments, but

Adam D. Ruppe (2/4) Feb 21 2016 What, specifically, do you have in mind?

crimaniak (5/9) Feb 25 2016 Where is only a couple of ad-hoc checks for attributes values.

Adam D. Ruppe (7/11) Feb 25 2016 The css3 selector standard offers three substring search:

Dejan Lekic (9/9) Feb 24 2016 If you really want to be serious about the XML package, then I

Alex Vincent (20/25) Mar 01 2016 I agree, but the Document Object Model (DOM) is a huuuuuuuuge

Adam D. Ruppe (6/9) Mar 02 2016 My dom.d implements a fair chunk of it already.

=?UTF-8?Q?Tobias=20M=C3=BCller?= (4/16) Mar 01 2016 What's the usecase of DOM outside of browser interoperability/scripting?

Adam D. Ruppe (6/9) Mar 02 2016 I find my extended dom to be very nice, especially thanks to D's

=?UTF-8?Q?Tobias=20M=C3=BCller?= (5/16) Mar 03 2016 Sure, some kind of DOM is certainly useful. But the standard XML-DOM isn...

Craig Dillabaugh (5/18) Mar 05 2016 Robert, we have had some student interest in GSOC for XML. Would

Robert burner Schadek (2/6) Mar 06 2016 Of course

Lodovico Giaretta (6/14) Mar 07 2016 Hi,
Craig Dillabaugh (6/14) Mar 07 2016 Great. Can you please get in touch by email so I can add you to
Alex Vincent (5/5) Mar 12 2016 For everyone's information, I've posted a pull request to Mr.

"Robert burner Schadek" <rburners gmail.com> writes:

std.xml has been considered not up to specs nearly 3 years now. 
Time to build a successor. I currently plan the following featues 
for it:

- SAX and DOM parser
- in-situ / slicing parsing when possible (forward range?)
- compile time switch (CTS) for lazy attribute parsing
- CTS for encoding (ubyte(ASCII), char(utf8), ... )
- CTS for input validating
- performance

Not much code yet, I'm currently building the performance test 
suite https://github.com/burner/std.xml2

Please post you feature requests, and please keep the posts DRY 
and on topic.

May 03 2015

"Joakim" <dlang joakim.fea.st> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

My request: just skip it.  XML is a horrible waste of space for a 
standard, better D doesn't support it well, anything to 
discourage it's use.  I'd rather see you spend your time on 
something worthwhile.  If data formats are your thing, you could 
help get Ludwig's JSON stuff in, or better yet, enable some nice 
binary data format.

May 03 2015

"Meta" <jared771 gmail.com> writes:

On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:
 On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
 wrote:
 std.xml has been considered not up to specs nearly 3 years 
 now. Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts 
 DRY and on topic.

 My request: just skip it.  XML is a horrible waste of space for 
 a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

That's not really an option considering the huge amount of XML 
data there is out there.

May 03 2015

"w0rp" <devw0rp gmail.com> writes:

On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:
 On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
 wrote:
 std.xml has been considered not up to specs nearly 3 years 
 now. Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts 
 DRY and on topic.

 My request: just skip it.  XML is a horrible waste of space for 
 a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

I agree that JSON is superior through-and-through, but legacy 
support matters, and XML is in many places. It's good to have a 
quality XML parsing library.

May 03 2015

Marco Leise <Marco.Leise gmx.de> writes:

On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:

 My request: just skip it.  XML is a horrible waste of space for 
 a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

Am Sun, 03 May 2015 18:44:11 +0000
schrieb "w0rp" <devw0rp gmail.com>:

 I agree that JSON is superior through-and-through, but legacy 
 support matters, and XML is in many places. It's good to have a 
 quality XML parsing library.

You two are terrible at motivating people. "Better D doesn't
support it well" and "JSON is superior through-and-through" is
overly dismissive. To me it sounds like someone saying replace
C++ with JavaScript, because C++ is a horrible standard and
JavaScript is so much superior.  Honestly.

Remember that while JSON is simpler, XML is not just a
structured container for bool, Number and String data. It
comes with many official side kicks covering a broad range of
use cases:

XPath:
 * allows you to use XML files like a textual database
 * complex enough to allow for almost any imaginable query
 * many tools emerged to test XPath expressions against XML documents
 * also powers XSLT
   (http://www.liquid-technologies.com/xpath-tutorial.aspx)

XSL (Extensible Stylesheet Language) and
XSLT (XSL Transformations):
 * written as XML documents
 * standard way to transform XML from one structure into another
 * convert or "compile" data to XHTML or SVG for display in a browser
 * output to XSL-FO

XSL-FO (XSL formatting objects):
 * written as XSL
 * type-setting for XML; a XSL-FO processor is similar to a LaTex processor
 * reads an XML document (a "Format" document) and outputs to a PDF, RTF or
similar format

XML Schema Definition (XSD):
 * written as XML
 * linked in by an XML file
 * defines structure and validates content to some extent
 * can set constraints on how often an element can occur in a list
 * can validate data type of values (length, regex, positive, etc.)
 * database like unique IDs and references

I think XML is the most eat-your-own-dog-food language ever
and nicely covers a wide range of use cases. In any case there
are many XML based file formats that we might want to parse.
Amongst them SVG, OpenDocument (Open/LibreOffics), RSS feeds,
several US Offices, XMP and other meta data formats.

When it comes to which features to support, I personally used
XSD more than XPath and the tech using it. But quite frankly
both would be expected by users. Based on XPath, XSL
transformations can be added any time then. Anything beyond
that doesn't feel quite "core" enough to be in a XML module.

-- 
Marco

May 04 2015

"Joakim" <dlang joakim.fea.st> writes:

On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote:
 On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:

 My request: just skip it.  XML is a horrible waste of space 
 for a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

 You two are terrible at motivating people. "Better D doesn't
 support it well" and "JSON is superior through-and-through" is
 overly dismissive. To me it sounds like someone saying replace
 C++ with JavaScript, because C++ is a horrible standard and
 JavaScript is so much superior.  Honestly.

You seem to have missed the point of my post, which was to 
discourage him from working on an XML module for phobos.  As for 
"motivating" him, I suggested better alternatives.  And I never 
said JSON was great, but it's certainly _much_ more readable than 
XML, which is one of the basic goals of a text format.

 Remember that while JSON is simpler, XML is not just a
 structured container for bool, Number and String data. It
 comes with many official side kicks covering a broad range of
 use cases:

 XPath:
  * allows you to use XML files like a textual database
  * complex enough to allow for almost any imaginable query
  * many tools emerged to test XPath expressions against XML 
 documents
  * also powers XSLT
    (http://www.liquid-technologies.com/xpath-tutorial.aspx)

 XSL (Extensible Stylesheet Language) and
 XSLT (XSL Transformations):
  * written as XML documents
  * standard way to transform XML from one structure into another
  * convert or "compile" data to XHTML or SVG for display in a 
 browser
  * output to XSL-FO

 XSL-FO (XSL formatting objects):
  * written as XSL
  * type-setting for XML; a XSL-FO processor is similar to a 
 LaTex processor
  * reads an XML document (a "Format" document) and outputs to a 
 PDF, RTF or similar format

 XML Schema Definition (XSD):
  * written as XML
  * linked in by an XML file
  * defines structure and validates content to some extent
  * can set constraints on how often an element can occur in a 
 list
  * can validate data type of values (length, regex, positive, 
 etc.)
  * database like unique IDs and references

These are all incredibly dumb ideas.  I don't deny that many 
people may use these things, but then people use hammers for all 
kinds of things they shouldn't use them for too. :)

 I think XML is the most eat-your-own-dog-food language ever
 and nicely covers a wide range of use cases.

The problem is you're still eating dog food. ;)

 In any case there
 are many XML based file formats that we might want to parse.
 Amongst them SVG, OpenDocument (Open/LibreOffics), RSS feeds,
 several US Offices, XMP and other meta data formats.

Sure, and if he has any real need for any of those, who are we to 
stop him?  But if he's just looking for some way to contribute, 
there are better ways.

On Monday, 4 May 2015 at 20:44:42 UTC, Jonathan M Davis wrote:
 Also true. Many of us just don't find enough time to work on D, 
 and we don't seem to do a good job of encouraging larger 
 contributions to Phobos, so newcomers don't tend to contribute 
 like that. And there's so much to do all around that the big 
 stuff just falls by the wayside, and it really shouldn't.

This is why I keep asking Walter and Andrei for a list of "big 
stuff" on the wiki- they don't have to be big, just important- so 
that newcomers know where help is most needed.  Of course, it 
doesn't have to be them, it could be any member of the D core 
team, though whatever the BDFLs push for would have a bit more 
weight.

May 09 2015

"Craig Dillabaugh" <craig.dillabaugh gmail.com> writes:

On Saturday, 9 May 2015 at 10:28:53 UTC, Joakim wrote:
 On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote:
 On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:


clip
 Remember that while JSON is simpler, XML is not just a
 structured container for bool, Number and String data. It
 comes with many official side kicks covering a broad range of
 use cases:

 XPath:
 * allows you to use XML files like a textual database
 * complex enough to allow for almost any imaginable query
 * many tools emerged to test XPath expressions against XML 
 documents
 * also powers XSLT
   (http://www.liquid-technologies.com/xpath-tutorial.aspx)

 XSL (Extensible Stylesheet Language) and
 XSLT (XSL Transformations):
 * written as XML documents
 * standard way to transform XML from one structure into another
 * convert or "compile" data to XHTML or SVG for display in a 
 browser
 * output to XSL-FO

 XSL-FO (XSL formatting objects):
 * written as XSL
 * type-setting for XML; a XSL-FO processor is similar to a 
 LaTex processor
 * reads an XML document (a "Format" document) and outputs to a 
 PDF, RTF or similar format

 XML Schema Definition (XSD):
 * written as XML
 * linked in by an XML file
 * defines structure and validates content to some extent
 * can set constraints on how often an element can occur in a 
 list
 * can validate data type of values (length, regex, positive, 
 etc.)
 * database like unique IDs and references

 These are all incredibly dumb ideas.  I don't deny that many 
 people may use these things, but then people use hammers for 
 all kinds of things they shouldn't use them for too. :)

 I think XML is the most eat-your-own-dog-food language ever
 and nicely covers a wide range of use cases.

 The problem is you're still eating dog food. ;)

I have to agree with Joakim on this.  Having spent much of this 
past
week trying to get XML generated by gSOAP (project has some legacy
code) to work with JAXB (Java) has reinforced my dislike for XML.

I've used things like XPath and XLST in the past, so I can 
appreciate
their power, but think the 'jobs' they perform would be better 
supported
elsewhere (ie. language specific XML frameworks).

In trying to pass data between applications I just want a simple 
way
of packaging up the data and ideally making 
serialization/deserialization
easy for me.  At some point the programmer working on these needs
to understand and validate the data anyway.  Sure you can use 
DTD/XML Schema to
handle the validation part, but it is just easier to deal with 
that
within you own code - without having to learn a 'whole new 
language', that
is likely harder to grok than the tools you would have at your 
disposal
in your language of choice.

Having said all that.  As much as I share Joakim's sentiment that 
I wish
XML would just go away,  there is a lot of it out there, and I 
think having good support in Phobos is very valuable so I thank 
Robert for his efforts.

Craig

May 09 2015

Marco Leise <Marco.Leise gmx.de> writes:

Am Sat, 09 May 2015 10:28:52 +0000
schrieb "Joakim" <dlang joakim.fea.st>:

 On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote:

 You two are terrible at motivating people. "Better D doesn't
 support it well" and "JSON is superior through-and-through" is
 overly dismissive.
 =E2=80=A6

=20
 You seem to have missed the point of my post, which was to=20
 discourage him from working on an XML module for phobos.  As for=20
 "motivating" him, I suggested better alternatives.  And I never=20
 said JSON was great, but it's certainly _much_ more readable than=20
 XML, which is one of the basic goals of a text format.

Well, I was mostly answering to w0rp here. JSON is both
readable and easy to parse, no question.
=20
 Remember that while JSON is simpler, XML is not just a
 structured container for bool, Number and String data. It
 comes with many official side kicks covering a broad range of
 use cases:

 XPath:
  =E2=80=A6

 XSL and XSLT
  =E2=80=A6

 XSL-FO (XSL formatting objects):
  =E2=80=A6

 XML Schema Definition (XSD):
  =E2=80=A6

=20
 These are all incredibly dumb ideas.  I don't deny that many=20
 people may use these things, but then people use hammers for all=20
 kinds of things they shouldn't use them for too. :)

:) One can't really answer this one. But with many hundreds of
published data exchange formats built on XML, it can't have been
too shabby all along.
And sometimes small things matter, like being able to add comments
along with the "payload". JSON doesn't have that.
Or knowing that both sender and receiver will validate the XML the
same way through XSD. So if it doesn't blow up on your end, it will
pass validation on the other end, too.


Am Sat, 09 May 2015 13:04:57 +0000
schrieb "Craig Dillabaugh" <craig.dillabaugh gmail.com>:

 I have to agree with Joakim on this.  Having spent much of this=20
 past
 week trying to get XML generated by gSOAP (project has some legacy
 code) to work with JAXB (Java) has reinforced my dislike for XML.
=20
 I've used things like XPath and XLST in the past, so I can=20
 appreciate
 their power, but think the 'jobs' they perform would be better=20
 supported
 elsewhere (ie. language specific XML frameworks).
=20
 In trying to pass data between applications I just want a simple=20
 way
 of packaging up the data and ideally making=20
 serialization/deserialization
 easy for me.  At some point the programmer working on these needs
 to understand and validate the data anyway.  Sure you can use=20
 DTD/XML Schema to
 handle the validation part, but it is just easier to deal with=20
 that
 within you own code - without having to learn a 'whole new=20
 language', that
 is likely harder to grok than the tools you would have at your=20
 disposal
 in your language of choice.

You see, the thing is that XSD is _not_ a whole new language,
it is written in XML as well, probably specifically to make it
so. Try to switch the perspective: With XSD (if it is
sufficient for your validation needs) _one_ person needs to
learn and write it and other programmers (inside or outside
the company) just use the XML library of choice to handle
validation via that schema. Once the schema is loaded it is
usually no more than doc.validate();
(There is also good GUI tools to assist in writing XSD.)
What you propose on the other hand is that everyone involved
in the data exchange writes their own validation code in their
language of choice, with either no access to existing sources
or functionality that doesn't translate to their language!
=20
--=20
Marco

May 10 2015

"Joakim" <dlang joakim.fea.st> writes:

On Sunday, 10 May 2015 at 07:01:58 UTC, Marco Leise wrote:
 Am Sat, 09 May 2015 10:28:52 +0000
 schrieb "Joakim" <dlang joakim.fea.st>:

 On Monday, 4 May 2015 at 18:50:43 UTC, Marco Leise wrote:
 Remember that while JSON is simpler, XML is not just a
 structured container for bool, Number and String data. It
 comes with many official side kicks covering a broad range of
 use cases:

 XPath:
  …

 XSL and XSLT
  …

 XSL-FO (XSL formatting objects):
  …

 XML Schema Definition (XSD):
  …

 
 These are all incredibly dumb ideas.  I don't deny that many 
 people may use these things, but then people use hammers for 
 all kinds of things they shouldn't use them for too. :)

 :) One can't really answer this one. But with many hundreds of
 published data exchange formats built on XML, it can't have been
 too shabby all along.

It's worse than shabby, it's a horrible, horrible choice.  Not 
just for data formats, but for _anything_.  XML should not be 
used.

 And sometimes small things matter, like being able to add 
 comments
 along with the "payload". JSON doesn't have that.
 Or knowing that both sender and receiver will validate the XML 
 the
 same way through XSD. So if it doesn't blow up on your end, it 
 will
 pass validation on the other end, too.

One can do all these things with better formats than either XML 
or JSON.

But why do we often end up dealing with these two?  Familiarity, 
that is the only reason.  XML seems familiar to anybody who's 
written some HTML, and JSON became familiar to web developers 
initially.  Starting from those two large niches, they've 
expanded out to become the two most popular data interchange 
formats, despite XML being a horrible mess and JSON being too 
simple for many uses.

I'd like to see a move back to binary formats, which is why I 
mentioned that to Robert.  D would be an ideal language in which 
to show the superiority of binary to text formats, given its 
emphasis on efficiency.  Many devs have learned the wrong lessons 
from past closed binary formats, when open binary formats 
wouldn't have many of those deficiencies.

There have been some interesting moves back to open binary 
formats/protocols in recent years, like Hessian 
(http://hessian.caucho.com/), Thrift 
(https://thrift.apache.org/), MessagePack (http://msgpack.org/), 
and Cap'n Proto (from the protobufs guy after he left google - 
https://capnproto.org/).  I'd rather see phobos support these, 
which are the future, rather than flash-in-the-pan text formats 
like XML or JSON.

May 10 2015

"Laeeth Isharc" <nospamlaeeth nospamlaeeth.com> writes:

On Sunday, 10 May 2015 at 08:54:09 UTC, Joakim wrote:
 It's worse than shabby, it's a horrible, horrible choice.  Not 
 just for data formats, but for _anything_.  XML should not be 
 used.

I feel the same way about XML, and I also think that having 
strong  aesthetic internal emotional responses is often necessary 
to achieve excellence in engineering.

 But why do we often end up dealing with these two?  
 Familiarity, that is the only reason.  XML seems familiar to 
 anybody who's written some HTML, and JSON became familiar to 
 web developers initially.  Starting from those two large 
 niches, they've expanded out to become the two most popular 
 data interchange formats, despite XML being a horrible mess and 
 JSON being too simple for many uses.

Sometimes you get to pick, but often not.  I can hardly tell the 
UK Debt Management Office to give up XML and switch to msgpack 
structs (well, I can, but I am not sure they would listen).  So 
at the moment for some data series I use a python library via PyD 
to convert xml files to JSON.  But it would be nice to do it all 
in D.

I am not sure XML is going away very soon since new protocols 
keep being created using it.  (Most recent one I heard of is one 
for allowing hedge funds to achieve full transparency of their 
portfolio to end investors - not necessarily something that will 
achieve what people think it will, but one in tune with the 
times).


Laeeth.

May 10 2015

"Kagamin" <spam here.lot> writes:

On Sunday, 10 May 2015 at 08:54:09 UTC, Joakim wrote:
 One can do all these things with better formats than either XML 
 or JSON.

Hypothetically, yes, though formats better than XML don't exist. 
I personally find XML perfectly readable.

May 12 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 10 May 2015 at 07:01:58 UTC, Marco Leise wrote:
 Well, I was mostly answering to w0rp here. JSON is both
 readable and easy to parse, no question.

JSON is just javascript literals with some silly constraints. As 
crappy a format as it gets. Even pure Lisp would have been 
better. And much more powerful!

 :) One can't really answer this one. But with many hundreds of
 published data exchange formats built on XML, it can't have been
 too shabby all along.
 And sometimes small things matter, like being able to add 
 comments
 along with the "payload".

XML is actually great for what it is: eXtensible. It means you 
can build forward compatible formats and annotate existing 
formats with metadata without breaking existing (compliant) 
applications etc... It also means you can datamine files whithout 
knowing the full format.

 Or knowing that both sender and receiver will validate the XML 
 the
 same way through XSD.

Right, or build a database/archival service that is generic.

XML is not going away until there is something better, and that 
won't happen anytime soon. It is also one of the few formats that 
I actually need library and _good_ DOM support for. (JSON can be 
done in an afternoon, so I don't care if it is supported or 
not...)

May 10 2015

"Alex Parrill" <initrd.gz gmail.com> writes:

Can we please not turn this thread into an XML vs JSON flamewar?

XML is one of the most popular data formats (for better or for 
worse), so a parser would be a good addition to the standard 
library.

May 11 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 11 May 2015 at 15:20:12 UTC, Alex Parrill wrote:
 Can we please not turn this thread into an XML vs JSON flamewar?

This is not a flamewar, JSON is ad hoc and I use it a lot, but it 
isn't actually suitable as a file and archival exchange format. 
It is important that people understand what the point of XML is 
in order to build something useful.

Full XML support and tooling is very valuable for typed GC-backed 
batch processing. That means namespaces, entities, XQuery 
equivalents, DOMs etc

A library backed tooling pipeline would be a valuable asset for 
D. The value is not in _reading_ or _writing_ XML. The value is 
all about providing a framework for structured grammar/namespace 
based _processing_ and _transforms_.

May 11 2015

Chris <wendlec tcd.ie> writes:

On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:
 On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
 wrote:
 std.xml has been considered not up to specs nearly 3 years 
 now. Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts 
 DRY and on topic.

 My request: just skip it.  XML is a horrible waste of space for 
 a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

Glad to hear that someone is working on XML support. We cannot 
just "skip it". XML/HTML like mark up comes up all the time, here 
and there. I recently had to write a mini-parser (nowhere near 
the stuff Robert is doing, just a quick fix!) to extract data 
from XML input. This has nothing to do with personal preferences, 
it's just there [1] and has to be dealt with.

[1] https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language

Feb 19 2016

Joakim <dlang joakim.fea.st> writes:

On Friday, 19 February 2016 at 12:13:53 UTC, Chris wrote:
 On Sunday, 3 May 2015 at 17:47:15 UTC, Joakim wrote:
 On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
 wrote:
 std.xml has been considered not up to specs nearly 3 years 
 now. Time to build a successor. I currently plan the 
 following featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance 
 test suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts 
 DRY and on topic.

 My request: just skip it.  XML is a horrible waste of space 
 for a standard, better D doesn't support it well, anything to 
 discourage it's use.  I'd rather see you spend your time on 
 something worthwhile.  If data formats are your thing, you 
 could help get Ludwig's JSON stuff in, or better yet, enable 
 some nice binary data format.

 Glad to hear that someone is working on XML support. We cannot 
 just "skip it". XML/HTML like mark up comes up all the time, 
 here and there. I recently had to write a mini-parser (nowhere 
 near the stuff Robert is doing, just a quick fix!) to extract 
 data from XML input. This has nothing to do with personal 
 preferences, it's just there [1] and has to be dealt with.

 [1] 
 https://en.wikipedia.org/wiki/Speech_Synthesis_Markup_Language

Then write a good XML extraction-only library and dub it.  I see 
no reason to include this in Phobos, which will encourage those 
who don't know any better to use it, since it comes with the 
compiler.  I'll close with a quote from Saint Linus of Torvalds, 
which I was unaware of till a couple days ago:

"XML is crap. Really. There are no excuses. XML is nasty to parse 
for humans, and it's a disaster to parse even for computers. 
There's just no reason for that horrible crap to exist."
https://en.wikiquote.org/wiki/Linus_Torvalds#2014

Feb 23 2016

Dmitry <dmitry indiedev.ru> writes:

On Tuesday, 23 February 2016 at 11:22:23 UTC, Joakim wrote:
 Then write a good XML extraction-only library and dub it. I see 
 no reason to include this in Phobos

You won't be able to sleep if it will be in Phobos?

I use XML and I don't like check tons of side libraries for see 
which will be good for me, which have support (bugfixes), which 
will have support in some years, etc.
Lot of systems already using XML and any serious language _must_ 
have official support for it.

 If data formats are your thing, you could help get Ludwig's 
 JSON stuff in, or better yet, enable some nice binary data 
 format.

If it better for you, it not mean that it will better for 
everyone.

Feb 23 2016

Craig Dillabaugh <craig.dillabaugh gmail.com> writes:

On Tuesday, 23 February 2016 at 12:46:38 UTC, Dmitry wrote:
 On Tuesday, 23 February 2016 at 11:22:23 UTC, Joakim wrote:
 Then write a good XML extraction-only library and dub it. I 
 see no reason to include this in Phobos

 You won't be able to sleep if it will be in Phobos?

 I use XML and I don't like check tons of side libraries for see 
 which will be good for me, which have support (bugfixes), which 
 will have support in some years, etc.
 Lot of systems already using XML and any serious language 
 _must_ have official support for it.

So are you trying to say C/C++ are not serious languages :o)

Having said that, as much as I hate XML, basic support would be a 
nice feature for the language.

Feb 24 2016

"Robert burner Schadek" <rburners gmail.com> writes:

- CTS to disable parsing location (line,column)

May 03 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2015 10:39 AM, Robert burner Schadek wrote:
 Please post you feature requests, and please keep the posts DRY and on topic.

Pipeline range interface, for example:

     source.xmlparse(configuration).whatever();

May 03 2015

"wobbles " <grogan.colin gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

Could possibly use pegged to do it?
It may simplify the parsing portion of it for you at least.

May 03 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2015 10:39 AM, Robert burner Schadek wrote:
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )

Encoding schemes should be handled by adapter algorithms, not in the XML parser 
itself, which should only handle UTF8.

May 03 2015

Marco Leise <Marco.Leise gmx.de> writes:

Am Sun, 03 May 2015 14:00:11 -0700
schrieb Walter Bright <newshound2 digitalmars.com>:

 On 5/3/2015 10:39 AM, Robert burner Schadek wrote:
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )

 
 Encoding schemes should be handled by adapter algorithms, not in the XML
parser 
 itself, which should only handle UTF8.

Unlike JSON, XML actually declares the encoding in the prolog,
e.g.: <?xml version="1.0" encoding="Windows-1252"?>

-- 
Marco

May 04 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2015 10:39 AM, Robert burner Schadek wrote:
 Please post you feature requests, and please keep the posts DRY and on topic.

Try to design the interface to it so it does not inherently require the 
implementation to allocate GC memory.

May 03 2015

"Ilya Yaroshenko" <ilyayaroshenko gmail.com> writes:

Can it lazily reads huge files (files greater than memory)?

May 03 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
 Can it lazily reads huge files (files greater than memory)?

If a range interface is used, it doesn't need to be aware of where the data is 
coming from. In fact, the xml package should NOT be doing I/O.

May 03 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote:
 On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
 Can it lazily reads huge files (files greater than memory)?

 If a range interface is used, it doesn't need to be aware of 
 where the data is coming from. In fact, the xml package should 
 NOT be doing I/O.

Wouldn't D-ranges make it impossible to use SIMD optimizations  
when scanning?

However, it would make a lot of sense to just convert an existing 
XML solution with Boost license. I don't know which ones are any 
good, but RapidXML is at least Boost.

May 04 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote:
 However, it would make a lot of sense to just convert an 
 existing XML solution with Boost license. I don't know which 
 ones are any good, but RapidXML is at least Boost.

Given how D's arrays work, we have the opportunity to have an 
_extremely_ fast XML parser thanks to slices. It's highly 
unlikely that any C or C++ solution is going to be able to 
compete, and if it can, it's likely to be far more complex than 
necessary. Parsing is an area where we definitely should write 
our own stuff rather than porting existing code from other 
languages or use existing libraries in other languages via C 
bindings. Fast parsing is definitely a killer feature of D and 
the fact that std.xml botches that so badly is just embarrassing.

- Jonathan M Davis

May 04 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/4/15 12:31 PM, Jonathan M Davis wrote:
 On Monday, 4 May 2015 at 09:35:55 UTC, Ola Fosheim Grøstad wrote:
 However, it would make a lot of sense to just convert an existing XML
 solution with Boost license. I don't know which ones are any good, but
 RapidXML is at least Boost.

 Given how D's arrays work, we have the opportunity to have an
 _extremely_ fast XML parser thanks to slices. It's highly unlikely that
 any C or C++ solution is going to be able to compete, and if it can,
 it's likely to be far more complex than necessary. Parsing is an area
 where we definitely should write our own stuff rather than porting
 existing code from other languages or use existing libraries in other
 languages via C bindings. Fast parsing is definitely a killer feature of
 D and the fact that std.xml botches that so badly is just embarrassing.

To be frank what's more embarrassing is that we managed to do nothing 
about it for years (aside from endlessly wailing about it in an a 
capella ensemble). It's a failure of leadership (that Walter and I need 
to work on) that very many unimportant and arguably less interesting 
areas of Phobos get attention at the expense of this one. -- Andrei

May 04 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Monday, 4 May 2015 at 19:45:18 UTC, Andrei Alexandrescu wrote:
 On 5/4/15 12:31 PM, Jonathan M Davis wrote:
 Fast parsing is definitely a killer feature of
 D and the fact that std.xml botches that so badly is just 
 embarrassing.

 To be frank what's more embarrassing is that we managed to do 
 nothing about it for years (aside from endlessly wailing about 
 it in an a capella ensemble). It's a failure of leadership 
 (that Walter and I need to work on) that very many unimportant 
 and arguably less interesting areas of Phobos get attention at 
 the expense of this one. -- Andrei

Also true. Many of us just don't find enough time to work on D, 
and we don't seem to do a good job of encouraging larger 
contributions to Phobos, so newcomers don't tend to contribute 
like that. And there's so much to do all around that the big 
stuff just falls by the wayside, and it really shouldn't.

- Jonathan M Davis

May 04 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/4/2015 12:31 PM, Jonathan M Davis wrote:
 Given how D's arrays work, we have the opportunity to have an _extremely_ fast
 XML parser thanks to slices. It's highly unlikely that any C or C++ solution is
 going to be able to compete, and if it can, it's likely to be far more complex
 than necessary. Parsing is an area where we definitely should write our own
 stuff rather than porting existing code from other languages or use existing
 libraries in other languages via C bindings. Fast parsing is definitely a
killer
 feature of D and the fact that std.xml botches that so badly is just
embarrassing.

Tango's XML package was well regarded and the fastest in the business. It used 
slicing, and almost no memory allocation.

May 04 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 4 May 2015 at 19:31:59 UTC, Jonathan M Davis wrote:
 Given how D's arrays work, we have the opportunity to have an 
 _extremely_ fast XML parser thanks to slices.

Yes, that would be great. XML is a flexible go-to archive, 
exchange and application format.

Things like entities, namespaces and so makes it non-trivial, but 
being able to conveniently process Inkscape and Open Office files 
etc would be very useful.

One should probably look at what applications generate XML and 
create some large test files with existing applications.

May 05 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/4/2015 2:35 AM, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:
 Wouldn't D-ranges make it impossible to use SIMD optimizations when scanning?

Not at all. Algorithms can be specialized for various forms of input ranges, 
including ones where SIMD optimizations can be used.

Specialization is one of the very cool things about D algorithms.

May 04 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Sunday, 3 May 2015 at 22:02:13 UTC, Walter Bright wrote:
 On 5/3/2015 2:31 PM, Ilya Yaroshenko wrote:
 Can it lazily reads huge files (files greater than memory)?

 If a range interface is used, it doesn't need to be aware of 
 where the data is coming from. In fact, the xml package should 
 NOT be doing I/O.

Indeed. It should operate on ranges without caring where they 
came from (though it may end up supporting both input ranges and 
random-access ranges with the idea that it can support reading of 
a socket with a range in a less efficient manner or operating on 
a whole file at once as via a random-access range for more 
efficient parsing).

But if I/O is a big concern, I'd suggest just using std.mmfile to 
do the trick, since then you can still operate on the whole file 
as a single array without having to actually have the whole thing 
in memory.

- Jonathan M Davis

May 04 2015

Michel Fortin <michel.fortin michelf.ca> writes:

On 2015-05-03 17:39:46 +0000, "Robert burner Schadek" 
<rburners gmail.com> said:

 std.xml has been considered not up to specs nearly 3 years now. Time to 
 build a successor. I currently plan the following featues for it:
 
 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance
 
 Not much code yet, I'm currently building the performance test suite 
 https://github.com/burner/std.xml2
 
 Please post you feature requests, and please keep the posts DRY and on topic.

This isn't a feature request (sorry?), but I just want to point out 
that you should feel free to borrow code from 
https://github.com/michelf/mfr-xml-d  There's probably a lot you can 
reuse in there.

-- 
Michel Fortin
michel.fortin michelf.ca
http://michelf.ca

May 03 2015

"Robert burner Schadek" <rburners gmail.com> writes:

On Sunday, 3 May 2015 at 23:32:28 UTC, Michel Fortin wrote:
 This isn't a feature request (sorry?), but I just want to point 
 out that you should feel free to borrow code from 
 https://github.com/michelf/mfr-xml-d  There's probably a lot 
 you can reuse in there.

nice, thank you

May 04 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 4/05/2015 5:39 a.m., Robert burner Schadek wrote:
 std.xml has been considered not up to specs nearly 3 years now. Time to
 build a successor. I currently plan the following featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test suite
 https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY and on
 topic.

Preferably the interfaces are made first 1:1 as the spec requires.
Then its just a matter of building the actual reader/writer code.

That way we could theoretically rewrite the reader/writer to support 
other formats such as html5/svg. Independently of phobos.

Also would be nice to be CTFE'able!

May 03 2015

"Jonathan M Davis" <jmdavisProg gmx.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

If I were doing it, I'd do three types of parsers:

1. A parser that was pretty much as low level as you can get, 
where you basically a range of XML atributes or tags. Exactly how 
to build that could be a bit entertaining, since it would have to 
be hierarchical, and ranges aren't, but something like a range of 
tags where you can get a range of its attributes and sub-tags 
from it so that the whole document can be processed without 
actually getting to the level of even a SAX parser. That parser 
could then be used to build the other parsers, and anyone who 
needed insanely fast speeds could use it rather than the SAX or 
DOM parser so long as they were willing to pay the inevitable 
loss in user-friendliness.

2. SAX parser built on the low level parser.

3. DOM parser built either on the low level parser or the SAX 
parser (whichever made more sense).

I doubt that I'm really explaining the low level parser well 
enough or have even though through it enough, but I really think 
that even a SAX parser is too high level for the base parser and 
that something that slightly higher than a lexer (high enough to 
actually be processing XML rather than individual tokens but 
pretty much only as high as is required to do that) would be a 
far better choice.

IIRC, Michel Fortin's work went in that direction, and he linked 
to his code in another post, so I'd suggest at least looking at 
that for ideas.

Regardless, by building layers of XML parsers rather than just 
the standard ones, it should be possible to get higher 
performance while still having the more standard, user-friendly 
ones for those that don't need the full performance and do need 
the user-friendliness (though of course, we do want the SAX and 
DOM parsers to be efficient as well).

- Jonathan M Davis

May 04 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-04 21:14, Jonathan M Davis wrote:

 If I were doing it, I'd do three types of parsers:

 1. A parser that was pretty much as low level as you can get, where you
 basically a range of XML atributes or tags. Exactly how to build that
 could be a bit entertaining, since it would have to be hierarchical, and
 ranges aren't, but something like a range of tags where you can get a
 range of its attributes and sub-tags from it so that the whole document
 can be processed without actually getting to the level of even a SAX
 parser. That parser could then be used to build the other parsers, and
 anyone who needed insanely fast speeds could use it rather than the SAX
 or DOM parser so long as they were willing to pay the inevitable loss in
 user-friendliness.

 2. SAX parser built on the low level parser.

 3. DOM parser built either on the low level parser or the SAX parser
 (whichever made more sense).

 I doubt that I'm really explaining the low level parser well enough or
 have even though through it enough, but I really think that even a SAX
 parser is too high level for the base parser and that something that
 slightly higher than a lexer (high enough to actually be processing XML
 rather than individual tokens but pretty much only as high as is
 required to do that) would be a far better choice.

 IIRC, Michel Fortin's work went in that direction, and he linked to his
 code in another post, so I'd suggest at least looking at that for ideas.

This way the XML parser is structured in Tango. A pull parser at the 
lowest level, a SAX parser on top of that and I think the DOM parser 
builds on top of the pull parser.

The Tango pull parser can give you the following tokens:

* start element
* attribute
* end element
* end empty element
* data
* comment
* cdata
* doctype
* pi

-- 
/Jacob Carlborg

May 04 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-03 19:39, Robert burner Schadek wrote:

 Not much code yet, I'm currently building the performance test suite
 https://github.com/burner/std.xml2

I recommend benchmarking against the Tango pull parser.

-- 
/Jacob Carlborg

May 04 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/4/2015 12:28 PM, Jacob Carlborg wrote:
 On 2015-05-03 19:39, Robert burner Schadek wrote:

 Not much code yet, I'm currently building the performance test suite
 https://github.com/burner/std.xml2

 I recommend benchmarking against the Tango pull parser.

I agree. The Tango XML parser has set the performance bar. If any new solution 
can't match that, throw it out and try again.

May 04 2015

"Mario =?UTF-8?B?S3LDtnBsaW4i?= <linkrope github.com> writes:

On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote:
 On 2015-05-03 19:39, Robert burner Schadek wrote:

 Not much code yet, I'm currently building the performance test 
 suite
 https://github.com/burner/std.xml2

 I recommend benchmarking against the Tango pull parser.

Recently, I compared DOM parsers for an XML files of 100 MByte:

15.8 s tango.text.xml (SiegeLord/Tango-D2)
13.4 s ae.utils.xml (CyberShadow/ae)
  8.5 s xml.etree (Python)

Either the Tango DOM parser is slow compared to the Tango pull 
parser,
or the D2 port ruined the performance.

May 05 2015

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Tuesday, 5 May 2015 at 10:41:37 UTC, Mario Kröplin wrote:
 On Monday, 4 May 2015 at 19:28:25 UTC, Jacob Carlborg wrote:
 On 2015-05-03 19:39, Robert burner Schadek wrote:

 Not much code yet, I'm currently building the performance 
 test suite
 https://github.com/burner/std.xml2

 I recommend benchmarking against the Tango pull parser.

 Recently, I compared DOM parsers for an XML files of 100 MByte:

 15.8 s tango.text.xml (SiegeLord/Tango-D2)
 13.4 s ae.utils.xml (CyberShadow/ae)
  8.5 s xml.etree (Python)

 Either the Tango DOM parser is slow compared to the Tango pull 
 parser,
 or the D2 port ruined the performance.

As usual: system, compiler, compiler version, compilation flags?

May 05 2015

Richard Webb <richard.webb boldonjames.com> writes:

On 05/05/2015 11:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= 
<linkrope github.com>" wrote:
 Recently, I compared DOM parsers for an XML files of 100 MByte:

 15.8 s tango.text.xml (SiegeLord/Tango-D2)
 13.4 s ae.utils.xml (CyberShadow/ae)
   8.5 s xml.etree (Python)

 Either the Tango DOM parser is slow compared to the Tango pull parser,
 or the D2 port ruined the performance.


fwiw I did some tests a couple of years back with 
https://launchpad.net/d2-xml on 20 odd megabyte files and found it 
faster than Tango.
Unfortunately that would need some work to test now, as xmlp is 
abandoned and wouldn't build last time I tried it :-(

I also had some success with https://github.com/opticron/kxml, though it 
had some issues with chuffy entity decoding performance.


Also, profiling showed a lot of time spent in the GC, and the recent 
improvements in that area might have changed things by now.

May 05 2015

Walter Bright <newshound2 digitalmars.com> writes:

On 5/5/2015 4:16 AM, Richard Webb wrote:
 Also, profiling showed a lot of time spent in the GC, and the recent
 improvements in that area might have changed things by now.

I haven't read the Tango source code, but the performance of it's xml was 
supposedly because it did not use the GC, it used slices.

May 05 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-06 01:38, Walter Bright wrote:

 I haven't read the Tango source code, but the performance of it's xml
 was supposedly because it did not use the GC, it used slices.

That's only true for the pull parser (not sure about the SAX parser). 
The DOM parser needs to allocate the nodes, but if I recall correctly 
those are allocated in a free list. Not sure which parser was used in 
the test.

-- 
/Jacob Carlborg

May 05 2015

Richard Webb <richard.webb boldonjames.com> writes:

On 06/05/2015 07:31, Jacob Carlborg wrote:
 On 2015-05-06 01:38, Walter Bright wrote:

 I haven't read the Tango source code, but the performance of it's xml
 was supposedly because it did not use the GC, it used slices.

 That's only true for the pull parser (not sure about the SAX parser).
 The DOM parser needs to allocate the nodes, but if I recall correctly
 those are allocated in a free list. Not sure which parser was used in
 the test.

The direct comparisons were with the DOM parsers (I was playing with a D 
port of some C++ code at work at the time, and that is DOM based).

xmlp has alternate parsers (event driven etc) which were faster in some 
simple tests i did, but I don't recall if I did a direct comparison with 
Tango there.

May 06 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-05 12:41, "Mario =?UTF-8?B?S3LDtnBsaW4i?= 
<linkrope github.com>" wrote:

 Recently, I compared DOM parsers for an XML files of 100 MByte:

 15.8 s tango.text.xml (SiegeLord/Tango-D2)
 13.4 s ae.utils.xml (CyberShadow/ae)
   8.5 s xml.etree (Python)

 Either the Tango DOM parser is slow compared to the Tango pull parser,

Yes, of course it's slower. The DOM parser creates a DOM as well, which 
the pull parser doesn't.

These other libraries, what kind of parsers are those using? I mean, 
it's not fair to compare a pull parser against a DOM parser.

Could you try D1 Tango as well? Or do you have the benchmark available 
somewhere?

 or the D2 port ruined the performance.

Might be the case as well, see this comment [1].

[1] 
http://forum.dlang.org/thread/vsbsxfeciryrdsjhhfak forum.dlang.org?page=3#post-mi8hs8:24b0j:241:40digitalmars.com

-- 
/Jacob Carlborg

May 05 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Tuesday, 5 May 2015 at 12:10:59 UTC, Jacob Carlborg wrote:
 Yes, of course it's slower. The DOM parser creates a DOM as 
 well, which the pull parser doesn't.

 These other libraries, what kind of parsers are those using? I 
 mean, it's not fair to compare a pull parser against a DOM 
 parser.

I agree. Most applications will use a DOM parser for convenience, 
so sacrificing some speed initially in favour of easy-of-use 
makes a lot of sense. As long as it is possible to improve it 
later (e.g. use SIMD scanning to find the end of CDATA etc).

In my opinion it is rather difficult to build a good API without 
also using the API in an application in parallel. So it would be 
a good strategy to build a specific DOM along with writing the 
XML infrastructure, like SVG/HTML.

Also, some parsers, like RapidXML only support a subset of XML. 
So they cannot be used for comparisons.

May 05 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-05 16:04, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= 
<ola.fosheim.grostad+dlang gmail.com>" wrote:

 In my opinion it is rather difficult to build a good API without also
 using the API in an application in parallel. So it would be a good
 strategy to build a specific DOM along with writing the XML
 infrastructure, like SVG/HTML.

Agree.

 Also, some parsers, like RapidXML only support a subset of XML. So they
 cannot be used for comparisons.

The Tango parser has some limitation as well. In some places it 
sacrificed correctness for speed. There's a comment claiming the parser 
might read past the input if it's not well formed.

-- 
/Jacob Carlborg

May 05 2015

Brad Roberts via Digitalmars-d <digitalmars-d puremagic.com> writes:

An old friend of mine who was intimate with the microsoft xml parsers 
was fond of saying, particularly with respect to xml parsers, that if 
you hadn't finished implementing and testing error handling and negative 
tests (ie, malformed documents) that your positive benchmarks were 
fairly meaningless.  A whole lot of work goes into that 'second half' of 
things that can quickly cost performance.

I didn't dive or don't recall specific details as this was years ago.

The (over-)generalization from there is an old adage: it's easy to write 
an incorrect program.

On 5/5/2015 11:33 PM, Jacob Carlborg via Digitalmars-d wrote:
 On 2015-05-05 16:04, "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?=
 <ola.fosheim.grostad+dlang gmail.com>" wrote:

 In my opinion it is rather difficult to build a good API without also
 using the API in an application in parallel. So it would be a good
 strategy to build a specific DOM along with writing the XML
 infrastructure, like SVG/HTML.

 Agree.

 Also, some parsers, like RapidXML only support a subset of XML. So they
 cannot be used for comparisons.

 The Tango parser has some limitation as well. In some places it
 sacrificed correctness for speed. There's a comment claiming the parser
 might read past the input if it's not well formed.

May 05 2015

Jacob Carlborg <doob me.com> writes:

On 2015-05-03 19:39, Robert burner Schadek wrote:

 Not much code yet, I'm currently building the performance test suite
 https://github.com/burner/std.xml2

There are a couple of interesting comments about the Tango pull parser 
that can be worth mentioning:

* Use -version=whitespace to retain whitespace as data nodes. We see a 
%25 increase in token count and 10% throughput drop when parsing 
"hamlet.xml" with this option enabled (pullparser alone)

* The parser is constructed with some tradeoffs relating to document 
integrity. It is generally optimized for well-formed documents, and 
currently may read past a document-end for those that are not well formed

* Making some tiny unrelated change to the code can cause notable 
throughput changes. We're not yet clear why these swings are so 
pronounced (for changes outside the code path) but they seem to be 
related to the alignment of codegen. It could be a cache-line issue, or 
something else

The last comment might not relevant anymore since these are all quite 
old comments.

-- 
/Jacob Carlborg

May 04 2015

"Liam McSherry" <mcsherry.liam gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

Not a feature, but if `std.data.json` [1] gets accepted in to
Phobos, it may be something to consider naming this
`std.data.xml` (although that might not as effectively
differentiate it from `std.xml`).

[1]: http://wiki.dlang.org/Review_Queue

May 04 2015

Rikki Cattermole <alphaglosined gmail.com> writes:

On 5/05/2015 10:45 a.m., Liam McSherry wrote:
 On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek
 wrote:
 std.xml has been considered not up to specs nearly 3 years now. Time
 to build a successor. I currently plan the following featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test suite
 https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY and on
 topic.

 Not a feature, but if `std.data.json` [1] gets accepted in to
 Phobos, it may be something to consider naming this
 `std.data.xml` (although that might not as effectively
 differentiate it from `std.xml`).

 [1]: http://wiki.dlang.org/Review_Queue

It really should be std.data.xml. To keep with the new structuring. Plus 
it'll make transitioning a little easier.

May 04 2015

"weaselcat" <weaselcat gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

maybe off-topic, but it would be nice if the standard json,xml, 
etc etc all had identical interfaces(except for 
implementation-specific quirks.) This might be something worth 
discussing if it wasn't already agreed upon.

May 04 2015

Marco Leise <Marco.Leise gmx.de> writes:

Am Tue, 05 May 2015 02:01:50 +0000
schrieb "weaselcat" <weaselcat gmail.com>:

 maybe off-topic, but it would be nice if the standard json,xml, 
 etc etc all had identical interfaces(except for 
 implementation-specific quirks.) This might be something worth 
 discussing if it wasn't already agreed upon.

I don't think this needs discussion. It is plain impossible to
have a sophisticated JSON parser and a sophisticated XML
parser share the same API. Established function names,
structural differences in the formats and feature sets differ
to much.
For example in XML attributes and child elements are used
somewhat interchangeably whereas in JSON attributes don't
exist. So while in JSON "obj.field" makes sense in XML you
would want to select either an attribute or an element with
the name "field".

-- 
Marco

May 04 2015

Alex Vincent <ajvincent gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

I'm looking for a status update.  DUB doesn't seem to have many 
options posted.  I was thinking about starting a SAXParser 
implementation.

Feb 17 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 04:34:13 UTC, Alex Vincent wrote:
 I'm looking for a status update.  DUB doesn't seem to have many 
 options posted.  I was thinking about starting a SAXParser 
 implementation.

I'm working on it, but recently I had to do some major 
restructuring of the code.
Currently I'm trying to get this merged 
https://github.com/D-Programming-Language/phobos/pull/3880 
because I had some problems with the encoding of test files. XML 
has a lot of corner cases, it just takes time.

If you want to on some XML stuff, please join me. It is properly 
more productive working together than creating two competing 
implementations.

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner 
Schadek wrote:
 If you want to on some XML stuff, please join me. It is 
 properly more productive working together than creating two 
 competing implementations.

also I would like to see this 
https://github.com/D-Programming-Language/phobos/pull/2995 go in 
first to be able to accurately measure and compare performance

Feb 18 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 02/18/2016 05:49 AM, Robert burner Schadek wrote:
 On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner Schadek wrote:
 If you want to on some XML stuff, please join me. It is properly more
 productive working together than creating two competing implementations.

 also I would like to see this
 https://github.com/D-Programming-Language/phobos/pull/2995 go in first
 to be able to accurately measure and compare performance

Would the measuring be possible with 2995 as a dub package? -- Andrei

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 12:30:29 UTC, Andrei 
Alexandrescu wrote:
 also I would like to see this
 https://github.com/D-Programming-Language/phobos/pull/2995 go 
 in first
 to be able to accurately measure and compare performance

 Would the measuring be possible with 2995 as a dub package? -- 
 Andrei

yes, after have synced the dub package to the PR

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 15:39:01 UTC, Robert burner 
Schadek wrote:
 On Thursday, 18 February 2016 at 12:30:29 UTC, Andrei 
 Alexandrescu wrote:
 Would the measuring be possible with 2995 as a dub package? -- 
 Andrei

 yes, after have synced the dub package to the PR

brought the dub package up to date with the PR (v0.0.6)

Feb 23 2016

Alex Vincent <ajvincent gmail.com> writes:

On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner 
Schadek wrote:
 If you want to on some XML stuff, please join me. It is 
 properly more productive working together than creating two 
 competing implementations.

Oh, I absolutely agree, independent implementation is a bad 
thing. (Someone should rename DRY as "don't repeat yourself or 
others"... but DRYOO sounds weird.)

Where's your repo?

Feb 18 2016

Craig Dillabaugh <craig.dillabaugh gmail.com> writes:

On Thursday, 18 February 2016 at 10:18:18 UTC, Robert burner 
Schadek wrote:
 On Thursday, 18 February 2016 at 04:34:13 UTC, Alex Vincent 
 wrote:
 I'm looking for a status update.  DUB doesn't seem to have 
 many options posted.  I was thinking about starting a 
 SAXParser implementation.

 I'm working on it, but recently I had to do some major 
 restructuring of the code.
 Currently I'm trying to get this merged 
 https://github.com/D-Programming-Language/phobos/pull/3880 
 because I had some problems with the encoding of test files. 
 XML has a lot of corner cases, it just takes time.

 If you want to on some XML stuff, please join me. It is 
 properly more productive working together than creating two 
 competing implementations.

Would you be interested in mentoring a student for the Google 
Summer of Code to do work on std.xml?

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Friday, 19 February 2016 at 04:02:02 UTC, Craig Dillabaugh 
wrote:
 Would you be interested in mentoring a student for the Google 
 Summer of Code to do work on std.xml?

Yes, why not!

Feb 19 2016

Robert burner Schadek <rburners gmail.com> writes:

While working on a new xml implementation I came cross "control 
characters (CC)". [1]
When trying to validate/convert an utf string these lead to 
exceptions, because they are not valid utf character.
Unfortunately, some of these characters are allowed to appear in 
valid xml 1.* documents.

I currently see two option how to go about it:

1. Do not allow non CCs that do not work with existing 
functionality.
1.Pros
   * easy
1.Cons
   * the resulting xml implementation will not be xml 1.* complete

2. Add special cases to the existing functionality to handle CCs 
that are allowed in 1.0.
2.Pros
   * the resulting xml implementation will be xml 1.* complete
2.Cons
   * will make utf de/encoding slower as I would need to add 
additional logic

Any other ideas, feedback?




[1] https://en.wikipedia.org/wiki/C0_and_C1_control_codes

Feb 18 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 18 February 2016 at 15:56:58 UTC, Robert burner 
Schadek wrote:
 When trying to validate/convert an utf string these lead to 
 exceptions, because they are not valid utf character.

That means the user didn't encode them properly...

Which one specifically are you thinking of? I'm pretty sure all 
those control characters have a spot in the Unicode space and can 
be properly encoded as UTF-8 (though I think even if they are 
properly encoded, some of them are illegal in XML anyway).

If they appear in another form, it is invalid and/or needs a 
charset conversion, which should be specified in the XML document 
itself.

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

for instance, quick often I find <80> in tests that are supposed 
to be valid xml 1.0. they are invalid xml 1.1 though

Feb 18 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner 
Schadek wrote:
 for instance, quick often I find <80> in tests that are 
 supposed to be valid xml 1.0. they are invalid xml 1.1 though

What char encoding does the document declare itself as?

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 16:47:35 UTC, Adam D. Ruppe 
wrote:
 On Thursday, 18 February 2016 at 16:41:52 UTC, Robert burner 
 Schadek wrote:
 for instance, quick often I find <80> in tests that are 
 supposed to be valid xml 1.0. they are invalid xml 1.1 though

 What char encoding does the document declare itself as?

It does not, it has no prolog and therefore no EncodingInfo.

unix file says it is a utf8 encoded file, but not BOM is present.

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner 
Schadek wrote:
 unix file says it is a utf8 encoded file, but not BOM is 
 present.

the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

Feb 18 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
Schadek wrote:
 unix file says it is a utf8 encoded file, but not BOM is 
 present.

 the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

Gah, I should have read this before replying... well, that does 
appear to be valid utf-8.... why is it throwing an exception then?

I'm pretty sure that byte stream *is* actually well-formed xml 
1.0 and should pass utf validation as well as the XML 
well-formedness check.

Feb 18 2016

Alex Vincent <ajvincent gmail.com> writes:

On Thursday, 18 February 2016 at 17:26:30 UTC, Adam D. Ruppe 
wrote:
 On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
 Schadek wrote:
 unix file says it is a utf8 encoded file, but not BOM is 
 present.

 the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

 Gah, I should have read this before replying... well, that does 
 appear to be valid utf-8.... why is it throwing an exception 
 then?

 I'm pretty sure that byte stream *is* actually well-formed xml 
 1.0 and should pass utf validation as well as the XML 
 well-formedness check.

Regarding control characters:  If you give me a complete sample 
file, I can run it through Mozilla's UTF stream conversion and/or 
XML parsing code (via either SAX or DOMParser) to tell you how 
that reacts as a reference.  Mozilla supports XML 1.0, but not 
1.1.

Feb 18 2016

Robert burner Schadek <rburners gmail.com> writes:

On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent wrote:
 Regarding control characters:  If you give me a complete sample 
 file, I can run it through Mozilla's UTF stream conversion 
 and/or XML parsing code (via either SAX or DOMParser) to tell 
 you how that reacts as a reference.  Mozilla supports XML 1.0, 
 but not 1.1.

thanks you making the effort

https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml

Feb 18 2016

Alex Vincent <ajvincent gmail.com> writes:

On Thursday, 18 February 2016 at 21:53:24 UTC, Robert burner 
Schadek wrote:
 On Thursday, 18 February 2016 at 18:28:10 UTC, Alex Vincent 
 wrote:
 Regarding control characters:  If you give me a complete 
 sample file, I can run it through Mozilla's UTF stream 
 conversion and/or XML parsing code (via either SAX or 
 DOMParser) to tell you how that reacts as a reference.  
 Mozilla supports XML 1.0, but not 1.1.

 thanks you making the effort

 https://github.com/burner/std.xml2/blob/master/tests/eduni/xml-1.1/out/010.xml

In this case, Firefox just passes the control characters through 
to the contentHandler.characters method:

Starting runTest
Retrieved source
contentHandler.startDocument()
contentHandler.startElement("", "foo", "foo", {})
contentHandler.characters("\u0080")
contentHandler.endElement("", "foo", "foo")
contentHandler.endDocument()
Done reading

Feb 19 2016

Kagamin <spam here.lot> writes:

On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner 
Schadek wrote:
 the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

http://dpaste.dzfl.pl/80888ed31958 like this?

Feb 19 2016

Robert burner Schadek via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 2016-02-19 11:58, Kagamin via Digitalmars-d wrote:
 On Thursday, 18 February 2016 at 16:56:08 UTC, Robert burner Schadek
 wrote:
 the hex dump is "3C 66 6F 6F 3E C2 80 3C 2F 66 6F 6F 3E"

 http://dpaste.dzfl.pl/80888ed31958 like this?

No, The program just takes the hex dump as string.

you would need to do something like:

ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 66, 6F,
6F, 3E]);
string s = cast(string)arr;
dstring ds = to!dstring(s);

and see what happens

Feb 19 2016

Kagamin <spam here.lot> writes:

On Friday, 19 February 2016 at 12:30:06 UTC, Robert burner 
Schadek wrote:
 ubyte[] arr = cast(ubyte[])[3C, 66, 6F, 6F, 3E, C2, 80, 3C, 2F, 
 66, 6F,
 6F, 3E]);
 string s = cast(string)arr;
 dstring ds = to!dstring(s);

 and see what happens

http://dpaste.dzfl.pl/2f8a8ff10bde like this?

Feb 19 2016

Robert burner Schadek <rburners gmail.com> writes:

On Friday, 19 February 2016 at 12:55:52 UTC, Kagamin wrote:
 http://dpaste.dzfl.pl/2f8a8ff10bde like this?

yes

Feb 19 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 18 February 2016 at 16:54:10 UTC, Robert burner 
Schadek wrote:
 It does not, it has no prolog and therefore no EncodingInfo.

In that case, it needs to be valid UTF-8 or valid UTF-16 and it 
is a fatal error if there's any invalid bytes:

https://www.w3.org/TR/REC-xml/#charencoding

==
  It is a fatal error if an XML entity is determined (via default, 
encoding declaration, or higher-level protocol) to be in a 
certain encoding but contains byte sequences that are not legal 
in that encoding. Specifically, it is a fatal error if an entity 
encoded in UTF-8 contains any ill-formed code unit sequences, as 
defined in section 3.9 of Unicode [Unicode]. Unless an encoding 
is determined by a higher-level protocol, it is also a fatal 
error if an XML entity contains no encoding declaration and its 
content is not legal UTF-8 or UTF-16.
==

Feb 18 2016

crimaniak <crimaniak gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:

 Please post you feature requests...

- the ability to read documents with missing or incorrectly 
specified encoding
- additional feature: relaxed mode for reading html and broken 
XML documents

Some time ago I worked for Accusoft for the document 
viewing/converting software. The main experience that I get: any 
theoretically possible types of errors in the documents are real, 
when the application is popular.

Feb 20 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote:
 - the ability to read documents with missing or incorrectly 
 specified encoding
 - additional feature: relaxed mode for reading html and broken 
 XML documents

fyi, my dom.d can do those, I use it for web scraping where 
there's all kinds of hideous stuff out there.

https://github.com/adamdruppe/arsd/blob/master/dom.d

Feb 20 2016

crimaniak <crimaniak gmail.com> writes:

On Saturday, 20 February 2016 at 19:16:47 UTC, Adam D. Ruppe 
wrote:
 On Saturday, 20 February 2016 at 19:08:25 UTC, crimaniak wrote:
 - the ability to read documents with missing or incorrectly 
 specified encoding
 - additional feature: relaxed mode for reading html and broken 
 XML documents

 fyi, my dom.d can do those, I use it for web scraping where 
 there's all kinds of hideous stuff out there.

 https://github.com/adamdruppe/arsd/blob/master/dom.d

It works, thanks! I will use it in my experiments, but 
getElementsBySelector() selector language need to be improved I 
think.

Feb 21 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Sunday, 21 February 2016 at 23:01:22 UTC, crimaniak wrote:
 I will use it in my experiments, but getElementsBySelector() 
 selector language need to be improved I think.

What, specifically, do you have in mind?

Feb 21 2016

crimaniak <crimaniak gmail.com> writes:

On Sunday, 21 February 2016 at 23:57:40 UTC, Adam D. Ruppe wrote:
 On Sunday, 21 February 2016 at 23:01:22 UTC, crimaniak wrote:
 I will use it in my experiments, but getElementsBySelector() 
 selector language need to be improved I think.

 What, specifically, do you have in mind?

Where is only a couple of ad-hoc checks for attributes values. 
This language is not XPath-compatible, so most easy way to cover 
a lot of cases is regex check for attributes. Something like 
"script[src/https:.+\\.googleapis\\.com/i]"

Feb 25 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Thursday, 25 February 2016 at 23:59:04 UTC, crimaniak wrote:
 Where is only a couple of ad-hoc checks for attributes values. 
 This language is not XPath-compatible, so most easy way to 
 cover a lot of cases is regex check for attributes. Something 
 like "script[src/https:.+\\.googleapis\\.com/i]"

The css3 selector standard offers three substring search: 
[attr^=foo] if it begins with foo, [attr$=foo] if it ends with 
foo, and [attr*=foo] if it includes foo somewhere. dom.d supports 
all three now.

So for your regex, you could probably match: 
[attr*=googleapis.com] well enough.

Feb 25 2016

Dejan Lekic <dejan.lekic gmail.com> writes:

If you really want to be serious about the XML package, then I 
humbly believe implementing the commonly-known DOM interfaces is 
a must. Luckily there is IDL available for it: 
https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, 
speaking about DOM, all levels need to be supported!

Also, I would recommend borrowing the Tango's XML pull parser as 
it is blazingly fast.

Finally, perhaps integration with signal/slot module should 
perhaps be considered as well.

Feb 24 2016

Alex Vincent <ajvincent gmail.com> writes:

On Wednesday, 24 February 2016 at 10:55:01 UTC, Dejan Lekic wrote:
 If you really want to be serious about the XML package, then I 
 humbly believe implementing the commonly-known DOM interfaces 
 is a must. Luckily there is IDL available for it: 
 https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, 
 speaking about DOM, all levels need to be supported!

I agree, but the Document Object Model (DOM) is a huuuuuuuuge 
project.  It's a project I'd love to take an active hand in 
driving.  Also, DOM "level 4" is a living standard at whatwg.org, 
along with rules for parsing HTML.  (Which naturally means the 
rules are always changing.)

I have a partial implementation of DOM in JavaScript, so I am 
serious when I say it's going to take time.

Ideally (imho), we'd have a set of related packages, prefixed 
with std.web:
* html
* xml
* dom
* css
* javascript

(Yes, I'm suggesting a rename of std.xml2 to std.web.xml.)

But from what I can see, realistically the community is a long 
way from that.  I'm trying to write the SAX interfaces now.  I 
only have a limited amount of time to devote to this (a common 
complaint, I gather)...

Mar 01 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Wednesday, 2 March 2016 at 02:50:22 UTC, Alex Vincent wrote:
 I agree, but the Document Object Model (DOM) is a huuuuuuuuge 
 project.  It's a project I'd love to take an active hand in 
 driving.


My dom.d implements a fair chunk of it already.

https://github.com/adamdruppe/arsd/blob/master/dom.d

Yes, indeed, it is quite a lot of code, but easy to use if you 
are familiar with javascript and css selectors.

http://dpldocs.info/experimental-docs/arsd.dom.html

Mar 02 2016

=?UTF-8?Q?Tobias=20M=C3=BCller?= <troplin bluewin.ch> writes:

Dejan Lekic <dejan.lekic gmail.com> wrote:
 If you really want to be serious about the XML package, then I 
 humbly believe implementing the commonly-known DOM interfaces is 
 a must. Luckily there is IDL available for it: 
 https://www.w3.org/TR/DOM-Level-2-Core/idl/dom.idl . Also, 
 speaking about DOM, all levels need to be supported!
 
 Also, I would recommend borrowing the Tango's XML pull parser as 
 it is blazingly fast.
 
 Finally, perhaps integration with signal/slot module should 
 perhaps be considered as well.
 

What's the usecase of DOM outside of browser interoperability/scripting?
The API isn't particularly nice, especially in languages with a rich type
system.

Mar 01 2016

Adam D. Ruppe <destructionator gmail.com> writes:

On Wednesday, 2 March 2016 at 06:59:49 UTC, Tobias Müller wrote:
 What's the usecase of DOM outside of browser 
 interoperability/scripting? The API isn't particularly nice, 
 especially in languages with a rich type system.

I find my extended dom to be very nice, especially thanks to D's 
type system. I use it for a lot of things: using web apis, html 
scraping, config file stuff, working on my own documents, and 
even as my web template system.

Basically, dom.d made xml cool to me.

Mar 02 2016

=?UTF-8?Q?Tobias=20M=C3=BCller?= <troplin bluewin.ch> writes:

Adam D. Ruppe <destructionator gmail.com> wrote:
 On Wednesday, 2 March 2016 at 06:59:49 UTC, Tobias Müller wrote:
 What's the usecase of DOM outside of browser 
 interoperability/scripting? The API isn't particularly nice, 
 especially in languages with a rich type system.

 
 I find my extended dom to be very nice, especially thanks to D's 
 type system. I use it for a lot of things: using web apis, html 
 scraping, config file stuff, working on my own documents, and 
 even as my web template system.
 
 Basically, dom.d made xml cool to me.

Sure, some kind of DOM is certainly useful. But the standard XML-DOM isn't
particularly nice.
What's the point of a linked list style interface when you have ranges in
the language?

Mar 03 2016

Craig Dillabaugh <craig.dillabaugh gmail.com> writes:

On Sunday, 3 May 2015 at 17:39:48 UTC, Robert burner Schadek 
wrote:
 std.xml has been considered not up to specs nearly 3 years now. 
 Time to build a successor. I currently plan the following 
 featues for it:

 - SAX and DOM parser
 - in-situ / slicing parsing when possible (forward range?)
 - compile time switch (CTS) for lazy attribute parsing
 - CTS for encoding (ubyte(ASCII), char(utf8), ... )
 - CTS for input validating
 - performance

 Not much code yet, I'm currently building the performance test 
 suite https://github.com/burner/std.xml2

 Please post you feature requests, and please keep the posts DRY 
 and on topic.

Robert, we have had some student interest in GSOC for XML.  Would 
you be interested in mentoring a student to work with you on this.

Craig

Mar 05 2016

Robert burner Schadek <rburners gmail.com> writes:

On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh wrote:
 Robert, we have had some student interest in GSOC for XML.  
 Would you be interested in mentoring a student to work with you 
 on this.

 Craig

Of course

Mar 06 2016

Lodovico Giaretta <lodovico giaretart.net> writes:

On Sunday, 6 March 2016 at 11:46:00 UTC, Robert burner Schadek 
wrote:
 On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh 
 wrote:
 Robert, we have had some student interest in GSOC for XML.  
 Would you be interested in mentoring a student to work with 
 you on this.

 Craig

 Of course

Hi,
I don't know if this is the right spot to join the conversation;
I'm student and I'd really love to work on std.xml for GSoC!

I'm just waiting March 14 to apply.

Mar 07 2016

Craig Dillabaugh <craig.dillabaugh gmail.com> writes:

On Sunday, 6 March 2016 at 11:46:00 UTC, Robert burner Schadek 
wrote:
 On Saturday, 5 March 2016 at 15:20:12 UTC, Craig Dillabaugh 
 wrote:
 Robert, we have had some student interest in GSOC for XML.  
 Would you be interested in mentoring a student to work with 
 you on this.

 Craig

 Of course

Great.  Can you please get in touch by email so I can add you to 
the mentors list:

craig dot dillabaugh at gmail dot com

Cheers

Mar 07 2016

Alex Vincent <ajvincent gmail.com> writes:

For everyone's information, I've posted a pull request to Mr. 
Schadek's github repository, with a proposed Simple API for XML 
(SAX) stub.  I'd really appreciate reviews of the stub's 
interfaces.

https://github.com/burner/std.xml2/pull/5

Mar 12 2016

D Programming

C/C++ Programming

Other

digitalmars.D - std.xml2 (collecting features)