digitalmars.D - Performance of std.json
- David Soria Parra (28/28) Jun 01 2014 Hi,
- Joshua Niehus (4/9) Jun 01 2014 std.json is underpowered and in need of an overhaul. In the mean
- Jonathan M Davis via Digitalmars-d (20/48) Jun 01 2014 It's my understanding that the current design of std.json is considered
- w0rp (65/95) Jun 02 2014 I implemented a JSON library myself which parses JSON and
- w0rp (5/5) Jun 02 2014 It's worth noting, "pretty printing" could be configured entirely
- Jacob Carlborg (5/10) Jun 02 2014 I think there should be quite a minimal API, then a proper serialization...
- Sean Kelly (20/20) Jun 02 2014 The vibe.d parser is better, but it still creates a DOM-style
- Jacob Carlborg (4/24) Jun 02 2014 Yes, exactly.
- Jacob Carlborg (4/7) Jun 02 2014 That would be awesome. Is it written in D or was it C++ ?
- Sean Kelly (9/14) Jun 03 2014 It's written in C, and so would need an overhaul regardless. The
- Johannes Pfau (7/10) Jun 03 2014 I'd probably prefer a tokenizer/lexer as the lowest layer, then SAX and
- Jacob Carlborg (6/11) Jun 03 2014 If I recall correctly it will allocate strings instead of slicing the
- Jonathan M Davis via Digitalmars-d (13/23) Jun 03 2014 Agreed, though it might make sense to have something even lower level th...
- =?UTF-8?B?U8O2bmtlIEx1ZHdpZw==?= (5/25) Jun 03 2014 Since some time, the vibe.d parser can directly serialize from and to
- Chris Williams (24/29) Jun 02 2014 In general, I've been pretty happy with vibe.d, and I've heard
- David Soria Parra (7/30) Jun 02 2014 I think the main question is, given that std.json is close to be
- Chris Williams (18/24) Jun 02 2014 std.json really only has two methods parseJson and toJson. Any
- Masahiro Nakagawa (27/56) Jun 03 2014 I don't know the status of another D based JSON library.
- Masahiro Nakagawa (14/43) Jun 03 2014 BTW, my acquaintance points out your haskell code is different
Hi, I have recently had to deal with large amounts of JSON data in D. While doing that I've found that std.json is remarkable slow in comparison to other languages standard json implementation. I've create a small and simple benchmark parsing a local copy of a github API call "https://api.github.com/repos/D-Programming-Language/dmd/pulls" and parsing it 100% times and writing the title to stdout. My results as follows: ./d-test > /dev/null 3.54s user 0.02s system 99% cpu 3.560 total ./hs-test > /dev/null 0.02s user 0.00s system 93% cpu 0.023 total python test.py > /dev/null 0.77s user 0.02s system 99% cpu 0.792 total The concrete implementations (sorry for my terrible haskell implementation) can be found here: https://github.com/dsp/D-Json-Tests/ This is comapring D's std.json vs Haskells Data.Aeson and python standard library json. I am a bit concerned with the current state of our JSON parser given that a lot of applications these day use JSON. I personally consider a high speed implementation of JSON a critical part of a standard library. Would it make sense to start thinking about using ujson4c as an external library, or maybe come up with a better implementation. I know Orvid has something and might add some analysis as to why std.json is slow. Any ideas or pointers as to how to start with that?
Jun 01 2014
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:Would it make sense to start thinking about using ujson4c as an external library, or maybe come up with a better implementation. I know Orvid has something and might add some analysis as to why std.json is slow. Any ideas or pointers as to how to start with that?std.json is underpowered and in need of an overhaul. In the mean time have you tried vibe.d's json? http://vibed.org/api/vibe.data.json/
Jun 01 2014
On Mon, 02 Jun 2014 00:18:18 +0000 David Soria Parra via Digitalmars-d <digitalmars-d puremagic.com> wrote:Hi, I have recently had to deal with large amounts of JSON data in D. While doing that I've found that std.json is remarkable slow in comparison to other languages standard json implementation. I've create a small and simple benchmark parsing a local copy of a github API call "https://api.github.com/repos/D-Programming-Language/dmd/pulls" and parsing it 100% times and writing the title to stdout. My results as follows: ./d-test > /dev/null 3.54s user 0.02s system 99% cpu 3.560 total ./hs-test > /dev/null 0.02s user 0.00s system 93% cpu 0.023 total python test.py > /dev/null 0.77s user 0.02s system 99% cpu 0.792 total The concrete implementations (sorry for my terrible haskell implementation) can be found here: https://github.com/dsp/D-Json-Tests/ This is comapring D's std.json vs Haskells Data.Aeson and python standard library json. I am a bit concerned with the current state of our JSON parser given that a lot of applications these day use JSON. I personally consider a high speed implementation of JSON a critical part of a standard library. Would it make sense to start thinking about using ujson4c as an external library, or maybe come up with a better implementation. I know Orvid has something and might add some analysis as to why std.json is slow. Any ideas or pointers as to how to start with that?It's my understanding that the current design of std.json is considered to be poor, but I don't haven't used it, so I don't know any the details. But if it's as slow as you're finding to be the case, then I think that that supports the idea that it needs a redesign. The question then is what a new std.json should look like and who would do it. And that pretty much comes down to an interested and motivated developer coming up with and implementing a new design and then proposing it here. And until someone takes up that torch, we'll be stuck with what we have. Certainly, there's no fundamental reason why we can't have a lightening fast std.json. With ranges and slices, parsing in D in general should be faster than C/C++ (and definitely faster than Haskell of python), and if it isn't, that indicates that the implementation (if not the whole design) of that code needs to be redone. I know that vibe.d uses its own json implementation, but I don't know how much of that is part of its public API and how much of that is simply used internally: http://vibed.org - Jonathan M Davis
Jun 01 2014
On Monday, 2 June 2014 at 00:39:48 UTC, Jonathan M Davis via Digitalmars-d wrote:It's my understanding that the current design of std.json is considered to be poor, but I don't haven't used it, so I don't know any the details. But if it's as slow as you're finding to be the case, then I think that that supports the idea that it needs a redesign. The question then is what a new std.json should look like and who would do it. And that pretty much comes down to an interested and motivated developer coming up with and implementing a new design and then proposing it here. And until someone takes up that torch, we'll be stuck with what we have. Certainly, there's no fundamental reason why we can't have a lightening fast std.json. With ranges and slices, parsing in D in general should be faster than C/C++ (and definitely faster than Haskell of python), and if it isn't, that indicates that the implementation (if not the whole design) of that code needs to be redone. I know that vibe.d uses its own json implementation, but I don't know how much of that is part of its public API and how much of that is simply used internally: http://vibed.org - Jonathan M DavisI implemented a JSON library myself which parses JSON and generates JSON objects similar to how std.json does not. I wrote it largely because of the poor API in the standard library at the time, but I think by this point nearly all of the concerns have been alleviated. At the time I benchmarked it against std.json and vibe.d's implementation, and they were all pretty equivalent in terms of performance. I settled for edging just slightly ahead of std.json. If there's any major performance gains to make, I believe we will have to completely rethink how we go about parsing JSON I suspect transparent character encoding and decoding (dchar ranges) might be one potential source of trouble. In terms of API, I wouldn't go completely for an approach based on serialising to structs. Having a tagged union type is still helpful for situations where you just want to quickly get at some JSON data and do something with it. I have thought a great deal about writing data *to* JSON strings however, and I have an idea for this I would like to share. First, you define by convention that there is a function writeJSON which takes some value and an OutputRange, and then writes the value in a JSON representation directly to an OutputRange. You define in the library writeJSON functions for standard types. writeJSON(OutputRange)(JSONValue, OutputRange); writeJSON(OutputRange)(string, OutputRange); writeJSON(OutputRange)(int, OutputRange); writeJSON(OutputRange)(bool, OutputRange); writeJSON(OutputRange)(typeof(null), OutputRange); // ... You define one additional writeJSON function, which takes any InputRange of type T and writes an array of Ts. (So string[] will write an array of strings, int[] will write ints, etc.) writeJSON(InputRange, OutputRange)(InputRange inRange, OutputRange outRange) { foreach(ref value; inRange) { writeJSON(value, outRange); } } Add a convenience method which takes var args alternatively string, T, string, U, ... Call it say, writeJSONObject. You now have a decent framework for writing objects directly to OutputRanges. struct Foo { AnotherType bar; string stringValue; int intValue; } writeJSON(OutputRange)(Foo foo, OutputRange outRange) { // Writes {"bar":<bar_value>, ... } writeJSONObject(outRange, // writeJSONObject calls writeJSON for AnotherType, etc. "bar", foo.bar, "stringValue", foo.stringValue, "intValue", foo.intValue ); } There are more details, and something would need to be done for handling stack overflows, (inlining?) but there's the idea that I had for improving writing JSON at least. One advantage in this approach would be that it wouldn't be dependent on the GC, and scoped buffers could be used. (A nogc candidate, I think.) You can't get this ability out of something like toJSON which produces a string at once.
Jun 02 2014
It's worth noting, "pretty printing" could be configured entirely in an OutputRange which watches for certain syntax coming into the range and inserts whitespace where it believes to be appropriate, so writeJSON functions would not need to know anything about pretty printing.
Jun 02 2014
On 02/06/14 13:36, w0rp wrote:In terms of API, I wouldn't go completely for an approach based on serialising to structs. Having a tagged union type is still helpful for situations where you just want to quickly get at some JSON data and do something with it. I have thought a great deal about writing data *to* JSON strings however, and I have an idea for this I would like to share.I think there should be quite a minimal API, then a proper serialization module can be built on top of that. -- /Jacob Carlborg
Jun 02 2014
The vibe.d parser is better, but it still creates a DOM-style tree of objects, which isn't acceptable in some circumstances. I posted a performance comparison of the JSON parser I created for work use with std.json a while back, and mine is almost 100x faster than std.json in a simple test and allocates zero memory to boot: http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org I haven't tried it vs. the vibe.d parser, but I suspect it will still beat it by an order of magnitude or more because of the not allocating thing. I've said this a bunch of times, but what I want to see is a SAX-style parser as the bottom layer with an optional DOM-style parser built on top of it. Then people who want the tree generated can get it, and people who want performance or don't want allocations can get that too. I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it. Parsing JSON really isn't terribly difficult though. It shouldn't take more than a few days for one of the more parser-oriented people here to produce something comparable.
Jun 02 2014
On 02/06/14 21:13, Sean Kelly wrote:The vibe.d parser is better, but it still creates a DOM-style tree of objects, which isn't acceptable in some circumstances. I posted a performance comparison of the JSON parser I created for work use with std.json a while back, and mine is almost 100x faster than std.json in a simple test and allocates zero memory to boot: http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org I haven't tried it vs. the vibe.d parser, but I suspect it will still beat it by an order of magnitude or more because of the not allocating thing. I've said this a bunch of times, but what I want to see is a SAX-style parser as the bottom layer with an optional DOM-style parser built on top of it. Then people who want the tree generated can get it, and people who want performance or don't want allocations can get that too. I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it. Parsing JSON really isn't terribly difficult though. It shouldn't take more than a few days for one of the more parser-oriented people here to produce something comparable.Yes, exactly. -- /Jacob Carlborg
Jun 02 2014
On 02/06/14 21:13, Sean Kelly wrote:I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it.That would be awesome. Is it written in D or was it C++ ? -- /Jacob Carlborg
Jun 02 2014
On Tuesday, 3 June 2014 at 06:39:04 UTC, Jacob Carlborg wrote:On 02/06/14 21:13, Sean Kelly wrote:It's written in C, and so would need an overhaul regardless. The user basically assigns a bunch of function pointers for the callbacks. Using the parser at this level is really kind of difficult because you have to create a state machine for parsing anything reasonably complex, so what I usually do is nest calls to foreachObjectField and foreachArrayElem. I'm wondering if we can't do something similar here, but with corresponding ForwardRanges instead of the opApply style.I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it.That would be awesome. Is it written in D or was it C++ ?
Jun 03 2014
Am Mon, 02 Jun 2014 19:13:07 +0000 schrieb "Sean Kelly" <sean invisibleduck.org>:I've said this a bunch of times, but what I want to see is a SAX-style parser as the bottom layer with an optional DOM-style parser built on top of it.I'd probably prefer a tokenizer/lexer as the lowest layer, then SAX and DOM implemented using the tokenizer. This way we can provide a kind of input range. I actually used Brian Schotts std.lexer proposal to build a simple JSON tokenizer/lexer and it worked quite well. But I don't think std.lexer is zero-allocation yet so that's an important drawback.
Jun 03 2014
On 03/06/14 09:15, Johannes Pfau wrote:I'd probably prefer a tokenizer/lexer as the lowest layer, then SAX and DOM implemented using the tokenizer. This way we can provide a kind of input range. I actually used Brian Schotts std.lexer proposal to build a simple JSON tokenizer/lexer and it worked quite well. But I don't think std.lexer is zero-allocation yet so that's an important drawback.If I recall correctly it will allocate strings instead of slicing the input. The strings are then reused. If the input is sliced the whole input is retained in memory even if not all of the input is used. -- /Jacob Carlborg
Jun 03 2014
On Mon, 02 Jun 2014 19:13:07 +0000 Sean Kelly via Digitalmars-d <digitalmars-d puremagic.com> wrote:I've said this a bunch of times, but what I want to see is a SAX-style parser as the bottom layer with an optional DOM-style parser built on top of it. Then people who want the tree generated can get it, and people who want performance or don't want allocations can get that too. I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it. Parsing JSON really isn't terribly difficult though. It shouldn't take more than a few days for one of the more parser-oriented people here to produce something comparable.Agreed, though it might make sense to have something even lower level than a SAX parser. Certainly, for XML, I'd implement something that just gave you a range of the attributes without any consideration for what you might do with them, whereas it's my understanding that a SAX parser uses callbacks which triggered when it finds for what you're looking for. A SAX parser and DOM parser could then be built on top of the simple range API. I'd be looking to do something similar with JSON were I implementing a JSON parser, though since JSON is a bit different from XML in structure, I'm not quite sure what the lowest level API which would still be useful would be. I'd have to think about it. But in principle, I agree with what you're suggesting. - Jonathan M Davis
Jun 03 2014
Am 02.06.2014 21:13, schrieb Sean Kelly:The vibe.d parser is better, but it still creates a DOM-style tree of objects, which isn't acceptable in some circumstances. I posted a performance comparison of the JSON parser I created for work use with std.json a while back, and mine is almost 100x faster than std.json in a simple test and allocates zero memory to boot: http://forum.dlang.org/thread/cyzcirslzcgnyxbyzycc forum.dlang.org#post-gxgeizjsurulklzftfqz:40forum.dlang.org I haven't tried it vs. the vibe.d parser, but I suspect it will still beat it by an order of magnitude or more because of the not allocating thing. I've said this a bunch of times, but what I want to see is a SAX-style parser as the bottom layer with an optional DOM-style parser built on top of it. Then people who want the tree generated can get it, and people who want performance or don't want allocations can get that too. I'm starting to wonder if I should just try and get permission from work to open source my parser so I can submit it. Parsing JSON really isn't terribly difficult though. It shouldn't take more than a few days for one of the more parser-oriented people here to produce something comparable.Since some time, the vibe.d parser can directly serialize from and to string form, circumventing the DOM step and without unnecessary allocations. But I agree that an intermediate SAX layer would be nice to have. Maybe even an additional StAX layer.
Jun 03 2014
On Monday, 2 June 2014 at 00:39:48 UTC, Jonathan M Davis via Digitalmars-d wrote:I know that vibe.d uses its own json implementation, but I don't know how much of that is part of its public API and how much of that is simply used internally: http://vibed.orgIn general, I've been pretty happy with vibe.d, and I've heard that the parser speed of the JSON implementation is good. But I must admit that I found the API to be fairly obtuse. In order to do much of anything, you really need to serialize/deserialize from structs. The JSON objects themselves are pretty impossible to modify. I haven't looked at how vibe's parser works, but any very-fast parser would probably need to support an input stream, so that it can build out data in parallel to I/O, and do a lot of manual memory management. E.g. you probably want a stack of reusable node buffers that you use to add elements to as you scan the JSON tree, then clone off purpose-sized nodes from the work buffers when you encounter the end of the definition. Whereas, the current implementation in std.json only accepts a complete string and for each node starts with no memory and has to allocate/reallocate for every fresh piece of information. Having worked with JSON libraries quite a bit, the key to a good one is the ability to refer to paths through the data. So besides the JSON objects themselves, you need something like a "struct JPath" that represents an array of strings and size_ts, which you can pass into get, set, has, and count methods. I'd view the lack of that as the larger issue with the current JSON implementations.
Jun 02 2014
On Monday, 2 June 2014 at 19:05:15 UTC, Chris Williams wrote:In general, I've been pretty happy with vibe.d, and I've heard that the parser speed of the JSON implementation is good. But I must admit that I found the API to be fairly obtuse. In order to do much of anything, you really need to serialize/deserialize from structs. The JSON objects themselves are pretty impossible to modify. I haven't looked at how vibe's parser works, but any very-fast parser would probably need to support an input stream, so that it can build out data in parallel to I/O, and do a lot of manual memory management. E.g. you probably want a stack of reusable node buffers that you use to add elements to as you scan the JSON tree, then clone off purpose-sized nodes from the work buffers when you encounter the end of the definition. Whereas, the current implementation in std.json only accepts a complete string and for each node starts with no memory and has to allocate/reallocate for every fresh piece of information. Having worked with JSON libraries quite a bit, the key to a good one is the ability to refer to paths through the data. So besides the JSON objects themselves, you need something like a "struct JPath" that represents an array of strings and size_ts, which you can pass into get, set, has, and count methods. I'd view the lack of that as the larger issue with the current JSON implementations.I think the main question is, given that std.json is close to be unusable for anything serious due to it's poor performance, can we come up with something faster that has the same API. I am not sure what phobos take on backwards compatibility is, but I'd rather keep the API than breaking it for whoever is using std.json.
Jun 02 2014
On Monday, 2 June 2014 at 20:10:52 UTC, David Soria Parra wrote:I think the main question is, given that std.json is close to be unusable for anything serious due to it's poor performance, can we come up with something faster that has the same API. I am not sure what phobos take on backwards compatibility is, but I'd rather keep the API than breaking it for whoever is using std.json.std.json really only has two methods parseJson and toJson. Any implementation is going to have those two methods, so in terms of not breaking anything, you're pretty safe there. Since it doesn't have any methods except those two, it really comes down to the underlying data structure. Right now, you have to read the source and understand the structure in order to operate on it, which is a hassle, but is presumably what people are doing. So maintaining the current structure would be the key necessity. I think that limits the optimizations which could be performed, but doesn't make them impossible. Adding a stream-based parsing method would probably be the main optimization. That adds to the API, but is backwards compatible. The module has a lot of inner methods and recursion. Reducing the number of function calls, using manual stack management instead of recursion, etc. might give another significant gain. How parseJson() works is irrelevant to the caller, so all of that can be optimized to the heart's content.
Jun 02 2014
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:Hi, I have recently had to deal with large amounts of JSON data in D. While doing that I've found that std.json is remarkable slow in comparison to other languages standard json implementation. I've create a small and simple benchmark parsing a local copy of a github API call "https://api.github.com/repos/D-Programming-Language/dmd/pulls" and parsing it 100% times and writing the title to stdout. My results as follows: ./d-test > /dev/null 3.54s user 0.02s system 99% cpu 3.560 total ./hs-test > /dev/null 0.02s user 0.00s system 93% cpu 0.023 total python test.py > /dev/null 0.77s user 0.02s system 99% cpu 0.792 total The concrete implementations (sorry for my terrible haskell implementation) can be found here: https://github.com/dsp/D-Json-Tests/ This is comapring D's std.json vs Haskells Data.Aeson and python standard library json. I am a bit concerned with the current state of our JSON parser given that a lot of applications these day use JSON. I personally consider a high speed implementation of JSON a critical part of a standard library. Would it make sense to start thinking about using ujson4c as an external library, or maybe come up with a better implementation. I know Orvid has something and might add some analysis as to why std.json is slow. Any ideas or pointers as to how to start with that?I don't know the status of another D based JSON library. If you can install yajl library, then yajl-d is an another candidate. % time ./yajl_test > /dev/null ./yajl_test > /dev/null 0.42s user 0.01s system 99% cpu 0.434 total % time python test.py> /dev/null python test.py > /dev/null 0.65s user 0.02s system 99% cpu 0.671 total % time ./test > /dev/null ./test > /dev/null 3.10s user 0.02s system 99% cpu 3.125 total import yajl.yajl, std.datetime, std.file, std.stdio; void parse() { foreach(elem; readText("test.json").decode.array) { writeln(elem.object["title"]); } } int main(string[] args) { for(uint i = 0; i < 100; i++) { parse(); } return 0; } http://code.dlang.org/packages/yajl NOTE: yajl-d doesn't expose yajl's SAX style API unlike Sean's implementation
Jun 03 2014
On Monday, 2 June 2014 at 00:18:19 UTC, David Soria Parra wrote:Hi, I have recently had to deal with large amounts of JSON data in D. While doing that I've found that std.json is remarkable slow in comparison to other languages standard json implementation. I've create a small and simple benchmark parsing a local copy of a github API call "https://api.github.com/repos/D-Programming-Language/dmd/pulls" and parsing it 100% times and writing the title to stdout. My results as follows: ./d-test > /dev/null 3.54s user 0.02s system 99% cpu 3.560 total ./hs-test > /dev/null 0.02s user 0.00s system 93% cpu 0.023 total python test.py > /dev/null 0.77s user 0.02s system 99% cpu 0.792 total The concrete implementations (sorry for my terrible haskell implementation) can be found here: https://github.com/dsp/D-Json-Tests/ This is comapring D's std.json vs Haskells Data.Aeson and python standard library json. I am a bit concerned with the current state of our JSON parser given that a lot of applications these day use JSON. I personally consider a high speed implementation of JSON a critical part of a standard library. Would it make sense to start thinking about using ujson4c as an external library, or maybe come up with a better implementation. I know Orvid has something and might add some analysis as to why std.json is slow. Any ideas or pointers as to how to start with that?BTW, my acquaintance points out your haskell code is different from other samples. Your haskell code parses JSON array only once. This is why so fast. He uploads same behaviour code which parses JSON array at each loop. Please check it. https://gist.github.com/maoe/e5f72c3cf3687610fe5c On my env result: % time ./new_test > /dev/null ./new_test > /dev/null 1.13s user 0.02s system 99% cpu 1.144 total % time ./test > /dev/null ./test > /dev/null 0.02s user 0.00s system 91% cpu 0.023 total
Jun 03 2014