digitalmars.D - std.csv Performance Review
- Jesse Phillips (50/50) Jun 02 2017 Author here:
- Johan Engelen (7/8) Jun 03 2017 Quick note:
- bachmeier (5/11) Jun 03 2017 Do you know what happened with fastcsv [0], original thread [1].
- Jesse Phillips (6/10) Jun 03 2017 I do not. Rereading that in light of this new article I'm a
- H. S. Teoh via Digitalmars-d (49/58) Jun 03 2017 [...]
- Patrick Schluter (11/19) Jun 03 2017 If the file is in the file cache of the kernel, memory mapping
- Patrick Schluter (9/29) Jun 04 2017 To be precise, it's the copying of data that is spared by mmap.
- Jesse Phillips (15/34) Jun 04 2017 Ok, I took you up on that, I'm still skeptical:
- Seb (4/20) Jun 04 2017 In case you have time, it would be very interesting to compare it
- H. S. Teoh via Digitalmars-d (60/76) Jun 05 2017 Thank you for testing it yourself. I also tried to run the tests again
- Anonymouse (3/9) Jun 04 2017 Immediate idea is to have the cake and eat it too with
- Jon Degenhardt (10/20) Jun 05 2017 I'm the author of the TSV tools. I'd be happy to provide insights
Author here: The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task. Now I know std.csv isn't fast and it allocates. When I wrote my CSV parser, I'd also left around a parser which used slicing instead of allocation[3]. I compared these two: LDC -O3 -release std.csv: over 10 seconds csv slicing: under 5 seconds Over 50% improvement isn't bad, but this still wasn't competing with the other implementations. Now I didn't profile std.csv's implementation but I did take a look at the one with slicing. Majority of the time was spent in std.algorithm.startsWith, which is being called by countUntil. The calls made to empty() also add up from the use in countUntil and startsWith. These functions are by no means slow, startsWith averaged 1 millisecond execution time while countUntil was up to 5 milliseconds; thing is starts with was called a whopping 384,806,160 times. Keep in mind that the file itself has 10,512,769 rows of data with four columns. Now I've talked to std.csv's performance in the past, probably with the author of the fast command line tools. Essentially it came down to std.csv is restricted to parsing with only the Input Range api, and you can't correctly parse CSV without allocation. But now I'm working outside those restrictions and so I offer an additional point. Both of these do something none of the other implementation do, it validates the CSV is well formed. If it finds that the file no longer conforms to the correct CSV layout it makes a choice, either throw an exception or guess and continue on (based on the what the user requested). While the Nim implementation does handle escaped quotes (and newlines, unlike fast csv) the parsing assumes the file is well formed, which std.csv was quite prompt to point out that this file is indeed not well formed. Even though the issue can be ignored, the overhead of parsing to identify issues still remains. I haven't attempted write the algorithm assuming proper data structure so I don't know what the performance would look like, but I suspect it isn't negligible. There is also likely some overhead for providing the tokens through range interfaces. On another note, including this slicing version of the CSV parse can and should be included in std.csv as a specialization. But it is by no means ready. The feature requirements need to be spelled out better (hasSlicing!Range fails for strings but is the primary use-case for the optimization), escaped quotes remain in the returned data (like I said proper parsing requires allocation). 1. http://forum.dlang.org/post/chvukhbscgamxecvpwlw forum.dlang.org 2. https://www.euantorano.co.uk/posts/faster-command-line-tools-in-nim/ 3. https://github.com/JesseKPhillips/JPDLibs/tree/csvoptimize
Jun 02 2017
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:I compared these two: LDC -O3 -releaseQuick note: Keep in mind that LDC does not do cross-module inlining (non-template functions) by default yet. It's good to check whether you see big differences with `-enable-cross-module-inlining`. -Johan
Jun 03 2017
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:Author here: The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task. Now I know std.csv isn't fast and it allocates. When I wrote my CSV parser, I'd also left around a parser which used slicing instead of allocation[3].Do you know what happened with fastcsv [0], original thread [1]. [0] https://github.com/quickfur/fastcsv [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.com
Jun 03 2017
On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:Do you know what happened with fastcsv [0], original thread [1]. [0] https://github.com/quickfur/fastcsv [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.comI do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications. In other news, I'm not sure the validation std.varient.algebraic. Csv is doing is worth it.
Jun 03 2017
On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote:On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:[...] You don't have to be skeptical, neither do you have to believe what I claimed. I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing. You can just run it yourself and see the results for yourself. And yes, fastcsv has its limitations (the file has to fit in memory, no validation is done, etc.), which are also documented up-front in the README file. I wrote the code targeting a specific use case mentioned by the OP of the original thread, so I do not expect or claim you will see the same kind of performance for other use cases. If you want validation, then it's a given that you won't get maximum performance, simply because there's just more work to do. For data sets that don't fit into memory, I already have some ideas about how to extend my algorithm to work with it, so some of the performance may still be retained. But obviously it's not going to be as fast as if you can just read the entire file into memory first. (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself. Again, you don't have to believe me: the fastcsv code is there, just import std.mmfile, mmap the largest CSV file you can find, call fastcsv on it, and measure the performance yourself. If your code performs better, great, tell us all about it. I'm interested to learn how you did it.) Note that besides slicing, another major part of the performance boost in fastcsv is in minimizing GC allocations. If you allocate a string for each field in a row, it will be much slower than if you either sliced the original string, or if you allocated a large buffer for holding the data and just take slices for each field. Furthermore, if you allocate a new array per row to hold the list of fields, it will be much slower than if you allocate a large array for holding all the fields of all the rows, and merely slice this array to get your rows. Of course, you cannot know ahead of time exactly how many rows there will be, so the next best thing is to allocate a series of large arrays, capable of holding the field slices of k rows, for sufficiently large k. Once the current array runs out of space, copy the (partial) slices of the last row to beginning of a new large array, and continue from there. This way, you will be making n/k allocations, where n is the number of rows and k is the number of rows that fit into each buffer, as opposed to n allocations. For large values of k, this greatly reduces the GC load and significantly speeds things up. Again, don't take my word for it. Run a profiler on the fastcsv code and see for yourself. T -- People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANGDo you know what happened with fastcsv [0], original thread [1]. [0] https://github.com/quickfur/fastcsv [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.comI do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
Jun 03 2017
On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself.If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from. This stackoverflow [1] discussion links to a realworldtech discussion with Linus Torvalds explaining it in detail. On windows and Solaris the mechanism is the same. [1] https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962
Jun 03 2017
On Sunday, 4 June 2017 at 06:54:46 UTC, Patrick Schluter wrote:On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:To be precise, it's the copying of data that is spared by mmap. If the file is in the file cache, the open/read/write/close syscalls will also be fed from the memory mapped cache entry, but this requires that the data is copied from the kernel memory space to the processes buffer space. So each call to read will have to do this copying. So the gain from mmap comes for avoiding the copy of memory and avoiding the syscalls read/write/seek. The loading in memory of the physical file is the same in both cases.On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via (Note that this is much less of a limitation than it seems; for example you could use std.mmfile to memory-map the file into your address space so that it doesn't actually have to fit into memory, and you can still take slices of it. The OS will manage the paging from/to disk for you. Of course, it will be slower when something has to be paged from disk, but IME this is often much faster than if you read the data into memory yourself.If the file is in the file cache of the kernel, memory mapping does not need to reload the file as it is already in memory. In fact, calling mmap() changes only the sharing of the pages in general. That's where most of the performance win from memory mapping comes from.This stackoverflow [1] discussion links to a realworldtech discussion with Linus Torvalds explaining it in detail. On windows and Solaris the mechanism is the same. [1] https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination/6219962#6219962
Jun 04 2017
On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:On Sun, Jun 04, 2017 at 05:41:10AM +0000, Jesse Phillips via Digitalmars-d wrote:Ok, I took you up on that, I'm still skeptical: LDC2 -O3 -release -enable-cross-module-inlining std.csv: 12487 msecs fastcsv (no gc): 1376 msecs csvslicing: 3039 msecs That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time. Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.On Saturday, 3 June 2017 at 23:18:26 UTC, bachmeier wrote:[...] You don't have to be skeptical, neither do you have to believe what I claimed. I posted the entire code I used in the original thread, as well as the URLs of the exact data files I used for testing. You can just run it yourself and see the results for yourself.Do you know what happened with fastcsv [0], original thread [1]. [0] https://github.com/quickfur/fastcsv [1] http://forum.dlang.org/post/mailman.3952.1453600915.22025.digitalmars-d-learn puremagic.comI do not. Rereading that in light of this new article I'm a little sceptical of the 51 times faster, since I'm seeing only 10x against these other implications.
Jun 04 2017
On Sunday, 4 June 2017 at 15:59:03 UTC, Jesse Phillips wrote:On Sunday, 4 June 2017 at 06:15:24 UTC, H. S. Teoh wrote:In case you have time, it would be very interesting to compare it with other state of the art tools like paratext: http://www.wise.io/tech/paratext[...]Ok, I took you up on that, I'm still skeptical: LDC2 -O3 -release -enable-cross-module-inlining std.csv: 12487 msecs fastcsv (no gc): 1376 msecs csvslicing: 3039 msecs That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time. Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.
Jun 04 2017
On Sun, Jun 04, 2017 at 03:59:03PM +0000, Jesse Phillips via Digitalmars-d wrote: [...]Ok, I took you up on that, I'm still skeptical: LDC2 -O3 -release -enable-cross-module-inlining std.csv: 12487 msecs fastcsv (no gc): 1376 msecs csvslicing: 3039 msecs That looks like about 10 times faster to me. Using the slicing version failed because of \r\n line endings (guess multi-part separators is broken) I changed the data file so I could get the execution time.Thank you for testing it yourself. I also tried to run the tests again on my machine, and I can't seem to reproduce the 102136 msecs reading again. It seems that different compilers give somewhat readings, and also we are using different compile flags. In any case, in the spirit of full disclosure, here's my test with the 3 compilers, that I just ran just now just to be sure I'm not just copying old bad measurements: $ dmd -O -inline benchmark.d fastcsv.d $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 33119 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 2596 msecs $ gdc -O3 -finline benchmark.d fastcsv.d -o benchmark $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 23103 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 1909 msecs $ ldc2 -O3 benchmark.d fastcsv.d $ ./benchmark stdstruct std.csv read 2126883 records std.csv (struct): 20776 msecs $ ./benchmark faststruct2 fastcsv read 2126883 records fastcsv (struct with const(char)[]): 1813 msecs So, it looks like your 10x figure is more-or-less on target. I've no idea where the original 102136 msecs reading came from. Perhaps that was an unfortunate coincidence with a heavy load on my machine, or something like that. Or maybe just a bungled copy-n-paste.Anyway, I'm not trying to claim fastcsv isn't good at what it does, all I'm trying to point out is std.csv is doing more work than these faster csv parsers. And I don't even want to claim that std.csv is better because of that work, it actually appears that it was a mistake to do validation.I never intended for fastcsv to become a point of contention or as some kind of competition with std.csv, and I apologize if I ever came across that way. I fully understand that std.csv does more work than fastcsv; certainly, being able to assume an in-memory input and free slicing gives a big advantage over being restricted to just input range primitives. I had hoped to actually work fastcsv into a suitable form to merge into std.csv -- to dispel wrong perceptions of D being "slow", you see -- but it turned out to be more work than I had time for, so I didn't get very far beyond the initial promising results. My original hope was that the fastcsv code would be taken as a source of ideas that we could adopt for speeding up std.csv, rather than be taken in the way of "haha I wrote faster code than std.csv, so std.csv sux". The latter was not my intention at all. Anyway, I'm glad that you're looking into using slicing in std.csv. We need Phobos modules to be highly performant so that newcomers don't get the wrong impression about the language being slow. I'd also recommend investigating reducing GC load, as I described in my previous post, as another angle for improving the performance of std.csv. As for whether to validate or not: if you were to ask me, I'd leave it in, with a caveat in the docs that it would be less performant. As the standard library, Phobos should give the user options, including the option to validate input files that could potentially be malformed. But where the user knows the input is always well-formed, we can (and should) take advantage of that to achieve better performance. T -- Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
Jun 05 2017
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:Even though the issue can be ignored, the overhead of parsing to identify issues still remains. I haven't attempted write the algorithm assuming proper data structure so I don't know what the performance would look like, but I suspect it isn't negligible. There is also likely some overhead for providing the tokens through range interfaces.Immediate idea is to have the cake and eat it too with parseXML!(Yes.validate)(...), but more work.
Jun 04 2017
On Saturday, 3 June 2017 at 04:25:27 UTC, Jesse Phillips wrote:Author here: The discussion[1] and articles[2] around "Faster Command Line Tools" had me trying out std.csv for the task. [snip] Keep in mind that the file itself has 10,512,769 rows of data with four columns. Now I've talked to std.csv's performance in the past, probably with the author of the fast command line tools. [snip]I'm the author of the TSV tools. I'd be happy to provide insights I've gained from this exercise to help improve std.csv. I did examine std.csv when I first starting writing the TSV tools, but decided they weren't appropriate for what I was trying to do. I don't remember the specifics at this point though. CSV imposes many requirements on an implementation that TSV does not, which makes it challenging to produce an implementation optimal for both. --Jon
Jun 05 2017