digitalmars.D.learn - Help optimizing UnCompress for gzipped files
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (18/18) Jan 02 2018 Hi all,
- Stefan Koch (6/25) Jan 02 2018 Yes indeed. You can make it much faster by using a sliced static
- Adam D. Ruppe (13/17) Jan 02 2018 Only if you want data corruption! It keeps a copy of your pointer
- Steven Schveighoffer (12/23) Jan 02 2018 iopipe handles this quite well. And deals with the buffers properly
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (14/43) Jan 02 2018 Thanks Steve for this proposal (actually I already had an iopipe version
- Steven Schveighoffer (19/61) Jan 02 2018 Well, you don't need to use appender for that (and doing so is copying a...
- Steven Schveighoffer (5/8) Jan 02 2018 Would you mind telling me the source of the data? When I do get around
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (9/19) Jan 02 2018 Hi Steve,
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (12/29) Jan 02 2018 Thanks for this input, I updated the program to make use of this method
- Steven Schveighoffer (4/18) Jan 03 2018 Hm.. the numbers are worse! I would have expected to be at least
- Steven Schveighoffer (18/36) Jan 03 2018 Alright. I have spent some time examining the issues, and here are my
- Steven Schveighoffer (9/14) Jan 03 2018 Looking at when C realloc actually moves the data, it appears it all of
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (11/23) Jan 03 2018 That a very intersting finding! If I find some time today, I will also
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (23/23) Jan 04 2018 I added now a c variant, that does always malloc/memcpy/free. Its much
- Steven Schveighoffer (18/41) Jan 04 2018 Another thing I did was print out the size of the buffer before the
- Steven Schveighoffer (14/16) Jan 04 2018 v0.0.3 has been released. To take advantage of using malloc/realloc, you...
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (7/30) Jan 04 2018 Thanks Steve,
- Steven Schveighoffer (6/8) Jan 04 2018 free(mypipe.window.ptr);
- Steven Schveighoffer (6/8) Jan 04 2018 Still a bit irked that I can't match the C speed :/
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (9/18) Jan 04 2018 I guess, the code in my program is small enough that the optimize flags
- Steven Schveighoffer (24/40) Jan 05 2018 Yeah, I guess most of the bottlenecks are inside libz, or the memory
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (12/37) Jan 05 2018 I added another version that tries to do the "same" as the c version
- Steven Schveighoffer (25/40) Jan 05 2018 Hm... I think really there is some magic initial state of the allocator,...
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (11/55) Jan 06 2018 yes ... this is something i forgot to try out ... will do now :)
- Steven Schveighoffer (27/41) Jan 07 2018 Not from what I'm reading, the C solution is about the same (257 vs.
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (12/44) Jan 09 2018 yes you are right ... for proper benchmarking proper statistics should
- Adam D. Ruppe (9/12) Jan 02 2018 Yeah, std.zlib is VERY poorly written. You can get much better
- =?UTF-8?Q?Christian_K=c3=b6stlin?= (8/20) Jan 02 2018 I added a version that uses the gzip lowlevel apis (similar to my
Hi all, over the holidays, I played around with processing some gzipped json data. First version was implemented in ruby, but took too long, so I tried, dlang. This was already faster, but not really satisfactory fast. Then I wrote another version in java, which was much faster. After this I analyzed the first step of the process (gunzipping the data from a file to memory), and found out, that dlangs UnCompress is much slower than java, and ruby and plain c. There was some discussion on the forum a while ago: http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org The code I used and the numbers I got are here: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via rvm, python3 installed by homebrew, builtin clang compiler, ldc-1.7.0-beta1, java 1.8.0_152. Is there anything I can do to speed up the dlang stuff? Thanks in advance, Christian
Jan 02 2018
On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:Hi all, over the holidays, I played around with processing some gzipped json data. First version was implemented in ruby, but took too long, so I tried, dlang. This was already faster, but not really satisfactory fast. Then I wrote another version in java, which was much faster. After this I analyzed the first step of the process (gunzipping the data from a file to memory), and found out, that dlangs UnCompress is much slower than java, and ruby and plain c. There was some discussion on the forum a while ago: http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org The code I used and the numbers I got are here: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via rvm, python3 installed by homebrew, builtin clang compiler, ldc-1.7.0-beta1, java 1.8.0_152. Is there anything I can do to speed up the dlang stuff? Thanks in advance, ChristianYes indeed. You can make it much faster by using a sliced static array as buffer. I suspect that most of the slowdown is caused by the gc. As there should be only calls to the gzip library
Jan 02 2018
On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:You can make it much faster by using a sliced static array as buffer.Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design. If we really want a fast std.zlib, the API will need to be extended with new functions to fix these. Those new functions will probably look a LOT like the underlying C functions... which is why I say just use them right now.I suspect that most of the slowdown is caused by the gc. As there should be only calls to the gzip libraryplz measure before spreading FUD about the GC.
Jan 02 2018
On 1/2/18 8:57 AM, Adam D. Ruppe wrote:On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast. -SteveYou can make it much faster by using a sliced static array as buffer.Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.
Jan 02 2018
On 02.01.18 15:09, Steven Schveighoffer wrote:On 1/2/18 8:57 AM, Adam D. Ruppe wrote:Thanks Steve for this proposal (actually I already had an iopipe version on my harddisk that I applied to this problem) Its more or less your unzip example + putting the data to an appender (I hope this is how it should be done, to get the data to RAM). iopipe is already better than the normal dlang version, almost like java, but still far from the solution. I updated https://github.com/gizmomogwai/benchmarks/tree/master/gunzip I will give the direct gunzip calls a try ... In terms of json parsing, I had really nice results with the fast.json pull parser, but its comparing a little bit apples with oranges, because I did not pull out all the data there. --- ChristianOn Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast. -SteveYou can make it much faster by using a sliced static array as buffer.Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.
Jan 02 2018
On 1/2/18 1:01 PM, Christian Köstlin wrote:On 02.01.18 15:09, Steven Schveighoffer wrote:Well, you don't need to use appender for that (and doing so is copying a lot of the data an extra time). All you need is to extend the pipe until there isn't any more new data, and it will all be in the buffer. // almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip); // This line here will work with the current release (0.0.2): while(mypipe.extend(0) != 0) {} //But I have a fix for a bug that hasn't been released yet, this would work if you use iopipe-master: mypipe.ensureElems(); // getting the data is as simple as looking at the buffer. auto data = mypipe.window; // ubyte[] of the dataOn 1/2/18 8:57 AM, Adam D. Ruppe wrote:Thanks Steve for this proposal (actually I already had an iopipe version on my harddisk that I applied to this problem) Its more or less your unzip example + putting the data to an appender (I hope this is how it should be done, to get the data to RAM).On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:iopipe handles this quite well. And deals with the buffers properly (yes, it is very tricky. You have to ref-count the zstream structure, because it keeps internal pointers to *itself* as well!). And no, iopipe doesn't use std.zlib, I use the etc.zlib functions (but I poached some ideas from std.zlib when writing it). https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d I even wrote a json parser for iopipe. But it's far from complete. And probably needs updating since I changed some of the iopipe API. https://github.com/schveiguy/jsoniopipe Depending on the use case, it might be enough, and should be very fast.You can make it much faster by using a sliced static array as buffer.Only if you want data corruption! It keeps a copy of your pointer internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605 It also will always overallocate new buffers on each call <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602> There is no efficient way to use it. The implementation is substandard because the API limits the design.iopipe is already better than the normal dlang version, almost like java, but still far from the solution. I updated https://github.com/gizmomogwai/benchmarks/tree/master/gunzip I will give the direct gunzip calls a try ... In terms of json parsing, I had really nice results with the fast.json pull parser, but its comparing a little bit apples with oranges, because I did not pull out all the data there.Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable in your case. The end goal is to have something fast, but very easy to construct. I wasn't planning on focusing on the speed (yet) like other libraries do, but ease of writing code to use it. -Steve
Jan 02 2018
On 1/2/18 3:13 PM, Steven Schveighoffer wrote:// almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip);Would you mind telling me the source of the data? When I do get around to it, I want to have a good dataset to test things against, and would be good to use what others reach for. -Steve
Jan 02 2018
On 02.01.18 21:48, Steven Schveighoffer wrote:On 1/2/18 3:13 PM, Steven Schveighoffer wrote:Hi Steve, thanks for looking into this. I use data from nist.gov, the Makefile includes these download instructions: curl -s https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2011.json.gz > out/nist/2011.json.gz` -- Christian Köstlin// almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip);Would you mind telling me the source of the data? When I do get around to it, I want to have a good dataset to test things against, and would be good to use what others reach for. -Steve
Jan 02 2018
On 02.01.18 21:13, Steven Schveighoffer wrote:Well, you don't need to use appender for that (and doing so is copying a lot of the data an extra time). All you need is to extend the pipe until there isn't any more new data, and it will all be in the buffer. // almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip); // This line here will work with the current release (0.0.2): while(mypipe.extend(0) != 0) {}Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.I added direct gunzip calls as well... Those are really good, as long as I do not try to get the data into ram :) then it is "bad" again. I wonder what the real difference to the lowlevel solution with own appender and the c version is. For me they look almost the same (ugly, only the performance seems to be nice). Funny thing is, that if I add the clang address sanitizer things to the c program, I get almost the same numbers as for java :)I will give the direct gunzip calls a try ...Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable in your case. The end goal is to have something fast, but very easy to construct. I wasn't planning on focusing on the speed (yet) like other libraries do, but ease of writing code to use it. -Steve-- Christian Köstlin
Jan 02 2018
On 1/3/18 2:47 AM, Christian Köstlin wrote:On 02.01.18 21:13, Steven Schveighoffer wrote:Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this. -SteveWell, you don't need to use appender for that (and doing so is copying a lot of the data an extra time). All you need is to extend the pipe until there isn't any more new data, and it will all be in the buffer. // almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip); // This line here will work with the current release (0.0.2): while(mypipe.extend(0) != 0) {}Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.
Jan 03 2018
On 1/3/18 9:42 AM, Steven Schveighoffer wrote:On 1/3/18 2:47 AM, Christian Köstlin wrote:Alright. I have spent some time examining the issues, and here are my findings: 1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues. 2. extend(0) always attempts to read 8k more bytes. The buffer extends by 8k by going into another couple pages. Somehow this is choking the algorithm. I think the cost of extending pages actually ends up hurting performance (something we need to look at in the GC in general). 3. extendElems should allow extending all the elements in the most optimal fashion. Fixing that, iopipe performs as well as the Appender/iopipe version. This is coming out as well. Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :) -SteveOn 02.01.18 21:13, Steven Schveighoffer wrote:Hm.. the numbers are worse! I would have expected to be at least comparable. I'll have to look into it. Thanks for posting this.Well, you don't need to use appender for that (and doing so is copying a lot of the data an extra time). All you need is to extend the pipe until there isn't any more new data, and it will all be in the buffer. // almost the same line from your current version auto mypipe = openDev("../out/nist/2011.json.gz") .bufd.unzip(CompressionFormat.gzip); // This line here will work with the current release (0.0.2): while(mypipe.extend(0) != 0) {}Thanks for this input, I updated the program to make use of this method and compare it to the appender thing as well.
Jan 03 2018
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues.Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc. So it's not so much the performance of the copying, but the number of times copying is required that is affecting the ultimate performance. -Steve
Jan 03 2018
On 03.01.18 22:33, Steven Schveighoffer wrote:On 1/3/18 3:28 PM, Steven Schveighoffer wrote:That a very intersting finding! If I find some time today, I will also try a pure malloc/memcpy/free solution in c and also look how many real allocs are done. Thats funny, because this means, that this whole *2 thing for growing buffers that you learn is well taken care of in libc? Or does it still mean, that you should allocate in a pattern like this, for libc's algorithm to kick in? Is there api to see how much realloc really allocated, or is it only possible by comparing pointers to see if realloc delivers a new or the old pointer? -- Christian Köstlin1. The major differentiator between the C and D algorithms is the use of C realloc. This one thing saves the most time. I'm going to update iopipe so you can use it (stand by). I will also be examining how to simulate using realloc when not using C malloc in iopipe. I think it's the copying of data to the new buffer that is causing issues.Looking at when C realloc actually moves the data, it appears it all of a sudden over-allocates very very large blocks, much larger than the GC will over-allocate. This is why the GC is losing. After a certain size, the GC doesn't allocate blocks to grow into, so we start copying on every realloc.
Jan 03 2018
I added now a c variant, that does always malloc/memcpy/free. Its much slower for sure. Also I put some output in thats shows when a real realloc happens. Its like you said: did a real realloc did a real realloc did a real realloc did a real realloc did a real realloc did a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did a real realloc did not a real realloc funny thing is, i do not always get the same sequence of real reallocs. -- Christian Köstlin
Jan 04 2018
On 1/4/18 4:47 AM, Christian Köstlin wrote:I added now a c variant, that does always malloc/memcpy/free. Its much slower for sure. Also I put some output in thats shows when a real realloc happens. Its like you said: did a real realloc did a real realloc did a real realloc did a real realloc did a real realloc did a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did not a real realloc did a real realloc did not a real realloc funny thing is, i do not always get the same sequence of real reallocs.Another thing I did was print out the size of the buffer before the realloc. In other words: auto oldsize = buffer.length; auto oldptr = buffer.ptr; buffer = (cast(ubyte *)realloc(buffer.ptr, newsize))[0 .. newsize]; if(buffer.ptr != oldptr) writeln("moved: ", oldsize); Even the sizes malloc chooses for its big leap are not consistent from run to run! I'm wondering if possibly when you keep reallocing a large block, and it doesn't have enough memory, needs more from the system, it just tacks it onto the end of the existing big block. This would make the performance for this microbenchmark highly dependent on not doing any other allocations! BTW, my numbers are consistent, but there is a difference from yours. For example, my C runs are about 430ms, and my fast D runs are about 650-700ms. I never get the 200ms range runs that you are seeing. I have a similar machine, but still running Sierra. -Steve
Jan 04 2018
On 1/3/18 3:28 PM, Steven Schveighoffer wrote:Stay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :)v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator. e.g.: auto myPipe = openDev("file.gz").bufd // not here .unzip!Mallocator; // here myPipe.ensureElems(); // this works now. auto bytesRead = myPipe.window.length; The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow. Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator). -Steve
Jan 04 2018
On 04.01.18 16:53, Steven Schveighoffer wrote:On 1/3/18 3:28 PM, Steven Schveighoffer wrote:Thanks Steve, this runs now faster, I will update the table. Sorry, but I do not get how to cleanup the mallocated memory :) Could you help me out here? -- Christian KöstlinStay tuned, there will be updates to iopipe to hopefully make it as fast in this microbenchmark as the C version :)v0.0.3 has been released. To take advantage of using malloc/realloc, you can use std.experimental.allocator.mallocator.Mallocator as the BufferManager's allocator. e.g.: auto myPipe = openDev("file.gz").bufd // not here .unzip!Mallocator; // here myPipe.ensureElems(); // this works now. auto bytesRead = myPipe.window.length; The reason you don't need it on the first bufd is because that is the buffer of the file stream, which isn't going to grow. Note: the buffer manager does not free the data on destruction! You are responsible for freeing it (if you use Mallocator).
Jan 04 2018
On 1/4/18 1:57 PM, Christian Köstlin wrote:Sorry, but I do not get how to cleanup the mallocated memory :) Could you help me out here?free(mypipe.window.ptr); At the moment this works because you haven't released any data from the front. But eventually, I will need to figure out how to handle this when it's not so neat. -Steve
Jan 04 2018
On 1/4/18 1:57 PM, Christian Köstlin wrote:Thanks Steve, this runs now faster, I will update the table.Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc. -Steve
Jan 04 2018
On 04.01.18 20:46, Steven Schveighoffer wrote:On 1/4/18 1:57 PM, Christian Köstlin wrote:I guess, the code in my program is small enough that the optimize flags do not matter... most of the stuff is pulled from libz? Which is dynamically linked against /usr/lib/libz.1.dylib. I also cannot understand what I should do more (will try realloc with Mallocator) for the dlang-low-level variant to get to the c speed. rust is doing quite well there -- Christian KöstlinThanks Steve, this runs now faster, I will update the table.Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.
Jan 04 2018
On 1/5/18 1:01 AM, Christian Köstlin wrote:On 04.01.18 20:46, Steven Schveighoffer wrote:Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself.On 1/4/18 1:57 PM, Christian Köstlin wrote:I guess, the code in my program is small enough that the optimize flags do not matter... most of the stuff is pulled from libz? Which is dynamically linked against /usr/lib/libz.1.dylib.Thanks Steve, this runs now faster, I will update the table.Still a bit irked that I can't match the C speed :/ But, I can't get your C speed to duplicate on my mac even with gcc, so I'm not sure where to start. I find it interesting that you are not using any optimization flags for gcc.I also cannot understand what I should do more (will try realloc with Mallocator) for the dlang-low-level variant to get to the c speed.D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.rust is doing quite well thereI'll say a few words of caution here: 1. Almost all of these tests use the same C library to unzip. So it's really not a test of the performance of decompression, but the performance of memory management. And it appears that any test using malloc/realloc is in a different tier. Presumably because of the lack of copies (as discussed earlier). 2. Your rust test (I think, I'm not sure) is testing 2 things in the same run, which could potentially have dramatic consequences for the second test. For instance, it could already have all the required memory blocks ready, and the allocation strategy suddenly gets better. Or maybe there is some kind of caching of the input being done. I think you have a fairer test for the second option by running it in a separate program. I've never used rust, so I don't know what exactly your code is doing. 3. It's hard to make a decision based on such microbenchmarks as to which solution is "better" in an actual real-world program, especially when the state/usage of the memory allocator plays a huge role in this. -Steve
Jan 05 2018
On 05.01.18 15:39, Steven Schveighoffer wrote:Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself. D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing. My solution does have the same memory leak, as I am not sure how to best get the memory out of the FastAppender so that it is automagically cleaned up. Perhaps if we get rc things, this gets easier? I updated: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip with the newest numbers on my machine, but I think your iopipe solution is the best one we can get at the moment!sure .. thats truerust is doing quite well thereI'll say a few words of caution here: 1. Almost all of these tests use the same C library to unzip. So it's really not a test of the performance of decompression, but the performance of memory management. And it appears that any test using malloc/realloc is in a different tier. Presumably because of the lack of copies (as discussed earlier). 2. Your rust test (I think, I'm not sure) is testing 2 things in the same run, which could potentially have dramatic consequences for the second test. For instance, it could already have all the required memory blocks ready, and the allocation strategy suddenly gets better. Or maybe there is some kind of caching of the input being done. I think you have a fairer test for the second option by running it in a separate program. I've never used rust, so I don't know what exactly your code is doing. 3. It's hard to make a decision based on such microbenchmarks as to which solution is "better" in an actual real-world program, especially when the state/usage of the memory allocator plays a huge role in this.
Jan 05 2018
On 1/5/18 3:09 PM, Christian Köstlin wrote:On 05.01.18 15:39, Steven Schveighoffer wrote:Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast. One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C. One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself. D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing.My solution does have the same memory leak, as I am not sure how to best get the memory out of the FastAppender so that it is automagically cleaned up. Perhaps if we get rc things, this gets easier?I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer. Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe. -Steve
Jan 05 2018
On 05.01.18 23:04, Steven Schveighoffer wrote:On 1/5/18 3:09 PM, Christian Köstlin wrote:yes ... this is something i forgot to try out ... will do now :) mhh .. interesting numbers ... c is even faster, my d lowlevel solution is also a little bit faster, but much slower than the no copy version (funnily, no copy is the wrong name, it just overwrites all the data in a small buffer).On 05.01.18 15:39, Steven Schveighoffer wrote:Hm... I think really there is some magic initial state of the allocator, and that's what allows it to go so fast. One thing about the D version, because druntime is also using malloc (the GC is backed by malloc'd data after all), the initial state of the heap is quite different from when you start in C. It may be impossible or nearly impossible to duplicate the performance. But the flipside (if this is indeed the case) is that you won't see the same performance in a real-world app anyway, even in C. One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.Yeah, I guess most of the bottlenecks are inside libz, or the memory allocator. There isn't much optimization to be done in the main program itself. D compiles just the same as C. So theoretically you should be able to get the same performance with a ported version of your C code. It's worth a shot.I added another version that tries to do the "same" as the c version using mallocator, but i am still way off, perhaps its creating too many ranges on the underlying array. but its around the same speed as your great iopipe thing.yes .. thats tricky ... one question about iopipe. is it possible to transform the elements in the pipe as well ... e.g. away from a buffer of bytes to json objects? -- Christian KöstlinMy solution does have the same memory leak, as I am not sure how to best get the memory out of the FastAppender so that it is automagically cleaned up. Perhaps if we get rc things, this gets easier?I've been giving some thought to this. I think iopipe needs some buffer management primitives that allow you to finagle the buffer. I've been needing this for some time anyway (for file seeking). Right now, the buffer itself is buried in the chain, so it's hard to get at the actual buffer. Alternatively, I probably also need to give some thought to a mechanism that auto-frees the memory when it can tell nobody is still using the iopipe. Given that iopipe's signature feature is direct buffer access, this would mean anything that uses such a feature would have to be unsafe.
Jan 06 2018
On 1/6/18 11:14 AM, Christian Köstlin wrote:On 05.01.18 23:04, Steven Schveighoffer wrote:Not from what I'm reading, the C solution is about the same (257 vs. 261). Not sure if you have averaged these numbers, especially on a real computer that might be doing other things. Note: I would expect it to be a tiny bit faster, but not monumentally faster. From my testing with the reallocation, it only reallocates a large quantity of data once. However, the D solution should be much faster. Part of the issue is that you still aren't low-level enough :) Instead of allocating the ubyte array with this line: ubyte[] buffer = new ubyte[200*1024*1024]; Try this instead: // from std.array auto buffer = uninitializedArray!(ubyte[], 200*1024*1024); The difference is that the first one will have the runtime 0-initialize all the data.One thing to try, you preallocate the ENTIRE buffer. This only works if you know how many bytes it will decompress to (not always possible), but it will take the allocator out of the equation completely. And it's probably going to be the most efficient method (you aren't leaving behind smaller unused blocks when you realloc). If for some reason we can't beat/tie the C version doing that, then something else is going on.yes ... this is something i forgot to try out ... will do now :) mhh .. interesting numbers ... c is even faster, my d lowlevel solution is also a little bit faster, but much slower than the no copy version (funnily, no copy is the wrong name, it just overwrites all the data in a small buffer).one question about iopipe. is it possible to transform the elements in the pipe as well ... e.g. away from a buffer of bytes to json objects?Yes! I am working on doing just that, but haven't had a chance to update the toy project I wrote: https://github.com/schveiguy/jsoniopipe I was planning actually on having an iopipe of JsonItem, which would work just like a normal buffer, but reference the ubyte buffer underneath. Eventually, the final product should have a range of JsonValue, which you would recurse into in order to parse its children. All of it will be lazy, and stream-based, so you don't have to load the whole file if it's huge. Note, you can't have an iopipe of JsonValue, because it's a recursive format. JsonItems are just individual defined tokens, so they can be linear. -Steve
Jan 07 2018
On 07.01.18 14:44, Steven Schveighoffer wrote:Not from what I'm reading, the C solution is about the same (257 vs. 261). Not sure if you have averaged these numbers, especially on a real computer that might be doing other things.yes you are right ... for proper benchmarking proper statistics should be in place, taking out extreme values, averaging them, ...Note: I would expect it to be a tiny bit faster, but not monumentally faster. From my testing with the reallocation, it only reallocates a large quantity of data once. However, the D solution should be much faster. Part of the issue is that you still aren't low-level enough :) Instead of allocating the ubyte array with this line: ubyte[] buffer = new ubyte[200*1024*1024]; Try this instead: // from std.array auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);thanks for that ... i just did not know how to get an uninitialized array. i was aware, that dlang is nice and puts init there :)Yes! I am working on doing just that, but haven't had a chance to update the toy project I wrote: https://github.com/schveiguy/jsoniopipe I was planning actually on having an iopipe of JsonItem, which would work just like a normal buffer, but reference the ubyte buffer underneath. Eventually, the final product should have a range of JsonValue, which you would recurse into in order to parse its children. All of it will be lazy, and stream-based, so you don't have to load the whole file if it's huge. Note, you can't have an iopipe of JsonValue, because it's a recursive format. JsonItems are just individual defined tokens, so they can be linear.sounds really good. i played around with https://github.com/mleise/fast/blob/master/source/fast/json.d ... thats an interesting pull parser with the wrong licence unfortunately ... i wonder if something like this could be done on top of iopipe instead of a "real" buffer. --- Christian Köstlin
Jan 09 2018
On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:After this I analyzed the first step of the process (gunzipping the data from a file to memory), and found out, that dlangs UnCompress is much slower than java, and ruby and plain c.Yeah, std.zlib is VERY poorly written. You can get much better performance by just calling the C functions yourself instead. (You can just import import etc.c.zlib; it is included still) Improving it would mean changing the public API. I think the one-shot compress/uncompress functions are ok, but the streaming class does a lot of unnecessary work inside like copying stuff around.
Jan 02 2018
On 02.01.18 14:51, Adam D. Ruppe wrote:On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:I added a version that uses the gzip lowlevel apis (similar to my example c program). I am still having problems copying the data fast enough to an dlang array. please see the updated page: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip -- Christian KöstlinAfter this I analyzed the first step of the process (gunzipping the data from a file to memory), and found out, that dlangs UnCompress is much slower than java, and ruby and plain c.Yeah, std.zlib is VERY poorly written. You can get much better performance by just calling the C functions yourself instead. (You can just import import etc.c.zlib; it is included still) Improving it would mean changing the public API. I think the one-shot compress/uncompress functions are ok, but the streaming class does a lot of unnecessary work inside like copying stuff around.
Jan 02 2018