digitalmars.D.learn - Help optimizing UnCompress for gzipped files

=?UTF-8?Q?Christian_K=c3=b6stlin?= (18/18) Jan 02 2018 Hi all,

Stefan Koch (6/25) Jan 02 2018 Yes indeed. You can make it much faster by using a sliced static

Adam D. Ruppe (13/17) Jan 02 2018 Only if you want data corruption! It keeps a copy of your pointer

Steven Schveighoffer (12/23) Jan 02 2018 iopipe handles this quite well. And deals with the buffers properly

=?UTF-8?Q?Christian_K=c3=b6stlin?= (14/43) Jan 02 2018 Thanks Steve for this proposal (actually I already had an iopipe version

Steven Schveighoffer (19/61) Jan 02 2018 Well, you don't need to use appender for that (and doing so is copying a...

Steven Schveighoffer (5/8) Jan 02 2018 Would you mind telling me the source of the data? When I do get around

=?UTF-8?Q?Christian_K=c3=b6stlin?= (9/19) Jan 02 2018 Hi Steve,

=?UTF-8?Q?Christian_K=c3=b6stlin?= (12/29) Jan 02 2018 Thanks for this input, I updated the program to make use of this method

Steven Schveighoffer (4/18) Jan 03 2018 Hm.. the numbers are worse! I would have expected to be at least

Steven Schveighoffer (18/36) Jan 03 2018 Alright. I have spent some time examining the issues, and here are my

Steven Schveighoffer (9/14) Jan 03 2018 Looking at when C realloc actually moves the data, it appears it all of

=?UTF-8?Q?Christian_K=c3=b6stlin?= (11/23) Jan 03 2018 That a very intersting finding! If I find some time today, I will also

=?UTF-8?Q?Christian_K=c3=b6stlin?= (23/23) Jan 04 2018 I added now a c variant, that does always malloc/memcpy/free. Its much

Steven Schveighoffer (18/41) Jan 04 2018 Another thing I did was print out the size of the buffer before the

Steven Schveighoffer (14/16) Jan 04 2018 v0.0.3 has been released. To take advantage of using malloc/realloc, you...

=?UTF-8?Q?Christian_K=c3=b6stlin?= (7/30) Jan 04 2018 Thanks Steve,

Steven Schveighoffer (6/8) Jan 04 2018 free(mypipe.window.ptr);
Steven Schveighoffer (6/8) Jan 04 2018 Still a bit irked that I can't match the C speed :/

=?UTF-8?Q?Christian_K=c3=b6stlin?= (9/18) Jan 04 2018 I guess, the code in my program is small enough that the optimize flags

Steven Schveighoffer (24/40) Jan 05 2018 Yeah, I guess most of the bottlenecks are inside libz, or the memory

=?UTF-8?Q?Christian_K=c3=b6stlin?= (12/37) Jan 05 2018 I added another version that tries to do the "same" as the c version

Steven Schveighoffer (25/40) Jan 05 2018 Hm... I think really there is some magic initial state of the allocator,...

=?UTF-8?Q?Christian_K=c3=b6stlin?= (11/55) Jan 06 2018 yes ... this is something i forgot to try out ... will do now :)

Steven Schveighoffer (27/41) Jan 07 2018 Not from what I'm reading, the C solution is about the same (257 vs.

=?UTF-8?Q?Christian_K=c3=b6stlin?= (12/44) Jan 09 2018 yes you are right ... for proper benchmarking proper statistics should

Adam D. Ruppe (9/12) Jan 02 2018 Yeah, std.zlib is VERY poorly written. You can get much better

=?UTF-8?Q?Christian_K=c3=b6stlin?= (8/20) Jan 02 2018 I added a version that uses the gzip lowlevel apis (similar to my

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

Hi all,

over the holidays, I played around with processing some gzipped json
data. First version was implemented in ruby, but took too long, so I
tried, dlang. This was already faster, but not really satisfactory fast.
Then I wrote another version in java, which was much faster.

After this I analyzed the first step of the process (gunzipping the data
from a file to memory), and found out, that dlangs UnCompress is much
slower than java, and ruby and plain c.

There was some discussion on the forum a while ago:
http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org

The code I used and the numbers I got are here:
https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via rvm,
python3 installed by homebrew, builtin clang compiler, ldc-1.7.0-beta1,
java 1.8.0_152.

Is there anything I can do to speed up the dlang stuff?

Thanks in advance,
Christian

Jan 02 2018

Stefan Koch <uplink.coder googlemail.com> writes:

On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin 
wrote:
 Hi all,

 over the holidays, I played around with processing some gzipped 
 json data. First version was implemented in ruby, but took too 
 long, so I tried, dlang. This was already faster, but not 
 really satisfactory fast. Then I wrote another version in java, 
 which was much faster.

 After this I analyzed the first step of the process (gunzipping 
 the data from a file to memory), and found out, that dlangs 
 UnCompress is much slower than java, and ruby and plain c.

 There was some discussion on the forum a while ago: 
 http://forum.dlang.org/thread/pihxxhjgnveulcdtadvg forum.dlang.org

 The code I used and the numbers I got are here: 
 https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

 I used an i7 macbook with os x 10.13.2, ruby 2.5.0 built via 
 rvm, python3 installed by homebrew, builtin clang compiler, 
 ldc-1.7.0-beta1, java 1.8.0_152.

 Is there anything I can do to speed up the dlang stuff?

 Thanks in advance,
 Christian

Yes indeed. You can make it much faster by using a sliced static 
array as buffer.
I suspect that most of the slowdown is caused by the gc.
As there should be only calls to the gzip library

Jan 02 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as 
 buffer.

Only if you want data corruption! It keeps a copy of your pointer 
internally: 
https://github.com/dlang/phobos/blob/master/std/zlib.d#L605

It also will always overallocate new buffers on each call 
<https://github.com/dlang/phobos/blob/master/std/zlib.d#L602>

There is no efficient way to use it. The implementation is 
substandard because the API limits the design.

If we really want a fast std.zlib, the API will need to be 
extended with new functions to fix these. Those new functions 
will probably look a LOT like the underlying C functions... which 
is why I say just use them right now.

 I suspect that most of the slowdown is caused by the gc.
 As there should be only calls to the gzip library

plz measure before spreading FUD about the GC.

Jan 02 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.

 
 Only if you want data corruption! It keeps a copy of your pointer 
 internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605
 
 It also will always overallocate new buffers on each call 
 <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602>
 
 There is no efficient way to use it. The implementation is substandard 
 because the API limits the design.

iopipe handles this quite well. And deals with the buffers properly 
(yes, it is very tricky. You have to ref-count the zstream structure, 
because it keeps internal pointers to *itself* as well!). And no, iopipe 
doesn't use std.zlib, I use the etc.zlib functions (but I poached some 
ideas from std.zlib when writing it).

https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d

I even wrote a json parser for iopipe. But it's far from complete. And 
probably needs updating since I changed some of the iopipe API.

https://github.com/schveiguy/jsoniopipe

Depending on the use case, it might be enough, and should be very fast.

-Steve

Jan 02 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 02.01.18 15:09, Steven Schveighoffer wrote:
 On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.

 Only if you want data corruption! It keeps a copy of your pointer
 internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605

 It also will always overallocate new buffers on each call
 <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602>

 There is no efficient way to use it. The implementation is substandard
 because the API limits the design.

 
 iopipe handles this quite well. And deals with the buffers properly
 (yes, it is very tricky. You have to ref-count the zstream structure,
 because it keeps internal pointers to *itself* as well!). And no, iopipe
 doesn't use std.zlib, I use the etc.zlib functions (but I poached some
 ideas from std.zlib when writing it).
 
 https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d
 
 I even wrote a json parser for iopipe. But it's far from complete. And
 probably needs updating since I changed some of the iopipe API.
 
 https://github.com/schveiguy/jsoniopipe
 
 Depending on the use case, it might be enough, and should be very fast.
 
 -Steve

Thanks Steve for this proposal (actually I already had an iopipe version
on my harddisk that I applied to this problem) Its more or less your
unzip example + putting the data to an appender (I hope this is how it
should be done, to get the data to RAM).

iopipe is already better than the normal dlang version, almost like
java, but still far from the solution. I updated
https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

I will give the direct gunzip calls a try ...

In terms of json parsing, I had really nice results with the fast.json
pull parser, but its comparing a little bit apples with oranges, because
I did not pull out all the data there.

---
Christian

Jan 02 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/2/18 1:01 PM, Christian Köstlin wrote:
 On 02.01.18 15:09, Steven Schveighoffer wrote:
 On 1/2/18 8:57 AM, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 11:22:06 UTC, Stefan Koch wrote:
 You can make it much faster by using a sliced static array as buffer.

 Only if you want data corruption! It keeps a copy of your pointer
 internally: https://github.com/dlang/phobos/blob/master/std/zlib.d#L605

 It also will always overallocate new buffers on each call
 <https://github.com/dlang/phobos/blob/master/std/zlib.d#L602>

 There is no efficient way to use it. The implementation is substandard
 because the API limits the design.

 iopipe handles this quite well. And deals with the buffers properly
 (yes, it is very tricky. You have to ref-count the zstream structure,
 because it keeps internal pointers to *itself* as well!). And no, iopipe
 doesn't use std.zlib, I use the etc.zlib functions (but I poached some
 ideas from std.zlib when writing it).

 https://github.com/schveiguy/iopipe/blob/master/source/iopipe/zip.d

 I even wrote a json parser for iopipe. But it's far from complete. And
 probably needs updating since I changed some of the iopipe API.

 https://github.com/schveiguy/jsoniopipe

 Depending on the use case, it might be enough, and should be very fast.

 Thanks Steve for this proposal (actually I already had an iopipe version
 on my harddisk that I applied to this problem) Its more or less your
 unzip example + putting the data to an appender (I hope this is how it
 should be done, to get the data to RAM).

Well, you don't need to use appender for that (and doing so is copying a 
lot of the data an extra time). All you need is to extend the pipe until 
there isn't any more new data, and it will all be in the buffer.

// almost the same line from your current version
auto mypipe = openDev("../out/nist/2011.json.gz")
                   .bufd.unzip(CompressionFormat.gzip);

// This line here will work with the current release (0.0.2):
while(mypipe.extend(0) != 0) {}

//But I have a fix for a bug that hasn't been released yet, this would 
work if you use iopipe-master:
mypipe.ensureElems();

// getting the data is as simple as looking at the buffer.
auto data = mypipe.window; // ubyte[] of the data

 iopipe is already better than the normal dlang version, almost like
 java, but still far from the solution. I updated
 https://github.com/gizmomogwai/benchmarks/tree/master/gunzip
 
 I will give the direct gunzip calls a try ...
 
 In terms of json parsing, I had really nice results with the fast.json
 pull parser, but its comparing a little bit apples with oranges, because
 I did not pull out all the data there.

Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable 
in your case. The end goal is to have something fast, but very easy to 
construct. I wasn't planning on focusing on the speed (yet) like other 
libraries do, but ease of writing code to use it.

-Steve

Jan 02 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/2/18 3:13 PM, Steven Schveighoffer wrote:
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

Would you mind telling me the source of the data? When I do get around 
to it, I want to have a good dataset to test things against, and would 
be good to use what others reach for.

-Steve

Jan 02 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 02.01.18 21:48, Steven Schveighoffer wrote:
 On 1/2/18 3:13 PM, Steven Schveighoffer wrote:
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

 
 Would you mind telling me the source of the data? When I do get around
 to it, I want to have a good dataset to test things against, and would
 be good to use what others reach for.
 
 -Steve

Hi Steve,

thanks for looking into this.
I use data from nist.gov, the Makefile includes these download instructions:
    curl -s
https://static.nvd.nist.gov/feeds/json/cve/1.0/nvdcve-1.0-2011.json.gz >
out/nist/2011.json.gz`

--
Christian Köstlin

Jan 02 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.
 
 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                   .bufd.unzip(CompressionFormat.gzip);
 
 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}

Thanks for this input, I updated the program to make use of this method
and compare it to the appender thing as well.

 I will give the direct gunzip calls a try ...


I added direct gunzip calls as well... Those are really good, as long as
I do not try to get the data into ram :) then it is "bad" again.
I wonder what the real difference to the lowlevel solution with own
appender and the c version is. For me they look almost the same (ugly,
only the performance seems to be nice).

Funny thing is, that if I add the clang address sanitizer things to the
c program, I get almost the same numbers as for java :)


 Yeah, with jsoniopipe being very raw, I wouldn't be sure it was usable
 in your case. The end goal is to have something fast, but very easy to
 construct. I wasn't planning on focusing on the speed (yet) like other
 libraries do, but ease of writing code to use it.
 
 -Steve

--
Christian Köstlin

Jan 02 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/3/18 2:47 AM, Christian Köstlin wrote:
 On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.

 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}

 Thanks for this input, I updated the program to make use of this method
 and compare it to the appender thing as well.
 

Hm.. the numbers are worse! I would have expected to be at least 
comparable. I'll have to look into it. Thanks for posting this.

-Steve

Jan 03 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/3/18 9:42 AM, Steven Schveighoffer wrote:
 On 1/3/18 2:47 AM, Christian Köstlin wrote:
 On 02.01.18 21:13, Steven Schveighoffer wrote:
 Well, you don't need to use appender for that (and doing so is copying a
 lot of the data an extra time). All you need is to extend the pipe until
 there isn't any more new data, and it will all be in the buffer.

 // almost the same line from your current version
 auto mypipe = openDev("../out/nist/2011.json.gz")
                    .bufd.unzip(CompressionFormat.gzip);

 // This line here will work with the current release (0.0.2):
 while(mypipe.extend(0) != 0) {}

 Thanks for this input, I updated the program to make use of this method
 and compare it to the appender thing as well.

 
 Hm.. the numbers are worse! I would have expected to be at least 
 comparable. I'll have to look into it. Thanks for posting this.

Alright. I have spent some time examining the issues, and here are my 
findings:

1. The major differentiator between the C and D algorithms is the use of 
C realloc. This one thing saves the most time. I'm going to update 
iopipe so you can use it (stand by). I will also be examining how to 
simulate using realloc when not using C malloc in iopipe. I think it's 
the copying of data to the new buffer that is causing issues.

2. extend(0) always attempts to read 8k more bytes. The buffer extends 
by 8k by going into another couple pages. Somehow this is choking the 
algorithm. I think the cost of extending pages actually ends up hurting 
performance (something we need to look at in the GC in general).

3. extendElems should allow extending all the elements in the most 
optimal fashion. Fixing that, iopipe performs as well as the 
Appender/iopipe version. This is coming out as well.

Stay tuned, there will be updates to iopipe to hopefully make it as fast 
in this microbenchmark as the C version :)

-Steve

Jan 03 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 1. The major differentiator between the C and D algorithms is the use of 
 C realloc. This one thing saves the most time. I'm going to update 
 iopipe so you can use it (stand by). I will also be examining how to 
 simulate using realloc when not using C malloc in iopipe. I think it's 
 the copying of data to the new buffer that is causing issues.

Looking at when C realloc actually moves the data, it appears it all of 
a sudden over-allocates very very large blocks, much larger than the GC 
will over-allocate. This is why the GC is losing. After a certain size, 
the GC doesn't allocate blocks to grow into, so we start copying on 
every realloc.

So it's not so much the performance of the copying, but the number of 
times copying is required that is affecting the ultimate performance.

-Steve

Jan 03 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 03.01.18 22:33, Steven Schveighoffer wrote:
 On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 1. The major differentiator between the C and D algorithms is the use
 of C realloc. This one thing saves the most time. I'm going to update
 iopipe so you can use it (stand by). I will also be examining how to
 simulate using realloc when not using C malloc in iopipe. I think it's
 the copying of data to the new buffer that is causing issues.

 
 Looking at when C realloc actually moves the data, it appears it all of
 a sudden over-allocates very very large blocks, much larger than the GC
 will over-allocate. This is why the GC is losing. After a certain size,
 the GC doesn't allocate blocks to grow into, so we start copying on
 every realloc.

That a very intersting finding! If I find some time today, I will also
try a pure malloc/memcpy/free solution in c and also look how many real
allocs are done. Thats funny, because this means, that this whole *2
thing for growing buffers that you learn is well taken care of in libc?
Or does it still mean, that you should allocate in a pattern like this,
for libc's algorithm to kick in? Is there api to see how much realloc
really allocated, or is it only possible by comparing pointers to see if
realloc delivers a new or the old pointer?

--
Christian  Köstlin

Jan 03 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

I added now a c variant, that does always malloc/memcpy/free. Its much
slower for sure.
Also I put some output in thats shows when a real realloc happens. Its
like you said:

did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did not a real realloc
did a real realloc
did not a real realloc

funny thing is, i do not always get the same sequence of real reallocs.


--
Christian Köstlin

Jan 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/4/18 4:47 AM, Christian Köstlin wrote:
 I added now a c variant, that does always malloc/memcpy/free. Its much
 slower for sure.
 Also I put some output in thats shows when a real realloc happens. Its
 like you said:
 
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did not a real realloc
 did a real realloc
 did not a real realloc
 
 funny thing is, i do not always get the same sequence of real reallocs.

Another thing I did was print out the size of the buffer before the 
realloc. In other words:

auto oldsize = buffer.length;
auto oldptr = buffer.ptr;
buffer = (cast(ubyte *)realloc(buffer.ptr, newsize))[0 .. newsize];
if(buffer.ptr != oldptr) writeln("moved: ", oldsize);

Even the sizes malloc chooses for its big leap are not consistent from 
run to run!

I'm wondering if possibly when you keep reallocing a large block, and it 
doesn't have enough memory, needs more from the system, it just tacks it 
onto the end of the existing big block. This would make the performance 
for this microbenchmark highly dependent on not doing any other allocations!

BTW, my numbers are consistent, but there is a difference from yours. 
For example, my C runs are about 430ms, and my fast D runs are about 
650-700ms. I never get the 200ms range runs that you are seeing. I have 
a similar machine, but still running Sierra.

-Steve

Jan 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/3/18 3:28 PM, Steven Schveighoffer wrote:

 Stay tuned, there will be updates to iopipe to hopefully make it as fast 
 in this microbenchmark as the C version :)

v0.0.3 has been released. To take advantage of using malloc/realloc, you 
can use std.experimental.allocator.mallocator.Mallocator as the 
BufferManager's allocator.

e.g.:

auto myPipe = openDev("file.gz").bufd // not here
       .unzip!Mallocator;              // here

myPipe.ensureElems(); // this works now.

auto bytesRead = myPipe.window.length;

The reason you don't need it on the first bufd is because that is the 
buffer of the file stream, which isn't going to grow.

Note: the buffer manager does not free the data on destruction! You are 
responsible for freeing it (if you use Mallocator).

-Steve

Jan 04 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 04.01.18 16:53, Steven Schveighoffer wrote:
 On 1/3/18 3:28 PM, Steven Schveighoffer wrote:
 
 Stay tuned, there will be updates to iopipe to hopefully make it as
 fast in this microbenchmark as the C version :)

 
 v0.0.3 has been released. To take advantage of using malloc/realloc, you
 can use std.experimental.allocator.mallocator.Mallocator as the
 BufferManager's allocator.
 
 e.g.:
 
 auto myPipe = openDev("file.gz").bufd // not here
       .unzip!Mallocator;              // here
 
 myPipe.ensureElems(); // this works now.
 
 auto bytesRead = myPipe.window.length;
 
 The reason you don't need it on the first bufd is because that is the
 buffer of the file stream, which isn't going to grow.
 
 Note: the buffer manager does not free the data on destruction! You are
 responsible for freeing it (if you use Mallocator).

Thanks Steve,
this runs now faster, I will update the table.

Sorry, but I do not get how to cleanup the mallocated memory :)
Could you help me out here?

--
Christian Köstlin

Jan 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Sorry, but I do not get how to cleanup the mallocated memory :)
 Could you help me out here?

free(mypipe.window.ptr);

At the moment this works because you haven't released any data from the 
front. But eventually, I will need to figure out how to handle this when 
it's not so neat.

-Steve

Jan 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.

Still a bit irked that I can't match the C speed :/

But, I can't get your C speed to duplicate on my mac even with gcc, so 
I'm not sure where to start. I find it interesting that you are not 
using any optimization flags for gcc.

-Steve

Jan 04 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 04.01.18 20:46, Steven Schveighoffer wrote:
 On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.

 
 Still a bit irked that I can't match the C speed :/
 
 But, I can't get your C speed to duplicate on my mac even with gcc, so
 I'm not sure where to start. I find it interesting that you are not
 using any optimization flags for gcc.

I guess, the code in my program is small enough that the optimize flags
do not matter... most of the stuff is pulled from libz? Which is
dynamically linked against /usr/lib/libz.1.dylib.

I also cannot understand what I should do more (will try realloc with
Mallocator) for the dlang-low-level variant to get to the c speed.
rust is doing quite well there

--
Christian Köstlin

Jan 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/5/18 1:01 AM, Christian Köstlin wrote:
 On 04.01.18 20:46, Steven Schveighoffer wrote:
 On 1/4/18 1:57 PM, Christian Köstlin wrote:
 Thanks Steve,
 this runs now faster, I will update the table.

 Still a bit irked that I can't match the C speed :/

 But, I can't get your C speed to duplicate on my mac even with gcc, so
 I'm not sure where to start. I find it interesting that you are not
 using any optimization flags for gcc.

 I guess, the code in my program is small enough that the optimize flags
 do not matter... most of the stuff is pulled from libz? Which is
 dynamically linked against /usr/lib/libz.1.dylib.

Yeah, I guess most of the bottlenecks are inside libz, or the memory 
allocator. There isn't much optimization to be done in the main program 
itself.

 I also cannot understand what I should do more (will try realloc with
 Mallocator) for the dlang-low-level variant to get to the c speed.

D compiles just the same as C. So theoretically you should be able to 
get the same performance with a ported version of your C code. It's 
worth a shot.

 rust is doing quite well there

I'll say a few words of caution here:

1. Almost all of these tests use the same C library to unzip. So it's 
really not a test of the performance of decompression, but the 
performance of memory management. And it appears that any test using 
malloc/realloc is in a different tier. Presumably because of the lack of 
copies (as discussed earlier).
2. Your rust test (I think, I'm not sure) is testing 2 things in the 
same run, which could potentially have dramatic consequences for the 
second test. For instance, it could already have all the required memory 
blocks ready, and the allocation strategy suddenly gets better. Or maybe 
there is some kind of caching of the input being done. I think you have 
a fairer test for the second option by running it in a separate program. 
I've never used rust, so I don't know what exactly your code is doing.
3. It's hard to make a decision based on such microbenchmarks as to 
which solution is "better" in an actual real-world program, especially 
when the state/usage of the memory allocator plays a huge role in this.

-Steve

Jan 05 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.

I added another version that tries to do the "same" as the c version
using mallocator, but i am still way off, perhaps its creating too many
ranges on the underlying array. but its around the same speed as your
great iopipe thing.
My solution does have the same memory leak, as I am not sure how to best
get the memory out of the FastAppender so that it is automagically
cleaned up. Perhaps if we get rc things, this gets easier?
I updated: https://github.com/gizmomogwai/benchmarks/tree/master/gunzip
with the newest numbers on my machine, but I think your iopipe solution
is the best one we can get at the moment!

 rust is doing quite well there

 
 I'll say a few words of caution here:
 
 1. Almost all of these tests use the same C library to unzip. So it's
 really not a test of the performance of decompression, but the
 performance of memory management. And it appears that any test using
 malloc/realloc is in a different tier. Presumably because of the lack of
 copies (as discussed earlier).
 2. Your rust test (I think, I'm not sure) is testing 2 things in the
 same run, which could potentially have dramatic consequences for the
 second test. For instance, it could already have all the required memory
 blocks ready, and the allocation strategy suddenly gets better. Or maybe
 there is some kind of caching of the input being done. I think you have
 a fairer test for the second option by running it in a separate program.
 I've never used rust, so I don't know what exactly your code is doing.
 3. It's hard to make a decision based on such microbenchmarks as to
 which solution is "better" in an actual real-world program, especially
 when the state/usage of the memory allocator plays a huge role in this.

sure .. thats true

Jan 05 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/5/18 3:09 PM, Christian Köstlin wrote:
 On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.

 I added another version that tries to do the "same" as the c version
 using mallocator, but i am still way off, perhaps its creating too many
 ranges on the underlying array. but its around the same speed as your
 great iopipe thing.

Hm... I think really there is some magic initial state of the allocator, 
and that's what allows it to go so fast.

One thing about the D version, because druntime is also using malloc 
(the GC is backed by malloc'd data after all), the initial state of the 
heap is quite different from when you start in C. It may be impossible 
or nearly impossible to duplicate the performance. But the flipside (if 
this is indeed the case) is that you won't see the same performance in a 
real-world app anyway, even in C.

One thing to try, you preallocate the ENTIRE buffer. This only works if 
you know how many bytes it will decompress to (not always possible), but 
it will take the allocator out of the equation completely. And it's 
probably going to be the most efficient method (you aren't leaving 
behind smaller unused blocks when you realloc). If for some reason we 
can't beat/tie the C version doing that, then something else is going on.

 My solution does have the same memory leak, as I am not sure how to best
 get the memory out of the FastAppender so that it is automagically
 cleaned up. Perhaps if we get rc things, this gets easier?

I've been giving some thought to this. I think iopipe needs some buffer 
management primitives that allow you to finagle the buffer. I've been 
needing this for some time anyway (for file seeking). Right now, the 
buffer itself is buried in the chain, so it's hard to get at the actual 
buffer.

Alternatively, I probably also need to give some thought to a mechanism 
that auto-frees the memory when it can tell nobody is still using the 
iopipe. Given that iopipe's signature feature is direct buffer access, 
this would mean anything that uses such a feature would have to be unsafe.

-Steve

Jan 05 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 05.01.18 23:04, Steven Schveighoffer wrote:
 On 1/5/18 3:09 PM, Christian Köstlin wrote:
 On 05.01.18 15:39, Steven Schveighoffer wrote:
 Yeah, I guess most of the bottlenecks are inside libz, or the memory
 allocator. There isn't much optimization to be done in the main program
 itself.

 D compiles just the same as C. So theoretically you should be able to
 get the same performance with a ported version of your C code. It's
 worth a shot.

 I added another version that tries to do the "same" as the c version
 using mallocator, but i am still way off, perhaps its creating too many
 ranges on the underlying array. but its around the same speed as your
 great iopipe thing.

 
 Hm... I think really there is some magic initial state of the allocator,
 and that's what allows it to go so fast.
 
 One thing about the D version, because druntime is also using malloc
 (the GC is backed by malloc'd data after all), the initial state of the
 heap is quite different from when you start in C. It may be impossible
 or nearly impossible to duplicate the performance. But the flipside (if
 this is indeed the case) is that you won't see the same performance in a
 real-world app anyway, even in C.
 
 One thing to try, you preallocate the ENTIRE buffer. This only works if
 you know how many bytes it will decompress to (not always possible), but
 it will take the allocator out of the equation completely. And it's
 probably going to be the most efficient method (you aren't leaving
 behind smaller unused blocks when you realloc). If for some reason we
 can't beat/tie the C version doing that, then something else is going on.

yes ... this is something i forgot to try out ... will do now :)
mhh .. interesting numbers ... c is even faster, my d lowlevel solution
is also a little bit faster, but much slower than the no copy version
(funnily, no copy is the wrong name, it just overwrites all the data in
a small buffer).

 My solution does have the same memory leak, as I am not sure how to best
 get the memory out of the FastAppender so that it is automagically
 cleaned up. Perhaps if we get rc things, this gets easier?

 
 I've been giving some thought to this. I think iopipe needs some buffer
 management primitives that allow you to finagle the buffer. I've been
 needing this for some time anyway (for file seeking). Right now, the
 buffer itself is buried in the chain, so it's hard to get at the actual
 buffer.
 
 Alternatively, I probably also need to give some thought to a mechanism
 that auto-frees the memory when it can tell nobody is still using the
 iopipe. Given that iopipe's signature feature is direct buffer access,
 this would mean anything that uses such a feature would have to be unsafe.

yes .. thats tricky ...
one question about iopipe. is it possible to transform the elements in
the pipe as well ... e.g. away from a buffer of bytes to json objects?

--
Christian Köstlin

Jan 06 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 1/6/18 11:14 AM, Christian Köstlin wrote:
 On 05.01.18 23:04, Steven Schveighoffer wrote:
 One thing to try, you preallocate the ENTIRE buffer. This only works if
 you know how many bytes it will decompress to (not always possible), but
 it will take the allocator out of the equation completely. And it's
 probably going to be the most efficient method (you aren't leaving
 behind smaller unused blocks when you realloc). If for some reason we
 can't beat/tie the C version doing that, then something else is going on.

 yes ... this is something i forgot to try out ... will do now :)
 mhh .. interesting numbers ... c is even faster, my d lowlevel solution
 is also a little bit faster, but much slower than the no copy version
 (funnily, no copy is the wrong name, it just overwrites all the data in
 a small buffer).

Not from what I'm reading, the C solution is about the same (257 vs. 
261). Not sure if you have averaged these numbers, especially on a real 
computer that might be doing other things.

Note: I would expect it to be a tiny bit faster, but not monumentally 
faster. From my testing with the reallocation, it only reallocates a 
large quantity of data once.

However, the D solution should be much faster. Part of the issue is that 
you still aren't low-level enough :)

Instead of allocating the ubyte array with this line:

ubyte[] buffer = new ubyte[200*1024*1024];

Try this instead:

// from std.array
auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);

The difference is that the first one will have the runtime 0-initialize 
all the data.

 one question about iopipe. is it possible to transform the elements in
 the pipe as well ... e.g. away from a buffer of bytes to json objects?

Yes! I am working on doing just that, but haven't had a chance to update 
the toy project I wrote: https://github.com/schveiguy/jsoniopipe

I was planning actually on having an iopipe of JsonItem, which would 
work just like a normal buffer, but reference the ubyte buffer underneath.

Eventually, the final product should have a range of JsonValue, which 
you would recurse into in order to parse its children. All of it will be 
lazy, and stream-based, so you don't have to load the whole file if it's 
huge.

Note, you can't have an iopipe of JsonValue, because it's a recursive 
format. JsonItems are just individual defined tokens, so they can be linear.

-Steve

Jan 07 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 07.01.18 14:44, Steven Schveighoffer wrote:
 Not from what I'm reading, the C solution is about the same (257 vs.
 261). Not sure if you have averaged these numbers, especially on a real
 computer that might be doing other things.

yes you are right ... for proper benchmarking proper statistics should
be in place, taking out extreme values, averaging them, ...

 Note: I would expect it to be a tiny bit faster, but not monumentally
 faster. From my testing with the reallocation, it only reallocates a
 large quantity of data once.
 
 However, the D solution should be much faster. Part of the issue is that
 you still aren't low-level enough :)
 
 Instead of allocating the ubyte array with this line:
 
 ubyte[] buffer = new ubyte[200*1024*1024];
 
 Try this instead:
 
 // from std.array
 auto buffer = uninitializedArray!(ubyte[], 200*1024*1024);

thanks for that ... i just did not know how to get an uninitialized
array. i was aware, that dlang is nice and puts init there :)

 Yes! I am working on doing just that, but haven't had a chance to update
 the toy project I wrote: https://github.com/schveiguy/jsoniopipe
 
 I was planning actually on having an iopipe of JsonItem, which would
 work just like a normal buffer, but reference the ubyte buffer underneath.
 
 Eventually, the final product should have a range of JsonValue, which
 you would recurse into in order to parse its children. All of it will be
 lazy, and stream-based, so you don't have to load the whole file if it's
 huge.
 
 Note, you can't have an iopipe of JsonValue, because it's a recursive
 format. JsonItems are just individual defined tokens, so they can be
 linear.

sounds really good. i played around with
https://github.com/mleise/fast/blob/master/source/fast/json.d ... thats
an interesting pull parser with the wrong licence unfortunately ... i
wonder if something like this could be done on top of iopipe instead of
a "real" buffer.

---
Christian Köstlin

Jan 09 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin 
wrote:
 After this I analyzed the first step of the process (gunzipping 
 the data from a file to memory), and found out, that dlangs 
 UnCompress is much slower than java, and ruby and plain c.

Yeah, std.zlib is VERY poorly written. You can get much better 
performance by just calling the C functions yourself instead. 
(You can just import import etc.c.zlib; it is included still)

Improving it would mean changing the public API. I think the 
one-shot compress/uncompress functions are ok, but the streaming 
class does a lot of unnecessary work inside like copying stuff 
around.

Jan 02 2018

=?UTF-8?Q?Christian_K=c3=b6stlin?= <christian.koestlin gmail.com> writes:

On 02.01.18 14:51, Adam D. Ruppe wrote:
 On Tuesday, 2 January 2018 at 10:27:11 UTC, Christian Köstlin wrote:
 After this I analyzed the first step of the process (gunzipping the
 data from a file to memory), and found out, that dlangs UnCompress is
 much slower than java, and ruby and plain c.

 
 Yeah, std.zlib is VERY poorly written. You can get much better
 performance by just calling the C functions yourself instead. (You can
 just import import etc.c.zlib; it is included still)
 
 Improving it would mean changing the public API. I think the one-shot
 compress/uncompress functions are ok, but the streaming class does a lot
 of unnecessary work inside like copying stuff around.

I added a version that uses the gzip lowlevel apis (similar to my
example c program). I am still having problems copying the data fast
enough to an dlang array.

please see the updated page:
https://github.com/gizmomogwai/benchmarks/tree/master/gunzip

--
Christian Köstlin

Jan 02 2018

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Help optimizing UnCompress for gzipped files