www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - d2 file input performance

reply =?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:
Hi guys,


i started the thread: 
http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 
on stackoverflow, because i ran into kind of a problem.

i wanted to read data from a file (or even better from a stream, but 
lets stay with file), byte-by-byte. the whole thing was part of my 
protobuf implementation for d2, and there you have to look at each byte 
to read out the varints. i was very proud of my implementation until i
benchmarked it first against java (ok ... i was a little slower than 
java) and then against c++ (ok ... this was a complete different game).

after some optimizing i got better, but was still way slower than c++. 
so i started some small microbenchmarks regarding fileio: 
https://github.com/gizmomogwai/performance in c++, java and d2.

could you help me improve on the d2 performance? i am sure, that i am 
missing something fundamental, because i thing it should be at least 
possible be equal or better than java.

thanks in advance

christian
Aug 26 2011
next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 26 Aug 2011 13:43:23 -0400, Christian K=C3=B6stlin  =

<christian.koestlin gmail.com> wrote:

 Hi guys,


 i started the thread:  =
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-byte=
s-in-d2 =
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but  =
 lets stay with file), byte-by-byte. the whole thing was part of my  =
 protobuf implementation for d2, and there you have to look at each byt=
e =
 to read out the varints. i was very proud of my implementation until i=
 benchmarked it first against java (ok ... i was a little slower than  =
 java) and then against c++ (ok ... this was a complete different game)=
.
 after some optimizing i got better, but was still way slower than c++.=
=
 so i started some small microbenchmarks regarding fileio:  =
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am =
=
 missing something fundamental, because i thing it should be at least  =
 possible be equal or better than java.

 thanks in advance
Two things: First, there is a large difference: C++ version: int read() {...} D version: int read(ubyte *bufferptr) {...} This may not be optimized as well. You should make it the same. Second, use -inline, it will help tremendously. I'd bet money that the largest slowdown is the function calls. Inlining= = makes things run so much faster it's not even funny. Also, note that FILE* is *already buffered*, there is no reason to do = anything but fgetc. In fact, it would probably be faster. -Steve
Aug 26 2011
parent reply bearophile <bearophileHUGS lycos.com> writes:
Steven Schveighoffer:

 In fact, it would probably be faster.
I suggest the OP to keep us updated on this matter. And later after some time, if no solutions are found, to bring the issue to the main D newsgroup and to Bugzilla too. This is a significant issue. Bye, bearophile
Aug 26 2011
parent =?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:
On 8/26/11 23:56 , bearophile wrote:
 Steven Schveighoffer:

 In fact, it would probably be faster.
I suggest the OP to keep us updated on this matter. And later after some time, if no solutions are found, to bring the issue to the main D newsgroup and to Bugzilla too. This is a significant issue. Bye, bearophile
Small update: I added some more example implementations as a reaction to Mehrdad's suggestion to make sure to use the same file-read api. So the c++ and the d version both load libc dynamically and from that the symbol fread. respective times from c++ and d: 115ms vs. 504ms. the only thing i could also try is to use ldc or gdc (but i first have to install those). regards christian
Aug 29 2011
prev sibling next sibling parent Heywood Floyd <soul8o8 gmail.com> writes:
Christian Köstlin Wrote:
 after some optimizing i got better, but was still way slower than c++. 
 so i started some small microbenchmarks regarding fileio: 
 https://github.com/gizmomogwai/performance in c++, java and d2.
 
 christian
Hello! Thanks for you effort in putting this together! I found this interesting and played around with some of your code examples. My findings differ somewhat from yours, so I thought I'd post them. From what I can tell, G++ does generate almost twice (~1.9x) as fast code, in the fread()/File-example, as DMD. Even though the D-code does handle errors encountered by fread(), that certainly can't explain the dramatic difference in speed alone. It would be very interesting to see how GDC and LDC perform in these tests!! (I don't have them installed.) Anyway, here are my notes: I concentrated on the G++-fread-example and the DMD-File-example, as they seem comparable enough. However, I did some changes to the benchmark in order to "level" the playing field: 1) Made sure both C++ and D used a 1 kb (fixed-size) buffer 2) Made sure the underlying setvbuf() buffer is the same (64 kb) 3) Made sure the read data has an actual side effect by printing out the accumulated data after each file. (Cheapo CRC) The last point, 3, particularly seemed to make the G++-example considerably slower, perhaps hinting G++ is otherwise doing some clever optimization here. The second point, 2, seemed to have no effect on C++, but it helped somewhat for D. This may hint at C++ doing its own buffering or something. (?) In that case the benchmark is bogus. Anyway, these are the results: (G++ 4.2.1, fread()+crc, osx) G++ 1135 ms (no flags) G++ 399 ms -O1 G++ 368 ms -O2 G++ 368 ms -O3 G++nofx 156 ms -O3 (Disqualified!) (DMD 2.054, rawRead()+crc, osx) DMD 995 ms (no flags) DMD 913 ms -O DMD 888 ms -release DMD 713 ms -release -O -inline DMD 703 ms -release -O DMD 693 ms -release -O -inline -noboundscheck Well, I suppose a possible (and to me plausable) explanation is that G++'s optimizations are a lot more extensive than DMD's. Skipping printing out the CRC-value ("nofx") makes the C++ code more than twice as fast. Note that the code calculating the CRC-value is still in place, the value is just not printed out, and surely, calling printf() 10 times can hardly account for a 200 ms increase. (?) I think it's safe to assume code is simply being ignored here, as it's not having any side effect. My gut feel is DMD is not doing inlining, at least not to the same extent G++ is, as that seems to be especially important since we're making a function call for every single byte here. (Using the -inline flag even seems to make the D code slower. Weird.) But of course I don't really know. Again, GDC and LDC would be interesting to see here. Finally, to this I must add the size of the generated binary: G++ 15 kb DMD 882 kb Yikes. I believe there's nothing (large enough) to hide behind for DMD there. That's it! Kind regards /HF Here's the modifed code: (Original https://github.com/gizmomogwai/performance) // - - - - - - 8< - - - - - - import std.stdio, std.datetime, core.stdc.stdio; struct FileReader { private: File file; enum BUFFER_SIZE = 1024; ubyte[BUFFER_SIZE] readBuf; size_t pos, len; this(string name){ file = File(name, "rb"); //setbuf(file.getFP(), null); // No buffer setvbuf(file.getFP(), null, _IOFBF, BUFFER_SIZE * 64); } bool fillBuffer() { auto tmpBuf = file.rawRead(readBuf); len = tmpBuf.length; pos = 0; return len > 0; } public: int read() { if(pos == len){ if(fillBuffer() == false) return -1; } return readBuf[pos++]; } } size_t readBytes() { size_t count = 0; ulong crc = 0; for (int i=0; i<10; i++) { auto file = FileReader("/tmp/shop_with_ids.pb"); auto data = file.read(); while(data != -1){ count++; crc += data; data = file.read(); } writeln(crc); } return count; } int main(string[] args) { auto sw = StopWatch(AutoStart.no); sw.start(); auto count = readBytes(); sw.stop(); writeln("<tr><td>d2-6-B</td><td>", count, "</td><td>", sw.peek().msecs, "</td><td>using std.stdio.File </td></tr>"); return 0; } // - - - - - - 8< - - - - - - #include "stopwatch.h" #include <iostream> #include <stdio.h> class StdioFileReader { private: FILE* fFile; static const size_t BUFFER_SIZE = 1024; unsigned char fBuffer[BUFFER_SIZE]; unsigned char* fBufferPtr; unsigned char* fBufferEnd; public: StdioFileReader(std::string s) : fFile(fopen(s.c_str(), "rb")), fBufferPtr(fBuffer), fBufferEnd(fBuffer) { assert(fFile); //setbuf(fFile, NULL); // No buffer setvbuf(fFile, NULL, _IOFBF, BUFFER_SIZE * 64); } ~StdioFileReader() { fclose(fFile); } int read() { bool finished = fBufferPtr == fBufferEnd; if (finished) { finished = fillBuffer(); if (finished) { return -1; } } return *fBufferPtr++; } private: bool fillBuffer() { size_t l = fread(fBuffer, 1, BUFFER_SIZE, fFile); fBufferPtr = fBuffer; fBufferEnd = fBufferPtr+l; return l == 0; } }; size_t readBytes() { size_t res = 0; unsigned long crc = 0; for (int i=0; i<10; i++) { StdioFileReader r("/tmp/shop_with_ids.pb"); int read = r.read(); while (read != -1) { ++res; crc += read; read = r.read(); } std::cout << crc << "\n"; // Comment out for "nofx" } return res; } int main(int argc, char** args) { StopWatch sw; sw.start(); size_t count = readBytes(); sw.stop(); std::cout << "<tr><td>cpp-1-B</td><td>" << count << "</td><td>" << sw.delta() << "</td><td>straight forward implementation using fread with buffering.</td></tr>" << std::endl; return 0; }
Aug 28 2011
prev sibling next sibling parent reply =?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:
Update:

I added performance tests for ldc and gdc with the same programs.
The results are interesting (please see the github page for the details).

regards

christian

On 8/26/11 19:43 , Christian Köstlin wrote:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian
Aug 31 2011
parent reply David Nadlinger <see klickverbot.at> writes:
On 9/1/11 7:12 AM, Christian Köstlin wrote:
 Update:

 I added performance tests for ldc and gdc with the same programs.
 The results are interesting (please see the github page for the details).
Oh wow, LDC must accidentally call some druntime functions for the ubyte[1] case, or something similar, could you please file a ticket at http://dsource.org/projects/ldc/newticket? Thanks, David
Aug 31 2011
parent =?UTF-8?B?Q2hyaXN0aWFuIEvDtnN0bGlu?= <christian.koestlin gmail.com> writes:
On 9/1/11 7:24 , David Nadlinger wrote:
 On 9/1/11 7:12 AM, Christian Köstlin wrote:
 Update:

 I added performance tests for ldc and gdc with the same programs.
 The results are interesting (please see the github page for the details).
Oh wow, LDC must accidentally call some druntime functions for the ubyte[1] case, or something similar, could you please file a ticket at http://dsource.org/projects/ldc/newticket? Thanks, David
hi david, i am not sure what i have to do to open a ticket, i suppose that i should get an trac account and so on. but what would be the description. i suppose the time for this particular tests is quite strange and out of bounds :) right now my tests show, that lcd seems to be faster than dmd and slower than gdc in most cases. even my c++ program runs slower compiled with llvm-c++. perhaps you could open the bug report? feel free to point to the github repository or take the source and put it into the bugreport. thanks for your feedback christian
Sep 04 2011
prev sibling parent reply dennis luehring <dl.soluz gmx.net> writes:
Am 26.08.2011 19:43, schrieb Christian Köstlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian
i would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, driver speed will reducued down to a little bit of noise 2. make your szenario much bigger 3. would be interesting to see for example to cumulated every 1000 benchmarks-steps or something like that - to see caching coming in etc. running 10.000 times time for 1. 1000 steps xyzw time for 2. 1000 steps xyzw time for 3. 1000 steps xyzw time for 4. 1000 steps xyzw overall time ... xyz ...
Sep 02 2011
parent reply =?ISO-8859-15?Q?Christian_K=F6stlin?= writes:
On 9/3/11 7:53 , dennis luehring wrote:
 Am 26.08.2011 19:43, schrieb Christian Köstlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2

 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian
i would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, driver speed will reducued down to a little bit of noise 2. make your szenario much bigger 3. would be interesting to see for example to cumulated every 1000 benchmarks-steps or something like that - to see caching coming in etc. running 10.000 times time for 1. 1000 steps xyzw time for 2. 1000 steps xyzw time for 3. 1000 steps xyzw time for 4. 1000 steps xyzw overall time ... xyz ...
good point ... will see if i can adapt the tests... cK
Sep 04 2011
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 04.09.2011, 19:01 Uhr, schrieb Christian K=C3=B6stlin  =

<christian.koestlin gmail.com>:

 On 9/3/11 7:53 , dennis luehring wrote:
 Am 26.08.2011 19:43, schrieb Christian K=C3=B6stlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-by=
tes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but=
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each b=
yte
 to read out the varints. i was very proud of my implementation until=
i
 benchmarked it first against java (ok ... i was a little slower than=
 java) and then against c++ (ok ... this was a complete different gam=
e).
 after some optimizing i got better, but was still way slower than c+=
+.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i a=
m
 missing something fundamental, because i thing it should be at least=
 possible be equal or better than java.

 thanks in advance

 christian
i would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, dri=
ver
 speed will reducued down to a little bit of noise

 2. make your szenario much bigger

 3. would be interesting to see for example to cumulated every 1000
 benchmarks-steps or something like that - to see caching coming in et=
c.
 running 10.000 times

 time for 1. 1000 steps xyzw
 time for 2. 1000 steps xyzw
 time for 3. 1000 steps xyzw
 time for 4. 1000 steps xyzw
 overall time ... xyz
 ...
good point ... will see if i can adapt the tests... cK
-release -O -inline -noboundscheck is the options set for D2. In D1 = -release included -noboundscheck.
Oct 17 2011