digitalmars.D.learn - d2 file input performance
- =?ISO-8859-1?Q?Christian_K=F6stlin?= (18/18) Aug 26 2011 Hi guys,
- Steven Schveighoffer (21/38) Aug 26 2011 s-in-d2 =
- bearophile (4/5) Aug 26 2011 I suggest the OP to keep us updated on this matter. And later after some...
- =?ISO-8859-1?Q?Christian_K=F6stlin?= (10/15) Aug 29 2011 Small update:
- Heywood Floyd (160/165) Aug 28 2011 Hello!
- =?ISO-8859-1?Q?Christian_K=F6stlin?= (6/24) Aug 31 2011 Update:
- David Nadlinger (6/9) Aug 31 2011 Oh wow, LDC must accidentally call some druntime functions for the
- =?UTF-8?B?Q2hyaXN0aWFuIEvDtnN0bGlu?= (13/23) Sep 04 2011 hi david,
- dennis luehring (14/32) Sep 02 2011 i would change the test szenario a little bit
- =?ISO-8859-15?Q?Christian_K=F6stlin?= (4/44) Sep 04 2011 good point ...
- Marco Leise (12/62) Oct 17 2011 tes-in-d2
Hi guys, i started the thread: http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but lets stay with file), byte-by-byte. the whole thing was part of my protobuf implementation for d2, and there you have to look at each byte to read out the varints. i was very proud of my implementation until i benchmarked it first against java (ok ... i was a little slower than java) and then against c++ (ok ... this was a complete different game). after some optimizing i got better, but was still way slower than c++. so i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i am missing something fundamental, because i thing it should be at least possible be equal or better than java. thanks in advance christian
Aug 26 2011
On Fri, 26 Aug 2011 13:43:23 -0400, Christian K=C3=B6stlin = <christian.koestlin gmail.com> wrote:Hi guys, i started the thread: =http://stackoverflow.com/questions/7202710/fastest-way-of-reading-byte=s-in-d2 =on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but =lets stay with file), byte-by-byte. the whole thing was part of my =protobuf implementation for d2, and there you have to look at each byt=e =to read out the varints. i was very proud of my implementation until i=benchmarked it first against java (ok ... i was a little slower than =java) and then against c++ (ok ... this was a complete different game)=.after some optimizing i got better, but was still way slower than c++.==so i started some small microbenchmarks regarding fileio: =https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i am ==missing something fundamental, because i thing it should be at least =possible be equal or better than java. thanks in advanceTwo things: First, there is a large difference: C++ version: int read() {...} D version: int read(ubyte *bufferptr) {...} This may not be optimized as well. You should make it the same. Second, use -inline, it will help tremendously. I'd bet money that the largest slowdown is the function calls. Inlining= = makes things run so much faster it's not even funny. Also, note that FILE* is *already buffered*, there is no reason to do = anything but fgetc. In fact, it would probably be faster. -Steve
Aug 26 2011
Steven Schveighoffer:In fact, it would probably be faster.I suggest the OP to keep us updated on this matter. And later after some time, if no solutions are found, to bring the issue to the main D newsgroup and to Bugzilla too. This is a significant issue. Bye, bearophile
Aug 26 2011
On 8/26/11 23:56 , bearophile wrote:Steven Schveighoffer:Small update: I added some more example implementations as a reaction to Mehrdad's suggestion to make sure to use the same file-read api. So the c++ and the d version both load libc dynamically and from that the symbol fread. respective times from c++ and d: 115ms vs. 504ms. the only thing i could also try is to use ldc or gdc (but i first have to install those). regards christianIn fact, it would probably be faster.I suggest the OP to keep us updated on this matter. And later after some time, if no solutions are found, to bring the issue to the main D newsgroup and to Bugzilla too. This is a significant issue. Bye, bearophile
Aug 29 2011
Christian Köstlin Wrote:after some optimizing i got better, but was still way slower than c++. so i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. christianHello! Thanks for you effort in putting this together! I found this interesting and played around with some of your code examples. My findings differ somewhat from yours, so I thought I'd post them. From what I can tell, G++ does generate almost twice (~1.9x) as fast code, in the fread()/File-example, as DMD. Even though the D-code does handle errors encountered by fread(), that certainly can't explain the dramatic difference in speed alone. It would be very interesting to see how GDC and LDC perform in these tests!! (I don't have them installed.) Anyway, here are my notes: I concentrated on the G++-fread-example and the DMD-File-example, as they seem comparable enough. However, I did some changes to the benchmark in order to "level" the playing field: 1) Made sure both C++ and D used a 1 kb (fixed-size) buffer 2) Made sure the underlying setvbuf() buffer is the same (64 kb) 3) Made sure the read data has an actual side effect by printing out the accumulated data after each file. (Cheapo CRC) The last point, 3, particularly seemed to make the G++-example considerably slower, perhaps hinting G++ is otherwise doing some clever optimization here. The second point, 2, seemed to have no effect on C++, but it helped somewhat for D. This may hint at C++ doing its own buffering or something. (?) In that case the benchmark is bogus. Anyway, these are the results: (G++ 4.2.1, fread()+crc, osx) G++ 1135 ms (no flags) G++ 399 ms -O1 G++ 368 ms -O2 G++ 368 ms -O3 G++nofx 156 ms -O3 (Disqualified!) (DMD 2.054, rawRead()+crc, osx) DMD 995 ms (no flags) DMD 913 ms -O DMD 888 ms -release DMD 713 ms -release -O -inline DMD 703 ms -release -O DMD 693 ms -release -O -inline -noboundscheck Well, I suppose a possible (and to me plausable) explanation is that G++'s optimizations are a lot more extensive than DMD's. Skipping printing out the CRC-value ("nofx") makes the C++ code more than twice as fast. Note that the code calculating the CRC-value is still in place, the value is just not printed out, and surely, calling printf() 10 times can hardly account for a 200 ms increase. (?) I think it's safe to assume code is simply being ignored here, as it's not having any side effect. My gut feel is DMD is not doing inlining, at least not to the same extent G++ is, as that seems to be especially important since we're making a function call for every single byte here. (Using the -inline flag even seems to make the D code slower. Weird.) But of course I don't really know. Again, GDC and LDC would be interesting to see here. Finally, to this I must add the size of the generated binary: G++ 15 kb DMD 882 kb Yikes. I believe there's nothing (large enough) to hide behind for DMD there. That's it! Kind regards /HF Here's the modifed code: (Original https://github.com/gizmomogwai/performance) // - - - - - - 8< - - - - - - import std.stdio, std.datetime, core.stdc.stdio; struct FileReader { private: File file; enum BUFFER_SIZE = 1024; ubyte[BUFFER_SIZE] readBuf; size_t pos, len; this(string name){ file = File(name, "rb"); //setbuf(file.getFP(), null); // No buffer setvbuf(file.getFP(), null, _IOFBF, BUFFER_SIZE * 64); } bool fillBuffer() { auto tmpBuf = file.rawRead(readBuf); len = tmpBuf.length; pos = 0; return len > 0; } public: int read() { if(pos == len){ if(fillBuffer() == false) return -1; } return readBuf[pos++]; } } size_t readBytes() { size_t count = 0; ulong crc = 0; for (int i=0; i<10; i++) { auto file = FileReader("/tmp/shop_with_ids.pb"); auto data = file.read(); while(data != -1){ count++; crc += data; data = file.read(); } writeln(crc); } return count; } int main(string[] args) { auto sw = StopWatch(AutoStart.no); sw.start(); auto count = readBytes(); sw.stop(); writeln("<tr><td>d2-6-B</td><td>", count, "</td><td>", sw.peek().msecs, "</td><td>using std.stdio.File </td></tr>"); return 0; } // - - - - - - 8< - - - - - - #include "stopwatch.h" #include <iostream> #include <stdio.h> class StdioFileReader { private: FILE* fFile; static const size_t BUFFER_SIZE = 1024; unsigned char fBuffer[BUFFER_SIZE]; unsigned char* fBufferPtr; unsigned char* fBufferEnd; public: StdioFileReader(std::string s) : fFile(fopen(s.c_str(), "rb")), fBufferPtr(fBuffer), fBufferEnd(fBuffer) { assert(fFile); //setbuf(fFile, NULL); // No buffer setvbuf(fFile, NULL, _IOFBF, BUFFER_SIZE * 64); } ~StdioFileReader() { fclose(fFile); } int read() { bool finished = fBufferPtr == fBufferEnd; if (finished) { finished = fillBuffer(); if (finished) { return -1; } } return *fBufferPtr++; } private: bool fillBuffer() { size_t l = fread(fBuffer, 1, BUFFER_SIZE, fFile); fBufferPtr = fBuffer; fBufferEnd = fBufferPtr+l; return l == 0; } }; size_t readBytes() { size_t res = 0; unsigned long crc = 0; for (int i=0; i<10; i++) { StdioFileReader r("/tmp/shop_with_ids.pb"); int read = r.read(); while (read != -1) { ++res; crc += read; read = r.read(); } std::cout << crc << "\n"; // Comment out for "nofx" } return res; } int main(int argc, char** args) { StopWatch sw; sw.start(); size_t count = readBytes(); sw.stop(); std::cout << "<tr><td>cpp-1-B</td><td>" << count << "</td><td>" << sw.delta() << "</td><td>straight forward implementation using fread with buffering.</td></tr>" << std::endl; return 0; }
Aug 28 2011
Update: I added performance tests for ldc and gdc with the same programs. The results are interesting (please see the github page for the details). regards christian On 8/26/11 19:43 , Christian Köstlin wrote:Hi guys, i started the thread: http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but lets stay with file), byte-by-byte. the whole thing was part of my protobuf implementation for d2, and there you have to look at each byte to read out the varints. i was very proud of my implementation until i benchmarked it first against java (ok ... i was a little slower than java) and then against c++ (ok ... this was a complete different game). after some optimizing i got better, but was still way slower than c++. so i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i am missing something fundamental, because i thing it should be at least possible be equal or better than java. thanks in advance christian
Aug 31 2011
On 9/1/11 7:12 AM, Christian Köstlin wrote:Update: I added performance tests for ldc and gdc with the same programs. The results are interesting (please see the github page for the details).Oh wow, LDC must accidentally call some druntime functions for the ubyte[1] case, or something similar, could you please file a ticket at http://dsource.org/projects/ldc/newticket? Thanks, David
Aug 31 2011
On 9/1/11 7:24 , David Nadlinger wrote:On 9/1/11 7:12 AM, Christian Köstlin wrote:hi david, i am not sure what i have to do to open a ticket, i suppose that i should get an trac account and so on. but what would be the description. i suppose the time for this particular tests is quite strange and out of bounds :) right now my tests show, that lcd seems to be faster than dmd and slower than gdc in most cases. even my c++ program runs slower compiled with llvm-c++. perhaps you could open the bug report? feel free to point to the github repository or take the source and put it into the bugreport. thanks for your feedback christianUpdate: I added performance tests for ldc and gdc with the same programs. The results are interesting (please see the github page for the details).Oh wow, LDC must accidentally call some druntime functions for the ubyte[1] case, or something similar, could you please file a ticket at http://dsource.org/projects/ldc/newticket? Thanks, David
Sep 04 2011
Am 26.08.2011 19:43, schrieb Christian Köstlin:Hi guys, i started the thread: http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but lets stay with file), byte-by-byte. the whole thing was part of my protobuf implementation for d2, and there you have to look at each byte to read out the varints. i was very proud of my implementation until i benchmarked it first against java (ok ... i was a little slower than java) and then against c++ (ok ... this was a complete different game). after some optimizing i got better, but was still way slower than c++. so i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i am missing something fundamental, because i thing it should be at least possible be equal or better than java. thanks in advance christiani would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, driver speed will reducued down to a little bit of noise 2. make your szenario much bigger 3. would be interesting to see for example to cumulated every 1000 benchmarks-steps or something like that - to see caching coming in etc. running 10.000 times time for 1. 1000 steps xyzw time for 2. 1000 steps xyzw time for 3. 1000 steps xyzw time for 4. 1000 steps xyzw overall time ... xyz ...
Sep 02 2011
On 9/3/11 7:53 , dennis luehring wrote:Am 26.08.2011 19:43, schrieb Christian Köstlin:good point ... will see if i can adapt the tests... cKHi guys, i started the thread: http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but lets stay with file), byte-by-byte. the whole thing was part of my protobuf implementation for d2, and there you have to look at each byte to read out the varints. i was very proud of my implementation until i benchmarked it first against java (ok ... i was a little slower than java) and then against c++ (ok ... this was a complete different game). after some optimizing i got better, but was still way slower than c++. so i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i am missing something fundamental, because i thing it should be at least possible be equal or better than java. thanks in advance christiani would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, driver speed will reducued down to a little bit of noise 2. make your szenario much bigger 3. would be interesting to see for example to cumulated every 1000 benchmarks-steps or something like that - to see caching coming in etc. running 10.000 times time for 1. 1000 steps xyzw time for 2. 1000 steps xyzw time for 3. 1000 steps xyzw time for 4. 1000 steps xyzw overall time ... xyz ...
Sep 04 2011
Am 04.09.2011, 19:01 Uhr, schrieb Christian K=C3=B6stlin = <christian.koestlin gmail.com>:On 9/3/11 7:53 , dennis luehring wrote:tes-in-d2Am 26.08.2011 19:43, schrieb Christian K=C3=B6stlin:Hi guys, i started the thread: http://stackoverflow.com/questions/7202710/fastest-way-of-reading-by=on stackoverflow, because i ran into kind of a problem. i wanted to read data from a file (or even better from a stream, but=ytelets stay with file), byte-by-byte. the whole thing was part of my protobuf implementation for d2, and there you have to look at each b=ito read out the varints. i was very proud of my implementation until=benchmarked it first against java (ok ... i was a little slower than=e).java) and then against c++ (ok ... this was a complete different gam=+.after some optimizing i got better, but was still way slower than c+=mso i started some small microbenchmarks regarding fileio: https://github.com/gizmomogwai/performance in c++, java and d2. could you help me improve on the d2 performance? i am sure, that i a=missing something fundamental, because i thing it should be at least=verpossible be equal or better than java. thanks in advance christiani would change the test szenario a little bit 1. use a ramdisk - so stuff like location on disk, fragmentation, dri=c.speed will reducued down to a little bit of noise 2. make your szenario much bigger 3. would be interesting to see for example to cumulated every 1000 benchmarks-steps or something like that - to see caching coming in et=-release -O -inline -noboundscheck is the options set for D2. In D1 = -release included -noboundscheck.running 10.000 times time for 1. 1000 steps xyzw time for 2. 1000 steps xyzw time for 3. 1000 steps xyzw time for 4. 1000 steps xyzw overall time ... xyz ...good point ... will see if i can adapt the tests... cK
Oct 17 2011