digitalmars.D.learn - d2 file input performance

=?ISO-8859-1?Q?Christian_K=F6stlin?= (18/18) Aug 26 2011 Hi guys,

Steven Schveighoffer (21/38) Aug 26 2011 s-in-d2 =

bearophile (4/5) Aug 26 2011 I suggest the OP to keep us updated on this matter. And later after some...

=?ISO-8859-1?Q?Christian_K=F6stlin?= (10/15) Aug 29 2011 Small update:

Heywood Floyd (160/165) Aug 28 2011 Hello!
=?ISO-8859-1?Q?Christian_K=F6stlin?= (6/24) Aug 31 2011 Update:

David Nadlinger (6/9) Aug 31 2011 Oh wow, LDC must accidentally call some druntime functions for the

=?UTF-8?B?Q2hyaXN0aWFuIEvDtnN0bGlu?= (13/23) Sep 04 2011 hi david,

dennis luehring (14/32) Sep 02 2011 i would change the test szenario a little bit

=?ISO-8859-15?Q?Christian_K=F6stlin?= (4/44) Sep 04 2011 good point ...

Marco Leise (12/62) Oct 17 2011 tes-in-d2

=?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:

Hi guys,


i started the thread: 
http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2 
on stackoverflow, because i ran into kind of a problem.

i wanted to read data from a file (or even better from a stream, but 
lets stay with file), byte-by-byte. the whole thing was part of my 
protobuf implementation for d2, and there you have to look at each byte 
to read out the varints. i was very proud of my implementation until i
benchmarked it first against java (ok ... i was a little slower than 
java) and then against c++ (ok ... this was a complete different game).

after some optimizing i got better, but was still way slower than c++. 
so i started some small microbenchmarks regarding fileio: 
https://github.com/gizmomogwai/performance in c++, java and d2.

could you help me improve on the d2 performance? i am sure, that i am 
missing something fundamental, because i thing it should be at least 
possible be equal or better than java.

thanks in advance

christian

Aug 26 2011

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 26 Aug 2011 13:43:23 -0400, Christian K=C3=B6stlin  =

<christian.koestlin gmail.com> wrote:

 Hi guys,


 i started the thread:  =

 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-byte=

s-in-d2  =

 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but  =

 lets stay with file), byte-by-byte. the whole thing was part of my  =

 protobuf implementation for d2, and there you have to look at each byt=

e  =

 to read out the varints. i was very proud of my implementation until i=

 benchmarked it first against java (ok ... i was a little slower than  =

 java) and then against c++ (ok ... this was a complete different game)=

.
 after some optimizing i got better, but was still way slower than c++.=

  =

 so i started some small microbenchmarks regarding fileio:  =

 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am =

 =

 missing something fundamental, because i thing it should be at least  =

 possible be equal or better than java.

 thanks in advance

Two things:

First, there is a large difference:

C++ version:

  int read() {...}

D version:

int read(ubyte *bufferptr) {...}

This may not be optimized as well.  You should make it the same.

Second, use -inline, it will help tremendously.

I'd bet money that the largest slowdown is the function calls.  Inlining=
  =

makes things run so much faster it's not even funny.

Also, note that FILE* is *already buffered*, there is no reason to do  =

anything but fgetc.  In fact, it would probably be faster.

-Steve

Aug 26 2011

bearophile <bearophileHUGS lycos.com> writes:

Steven Schveighoffer:

 In fact, it would probably be faster.

I suggest the OP to keep us updated on this matter. And later after some time,
if no solutions are found, to bring the issue to the main D newsgroup and to
Bugzilla too. This is a significant issue.

Bye,
bearophile

Aug 26 2011

=?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:

On 8/26/11 23:56 , bearophile wrote:
 Steven Schveighoffer:

 In fact, it would probably be faster.

 I suggest the OP to keep us updated on this matter. And later after some time,
if no solutions are found, to bring the issue to the main D newsgroup and to
Bugzilla too. This is a significant issue.

 Bye,
 bearophile

Small update:
I added some more example implementations as a reaction to Mehrdad's 
suggestion to make sure to use the same file-read api. So the c++ and 
the d version both load libc dynamically and from that the symbol fread.
respective times from c++ and d: 115ms vs. 504ms.

the only thing i could also try is to use ldc or gdc (but i first have 
to install those).


regards
christian

Aug 29 2011

Heywood Floyd <soul8o8 gmail.com> writes:

Christian K�stlin Wrote:
 after some optimizing i got better, but was still way slower than c++. 
 so i started some small microbenchmarks regarding fileio: 
 https://github.com/gizmomogwai/performance in c++, java and d2.
 
 christian


Hello!

Thanks for you effort in putting this together!


I found this interesting and played around with some of your code examples.
My findings differ somewhat from yours, so I thought I'd post them.



From what I can tell, G++ does generate almost twice (~1.9x) as fast code, in
the fread()/File-example, as DMD. Even though the D-code does handle errors
encountered by fread(), that certainly can't explain the dramatic difference in
speed alone.

It would be very interesting to see how GDC and LDC perform in these tests!! (I
don't have them installed.)



Anyway, here are my notes:

I concentrated on the G++-fread-example and the DMD-File-example, as they seem
comparable enough. However, I did some changes to the benchmark in order to
"level" the playing field:

  1) Made sure both C++ and D used a 1 kb (fixed-size) buffer
  2) Made sure the underlying setvbuf() buffer is the same (64 kb)
  3) Made sure the read data has an actual side effect by printing out the
accumulated data after each file. (Cheapo CRC)

The last point, 3, particularly seemed to make the G++-example considerably
slower, perhaps hinting G++ is otherwise doing some clever optimization here.
The second point, 2, seemed to have no effect on C++, but it helped somewhat
for D. This may hint at C++ doing its own buffering or something. (?) In that
case the benchmark is bogus.

Anyway, these are the results:

	(G++ 4.2.1, fread()+crc, osx)
	G++		1135 ms		(no flags)
	G++		399 ms		-O1
	G++		368 ms		-O2
	G++		368 ms		-O3
	G++nofx	156 ms		-O3 (Disqualified!)

	(DMD 2.054, rawRead()+crc, osx)
	DMD		995 ms		(no flags)
	DMD		913 ms		-O
	DMD		888 ms		-release
	DMD		713 ms		-release -O -inline
	DMD		703 ms		-release -O
	DMD		693 ms		-release -O -inline -noboundscheck

Well, I suppose a possible (and to me plausable) explanation is that G++'s
optimizations are a lot more extensive than DMD's.

Skipping printing out the CRC-value ("nofx") makes the C++ code more than twice
as fast. Note that the code calculating the CRC-value is still in place, the
value is just not printed out, and surely, calling printf() 10 times can hardly
account for a 200 ms increase. (?) I think it's safe to assume code is simply
being ignored here, as it's not having any side effect.

My gut feel is DMD is not doing inlining, at least not to the same extent G++
is, as that seems to be especially important since we're making a function call
for every single byte here. (Using the -inline flag even seems to make the D
code slower. Weird.) But of course I don't really know. Again, GDC and LDC
would be interesting to see here.

Finally, to this I must add the size of the generated binary:

    G++   15 kb
    DMD   882 kb

Yikes. I believe there's nothing (large enough) to hide behind for DMD there.



That's it!
Kind regards
/HF





Here's the modifed code: (Original https://github.com/gizmomogwai/performance)

// - - - - - - 8< - - - - - - 

import 	std.stdio,
		std.datetime,
		core.stdc.stdio;
	
struct FileReader
{
private:
	File file;
	
	enum BUFFER_SIZE = 1024;
	ubyte[BUFFER_SIZE] readBuf;
	size_t pos, len;
	
	this(string name){
		file = File(name, "rb");
		//setbuf(file.getFP(), null); // No buffer
		setvbuf(file.getFP(), null, _IOFBF, BUFFER_SIZE * 64);
	}

	bool fillBuffer()
	{
		auto tmpBuf = file.rawRead(readBuf);
		len = tmpBuf.length;
		pos = 0;
		return len > 0;
	}
	
public:	
	int read()
	{
		if(pos == len){
			if(fillBuffer() == false)
				return -1;
		}
		return readBuf[pos++];
	}
}

size_t readBytes()
{
	size_t count = 0;
	ulong crc = 0;
	for (int i=0; i<10; i++) {
		auto file = FileReader("/tmp/shop_with_ids.pb");	
		auto data = file.read();
		while(data != -1){
			count++;
			crc += data;
			data = file.read();
		}
		writeln(crc);
	}
	return count;
}


int main(string[] args) {
  auto sw = StopWatch(AutoStart.no);
  sw.start();
  auto count = readBytes();
  sw.stop();
  writeln("<tr><td>d2-6-B</td><td>", count, "</td><td>", sw.peek().msecs,
"</td><td>using std.stdio.File </td></tr>");
  return 0;
}


// - - - - - - 8< - - - - - - 


#include "stopwatch.h"
#include <iostream>
#include <stdio.h>

class StdioFileReader {
private:
  FILE* fFile;
  static const size_t BUFFER_SIZE = 1024;
  unsigned char fBuffer[BUFFER_SIZE];
  unsigned char* fBufferPtr;
  unsigned char* fBufferEnd;

public:
  StdioFileReader(std::string s) : fFile(fopen(s.c_str(), "rb")),
fBufferPtr(fBuffer), fBufferEnd(fBuffer) {
    assert(fFile);
	//setbuf(fFile, NULL); // No buffer
	setvbuf(fFile, NULL, _IOFBF, BUFFER_SIZE * 64);
  }
  ~StdioFileReader() {
    fclose(fFile);
  }

  int read() {
    bool finished = fBufferPtr == fBufferEnd;
    if (finished) {
      finished = fillBuffer();
      if (finished) {
	return -1;
      }
    }
    return *fBufferPtr++;
  }

private:
  bool fillBuffer() {
    size_t l = fread(fBuffer, 1, BUFFER_SIZE, fFile);
    fBufferPtr = fBuffer;
    fBufferEnd = fBufferPtr+l;
    return l == 0;
  }
};

size_t readBytes() {
  size_t res = 0;
  unsigned long crc = 0;
  for (int i=0; i<10; i++) {
    StdioFileReader r("/tmp/shop_with_ids.pb");
    int read = r.read();

    while (read != -1) {
      ++res;
      crc += read;
      read = r.read();
    }
    std::cout << crc << "\n"; // Comment out for "nofx"
  }
  return res;
}

int main(int argc, char** args) {
  StopWatch sw;
  sw.start();
  size_t count = readBytes();
  sw.stop();
  std::cout << "<tr><td>cpp-1-B</td><td>" << count << "</td><td>" << sw.delta()
<< "</td><td>straight forward implementation using fread with
buffering.</td></tr>" << std::endl;
  return 0;
}

Aug 28 2011

=?ISO-8859-1?Q?Christian_K=F6stlin?= <christian.koestlin gmail.com> writes:

Update:

I added performance tests for ldc and gdc with the same programs.
The results are interesting (please see the github page for the details).

regards

christian

On 8/26/11 19:43 , Christian K�stlin wrote:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian

Aug 31 2011

David Nadlinger <see klickverbot.at> writes:

On 9/1/11 7:12 AM, Christian Köstlin wrote:
 Update:

 I added performance tests for ldc and gdc with the same programs.
 The results are interesting (please see the github page for the details).

Oh wow, LDC must accidentally call some druntime functions for the 
ubyte[1] case, or something similar, could you please file a ticket at 
http://dsource.org/projects/ldc/newticket?

Thanks,
David

Aug 31 2011

=?UTF-8?B?Q2hyaXN0aWFuIEvDtnN0bGlu?= <christian.koestlin gmail.com> writes:

On 9/1/11 7:24 , David Nadlinger wrote:
 On 9/1/11 7:12 AM, Christian Köstlin wrote:
 Update:

 I added performance tests for ldc and gdc with the same programs.
 The results are interesting (please see the github page for the details).

 Oh wow, LDC must accidentally call some druntime functions for the
 ubyte[1] case, or something similar, could you please file a ticket at
 http://dsource.org/projects/ldc/newticket?

 Thanks,
 David

hi david,

i am not sure what i have to do to open a ticket, i suppose that i 
should get an trac account and so on. but what would be the description. 
i suppose the time for this particular tests is quite strange and out of 
bounds :)

right now my tests show, that lcd seems to be faster than dmd and slower 
than gdc in most cases. even my c++ program runs slower compiled with 
llvm-c++.

perhaps you could open the bug report? feel free to point to the github 
repository or take the source and put it into the bugreport.

thanks for your feedback

christian

Sep 04 2011

dennis luehring <dl.soluz gmx.net> writes:

Am 26.08.2011 19:43, schrieb Christian K�stlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian

i would change the test szenario a little bit

1. use a ramdisk - so stuff like location on disk, fragmentation, driver 
speed will reducued down to a little bit of noise

2. make your szenario much bigger

3. would be interesting to see for example to cumulated every 1000 
benchmarks-steps or something like that - to see caching coming in etc.

running 10.000 times

time for 1. 1000 steps xyzw
time for 2. 1000 steps xyzw
time for 3. 1000 steps xyzw
time for 4. 1000 steps xyzw
overall time ... xyz
...

Sep 02 2011

=?ISO-8859-15?Q?Christian_K=F6stlin?= writes:

On 9/3/11 7:53 , dennis luehring wrote:
 Am 26.08.2011 19:43, schrieb Christian K�stlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-bytes-in-d2

 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but
 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each byte
 to read out the varints. i was very proud of my implementation until i
 benchmarked it first against java (ok ... i was a little slower than
 java) and then against c++ (ok ... this was a complete different game).

 after some optimizing i got better, but was still way slower than c++.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i am
 missing something fundamental, because i thing it should be at least
 possible be equal or better than java.

 thanks in advance

 christian

 i would change the test szenario a little bit

 1. use a ramdisk - so stuff like location on disk, fragmentation, driver
 speed will reducued down to a little bit of noise

 2. make your szenario much bigger

 3. would be interesting to see for example to cumulated every 1000
 benchmarks-steps or something like that - to see caching coming in etc.

 running 10.000 times

 time for 1. 1000 steps xyzw
 time for 2. 1000 steps xyzw
 time for 3. 1000 steps xyzw
 time for 4. 1000 steps xyzw
 overall time ... xyz
 ...

good point ...
will see if i can adapt the tests...

cK

Sep 04 2011

"Marco Leise" <Marco.Leise gmx.de> writes:

Am 04.09.2011, 19:01 Uhr, schrieb Christian K=C3=B6stlin  =

<christian.koestlin gmail.com>:

 On 9/3/11 7:53 , dennis luehring wrote:
 Am 26.08.2011 19:43, schrieb Christian K=C3=B6stlin:
 Hi guys,


 i started the thread:
 http://stackoverflow.com/questions/7202710/fastest-way-of-reading-by=



tes-in-d2
 on stackoverflow, because i ran into kind of a problem.

 i wanted to read data from a file (or even better from a stream, but=



 lets stay with file), byte-by-byte. the whole thing was part of my
 protobuf implementation for d2, and there you have to look at each b=



yte
 to read out the varints. i was very proud of my implementation until=



 i
 benchmarked it first against java (ok ... i was a little slower than=



 java) and then against c++ (ok ... this was a complete different gam=



e).
 after some optimizing i got better, but was still way slower than c+=



+.
 so i started some small microbenchmarks regarding fileio:
 https://github.com/gizmomogwai/performance in c++, java and d2.

 could you help me improve on the d2 performance? i am sure, that i a=



m
 missing something fundamental, because i thing it should be at least=



 possible be equal or better than java.

 thanks in advance

 christian

 i would change the test szenario a little bit

 1. use a ramdisk - so stuff like location on disk, fragmentation, dri=


ver
 speed will reducued down to a little bit of noise

 2. make your szenario much bigger

 3. would be interesting to see for example to cumulated every 1000
 benchmarks-steps or something like that - to see caching coming in et=


c.
 running 10.000 times

 time for 1. 1000 steps xyzw
 time for 2. 1000 steps xyzw
 time for 3. 1000 steps xyzw
 time for 4. 1000 steps xyzw
 overall time ... xyz
 ...

 good point ...
 will see if i can adapt the tests...

 cK

-release -O -inline -noboundscheck is the options set for D2. In D1  =

-release included -noboundscheck.

Oct 17 2011

D Programming

C/C++ Programming

Other

digitalmars.D.learn - d2 file input performance