digitalmars.D - some regex vs std.ascii vs handcode times

Jay Norwood (241/241) Mar 18 2012 I'm timing operations processing 10 2MB text files in parallel.

dennis luehring (3/244) Mar 18 2012 please put the test-files and testcode (which seem to fit in one file)
Dmitry Olshansky (22/256) Mar 19 2012 I hope you are going to tell us dmd flags being used. Hopefully -inline

Martin Nowak (4/22) Mar 21 2012 1.5x-3x and we haven't even ported back the faster phobos

Andrei Alexandrescu (13/30) Mar 19 2012 Great work! This prompts quite a few bug reports and enhancement

Jay Norwood (15/56) Mar 19 2012 I don't know if they are bugs. On D.learn I got the explanation
Juan Manuel Cabo (22/34) Mar 20 2012 On Monday, 19 March 2012 at 17:23:36 UTC, Andrei Alexandrescu

Juan Manuel Cabo (37/49) Mar 20 2012 Try copying each received chunk into an array of the size of

Juan Manuel Cabo (19/32) Mar 20 2012 mmm, I just looked in std/file.d and std/stdio.d.
Jay Norwood (31/46) Mar 21 2012 The empty loop measurement, which was the first benchmark, shows

Jay Norwood (47/57) Mar 26 2012 I tried using a ramdisk from imdisk, because the above article

Juan Manuel Cabo (130/192) Mar 26 2012 I downloaded and tried your benchmark. I first tried it

Juan Manuel Cabo (8/8) Mar 26 2012 On Tuesday, 27 March 2012 at 00:23:46 UTC, Juan Manuel Cabo wrote:
Jay Norwood (9/12) Mar 26 2012 > --jm

"Jay Norwood" <jayn prismnet.com> writes:

I'm timing operations processing 10 2MB text files in parallel.  
I haven't gotten to the part where I put the words in the map, 
but I've done enough through this point to say a few things about 
the measurements.

This first one took 1ms, which tells me that the taskPool tasks 
are doing a good job, and this will not be a problem.

// do nothing
//finished! time: 1 ms
void wcp_nothing(string fn)
{
	//G:\d\a7\a7\Release>a7

}


This second one was just to read in the files to get a baseline. 
One surpise, seen later is that the byChunk actually completed 
faster by several ms.

// read files
//finished! time: 31 ms
void wcp_whole_file(string fn)
{
	auto input = std.file.read(fn);
}


On the other end of the spectrum is the byLine version of the 
read.  So this is way too slow to be promoting in our examples, 
and if anyone is using this in the code you should instead read 
chunks ... maybe 1MB like in my example later below, and then 
split up the lines yourself.

// read files by line ... yikes! don't want to do this
//finished! time: 485 ms
void wcp_byLine(string fn)
{
	auto f = File(fn);
	foreach(line; f.byLine(std.string.KeepTerminator.yes)){
	}
}


Ok, this was the good surprise.  Reading by chunks was faster 
than reading the whole file, by several ms.

// read files by chunk ...!better than full input
//finished! time: 23 ms
void wcp_files_by_chunk(string fn)
{
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
	}
}


So this is doing the line count while reading chunks.  It is back 
to the same time as the read of full input.  It is ok though 
since it would avoid issues with reading in a really big file.  
So the final thing will probably use this rather than the full 
input read version.  But for now I'm going to read the full input 
for the other tests.

// read lc by chunk ...same as full input
//finished! time: 34 ms
void wcp_lc_bychunk (string fn)
{
	ulong l_cnt;
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
		foreach(c;chunk){
			if (c=='\n')
				l_cnt++;
		}
	}
}

There doesn't appear to be any significant overhead to reading 
dchar vs char.  I haven't looked at the code (or decode) 
underneath.  I presume it is decoding and expanding into dchar...

// read dchar lc by chunk ...same
//finished! time: 34 ms
void wcp_dchar(string fn)
{
	ulong l_cnt;
	auto f = File(fn);
	foreach(chunk; f.byChunk(1_000_000)){
		foreach(dchar c;chunk){
			if (c=='\n')
				l_cnt++;
		}
	}
}

So here is some surprise ... why is regex 136ms vs 34 ms hand 
code?

// count lines with regex
//finished! time: 136 ms
void wcp_lines_with_regex(string fn)
{
	string input = cast(string)std.file.read(fn);
	auto rx = regex("\n");
	ulong l_cnt;
	foreach(e; splitter(input, rx))
	{
		l_cnt ++;
	}
}

So this is faster than the non-compiled regex, but still slower 
than hand code, which was in the 35 ms range.  Also, the 
documentation implies that the "g" flag should have read in all 
the matches at once, but when I tested for that, the length from 
the result of a single call to match was always 1.  So I think it 
must be broken.


// count lines with compiled ctRegex matcher
//finished! time: 97 ms
void wcp_ctRegex(string fn)
{
	string input = cast(string)std.file.read(fn);
	enum ctr =  ctRegex!("\n","g");
	ulong l_cnt;
	foreach(m; match(input,ctr))
	{
		l_cnt ++;
	}
}


//count lines with char match, this or the one with chunks about 
the same
//finished! time: 34 ms
void wcp_char(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong l_cnt;
	foreach(c; input)
	{
		if (c == '\n')
		l_cnt ++;
	}
}

This is the fastest of the word finders, maybe by 40ms, which is 
meaningful if you project to larger tasks.  This is similar to 
the code used in the u++ benchmark processing of  alice.txt.  
Never mind the line counts for now.  I'm just wanting to measure 
the word count impact separately.

//find words using pointers
//finished! time: 110 ms
void wcp_pointer(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	char c;
	auto p = input.ptr;
	auto pe = p+input.length;
	while(p<pe)
	{
		c = *p;

		if (c >= 'a' && c <= 'z' ||
			c >= 'A' && c <= 'Z')
		{
			++w_cnt;
			auto st = p++;
			while (p<pe){
				c = *p;
				if 	(!(c >= 'a' && c <= 'z' ||
					 c >= 'A' && c <= 'Z' ||
					 c >= '0' && c <= '9'))
				{
					break;
				}
				p++;
			}
			auto wpend = p;
		}
		p++;
	}
}

Ok, this is way too slow with both ctRegex and with regex, and so 
again I think there is something broken with the greedy flag 
since moving it outside the foreach only returned a length of 1 
in both cases.

// count words with compiled ctRegex matcher !way too slow
//finished! time: 1299 ms for ctRegex
//finished! time: 2608 ms for regex
void wcp_words_ctRegex
(string fn)
{
	string input = cast(string)std.file.read(fn);
	enum ctr =  regex("[a-zA-Z][a-zA-Z0-9]*","g");
	ulong w_cnt;
	foreach(m; match(input,ctr))
	{
		//auto s = m.hit;
		w_cnt ++;
	}
}

Using slices is slower than the pointer version previously by 
about 40ms.  I turned off the range checking in this release 
build, so that doesn't explain it.

//find words with slices .. ok this is slower by a bunch
//finished! time: 153 ms
void wcp_word_slices(string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	while (!input.empty)
	{
		auto c = input[0];
   		if (c >= 'a' && c <= 'z' ||
			c >= 'A' && c <= 'Z')
		{
			++w_cnt;
			auto word = input;
  			foreach(j,char w ;input){
				if 	(!(w >= 'a' && w <= 'z' ||
					   w >= 'A' && w <= 'Z' ||
					   w >= '0' && w <= '9'))
				{
					word = input[0..j];
					input = input[j..$];
					break;
				}
			}
		}
		else input = input[1..$];
	}
}

Note that using std.ascii.isXx is actually calling a function, 
which I find surprising with all the template code emphasis.  
Note that that it added another 80ms vs the in-line code above.  
So I would say provide a mixin template for this, because it is 
currently being reused by several libraries.

//find words with std.ascii.isAlpha and isAlphaNum .. worse
//finished! time: 232 ms
void wcp_std_ascii (string fn)
{
	string input = cast(string)std.file.read(fn);
	ulong w_cnt;
	while (!input.empty)
	{
		auto c = input[0];
   		if (std.ascii.isAlpha(c))
		{
			++w_cnt;
			auto word = input;
  			foreach(j,char w ;input){
				if 	(!std.ascii.isAlphaNum(w))
				{
					word = input[0..j];
					input = input[j..$];
					break;
				}
			}
		}
		else input = input[1..$];
	}
}

Mar 18 2012

dennis luehring <dl.soluz gmx.net> writes:

please put the test-files and testcode (which seem to fit in one file)
somewhere so others can check your timings

Am 19.03.2012 05:12, schrieb Jay Norwood:
 I'm timing operations processing 10 2MB text files in parallel.
 I haven't gotten to the part where I put the words in the map,
 but I've done enough through this point to say a few things about
 the measurements.

 This first one took 1ms, which tells me that the taskPool tasks
 are doing a good job, and this will not be a problem.

 // do nothing
 //finished! time: 1 ms
 void wcp_nothing(string fn)
 {
 	//G:\d\a7\a7\Release>a7

 }


 This second one was just to read in the files to get a baseline.
 One surpise, seen later is that the byChunk actually completed
 faster by several ms.

 // read files
 //finished! time: 31 ms
 void wcp_whole_file(string fn)
 {
 	auto input = std.file.read(fn);
 }


 On the other end of the spectrum is the byLine version of the
 read.  So this is way too slow to be promoting in our examples,
 and if anyone is using this in the code you should instead read
 chunks ... maybe 1MB like in my example later below, and then
 split up the lines yourself.

 // read files by line ... yikes! don't want to do this
 //finished! time: 485 ms
 void wcp_byLine(string fn)
 {
 	auto f = File(fn);
 	foreach(line; f.byLine(std.string.KeepTerminator.yes)){
 	}
 }


 Ok, this was the good surprise.  Reading by chunks was faster
 than reading the whole file, by several ms.

 // read files by chunk ...!better than full input
 //finished! time: 23 ms
 void wcp_files_by_chunk(string fn)
 {
 	auto f = File(fn);
 	foreach(chunk; f.byChunk(1_000_000)){
 	}
 }


 So this is doing the line count while reading chunks.  It is back
 to the same time as the read of full input.  It is ok though
 since it would avoid issues with reading in a really big file.
 So the final thing will probably use this rather than the full
 input read version.  But for now I'm going to read the full input
 for the other tests.

 // read lc by chunk ...same as full input
 //finished! time: 34 ms
 void wcp_lc_bychunk (string fn)
 {
 	ulong l_cnt;
 	auto f = File(fn);
 	foreach(chunk; f.byChunk(1_000_000)){
 		foreach(c;chunk){
 			if (c=='\n')
 				l_cnt++;
 		}
 	}
 }

 There doesn't appear to be any significant overhead to reading
 dchar vs char.  I haven't looked at the code (or decode)
 underneath.  I presume it is decoding and expanding into dchar...

 // read dchar lc by chunk ...same
 //finished! time: 34 ms
 void wcp_dchar(string fn)
 {
 	ulong l_cnt;
 	auto f = File(fn);
 	foreach(chunk; f.byChunk(1_000_000)){
 		foreach(dchar c;chunk){
 			if (c=='\n')
 				l_cnt++;
 		}
 	}
 }

 So here is some surprise ... why is regex 136ms vs 34 ms hand
 code?

 // count lines with regex
 //finished! time: 136 ms
 void wcp_lines_with_regex(string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	auto rx = regex("\n");
 	ulong l_cnt;
 	foreach(e; splitter(input, rx))
 	{
 		l_cnt ++;
 	}
 }

 So this is faster than the non-compiled regex, but still slower
 than hand code, which was in the 35 ms range.  Also, the
 documentation implies that the "g" flag should have read in all
 the matches at once, but when I tested for that, the length from
 the result of a single call to match was always 1.  So I think it
 must be broken.


 // count lines with compiled ctRegex matcher
 //finished! time: 97 ms
 void wcp_ctRegex(string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	enum ctr =  ctRegex!("\n","g");
 	ulong l_cnt;
 	foreach(m; match(input,ctr))
 	{
 		l_cnt ++;
 	}
 }


 //count lines with char match, this or the one with chunks about
 the same
 //finished! time: 34 ms
 void wcp_char(string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	ulong l_cnt;
 	foreach(c; input)
 	{
 		if (c == '\n')
 		l_cnt ++;
 	}
 }

 This is the fastest of the word finders, maybe by 40ms, which is
 meaningful if you project to larger tasks.  This is similar to
 the code used in the u++ benchmark processing of  alice.txt.
 Never mind the line counts for now.  I'm just wanting to measure
 the word count impact separately.

 //find words using pointers
 //finished! time: 110 ms
 void wcp_pointer(string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	ulong w_cnt;
 	char c;
 	auto p = input.ptr;
 	auto pe = p+input.length;
 	while(p<pe)
 	{
 		c = *p;

 		if (c>= 'a'&&  c<= 'z' ||
 			c>= 'A'&&  c<= 'Z')
 		{
 			++w_cnt;
 			auto st = p++;
 			while (p<pe){
 				c = *p;
 				if 	(!(c>= 'a'&&  c<= 'z' ||
 					 c>= 'A'&&  c<= 'Z' ||
 					 c>= '0'&&  c<= '9'))
 				{
 					break;
 				}
 				p++;
 			}
 			auto wpend = p;
 		}
 		p++;
 	}
 }

 Ok, this is way too slow with both ctRegex and with regex, and so
 again I think there is something broken with the greedy flag
 since moving it outside the foreach only returned a length of 1
 in both cases.

 // count words with compiled ctRegex matcher !way too slow
 //finished! time: 1299 ms for ctRegex
 //finished! time: 2608 ms for regex
 void wcp_words_ctRegex
 (string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	enum ctr =  regex("[a-zA-Z][a-zA-Z0-9]*","g");
 	ulong w_cnt;
 	foreach(m; match(input,ctr))
 	{
 		//auto s = m.hit;
 		w_cnt ++;
 	}
 }

 Using slices is slower than the pointer version previously by
 about 40ms.  I turned off the range checking in this release
 build, so that doesn't explain it.

 //find words with slices .. ok this is slower by a bunch
 //finished! time: 153 ms
 void wcp_word_slices(string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	ulong w_cnt;
 	while (!input.empty)
 	{
 		auto c = input[0];
     		if (c>= 'a'&&  c<= 'z' ||
 			c>= 'A'&&  c<= 'Z')
 		{
 			++w_cnt;
 			auto word = input;
    			foreach(j,char w ;input){
 				if 	(!(w>= 'a'&&  w<= 'z' ||
 					   w>= 'A'&&  w<= 'Z' ||
 					   w>= '0'&&  w<= '9'))
 				{
 					word = input[0..j];
 					input = input[j..$];
 					break;
 				}
 			}
 		}
 		else input = input[1..$];
 	}
 }

 Note that using std.ascii.isXx is actually calling a function,
 which I find surprising with all the template code emphasis.
 Note that that it added another 80ms vs the in-line code above.
 So I would say provide a mixin template for this, because it is
 currently being reused by several libraries.

 //find words with std.ascii.isAlpha and isAlphaNum .. worse
 //finished! time: 232 ms
 void wcp_std_ascii (string fn)
 {
 	string input = cast(string)std.file.read(fn);
 	ulong w_cnt;
 	while (!input.empty)
 	{
 		auto c = input[0];
     		if (std.ascii.isAlpha(c))
 		{
 			++w_cnt;
 			auto word = input;
    			foreach(j,char w ;input){
 				if 	(!std.ascii.isAlphaNum(w))
 				{
 					word = input[0..j];
 					input = input[j..$];
 					break;
 				}
 			}
 		}
 		else input = input[1..$];
 	}
 }

Mar 18 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 19.03.2012 8:12, Jay Norwood wrote:
 I'm timing operations processing 10 2MB text files in parallel. I
 haven't gotten to the part where I put the words in the map, but I've
 done enough through this point to say a few things about the measurements.

I hope you are going to tell us dmd flags being used. Hopefully -inline 
is among them.

 This first one took 1ms, which tells me that the taskPool tasks are
 doing a good job, and this will not be a problem.

 // do nothing
 //finished! time: 1 ms
 void wcp_nothing(string fn)
 {
 //G:\d\a7\a7\Release>a7

 }


 This second one was just to read in the files to get a baseline. One
 surpise, seen later is that the byChunk actually completed faster by
 several ms.

 // read files
 //finished! time: 31 ms
 void wcp_whole_file(string fn)
 {
 auto input = std.file.read(fn);
 }


 On the other end of the spectrum is the byLine version of the read. So
 this is way too slow to be promoting in our examples, and if anyone is
 using this in the code you should instead read chunks ... maybe 1MB like
 in my example later below, and then split up the lines yourself.

 // read files by line ... yikes! don't want to do this
 //finished! time: 485 ms
 void wcp_byLine(string fn)
 {
 auto f = File(fn);
 foreach(line; f.byLine(std.string.KeepTerminator.yes)){
 }
 }


 Ok, this was the good surprise. Reading by chunks was faster than
 reading the whole file, by several ms.

 // read files by chunk ...!better than full input
 //finished! time: 23 ms
 void wcp_files_by_chunk(string fn)
 {
 auto f = File(fn);
 foreach(chunk; f.byChunk(1_000_000)){
 }
 }


 So this is doing the line count while reading chunks. It is back to the
 same time as the read of full input. It is ok though since it would
 avoid issues with reading in a really big file. So the final thing will
 probably use this rather than the full input read version. But for now
 I'm going to read the full input for the other tests.

 // read lc by chunk ...same as full input
 //finished! time: 34 ms
 void wcp_lc_bychunk (string fn)
 {
 ulong l_cnt;
 auto f = File(fn);
 foreach(chunk; f.byChunk(1_000_000)){
 foreach(c;chunk){
 if (c=='\n')
 l_cnt++;
 }
 }
 }

 There doesn't appear to be any significant overhead to reading dchar vs
 char. I haven't looked at the code (or decode) underneath. I presume it
 is decoding and expanding into dchar...

 // read dchar lc by chunk ...same
 //finished! time: 34 ms
 void wcp_dchar(string fn)
 {
 ulong l_cnt;
 auto f = File(fn);
 foreach(chunk; f.byChunk(1_000_000)){
 foreach(dchar c;chunk){
 if (c=='\n')
 l_cnt++;
 }
 }
 }

That was strange - decoding takes time, that surely trips few msesc in 
favor of non-decoding version. Probably I/o is hiding the facts here.
 So here is some surprise ... why is regex 136ms vs 34 ms hand code?

Are you really that surprised?! regex is mm very generic tool, vs 
hardcoded byte comparison?
In detail: regex still finds a "hit" and takes a slice of input + plus 
it works as if "n" was unknown string before running.
A more logical comparison would be splitting chunk by string separator 
that is a variable.
I actually find timing not bad at all here.

 // count lines with regex
 //finished! time: 136 ms
 void wcp_lines_with_regex(string fn)
 {
 string input = cast(string)std.file.read(fn);
 auto rx = regex("\n");
 ulong l_cnt;
 foreach(e; splitter(input, rx))
 {
 l_cnt ++;
 }
 }

 So this is faster than the non-compiled regex, but still slower than
 hand code, which was in the 35 ms range. Also, the documentation implies
 that the "g" flag should have read in all the matches at once, but when
 I tested for that, the length from the result of a single call to match
 was always 1. So I think it must be broken.

It's by design, match returns lazy _range_. g makes it search more then 
first match.

 // count lines with compiled ctRegex matcher
 //finished! time: 97 ms
 void wcp_ctRegex(string fn)
 {
 string input = cast(string)std.file.read(fn);
 enum ctr = ctRegex!("\n","g");
 ulong l_cnt;
 foreach(m; match(input,ctr))
 {
 l_cnt ++;
 }
 }

 //count lines with char match, this or the one with chunks about the same
 //finished! time: 34 ms
 void wcp_char(string fn)
 {
 string input = cast(string)std.file.read(fn);
 ulong l_cnt;
 foreach(c; input)
 {
 if (c == '\n')
 l_cnt ++;
 }
 }

 This is the fastest of the word finders, maybe by 40ms, which is
 meaningful if you project to larger tasks. This is similar to the code
 used in the u++ benchmark processing of alice.txt. Never mind the line
 counts for now. I'm just wanting to measure the word count impact
 separately.

Sorry to spoil your excitement but all of them do not count words. Not 
in a proper Unicode sense at all.


 //find words using pointers
 //finished! time: 110 ms
 void wcp_pointer(string fn)
 {
 string input = cast(string)std.file.read(fn);
 ulong w_cnt;
 char c;
 auto p = input.ptr;
 auto pe = p+input.length;
 while(p<pe)
 {
 c = *p;

 if (c >= 'a' && c <= 'z' ||
 c >= 'A' && c <= 'Z')
 {
 ++w_cnt;
 auto st = p++;
 while (p<pe){
 c = *p;
 if (!(c >= 'a' && c <= 'z' ||
 c >= 'A' && c <= 'Z' ||
 c >= '0' && c <= '9'))
 {
 break;
 }
 p++;
 }
 auto wpend = p;
 }
 p++;
 }
 }

 Ok, this is way too slow with both ctRegex and with regex, and so again
 I think there is something broken with the greedy flag since moving it
 outside the foreach only returned a length of 1 in both cases.

Nothing wrong with "g" flag, it implies that regex seeks all matches but 
returns lazy range. Doing all work eagerly in one go requires 
allocations. Again extra work that being done by R-T regex here is quite 
high and unavoidable, C-T can do much better though.

 // count words with compiled ctRegex matcher !way too slow
 //finished! time: 1299 ms for ctRegex
 //finished! time: 2608 ms for regex
 void wcp_words_ctRegex
 (string fn)
 {
 string input = cast(string)std.file.read(fn);
 enum ctr = regex("[a-zA-Z][a-zA-Z0-9]*","g");
 ulong w_cnt;
 foreach(m; match(input,ctr))
 {
 //auto s = m.hit;
 w_cnt ++;
 }
 }

 Using slices is slower than the pointer version previously by about
 40ms. I turned off the range checking in this release build, so that
 doesn't explain it.

 //find words with slices .. ok this is slower by a bunch
 //finished! time: 153 ms
 void wcp_word_slices(string fn)
 {
 string input = cast(string)std.file.read(fn);
 ulong w_cnt;
 while (!input.empty)
 {
 auto c = input[0];
 if (c >= 'a' && c <= 'z' ||
 c >= 'A' && c <= 'Z')
 {
 ++w_cnt;
 auto word = input;
 foreach(j,char w ;input){
 if (!(w >= 'a' && w <= 'z' ||
 w >= 'A' && w <= 'Z' ||
 w >= '0' && w <= '9'))
 {
 word = input[0..j];
 input = input[j..$];
 break;
 }
 }
 }
 else input = input[1..$];
 }
 }

 Note that using std.ascii.isXx is actually calling a function, which I
 find surprising with all the template code emphasis. Note that that it
 added another 80ms vs the in-line code above. So I would say provide a
 mixin template for this, because it is currently being reused by several
 libraries.

 //find words with std.ascii.isAlpha and isAlphaNum .. worse
 //finished! time: 232 ms
 void wcp_std_ascii (string fn)
 {
 string input = cast(string)std.file.read(fn);
 ulong w_cnt;
 while (!input.empty)
 {
 auto c = input[0];
 if (std.ascii.isAlpha(c))
 {
 ++w_cnt;
 auto word = input;
 foreach(j,char w ;input){
 if (!std.ascii.isAlphaNum(w))
 {
 word = input[0..j];
 input = input[j..$];
 break;
 }
 }
 }
 else input = input[1..$];
 }
 }


-- 
Dmitry Olshansky

Mar 19 2012

"Martin Nowak" <dawg dawgfoto.de> writes:

On Mon, 19 Mar 2012 08:56:28 +0100, Dmitry Olshansky  
<dmitry.olsh gmail.com> wrote:

 There doesn't appear to be any significant overhead to reading dchar vs
 char. I haven't looked at the code (or decode) underneath. I presume it
 is decoding and expanding into dchar...
  // read dchar lc by chunk ...same
 //finished! time: 34 ms
 void wcp_dchar(string fn)
 {
 ulong l_cnt;
 auto f = File(fn);
 foreach(chunk; f.byChunk(1_000_000)){
 foreach(dchar c;chunk){
 if (c=='\n')
 l_cnt++;
 }
 }
 }
  That was strange - decoding takes time, that surely trips few msesc in  
 favor of non-decoding version. Probably I/o is hiding the facts here.

1.5x-3x and we haven't even ported back the faster phobos
std.utf.decode to druntime which is called by foreach.

Mar 21 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/18/12 11:12 PM, Jay Norwood wrote:
 I'm timing operations processing 10 2MB text files in parallel. I
 haven't gotten to the part where I put the words in the map, but I've
 done enough through this point to say a few things about the measurements.

Great work! This prompts quite a few bug reports and enhancement 
suggestions - please submit them to bugzilla.

Two quick notes:

 On the other end of the spectrum is the byLine version of the read. So
 this is way too slow to be promoting in our examples, and if anyone is
 using this in the code you should instead read chunks ... maybe 1MB like
 in my example later below, and then split up the lines yourself.

 // read files by line ... yikes! don't want to do this
 //finished! time: 485 ms
 void wcp_byLine(string fn)
 {
 auto f = File(fn);
 foreach(line; f.byLine(std.string.KeepTerminator.yes)){
 }
 }

What OS did you use? (The implementation of byLine varies a lot across OSs.)

I wanted for a long time to improve byLine by allowing it to do its own 
buffering. That means once you used byLine it's not possible to stop it, 
get back to the original File, and continue reading it. Using byLine is 
a commitment. This is what most uses of it do anyway.

 Ok, this was the good surprise. Reading by chunks was faster than
 reading the whole file, by several ms.

What may be at work here is cache effects. Reusing the same 1MB may 
place it in faster cache memory, whereas reading 20MB at once may spill 
into slower memory.


Andrei

Mar 19 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Monday, 19 March 2012 at 17:23:36 UTC, Andrei Alexandrescu 
wrote:
 On 3/18/12 11:12 PM, Jay Norwood wrote:
 I'm timing operations processing 10 2MB text files in 
 parallel. I
 haven't gotten to the part where I put the words in the map, 
 but I've
 done enough through this point to say a few things about the 
 measurements.

 Great work! This prompts quite a few bug reports and 
 enhancement suggestions - please submit them to bugzilla.

I don't know if they are bugs.  On D.learn I got the explanation 
that the matches.captures.length() just returns the matches in 
the  expressions surrounded by (),  so I don't think this can be 
used ,other than in a for loop, to count lines, for example.  
std.algorithm.count works ok, but I was hoping that there was 
something in the ctRegex that would make it work as fast as the 
hand-coded string scan.

 Two quick notes:

 On the other end of the spectrum is the byLine version of the 
 read. So
 this is way too slow to be promoting in our examples, and if 
 anyone is
 using this in the code you should instead read chunks ... 
 maybe 1MB like
 in my example later below, and then split up the lines 
 yourself.

 // read files by line ... yikes! don't want to do this
 //finished! time: 485 ms
 void wcp_byLine(string fn)
 {
 auto f = File(fn);
 foreach(line; f.byLine(std.string.KeepTerminator.yes)){
 }
 }

 What OS did you use? (The implementation of byLine varies a lot 
 across OSs.)

I'm doing everything now on win7-64 right now.


 I wanted for a long time to improve byLine by allowing it to do 
 its own buffering. That means once you used byLine it's not 
 possible to stop it, get back to the original File, and 
 continue reading it. Using byLine is a commitment. This is what 
 most uses of it do anyway.

 Ok, this was the good surprise. Reading by chunks was faster 
 than
 reading the whole file, by several ms.

 What may be at work here is cache effects. Reusing the same 1MB 
 may place it in faster cache memory, whereas reading 20MB at 
 once may spill into slower memory.

Yes, I would guess that's the problem. This corei7 has 8MB cache, 
and the threadpool creates 7 active tasks by default, as I 
understand, so even 1MB blocks is on the border when running 
parallel. I'll lower the chunk size to some level that seems 
reasonable and retest.

 Andrei

Mar 19 2012

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Monday, 19 March 2012 at 17:23:36 UTC, Andrei Alexandrescu 
wrote:

[.....]

 I wanted for a long time to improve byLine by allowing it to do 
 its own buffering. That means once you used byLine it's not 
 possible to stop it, get back to the original File, and 
 continue reading it. Using byLine is a commitment. This is what 
 most uses of it do anyway.

Great!! Perhaps we don't have to choose. We may have both!!
Allow me to suggest:

       byLineBuffered(bufferSize, keepTerminator);
or    byLineOnly(bufferSize, keepTerminator);
or    byLineChunked(bufferSize, keepTerminator);
or    byLineFastAndDangerous :-) hahah :-)

Or the other way around:

       byLine(keepTerminator, underlyingBufferSize);
renaming the current one to:
       byLineUnbuffered(keepTerminator);

Other ideas (I think I read them somewhere about
this same byLine topic):
   * I think it'd be cool if 'line' could be a slice of the
underlying buffer when possible if buffering is added.
   * Another good idea would be a new argument, maxLineLength,
so that one can avoid reading and allocating the whole
file into a big line string if there are no newlines
in the file, and one knows the max length desired.

--jm


 Ok, this was the good surprise. Reading by chunks was faster 
 than
 reading the whole file, by several ms.

 What may be at work here is cache effects. Reusing the same 1MB 
 may place it in faster cache memory, whereas reading 20MB at 
 once may spill into slower memory.


 Andrei

Mar 20 2012

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Monday, 19 March 2012 at 04:12:33 UTC, Jay Norwood wrote:

[....]

 Ok, this was the good surprise.  Reading by chunks was faster 
 than reading the whole file, by several ms.

 // read files by chunk ...!better than full input
 //finished! time: 23 ms
 void wcp_files_by_chunk(string fn)
 {
 	auto f = File(fn);
 	foreach(chunk; f.byChunk(1_000_000)){
 	}
 }

Try copying each received chunk into an array of the size of
the file, so that the comparison is fair (allocation time
of one big array != allocation time of single chunk). Plus,
std.file.read() might reallocate).
   Plus, I think that the GC runs at the end of a program,
and it's definitely not the same to have it scan through
a bigger heap (20Mb) than just scan through 1Mb of ram,
and with 20Mb of a file, it might have gotten 'distracted'
with pointer-looking data.
   Are you benchmarking the time of the whole program,
or of just that snippet? Is the big array out of scope
after the std.file.read() ? If so, try putting the benchmark
start and end inside the function, at the same scope
maybe inside that wcp_whole_file() function.


[....]
 So here is some surprise ... why is regex 136ms vs 34 ms hand 
 code?

It's not surprising to me. I don't think that there
is a single regex engine in the world (I don't think even
the legendary Ken Thompson machine code engine) that can
surpass a hand coded:
      foreach(c; buffer) { lineCount += (c == '\n'); }
for line counting.

The same goes for indexOf. If you are searching for
the position of a single char (unless maybe that the
regex is optimized for that specific case, I'd like
to know of one if it exists), I think nothing beats indexOf.


Regular expressions are for what you rather never hand code,
or for convenience.
I would rather write a regex to do a complex subtitution in Vim,
than make a Vim function just for that. They have an amazing
descriptive power. You are basically describing a program
in a single regex line.


When people talk about *fast* regexen, they talk comparatively
to other engines, not as an absolute measure.
This confuses a lot of people.

--jm

Mar 20 2012

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Wednesday, 21 March 2012 at 05:49:29 UTC, Juan Manuel Cabo 
wrote:
 On Monday, 19 March 2012 at 04:12:33 UTC, Jay Norwood wrote:

 [....]

 Ok, this was the good surprise.  Reading by chunks was faster 
 than reading the whole file, by several ms.

 // read files by chunk ...!better than full input
 //finished! time: 23 ms
 void wcp_files_by_chunk(string fn)
 {
 	auto f = File(fn);
 	foreach(chunk; f.byChunk(1_000_000)){
 	}
 }


mmm, I just looked in std/file.d and std/stdio.d.
The std.file.read() function calls GetFileSize()
before reading, and you are dealing with very tight
differences (23ms vs 31ms). So there is a chance that
either the difference is accounted for by the extra
GetFileSize() (and extra stat() in the posix version),
or your threads/process loosing their scheduled slice
for that extra I/O call of GetFileSize().
   Also, in windows, std.file.read() uses ReadFile, while
byChunk uses fread(), though it should all be the same
in the end.

   It is better to try with files big enough to see whether
the timing difference gets either bigger or stays
just around those 8ms.

I'll later try all this too!!! Nice benchmarking
by the way!! Got me interested!

--jm

Mar 20 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Wednesday, 21 March 2012 at 05:49:29 UTC, Juan Manuel Cabo 
wrote:

   Are you benchmarking the time of the whole program,
 or of just that snippet? Is the big array out of scope
 after the std.file.read() ? If so, try putting the benchmark
 start and end inside the function, at the same scope
 maybe inside that wcp_whole_file() function.

The empty loop measurement, which was the first benchmark, shows 
that the overhead of everything outside the measurement is only 
1ms.  What I'm measuring is a parallel foreach loop that calls 
the small functions on 10 files, with the default number of 
threadpool threads, which is 7 threads, based on the 
documentation in std.parallelism.

I'm not measuring until end of program or even the console 
output.  It is all measured with the stopwatch timer, and around 
the parallel foreach loop.

 [....]
 So here is some surprise ... why is regex 136ms vs 34 ms hand 
 code?

 It's not surprising to me. I don't think that there
 is a single regex engine in the world (I don't think even
 the legendary Ken Thompson machine code engine) that can
 surpass a hand coded:
      foreach(c; buffer) { lineCount += (c == '\n'); }
 for line counting.

Well, ok, but I think the comments like below from the std.regex 
raise hopes of something approaching hand code.

"  //create static regex at compile-time, contains fast native 
code
   enum ctr = ctRegex!(`^.*/([^/]+)/?$`);
"

I'll put the test code on github this weekend.  I still want to 
try a few things that have been suggested.

on the chunk measurements ... I don't understand what is not 
"fair". In both measurements I processed the entire file.


On the use of larger files ... yes that will be interesting, but 
for these current measurements  the file reads are only taking on 
the order of 30ms for 20MB, which tells me they are already 
either being cached by win7, or else by the ssd's cache.

  I'll use the article instructions below and put the files being 
read into the cache prior to the test,  so that the file read 
time  should be small and consistent relative to the other buffer 
processing time inside the loops.

http://us.generation-nt.com/activate-windows-file-caching-tip-tips-tricks-2130881-0.html


Thanks

Mar 21 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Thursday, 22 March 2012 at 04:29:41 UTC, Jay Norwood wrote:
 On the use of larger files ... yes that will be interesting, 
 but for these current measurements  the file reads are only 
 taking on the order of 30ms for 20MB, which tells me they are 
 already either being cached by win7, or else by the ssd's cache.

  I'll use the article instructions below and put the files 
 being read into the cache prior to the test,  so that the file 
 read time  should be small and consistent relative to the other 
 buffer processing time inside the loops.

 http://us.generation-nt.com/activate-windows-file-caching-tip-tips-tricks-2130881-0.html


 Thanks

I tried using a ramdisk from imdisk, because the above article 
was just for caching network drives to your local disk.  The 
first set of times are from the ssd, the second from the ram 
disk, and both are about the same.  So I guess win7 is caching 
these file reads already.

I got imdisk for the ramdisk here
http://www.ltr-data.se/opencode.html/#ImDisk


These are the times for the imdisk reads (still executing from G 
hard drive , but reading from F ram disk)
G:\d\a7\a7\Release>wctest f:\al*.txt
finished wcp_nothing! time: 1 ms
finished wcp_whole_file! time: 31 ms
finished wcp_byLine! time: 525 ms
finished wcp_byChunk! time: 22 ms
finished wcp_lcByChunk! time: 33 ms
finished wcp_lcDcharByChunk! time: 30 ms
finished wcp_lcRegex! time: 141 ms
finished wcp_lcCtRegex! time: 104 ms
finished wcp_lcStdAlgoCount! time: 139 ms
finished wcp_lcChar! time: 37 ms
finished wcp_wcPointer! time: 121 ms
finished wcp_wcCtRegex! time: 1269 ms
finished wcp_wcRegex! time: 2908 ms
finished wcp_wcRegex2! time: 2693 ms
finished wcp_wcSlices! time: 179 ms
finished wcp_wcStdAscii! time: 222 ms

This is reading from the ssd Intel 510 series 120GB
G:\d\a7\a7\Release>wctest h:\al*.txt
finished wcp_nothing! time: 1 ms
finished wcp_whole_file! time: 32 ms
finished wcp_byLine! time: 518 ms
finished wcp_byChunk! time: 23 ms
finished wcp_lcByChunk! time: 33 ms
finished wcp_lcDcharByChunk! time: 31 ms
finished wcp_lcRegex! time: 159 ms
finished wcp_lcCtRegex! time: 89 ms
finished wcp_lcStdAlgoCount! time: 144 ms
finished wcp_lcChar! time: 34 ms
finished wcp_wcPointer! time: 118 ms
finished wcp_wcCtRegex! time: 1273 ms
finished wcp_wcRegex! time: 2889 ms
finished wcp_wcRegex2! time: 2688 ms
finished wcp_wcSlices! time: 175 ms
finished wcp_wcStdAscii! time: 220 ms

I added the source and the test text files on github

https://github.com/jnorwood/wc_test

Mar 26 2012

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Monday, 26 March 2012 at 07:10:00 UTC, Jay Norwood wrote:
 On Thursday, 22 March 2012 at 04:29:41 UTC, Jay Norwood wrote:
 On the use of larger files ... yes that will be interesting, 
 but for these current measurements  the file reads are only 
 taking on the order of 30ms for 20MB, which tells me they are 
 already either being cached by win7, or else by the ssd's 
 cache.

 I'll use the article instructions below and put the files 
 being read into the cache prior to the test,  so that the file 
 read time  should be small and consistent relative to the 
 other buffer processing time inside the loops.

 http://us.generation-nt.com/activate-windows-file-caching-tip-tips-tricks-2130881-0.html


 Thanks

 I tried using a ramdisk from imdisk, because the above article 
 was just for caching network drives to your local disk.  The 
 first set of times are from the ssd, the second from the ram 
 disk, and both are about the same.  So I guess win7 is caching 
 these file reads already.

 I got imdisk for the ramdisk here
 http://www.ltr-data.se/opencode.html/#ImDisk


 These are the times for the imdisk reads (still executing from 
 G hard drive , but reading from F ram disk)
 G:\d\a7\a7\Release>wctest f:\al*.txt
 finished wcp_nothing! time: 1 ms
 finished wcp_whole_file! time: 31 ms
 finished wcp_byLine! time: 525 ms
 finished wcp_byChunk! time: 22 ms
 finished wcp_lcByChunk! time: 33 ms
 finished wcp_lcDcharByChunk! time: 30 ms
 finished wcp_lcRegex! time: 141 ms
 finished wcp_lcCtRegex! time: 104 ms
 finished wcp_lcStdAlgoCount! time: 139 ms
 finished wcp_lcChar! time: 37 ms
 finished wcp_wcPointer! time: 121 ms
 finished wcp_wcCtRegex! time: 1269 ms
 finished wcp_wcRegex! time: 2908 ms
 finished wcp_wcRegex2! time: 2693 ms
 finished wcp_wcSlices! time: 179 ms
 finished wcp_wcStdAscii! time: 222 ms

 This is reading from the ssd Intel 510 series 120GB
 G:\d\a7\a7\Release>wctest h:\al*.txt
 finished wcp_nothing! time: 1 ms
 finished wcp_whole_file! time: 32 ms
 finished wcp_byLine! time: 518 ms
 finished wcp_byChunk! time: 23 ms
 finished wcp_lcByChunk! time: 33 ms
 finished wcp_lcDcharByChunk! time: 31 ms
 finished wcp_lcRegex! time: 159 ms
 finished wcp_lcCtRegex! time: 89 ms
 finished wcp_lcStdAlgoCount! time: 144 ms
 finished wcp_lcChar! time: 34 ms
 finished wcp_wcPointer! time: 118 ms
 finished wcp_wcCtRegex! time: 1273 ms
 finished wcp_wcRegex! time: 2889 ms
 finished wcp_wcRegex2! time: 2688 ms
 finished wcp_wcSlices! time: 175 ms
 finished wcp_wcStdAscii! time: 220 ms

 I added the source and the test text files on github

 https://github.com/jnorwood/wc_test




I downloaded and tried your benchmark. I first tried it
with the ten 10Mb files that you put in github, then
truncated them to 2Mb to get results comparable to
the test you said did.

* Used dmd 2.058.

* I tested both Windows7 64bit and then booted
into Linux Kubuntu 64bits to test there too.

* I tested in the following desktop computer, previously
disabling cpu throttling (disabled cool&quiet in the bios setup).

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 107
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4000+
stepping        : 1
cpu MHz         : 2109.443
cache size      : 512 KB

* The computer has 4Gb of RAM.

* Runned wcTest many times (more than 10) before saving
the results.

Results in Windows7 x64 with ten 10Mb files:
-------------------------------------------
finished wcp_nothing! time: 1 ms
finished wcp_whole_file! time: 130 ms
finished wcp_byLine! time: 1574 ms
finished wcp_byChunk! time: 133 ms
finished wcp_lcByChunk! time: 207 ms
finished wcp_lcDcharByChunk! time: 181 ms
finished wcp_lcRegex! time: 579 ms
finished wcp_lcCtRegex! time: 365 ms
finished wcp_lcStdAlgoCount! time: 511 ms
finished wcp_lcChar! time: 188 ms
finished wcp_wcPointer! time: 438 ms
finished wcp_wcCtRegex! time: 5448 ms
finished wcp_wcRegex! time: 17277 ms
finished wcp_wcRegex2! time: 15524 ms
finished wcp_wcSlices! time: 632 ms
finished wcp_wcStdAscii! time: 814 ms


Results in Windows7 x64 with ten 2Mb files:
-------------------------------------------
finished wcp_nothing! time: 1 ms
finished wcp_whole_file! time: 27 ms
finished wcp_byLine! time: 329 ms
finished wcp_byChunk! time: 34 ms
finished wcp_lcByChunk! time: 79 ms
finished wcp_lcDcharByChunk! time: 79 ms
finished wcp_lcRegex! time: 298 ms
finished wcp_lcCtRegex! time: 150 ms
finished wcp_lcStdAlgoCount! time: 216 ms
finished wcp_lcChar! time: 77 ms
finished wcp_wcPointer! time: 127 ms
finished wcp_wcCtRegex! time: 3250 ms
finished wcp_wcRegex! time: 6164 ms
finished wcp_wcRegex2! time: 5724 ms
finished wcp_wcSlices! time: 171 ms
finished wcp_wcStdAscii! time: 194 ms


Results in Kubuntu 64bits with ten 2Mb files:
----------------------------------------------
finished wcp_nothing! time: 0 ms
finished wcp_whole_file! time: 28 ms
finished wcp_byLine! time: 212 ms
finished wcp_byChunk! time: 20 ms
finished wcp_lcByChunk! time: 90 ms
finished wcp_lcDcharByChunk! time: 77 ms
finished wcp_lcRegex! time: 190 ms
finished wcp_lcCtRegex! time: 108 ms
finished wcp_lcStdAlgoCount! time: 132 ms
finished wcp_lcChar! time: 71 ms
finished wcp_wcPointer! time: 159 ms
finished wcp_wcCtRegex! time: 2308 ms
finished wcp_wcRegex! time: 4382 ms
finished wcp_wcRegex2! time: 3819 ms
finished wcp_wcSlices! time: 171 ms
finished wcp_wcStdAscii! time: 296 ms


Results in Kubuntu 64bits with ten 10Mb files:
----------------------------------------------
finished wcp_nothing! time: 0 ms
finished wcp_whole_file! time: 104 ms
finished wcp_byLine! time: 958 ms
finished wcp_byChunk! time: 85 ms
finished wcp_lcByChunk! time: 340 ms
finished wcp_lcDcharByChunk! time: 345 ms
finished wcp_lcRegex! time: 911 ms
finished wcp_lcCtRegex! time: 434 ms
finished wcp_lcStdAlgoCount! time: 614 ms
finished wcp_lcChar! time: 340 ms
finished wcp_wcPointer! time: 558 ms
finished wcp_wcCtRegex! time: 11300 ms
finished wcp_wcRegex! time: 19368 ms
finished wcp_wcRegex2! time: 17721 ms
finished wcp_wcSlices! time: 796 ms
finished wcp_wcStdAscii! time: 1216 ms


About the cache:
----------------
Each I/O operation carries also overhead not mitigated by
the cache. Depending on the OS, kernel, and filesystem,
every syscall that deals with I/O might or might not acquire
a mutex or wait on a mutex. Your process is not the only
one doing I/O, and might not be the only one accessing
a filesystem object or attribute, or reading or writing
to the memory disk cache's entries.


About byChunk vs std.file.read;
-------------------------------
Well, surprises are fun! In windows I got them about
the same in the ten 10Mb files test (depending on the run,
wcp_byChunk was below or above wcp_whole_file).
     But, in linux, I got wcp_byChunk consistently faster than
wcp_whole_file. I even inverted the order in which they
are tested to see if there was any effect there, and it
was the same. Though the difference shrank (in percentage)
when testing 10Mb files instead of 2Mb files.
     I'd like to change the benchmark to make automatic
repetitions and get the standard deviation, but I first
wanted to run it as is.
     I would jump to conclude that in the desktop computer I 
tested,
byChunk and std.file.read are the same in windows, and byChunk
is faster in Linux, but very little faster (might or might not
be explained by the extra stat call that it does to get the size
before reading).
     But since I didn't add repetitions, averaging and standard
deviation to your test, I can't really conclude anything for those
two (bychunk and std.file.read) which are too close.


-inline
-------
I tried to compile with -inline, but got 0ms for all the
tests. This is because there is a bug in the globbed overload
version of dirEntries() (or in the globbed matcher).
I don't know if it has been reported already.

--jm

Mar 26 2012

"Juan Manuel Cabo" <juanmanuel.cabo gmail.com> writes:

On Tuesday, 27 March 2012 at 00:23:46 UTC, Juan Manuel Cabo wrote:
[....]

I forgot to mention that for Linux Kubuntu x64 I had to put
change the type of a variable to auto in this line of wcTest.d:
    auto c_cnt = input.length;
And that in windows7 x64, the default for dmd is -m32, so I
compiled and tried your benchmark that way in windows.

--jm

Mar 26 2012

"Jay Norwood" <jayn prismnet.com> writes:

On Tuesday, 27 March 2012 at 00:23:46 UTC, Juan Manuel Cabo wrote:
  > I downloaded and tried your benchmark. I first tried it
 with the ten 10Mb files that you put in github, then
 truncated them to 2Mb to get results comparable to
 the test you said did.

  > --jm

All my times were with the 10 10MB files that I uploaded.  I 
tried on ssd, on hd and on an imdisk ramdrive, and got about the 
same times for all. I suppose win7 is caching these.   I'd expect 
several of the measured times  to be dominated by disk read time 
if they weren't somehow being cached.

Thanks for testing on the other systems.

Mar 26 2012

D Programming

C/C++ Programming

Other

digitalmars.D - some regex vs std.ascii vs handcode times