www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Slower than Python

reply "cvk012c" <cvk012c motorolasolutions.com> writes:
Tried to port my SIP parser from Python to D to boost performance
but got opposite result.
I created a simple test script which splits SIP REGISTER message
10 million times. Python version takes about 14 seconds to
execute, D version is about 23 seconds which is 1.6 times slower.
I used DMD 2.062 and compiled my script with options -release and
-O. I used Python 3.3 64 bit.
I ran both scripts on the same hardware with Windows 7 64.
Is this general problem with string performance in D or just
splitter() issue?
Did anybody compared performance of D string manipulating
functions with other languages like Python,Perl,Java and C++?


Here is Python version of test script:

import time

message = "REGISTER sip:comm.example.com SIP/2.0\r\n\
Content-Length: 0\r\n\
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n\
To: <sip:12345 comm.example.com>\r\n\
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n\
Max-Forwards: 70\r\n\
CSeq: 1 REGISTER\r\n\
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n\
Call-ID: 2910497622026445\r\n\
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"

t1 = time.perf_counter()
for i in range(10000000):
    for notused in message.split("\r\n"):
      pass
print(time.perf_counter()-t1)


Here is D version:
import std.stdio,std.algorithm,std.datetime;

void main()
{
    auto message = "REGISTER sip:example.com SIP/2.0\r\n~
Content-Length: 0\r\n~
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n~
To: <sip:12345 comm.example.com>\r\n~
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n~
Max-Forwards: 70\r\n~
CSeq: 1 REGISTER\r\n~
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n~
Call-ID: 2910497622026445\r\n~
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";

    auto t1 = Clock.currTime();
    foreach(i; 0..10000000)
    {
      foreach(notused; splitter(message, "\r\n"))
      {
      }
    }
    writeln(Clock.currTime()-t1);
}
Mar 01 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times slower.
 I used DMD 2.062 and compiled my script with options -release and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
Add -inline to the options. Andrei
Mar 01 2013
parent reply "simendsjo" <simendsjo gmail.com> writes:
On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER 
 message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
Add -inline to the options. Andrei
--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecs
Mar 01 2013
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 3:58 PM, simendsjo wrote:
 On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times slower.
 I used DMD 2.062 and compiled my script with options -release and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
Add -inline to the options. Andrei
--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecs
Also, the D version has a different string to parse (~ is not a line continuation character). The fixed version: auto message = "REGISTER sip:example.com SIP/2.0\r Content-Length: 0\r Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summaryq\";q=0.9\r To: <sip:12345 comm.example.com>\r User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r Max-Forwards: 70\r CSeq: 1 REGISTER\r Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r Call-ID: 2910497622026445\r From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; That shaves one extra second bringing it down to 9. Andrei
Mar 01 2013
prev sibling parent reply "cvk012c" <cvk012c motorolasolutions.com> writes:
On Friday, 1 March 2013 at 20:58:09 UTC, simendsjo wrote:
 On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu 
 wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER 
 message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
Add -inline to the options. Andrei
--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecs
On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.
Mar 01 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.
Why don't you write a Java version? It takes only few minutes, and you will have one more data point. Python string functions are written in C, compiled very efficiently (the standard Python binaries on Windows are compiled with the Microsoft Compiler, but also Intel compile can be found), and they are well optimized in several years of work by people like Hettinger :-) You will see Python2 code like this easily beat D for normal text files: for line in file(foo.txt): ... Both D and general performance aren't magical things. Performance comes from a long work of optimization of algorithms, code and compilers/virtual machines that run them. Bye, bearophile
Mar 01 2013
parent reply "cvk012c" <cvk012c motorolasolutions.com> writes:
On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.
Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it. Java version: public class Test { public static void main(String[] args) { String message = "REGISTER sip:comm.example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; long t1 = System.currentTimeMillis(); for (int i=0; i<10000000; i++) { int index1 = 0; while (true) { int index2 = message.indexOf("\r\n", index1); if (index2 == -1) break; String notused = message.substring(index1, index2); index1 = index2+2; } } System.out.println(System.currentTimeMillis()-t1); } } D version: import std.stdio,std.string,std.datetime; void main() { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10000000; i++) { auto substring = message; while (true) { auto index = indexOf(substring, "\r\n"); if (index == -1) break; auto notused = substring[0..index]; substring = substring[index+2..$]; } } writeln(Clock.currTime()-t1); }
Mar 01 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.
Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.
Try my hand-written version (elsewhere in thread). I think it can be done better too (use pointers instead of arrays). The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly). 2. dmd is not always the best optimizer. I saw one other person who said using a different d compiler resulted in a quicker time. 3. Any time you are looping 10 million times, small insignificant differences will be magnified. Note one other thing -- Be VERY wary of test cases that are fully-visible at compile time. Very smart compilers are known to reduce your code to something that isn't realistic. I remember seeing a similar comparison not too long ago where one wondered why g++ was so much faster (0.09 seconds or something) than D (4 or more seconds). Turns out, the g++ compiler optimized out his ENTIRE program :). You may want to read in the "message" string at runtime to avoid such issues. -Steve
Mar 01 2013
next sibling parent reply "cvk012c" <cvk012c motorolasolutions.com> writes:
On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
wrote:
 On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c 
 <cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.
Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.
The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly).
In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.
Mar 01 2013
next sibling parent reply "anon123" <z z.z> writes:
On Saturday, 2 March 2013 at 01:05:35 UTC, cvk012c wrote:
 On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
 wrote:
 On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c 
 <cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform 
 D easily.
 Thanks Andrei and simendsjo for a quick response though.
Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.
The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly).
In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.
My version is functionally equivalent, and measures 55 - 94 ms (depending on compiler; LDC is best). Your version performed in about 7s on my machine. Similar optimization taking advantage of immutability can be done on your python translation that results in measurements of <1ms. import std.stdio,std.string,std.datetime; void main() { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10_000_000; i++) { while (true) { auto index = indexOf(message, "\r\n"); if (index == -1) break; auto notused = message[0..index]; message = message[index+2..$]; } } writeln(Clock.currTime()-t1); }
Mar 01 2013
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/1/2013 5:19 PM, anon123 wrote:
 My version is functionally equivalent,
I don't think so. Take a look at what happens to the message variable. It is never restored to its original string on the next iteration of the loop.
Mar 01 2013
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 20:19:19 -0500, anon123 <z z.z> wrote:

 My version is functionally equivalent, and measures 55 - 94 ms  
 (depending on compiler; LDC is best). Your version performed in about 7s  
 on my machine.
Try printing out message each time through the 10,000,000 iterations -Steve
Mar 01 2013
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used  
 string specific indexOf function. Still result is very bad. For text  
 based protocols, such as SIP, performance of string manipulating  
 functions is very important. Unfortunately, looks like it is not D  
 strongest point at this time.
indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find). -Steve
Mar 01 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 8:29 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c
 <cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.
indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find).
find and indexOf on arrays are on par with handwritten loops. Andrei
Mar 02 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 12:17:54 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 8:29 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c
 <cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.
indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find).
find and indexOf on arrays are on par with handwritten loops.
This is not a personal attack, please don't take it that way. My anecdotal tests with hand writing a custom splitter range to handle the OP's program gave me a 40% savings. Whether it's find or not, I'm not sure, but there definitely is room for improvement. Using indexOf instead of splitter seems to result in the same time usage, so something that is common between the two would be a logical choice to target. It would be interesting to examine the assembly differences, but I haven't got the time right now. -Steve
Mar 02 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 2:32 PM, Steven Schveighoffer wrote:
 This is not a personal attack, please don't take it that way.
Not at all. I'm just confused about this exchange. You mention some correct facts ("MySplitter finished in 6 seconds instead of 10"), then some mistaken suppositions ("a string-specific version could be written"), and derive conclusions from both ("Maybe there is a bad constraint somewhere"). Hard to deal with the mix.
 My
 anecdotal tests with hand writing a custom splitter range to handle the
 OP's program gave me a 40% savings. Whether it's find or not, I'm not
 sure, but there definitely is room for improvement.
I think I understand where the confusion comes from. If you're referring to MySplitter, that's not comparable. It uses this at the core: for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } ... } This is not close to the code that would work with arrays. We could specialize things for small arrays, but that hasn't been done yet. My point is it's not comparable with the classic brute force subsequence search. What is comparable is shown below. indexOf2 defines the straight array-in-array brute force search. With dmd on my machine it runs in some 7 seconds. So does indexOf (this was my point). Then indexOf3 uses a simple trick that the Java implementation does: cache the first character. That reduces the running time to 5 seconds. Then indexOf4 optimizes the loop searching that first character, bringing run time to about 4 seconds. On that machine the Java implementation runs in 3.5 seconds. Andrei import std.stdio, std.string, std.datetime; long indexOf2(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; auto limit = haystack.length - needle.length; bigloop: for (size_t i; i <= limit; ++i) { foreach (q; 0 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } long indexOf3(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; immutable c = needle[0]; auto limit = haystack.length - needle.length; assert(limit < uint.max, "not handled"); bigloop: for (uint i; i <= limit; ++i) { if (haystack.ptr[i] != c) continue; foreach (q; 1 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } long indexOf4(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; immutable c = needle[0]; auto limit = haystack.length - needle.length; size_t i = 0; bigloop: for (;;) { while (i <= limit && haystack[i++] != c) {} if (i-- > limit) break; foreach (q; 1 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } unittest { assert(indexOf2("hello, world", ",") == 5); assert(indexOf2("hello, world", "a") == -1); assert(indexOf3("hello, world", ",") == 5); assert(indexOf3("hello, world", "a") == -1); assert(indexOf4("hello, world", ",") == 5); assert(indexOf4("hello, world", "a") == -1); } void main(string[] args) { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10000000; i++) { long index1 = 0; while (true) { auto index2 = indexOf(message[index1 .. $], "\r\n"); if (index2 == -1) break; auto notused = message[index1 .. index1 + index2]; if (args.length == 42) writeln(notused); index1 += index2 + 2; } } writeln(Clock.currTime()-t1); }
Mar 02 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 17:29:38 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 2:32 PM, Steven Schveighoffer wrote:
 This is not a personal attack, please don't take it that way.
Not at all. I'm just confused about this exchange. You mention some correct facts ("MySplitter finished in 6 seconds instead of 10"), then some mistaken suppositions ("a string-specific version could be written"), and derive conclusions from both ("Maybe there is a bad constraint somewhere"). Hard to deal with the mix.
I didn't realize the situation. I assumed that there wasn't a version of splitter for strings that took array-specific shortcuts. My corrected statement should read "the existing string-specific version could be improved." My conclusion of "Maybe there is a bad constraint somewhere" was not derived from that, it was based on your statement elsewhere that "I wrote a custom splitter specialized for arrays. It's as fast." Given that my tests have shown I can write a faster one quite easily than the implementation selected by phobos, and that you said you wrote one that's "as fast," I took that to mean you had written one that was more optimized than the chosen splitter version (and logically assumed you had included this version in Phobos with the intent it would be chosen when compiling with strings), leading me to suggest possibly that due to some bug, the "fast" implementation wasn't being chosen. I didn't realize that "as fast" didn't mean "as fast as yours". I actually don't know what you meant by that now.
 My
 anecdotal tests with hand writing a custom splitter range to handle the
 OP's program gave me a 40% savings. Whether it's find or not, I'm not
 sure, but there definitely is room for improvement.
I think I understand where the confusion comes from. If you're referring to MySplitter, that's not comparable. It uses this at the core: for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } ... } This is not close to the code that would work with arrays. We could specialize things for small arrays, but that hasn't been done yet. My point is it's not comparable with the classic brute force subsequence search.
Very good point, here is a new version that takes any string as a separator: struct MySplitter { private string s; private string separator; private string source; this(string src, string sep) { source = src; separator = sep; popFront(); } property string front() { return s; } property bool empty() { return s.ptr == null; } void popFront() { if(!source.length) { s = source; source = null; } else { size_t i = 0; bool found = false; for(; i + separator.length <= source.length; i++) { if(source[i] == separator[0]) { found = true; for(size_t j = i+1, k=1; k < separator.length; ++j, ++k) if(source[j] != separator[k]) { found = false; break; } if(found) break; } } s = source[0..i]; if(found) source = source[i + separator.length..$]; else source = source[$..$]; } } } Takes 7 seconds on my machine instead of 6, but not 10 like std.algorithm.splitter. I don't even like the loop that well, it looks crude, I can probably optimize it further. And it does not use any of your specified tricks. Is this sufficiently comparable, or am I missing something else? -Steve
Mar 04 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/4/13 10:39 AM, Steven Schveighoffer wrote:
 struct MySplitter
 {
 private string s;
 private string separator;
 private string source;
 this(string src, string sep)
 {
 source = src;
 separator = sep;
 popFront();
 }

  property string front()
 {
 return s;
 }

  property bool empty()
 {
 return s.ptr == null;
 }

 void popFront()
 {
 if(!source.length)
 {
 s = source;
 source = null;
 }
 else
 {
 size_t i = 0;
 bool found = false;
 for(; i + separator.length <= source.length; i++)
 {
 if(source[i] == separator[0])
 {
 found = true;
 for(size_t j = i+1, k=1; k < separator.length; ++j, ++k)
 if(source[j] != separator[k])
 {
 found = false;
 break;
 }
 if(found)
 break;
 }
 }
 s = source[0..i];
 if(found)
 source = source[i + separator.length..$];
 else
 source = source[$..$];
 }
 }
 }

 Takes 7 seconds on my machine instead of 6, but not 10 like
 std.algorithm.splitter. I don't even like the loop that well, it looks
 crude, I can probably optimize it further.

 And it does not use any of your specified tricks.

 Is this sufficiently comparable, or am I missing something else?

 -Steve
That's comparable, please file an enh request. Thanks, Andrei
Mar 04 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.
http://d.puremagic.com/issues/show_bug.cgi?id=9646
Mar 04 2013
parent reply qznc <qznc web.de> writes:
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:
 On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.
http://d.puremagic.com/issues/show_bug.cgi?id=9646
Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: dmd -O -release -inline -noboundscheck: std.algorithm.splitter took 5 secs, 170 ms, and 656 μs MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs ldc2 -O -release: std.algorithm.splitter took 3 secs, 540 ms, and 84 μs MySplitter took 2 secs, 288 ms, and 535 μs my_splitter took 3 secs, 463 ms, and 897 μs CPU: Intel i7 M 620 2.67GHz dmd: v2.070.2 ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1) auto my_splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s) if (is(typeof(binaryFun!pred(r.front, s.front)) : bool) && (hasSlicing!Range || isNarrowString!Range) && isForwardRange!Separator && (hasLength!Separator || isNarrowString!Separator)) { import std.algorithm.searching : find; import std.conv : unsigned; static struct Result { private: Range _input; Separator _separator; size_t _next_index; bool _empty; property auto separatorLength() { return _separator.length; } void findIndex() { auto found = find!pred(_input, _separator); _next_index = _input.length - found.length; } public: this(Range input, Separator separator) { _input = input; _separator = separator; _empty = false; findIndex(); } property Range front() { assert(!empty); return _input[0 .. _next_index]; } static if (isInfinite!Range) { enum bool empty = false; // Propagate infiniteness } else { property bool empty() { return _empty; } } void popFront() { assert(!empty); if (_input.empty) { _empty = true; assert(_next_index == 0); } else if (_input.length == _next_index) { _input = _input[$ .. $]; _empty = true; } else { _input = _input[_next_index + separatorLength .. $]; findIndex(); } } static if (isForwardRange!Range) { property typeof(this) save() { auto ret = this; ret._input = _input.save; return ret; } } } return Result(r, s); }
May 22 2016
next sibling parent reply Seb <seb wilzba.ch> writes:
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]
Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]
have you thought about opening a PR to improve `splitter`?
May 23 2016
parent reply qznc <qznc web.de> writes:
On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:
 On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]
Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]
have you thought about opening a PR to improve `splitter`?
Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying. Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR?
May 23 2016
next sibling parent Seb <seb wilzba.ch> writes:
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:
 On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]
Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]
have you thought about opening a PR to improve `splitter`?
Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying.
Good luck. If you need help, it's probably the best to open the PR as people can directly see your code and the Phobos repo is usually pretty active.
 Additionally, there is this weird special case for a 
 bidirectional range, which just adds unnecessary overhead.
 Is  "remove dead code" a good enough reason in itself for a PR?
Absolutely ;-)
May 23 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 Additionally, there is this weird special case for a 
 bidirectional range, which just adds unnecessary overhead. Is 
 "remove dead code" a good enough reason in itself for a PR?
Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists.
May 23 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/23/2016 08:17 AM, qznc wrote:
 On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 Additionally, there is this weird special case for a bidirectional
 range, which just adds unnecessary overhead. Is "remove dead code" a
 good enough reason in itself for a PR?
Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists.
Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- Andrei
May 23 2016
parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:
 Most uses are forward, so perhaps it's best to eliminate the 
 bidirectional bookkeeping if it adds overhead, and confine 
 those to a separate function e.g. splitterBidirectional. There 
 is precedent for that in Phobos. -- Andrei
I don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter. filterBidirectional always felt like a relic of a time when static if's didn't exist, even though I know that's not the case.
May 23 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/23/16 9:44 AM, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:
 Most uses are forward, so perhaps it's best to eliminate the
 bidirectional bookkeeping if it adds overhead, and confine those to a
 separate function e.g. splitterBidirectional. There is precedent for
 that in Phobos. -- Andrei
I don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter.
What would be the condition tested in the static if? Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- Andrei
May 23 2016
parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:
 What would be the condition tested in the static if?
I would assume `static if (isBidirectionalRange!Range)`
 Recall that sometimes you do want a forward range even when you 
 could define a bidirectional range. -- Andrei
I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
May 23 2016
next sibling parent reply qznc <qznc web.de> writes:
On Monday, 23 May 2016 at 14:25:52 UTC, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu 
 wrote:
 What would be the condition tested in the static if?
I would assume `static if (isBidirectionalRange!Range)`
 Recall that sometimes you do want a forward range even when 
 you could define a bidirectional range. -- Andrei
I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
Here is a commit, which removes the dead code: https://github.com/qznc/phobos/commit/0cb1e70f4504a52b81c30342be7ec26d597fac69 It would be strictly better to implement the back and popBack, no? My single test case says removing the code does not result in faster code. Maybe dmd optimizes the bookkeeping away. I see three options: 1. Remove dead bookkeeping code 2. Implement back() and popBack() 3. Use alternative splitter implementation (and implement back() and popBack()) The third one would be the best, if it is really faster.
May 23 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/23/2016 10:47 AM, qznc wrote:
 It would be strictly better to implement the back and popBack, no?
If it adds no overhead, you're right. -- Andrei
May 23 2016
prev sibling parent reply Jack Stouffer <jack jackstouffer.com> writes:
On Monday, 23 May 2016 at 14:47:22 UTC, qznc wrote:
 I see three options:

 1. Remove dead bookkeeping code
 2. Implement back() and popBack()
 3. Use alternative splitter implementation (and implement 
 back() and popBack())

 The third one would be the best, if it is really faster.
If the performance is really a problem (I think it's a micro optimization at best), then the best option is not to violate DRY. Have a template parameter, std.typecons.Flag, to turn off the back and popBack. Don't have two functions that are 95% the same code like filter and filterBidirectional.
May 23 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/23/2016 03:11 PM, Jack Stouffer wrote:
 (I think it's a micro optimization at best)
splitter in the stdlib is likely to be very frequently useful, any bit of speedup we put in it is likely to pay off immensely. -- Andrei
May 23 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/23/2016 10:25 AM, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:
 What would be the condition tested in the static if?
I would assume `static if (isBidirectionalRange!Range)`
 Recall that sometimes you do want a forward range even when you could
 define a bidirectional range. -- Andrei
I honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
You might be doing extra work even if the user never calls back and popBack. -- Andrei
May 23 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.
http://d.puremagic.com/issues/show_bug.cgi?id=9646
I think the actual slow thing is std.find, which splitter uses. Benchmark: https://dpaste.dzfl.pl/1e31b6f95659 The output is: std find took 162167000 manual find took 99852000 Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.
May 23 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the Boyer Moore
 algorithm instead of naive string matching.
Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei
May 23 2016
next sibling parent reply qznc <qznc web.de> writes:
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.
Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?
I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066
May 24 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu 
 wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.
Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?
I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066
From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
May 24 2016
parent reply qznc <qznc web.de> writes:
On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:
 On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 Apart from advanced algorithms, find should not be slower than 
 a naive nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066
From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
Unfortunately, it is slower. My current measurements [0]: std find took 753837231 manual find took 129561980 std find took 721676721 manual find took 216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slow
May 24 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/24/2016 06:44 AM, qznc wrote:
 On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:
 On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 Apart from advanced algorithms, find should not be slower than a
 naive nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066
From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
Unfortunately, it is slower. My current measurements [0]: std find took 753837231 manual find took 129561980 std find took 721676721 manual find took 216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slow
One somewhat unrelated thing that can be done with Boyer-Moore: use a constant-length skip array instead of dynamic allocation. Upon saturation, continue with brute force. -- Andrei
May 24 2016
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/24/2016 08:38 AM, Andrei Alexandrescu wrote:
 One somewhat unrelated thing that can be done with Boyer-Moore: use a
 constant-length skip array instead of dynamic allocation. Upon
 saturation, continue with brute force. -- Andrei
Another idea: in every searched string there is one specific index that offers the most information to a failed search, allowing for a large skip. Roughly speaking it's the largest index around which there are many characters not seen elsewhere in the searched string. (In a string with all distinct characters, that would be the last character.) If that offset is cheap to compute and offers a good skip, a search starting from that index would be very fast. -- Andrei
May 24 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Tuesday, 24 May 2016 at 10:44:12 UTC, qznc wrote:
 Unfortunately, it is slower. My current measurements [0]:


 std find    took    753837231
 manual find took    129561980

 std find    took    721676721
 manual find took    216418870

 The nested-for-loop implementation is roughly 4 times faster!

 Caveat: I'm not completely sure my benchmark is fair yet.

 [0] https://github.com/qznc/find-is-too-slow
Plot twist: manual find is not always faster. My benchmark now generates random haystack and needle. Sometimes Phobos clearly wins. Example: $ ./benchmark.ldc -n39 -l10000 -i10000 Found at 10000 std find took 289529102 manual find took 368958679 This is compiled with LDC, needle is 39 chars, haystack 10000 chars, and 10000 iterations. "Found at 10000" means the needle was not found. Very often manual find wins though. Takeaway: Manual find is too naive. However, something slows down std find in general. More research needed. PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice.
May 24 2016
next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/24/2016 03:47 PM, qznc wrote:
 PS: Any recommendations for a Linux dissassembler? Objdump is ok, but
 some arrows for jumps would be nice.
Try http://asm.dlang.org for short tests. -- Andrei
May 24 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:
 PS: Any recommendations for a Linux dissassembler? Objdump is 
 ok, but some arrows for jumps would be nice.
I discovered radare2. While it is actually a debugger (like gdb), it includes a nice disassembler. Example view, which shows the loop calling std.algorithm.find repeatedly: | 0x00402e08, 0 | nop [rax+rax+0x0] .--> 0x00402e10, 0 | mov rdx, [rbx] || 0x00402e13 0 | mov rcx, [rbx+0x8] || 0x00402e17 0 | mov rdi, r15 || 0x00402e1a 0 | mov rsi, r12 || 0x00402e1d 0 | call 0x409160 ; 4 = sym._D3std9algorithm9searching34__T4fin || 0x00402e22 0 | mov rcx, [rbx] || 0x00402e25 0 | sub rcx, rax || 0x00402e28, 0 | mov [rbx+0x20], rcx || 0x00402e2c, 0 | dec ebp `==< 0x00402e2e 0 | jnz 0x402e10 ; 5 = 0x00402e10 | 0x00402e30, 0 | lea r15, [rsp] Note the ASCII arrow on the left! Objdump should do that. :)
May 24 2016
parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 24 May 2016 at 22:01:17 UTC, qznc wrote:
 On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:

 I discovered radare2. While it is actually a debugger (like 
 gdb), it includes a nice disassembler. Example view, which 
 shows the loop calling std.algorithm.find repeatedly:

        |   0x00402e08,   0 |           nop [rax+rax+0x0]
       .--> 0x00402e10,   0 |           mov rdx, [rbx]
       ||   0x00402e13    0 |           mov rcx, [rbx+0x8]
       ||   0x00402e17    0 |           mov rdi, r15
       ||   0x00402e1a    0 |           mov rsi, r12
       ||   0x00402e1d    0 |           call 0x409160  ; 4 = 
 sym._D3std9algorithm9searching34__T4fin
       ||   0x00402e22    0 |           mov rcx, [rbx]
       ||   0x00402e25    0 |           sub rcx, rax
       ||   0x00402e28,   0 |           mov [rbx+0x20], rcx
       ||   0x00402e2c,   0 |           dec ebp
       `==< 0x00402e2e    0 |           jnz 0x402e10  ; 5 = 
 0x00402e10
        |   0x00402e30,   0 |           lea r15, [rsp]

 Note the ASCII arrow on the left! Objdump should do that. :)
Any suggestions how we can make find more efficient?
May 25 2016
parent reply qznc <qznc web.de> writes:
On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?
I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
May 25 2016
parent reply qznc <qznc web.de> writes:
On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?
I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
And here it is: https://github.com/dlang/phobos/pull/4362 Destroy!
May 25 2016
parent Chris <wendlec tcd.ie> writes:
On Wednesday, 25 May 2016 at 11:30:33 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?
I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
And here it is: https://github.com/dlang/phobos/pull/4362 Destroy!
Great stuff! I really appreciate your efforts. As I said, my code uses find/canFind a lot.
May 25 2016
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/24/2016 03:54 AM, qznc wrote:
 On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the Boyer Moore
 algorithm instead of naive string matching.
Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?
I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack.
There's also Rabin-Karp that can be used when the sought range is relatively short.
 The overall sentiment seems to be that KMP vs BM depends on the input.
 This means an uninformed find function could only use heuristics.
 Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to
 the BoyerMooreFinder. I filed an issue [0].
Great. I keep on thinking how the advanced algorithms should "kick in" only after a partial match was long enough and frequent enough. As long as the partial match is short and infrequent, brute force will do best.
 Apart from advanced algorithms, find should not be slower than a naive
 nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066
Nice, thanks. Andrei
May 24 2016
prev sibling next sibling parent Chris <wendlec tcd.ie> writes:
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.
Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish.
Since I work a lot with text, I use canFind etc. a lot. It'd be nice to be able to choose faster algorithms. People are often advised to use library functions instead of rolling their own, because library functions are optimized for speed, but this doesn't seem to be true of Phobos. Simple everyday tasks like string matching should be as fast as C.
 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?


 Andrei
May 24 2016
prev sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:

 Conventional wisdom has it that find() is brute force and 
 that's that, but probably it's time to destroy. Selectively 
 using advanced searching algorithms for the appropriate inputs 
 is very DbI-ish.

 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?


 Andrei
I've played around with some algorithms and kept them as simple as possible, but I've found that a linear brute force for-loop always beats them (it's the extra decision(s), I suppose). It looks nice in theory, but in hardware reality a stupid loop is more efficient. NB: I've found that `foreach` is usually faster than a manual `for`-loop. I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed.
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 8:28 AM, Chris wrote:
 I've played around with some algorithms and kept them as simple as
 possible, but I've found that a linear brute force for-loop always beats
 them (it's the extra decision(s), I suppose). It looks nice in theory,
 but in hardware reality a stupid loop is more efficient.
This is surprising. It should be easy to construct examples where brute force search performs arbitrarily poor, e.g. searching for "aaaaaaa...aab" in a long string with only "a"s.
 NB: I've found that `foreach` is usually faster than a manual `for`-loop.
That's surprising too. I'd be interested to look at a disassembly if you have one.
 I used qznc's benchmarking, and it's clear that the current
 implementation of std.algorithm.find is quite a poor performer. We do
 have to work on that. Library functions should be fast, no matter what.
 If someone uses `find`, they wanna be sure it's optimized for speed.
That's definitely the way to go. Thanks for looking into it! Andrei
May 27 2016
parent reply Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 13:29:00 UTC, Andrei Alexandrescu wrote:
 On 5/27/16 8:28 AM, Chris wrote:

 This is surprising. It should be easy to construct examples 
 where brute force search performs arbitrarily poor, e.g. 
 searching for "aaaaaaa...aab" in a long string with only "a"s.
I will look into this. Atm, I'm using qznc's benchmark.
 NB: I've found that `foreach` is usually faster than a manual 
 `for`-loop.
That's surprising too. I'd be interested to look at a disassembly if you have one.
I have to correct myself. A manual loop is faster, I couldn't believe it either, so I benchmarked the same function with `foreach` and `for`. Turns out that `for` is _way_ faster.
 I used qznc's benchmarking, and it's clear that the current
 implementation of std.algorithm.find is quite a poor 
 performer. We do
 have to work on that. Library functions should be fast, no 
 matter what.
 If someone uses `find`, they wanna be sure it's optimized for 
 speed.
That's definitely the way to go. Thanks for looking into it! Andrei
There is a bug, if it is one, in qznc's `manual_find`. It doesn't find strings at the end of a string, as "ing" in "string", and it does not find a string that is equal to the search string as in "abba" == "abba". This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; } A typical example would be: std find took 25642297 manual find took 7426676 // manual_find my std find took 6654496 // improved find findStringS took 7159556 // foreach(i, c; haystack) findStringS_ took 5315498 // for (auto i = 0 ...) see above I think it's clear that `std find` is veeeery slow. If anyone wants to test my function, please do so. I don't want to spread wrong data, but this is what I get at the moment (ldc latest version).
May 27 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 This outperforms both `manual_find` and the improved `std find`

 string findStringS_Manual(string haystack, string needle)
 {
 	
 	if (needle.length > haystack.length)
 		return haystack[$..$];
 	outer:
 	for (auto i = 0; i < haystack.length; i++)
 	{
 		if (haystack[i] != needle[0])
 			continue;
 		for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
 			if (haystack[j] != needle[k])
 			continue outer;
 		return haystack[i..$];
 	}
 	return haystack[$..$];
 }
Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k) Else it will be out or range when you look for "za" and the string ends with "z". There is a slight performance penalty, but it's still faster than the rest. Maybe there are cleverer ways. But every little check and decision comes with a performance penalty. The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. std find took 12573666 manual find took 7418454 my std find took 6903854 <=== findStringS took 7166720 findStringS_ took 6579529 <=== std find took 11849892 manual find took 7407091 my std find took 6573102 <=== findStringS took 7296610 findStringS_ took 6573214 <=== std find took 10744361 manual find took 7576133 my std find took 6332014 <=== findStringS took 7224946 findStringS_ took 6555976 <===
May 27 2016
next sibling parent qznc <qznc web.de> writes:
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 The improved `std find` comes very close to the manual function 
 above now, sometimes it's even faster or at least as fast.
I think it really depends on the use case. The manual algorithm is really naive and std-find is slightly more advanced. My benchmark creates a random haystack and needle, but this is not real world data. I believe the "slightly more advanced" approach is a good idea, but there is no data to prove it. It might be interesting what algorithms other language's standard libraries (C++, Python, Java, etc) use. Thanks for looking at the benchmarking, Chris! The more eyes, the better. :)
May 27 2016
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/27/16 10:41 AM, Chris wrote:
 The improved `std find` comes very close to the manual function above
 now, sometimes it's even faster or at least as fast.
What you want to do here for indexed access is to match the last element first. If no match, continue etc. If there's a match, enter an inner loop where you don't need to check for the end of the haystack. -- Andrei
May 27 2016
parent reply qznc <qznc web.de> writes:
On Friday, 27 May 2016 at 18:21:06 UTC, Andrei Alexandrescu wrote:
 What you want to do here for indexed access is to match the 
 last element first. If no match, continue etc. If there's a 
 match, enter an inner loop where you don't need to check for 
 the end of the haystack. -- Andrei
Another plot twist: Sentinels look interesting. Wilzbach [0] mentioned it in the pull request: "Btw on a related issue - it seems like no one has actually bother to implement Andrei's "Fastware" sentinels in Phobos. If you are already in benchmarking, you might give this specialization for mutable RandomAccessRanges a try ;-)" I tried it [1]. It changes the numbers from std find: 153 ±32 manual find: 112 ±19 qznc find: 121 ±18 Chris find: 132 ±20 to std find: 160 ±31 manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown. This means another special case for mutable ranges could be worthwhile. [0] https://github.com/dlang/phobos/pull/4362#issuecomment-222224387 [1] https://github.com/qznc/find-is-too-slow/commit/043d1d2f6dced625e964e8df883ad5a45d7fe654
May 27 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 06:19 PM, qznc wrote:
 manual find: 118 ±24
 qznc find:   109 ±13  <--- using the sentinel trick
 Chris find:  142 ±27

 It is normal that the numbers of the other tests change, since those are
 relative to the fastest one in each run. When qznc find 'wins' more
 often, the others get more slowdown.
What compiler?
 This means another special case for mutable ranges could be worthwhile.
Also mind the throwing possibilities. Andrei
May 27 2016
parent reply qznc <qznc web.de> writes:
On Saturday, 28 May 2016 at 01:30:11 UTC, Andrei Alexandrescu 
wrote:
 On 05/27/2016 06:19 PM, qznc wrote:
 manual find: 118 ±24
 qznc find:   109 ±13  <--- using the sentinel trick
 Chris find:  142 ±27

 It is normal that the numbers of the other tests change, since 
 those are
 relative to the fastest one in each run. When qznc find 'wins' 
 more
 often, the others get more slowdown.
What compiler?
LDC. For DMD the numbers show no improvement: std find: 170 ±31 manual find: 132 ±25 qznc find: 103 ±6 Chris find: 157 ±30 to std find: 168 ±30 manual find: 131 ±25 qznc find: 104 ±8 <--- sentinel Chris find: 155 ±29 (Caveat: Chris find has a bug, which is triggered once in the 10000 runs each.) The sentinel trick is only proof of concept. The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges.
 This means another special case for mutable ranges could be 
 worthwhile.
Also mind the throwing possibilities.
What do you mean? Using exceptions? If anybody wants to replicate. It only takes a few seconds and there are no dependencies except dmd/ldc. git clone https://github.com/qznc/find-is-too-slow cd find-is-too-slow make dmd make ldc
May 28 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need not
 support addition. Finding a sentinel is hard and most certainly requires
 more assumptions about the ranges.
No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei
May 28 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:
 On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need 
 not
 support addition. Finding a sentinel is hard and most 
 certainly requires
 more assumptions about the ranges.
No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei
I included your function (I also fixed the bug in my findStringS_). Here's a typical result: std find: 208 ±61 manual find: 127 ±29 qznc find: 108 ±13 Chris find: 140 ±32 Andrei find: 126 ±27 ===== std find: 207 ±59 manual find: 128 ±30 qznc find: 108 ±13 Chris find: 141 ±32 Andrei find: 125 ±27 I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner.
May 28 2016
parent qznc <qznc web.de> writes:
On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote:
 I used dmd, because I don't have ldc on my laptop. qznc's find 
 is clearly the winner.
DMD performance feels flaky to me. If performance is important, you should use ldc or gdc. Alternatively, you are Walter Bright and simply optimize dmd. ;)
May 28 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:
 On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need 
 not
 support addition. Finding a sentinel is hard and most 
 certainly requires
 more assumptions about the ranges.
No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work:
A nice one. Here are the numbers I got: LDC: std find: 156 ±33 manual find: 117 ±24 qznc find: 114 ±14 Chris find: 135 ±24 Andrei find: 112 ±27 DMD: std find: 177 ±31 manual find: 140 ±28 qznc find: 102 ±4 Chris find: 165 ±31 Andrei find: 130 ±25 My sentinel idea is to have only one condition branch in the inner loop. Andrei find still has to `if` in the inner loop. My inner loop looks like this: haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // place sentinel for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { // no condition if (!binaryFun!pred(haystack[j], needle[k])) { // only condition // ... ok in here comes another, but this not as costly } } As long as you are matching the needle, there is only one condition to check. There is no check for k < needleLength, because the check always fails for needle[$] aka haystack[scout] due to the sentinel. I thought it might be slower, since we need to write to memory twice instead of only reading. Turns out it is faster.
May 28 2016
parent reply qznc <qznc web.de> writes:
I played around with the benchmark. Some more numbers:

$ make ldc
ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
./benchmark.ldc
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
     std find: 153 ±25    +66 (1934)  -15 (7860)
  manual find: 122 ±28    +80 (1812)  -17 (8134)
    qznc find: 125 ±16    +18 (4644)  -15 (5126)
   Chris find: 148 ±29    +75 (1976)  -18 (7915)
  Andrei find: 114 ±23   +100 (1191)  -13 (8770)
  (avg slowdown vs fastest; absolute deviation)
$ make dmd
dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
     std find: 160 ±27    +44 (3162)  -20 (6709)
  manual find: 148 ±28    +54 (2701)  -19 (7178)
    qznc find: 102 ±3     +27 ( 766)   -1 (9136)
   Chris find: 175 ±30    +55 (2796)  -21 (7106)
  Andrei find: 122 ±22    +46 (2554)  -14 (7351)
  (avg slowdown vs fastest; absolute deviation)

The additional numbers on the right are the ±MAD separated by 
above or below the mean. For example Andrei find with ldc:

   Andrei find: 114 ±23   +100 (1191)  -13 (8770)

The mean slowdown is 114, which means 14% slower than the fastest 
one. The mean absolute deviation (MAD) is 23. More precisely, the 
mean deviation above the mean slowdown of 103 is 100 and -13 
below the mean slowdown. 1191 of the 10000 runs were above the 
mean slowdown and 8770 below. The 39 missing runs are equal to 
the mean slowdown.

What bothers me is that changing the alphabet changes the numbers 
so much. Currently, if you restrict the alphabets for haystack 
and needle, the numbers change. The benchmark already does a 
random subset on each run, but there is definitely a bias.
May 29 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
 I played around with the benchmark. Some more numbers:

 $ make ldc
 ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
 ./benchmark.ldc
 E: wrong result with Chris find
 E: wrong result with Chris find
 E: wrong result with Chris find
     std find: 153 ±25    +66 (1934)  -15 (7860)
  manual find: 122 ±28    +80 (1812)  -17 (8134)
    qznc find: 125 ±16    +18 (4644)  -15 (5126)
   Chris find: 148 ±29    +75 (1976)  -18 (7915)
  Andrei find: 114 ±23   +100 (1191)  -13 (8770)
  (avg slowdown vs fastest; absolute deviation)
 $ make dmd
 dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
 ./benchmark.dmd
 E: wrong result with Chris find
 E: wrong result with Chris find
 E: wrong result with Chris find
     std find: 160 ±27    +44 (3162)  -20 (6709)
  manual find: 148 ±28    +54 (2701)  -19 (7178)
    qznc find: 102 ±3     +27 ( 766)   -1 (9136)
   Chris find: 175 ±30    +55 (2796)  -21 (7106)
  Andrei find: 122 ±22    +46 (2554)  -14 (7351)
  (avg slowdown vs fastest; absolute deviation)

 The additional numbers on the right are the ±MAD separated by 
 above or below the mean. For example Andrei find with ldc:

   Andrei find: 114 ±23   +100 (1191)  -13 (8770)

 The mean slowdown is 114, which means 14% slower than the 
 fastest one. The mean absolute deviation (MAD) is 23. More 
 precisely, the mean deviation above the mean slowdown of 103 is 
 100 and -13 below the mean slowdown. 1191 of the 10000 runs 
 were above the mean slowdown and 8770 below. The 39 missing 
 runs are equal to the mean slowdown.

 What bothers me is that changing the alphabet changes the 
 numbers so much. Currently, if you restrict the alphabets for 
 haystack and needle, the numbers change. The benchmark already 
 does a random subset on each run, but there is definitely a 
 bias.
You can avoid "E: wrong result with Chris find" by using outer: for (auto i = 0; i < haystack.length-needle.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } It's a tad faster. I'm planning to test on more varied data and see where a bias may occur.
May 29 2016
parent Chris <wendlec tcd.ie> writes:
It's weird, on my machine, my find function is consistently 
faster than `manual find`

LDC:

     std find: 137 ±22    +33 (3580)  -17 (6251)
  manual find: 137 ±32    +53 (3105)  -23 (6834)
    qznc find: 106 ±8     +17 (2608)   -5 (7195)
   Chris find: 132 ±32    +58 (2803)  -23 (7131)
  Andrei find: 120 ±26    +54 (2466)  -17 (7454)

===

     std find: 136 ±22    +33 (3503)  -17 (6304)
  manual find: 137 ±33    +55 (3000)  -23 (6920)
    qznc find: 106 ±8     +17 (2535)   -5 (7287)
   Chris find: 132 ±33    +59 (2803)  -23 (7137)
  Andrei find: 119 ±25    +51 (2569)  -16 (7374)


It's negligible, but I wonder is it the extra initialization:

`size_t end = haystack.length - needle.length + 1;`

and

`size_t i = 0` is defined, before we know, if it's a legitimate 
search. But that shouldn't matter, or should it?

Or does it depend on the machine?

DMD slows all searches down considerably (except for `qznc find`, 
well done!:)

DMD:

     std find: 161 ±36    +46 (4015)  -31 (5847)
  manual find: 147 ±40    +68 (3001)  -28 (6917)
    qznc find: 106 ±8     +15 (2623)   -5 (7172)
   Chris find: 201 ±41    +73 (2830)  -28 (7120)
  Andrei find: 150 ±40    +61 (3347)  -30 (6575)


PS I've corrected the end of the search string to

`outer:
     for (auto i = 0; i < haystack.length-(needle.length-1); i++)`

else it will not match at the end of a string.
May 30 2016
prev sibling parent reply Jon Degenhardt <jond noreply.com> writes:
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
 I played around with the benchmark. Some more numbers:

 The mean slowdown is 114, which means 14% slower than the 
 fastest one. The mean absolute deviation (MAD) is 23. More 
 precisely, the mean deviation above the mean slowdown of 103 is 
 100 and -13 below the mean slowdown. 1191 of the 10000 runs 
 were above the mean slowdown and 8770 below. The 39 missing 
 runs are equal to the mean slowdown.
A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms. --Jon
May 29 2016
parent reply qznc <qznc web.de> writes:
On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
 A minor thing - you might consider also calculating the median 
 and median version of MAD (median of absolute deviations from 
 the median). The reason is that benchmarks often have outliers 
 in the max time dimension, median will do a job reducing the 
 effect of those outliers than mean. Your benchmark code could 
 publish both forms.
I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median? When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE. However, this is more advanced and requires further introspection. [0] http://forum.dlang.org/post/aahhopdcrengakvoemrn forum.dlang.org
May 29 2016
next sibling parent Jon Degenhardt <jond noreply.com> writes:
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
 A minor thing - you might consider also calculating the median 
 and median version of MAD (median of absolute deviations from 
 the median). The reason is that benchmarks often have outliers 
 in the max time dimension, median will do a job reducing the 
 effect of those outliers than mean. Your benchmark code could 
 publish both forms.
I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median?
Oh, I see. The benchmark varies the data on each run and aggregates, is that right? Sorry, I missed that.
May 29 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 When looking at the assembly I don't like the single-byte 
 loads. Since string (ubyte[] here) is of extraordinary 
 importance, it should be worthwhile to use word loads [0] 
 instead. Really fancy would be SSE.
So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25 +38 (3384) -19 (6486) manual find: 140 ±37 +64 (2953) -25 (6962) qznc find: 114 ±17 +33 (2610) -11 (7262) Chris find: 146 ±39 +66 (3045) -28 (6873) Andrei find: 126 ±29 +54 (2720) -19 (7189) Wordwise find: 130 ±30 +53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
May 30 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 05:31 AM, qznc wrote:
 On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 When looking at the assembly I don't like the single-byte loads. Since
 string (ubyte[] here) is of extraordinary importance, it should be
 worthwhile to use word loads [0] instead. Really fancy would be SSE.
So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25 +38 (3384) -19 (6486) manual find: 140 ±37 +64 (2953) -25 (6962) qznc find: 114 ±17 +33 (2610) -11 (7262) Chris find: 146 ±39 +66 (3045) -28 (6873) Andrei find: 126 ±29 +54 (2720) -19 (7189) Wordwise find: 130 ±30 +53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- Andrei
May 30 2016
next sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:
 Please throw this hat into the ring as well: it should improve 
 average search on large vocabulary dramatically.

 https://dpaste.dzfl.pl/dc8dc6e1eb53

 It uses a BM-inspired trick - once the last characters matched, 
 if the match subsequently fails it needn't start from the next 
 character in the haystack. The "skip" is computed lazily and in 
 a separate function so as to keep the loop tight. All in all a 
 routine worth a look. I wanted to write this for a long time. 
 -- Andrei
Cool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick.
May 30 2016
parent reply Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 18:57:15 UTC, Chris wrote:
 On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu 
 wrote:
 Please throw this hat into the ring as well: it should improve 
 average search on large vocabulary dramatically.

 https://dpaste.dzfl.pl/dc8dc6e1eb53

 It uses a BM-inspired trick - once the last characters 
 matched, if the match subsequently fails it needn't start from 
 the next character in the haystack. The "skip" is computed 
 lazily and in a separate function so as to keep the loop 
 tight. All in all a routine worth a look. I wanted to write 
 this for a long time. -- Andrei
Cool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick.
Congratulations! DMD: ./benchmark.dmd std: 178 ±31 +36 (4475) -29 (5344) manual: 167 ±46 +82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49 +82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHz LDC: std: 184 ±19 +28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25 +44 (2983) -17 (6911) Chris: 194 ±57 +78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHz
May 30 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 04:00 PM, Chris wrote:
 ./benchmark.dmd
         std: 178 ±31    +36 (4475)  -29 (5344)
      manual: 167 ±46    +82 (2883)  -32 (7054)
        qznc: 114 ±7     +18 (1990)   -5 (7288)
       Chris: 228 ±49    +82 (3050)  -35 (6780)
      Andrei: 103 ±5     +47 ( 658)   -2 (9295)
 (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

 LDC:

         std: 184 ±19    +28 (3420)  -14 (6523)
      manual: 205 ±59   +120 (2463)  -39 (7443)
        qznc: 151 ±25    +44 (2983)  -17 (6911)
       Chris: 194 ±57    +78 (3702)  -46 (6251)
      Andrei: 101 ±2     +42 ( 435)   -1 (9542)
   (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz
Thanks for looking into this! qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei
May 30 2016
parent reply qznc <qznc web.de> writes:
On Monday, 30 May 2016 at 20:08:46 UTC, Andrei Alexandrescu wrote:
 On 05/30/2016 04:00 PM, Chris wrote:
 ./benchmark.dmd
         std: 178 ±31    +36 (4475)  -29 (5344)
      manual: 167 ±46    +82 (2883)  -32 (7054)
        qznc: 114 ±7     +18 (1990)   -5 (7288)
       Chris: 228 ±49    +82 (3050)  -35 (6780)
      Andrei: 103 ±5     +47 ( 658)   -2 (9295)
 (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

 LDC:

         std: 184 ±19    +28 (3420)  -14 (6523)
      manual: 205 ±59   +120 (2463)  -39 (7443)
        qznc: 151 ±25    +44 (2983)  -17 (6911)
       Chris: 194 ±57    +78 (3702)  -46 (6251)
      Andrei: 101 ±2     +42 ( 435)   -1 (9542)
   (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz
Thanks for looking into this! qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei
What function call should be replaced with inline code? Overall, I'm very confused and displeased by the benchmark right now. Initially, the problem was that a naive manual find was slower than std find in Phobos. Look at the LDC numbers–manual find is slower than std find. By tweaking the benchmark, the numbers can be shifted around arbitrarily. Even without tweaking, it is very processor dependent. Here are the numbers from my laptop: ./benchmark.ldc std: 149 ±24 +29 (4275) -21 (5604) manual: 129 ±37 +90 (2055) -23 (7917) qznc: 131 ±22 +26 (4381) -20 (5488) Chris: 154 ±35 +81 (2192) -22 (7672) Andrei: 118 ±28 +82 (1762) -16 (8194) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz ./benchmark.dmd std: 143 ±22 +26 (4219) -19 (5602) manual: 158 ±31 +47 (3436) -24 (6492) qznc: 101 ±2 +12 (1349) -1 (7851) Chris: 216 ±38 +56 (3393) -28 (6541) Andrei: 135 ±31 +48 (3326) -24 (6589) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz And Desktop: ./benchmark.ldc std: 129 ±24 +40 (3121) -17 (6767) manual: 129 ±31 +59 (2668) -21 (7244) qznc: 112 ±14 +30 (2542) -9 (7312) Chris: 134 ±33 +58 (2835) -23 (7068) Andrei: 123 ±27 +53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz ./benchmark.dmd std: 157 ±31 +44 (3693) -24 (6234) manual: 143 ±41 +73 (2854) -28 (7091) qznc: 116 ±21 +35 (3092) -14 (6844) Chris: 181 ±50 +74 (3452) -38 (6510) Andrei: 136 ±38 +64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz
May 30 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/30/2016 06:16 PM, qznc wrote:
 What function call should be replaced with inline code?
The call to computeSkip.
 Overall, I'm very confused and displeased by the benchmark right now.
I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail. * Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length). * Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended. Andrei
May 30 2016
parent reply qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu 
wrote:
 I agree it's difficult to characterize the behavior of 
 substring search with one number. There are many dimensions of 
 variation. (But there's no reason for an emotional response.) A 
 few possible baselines come to mind:

 * Search a long string for a one-character string, match and 
 fail.
There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.
 * Take an English text string. Search for a substring 
 consisting of its last portion (e.g. 5% of the length).
How long should the english text be? A Tweet? A book? A Gigabyte of log files? English text means basically ASCII and no Unicode?
 * Take an English text string. Search for a substring 
 consisting of a fraction of the text (e.g. 3%) with additional 
 characters prepended. Repeat for appended.
Why the prepend/append? To force a mismatch?
May 31 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 1:54 PM, qznc wrote:
 Using a one-letter needle string is more like a user mistake than
 something to optimize for.
How is splitting on ',' a user mistake? -- Andrei
May 31 2016
parent reply qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 18:31:47 UTC, Andrei Alexandrescu 
wrote:
 On 5/31/16 1:54 PM, qznc wrote:
 Using a one-letter needle string is more like a user mistake 
 than
 something to optimize for.
How is splitting on ',' a user mistake? -- Andrei
The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19 +61 ( 161) -11 ( 825) manual: 161 ±47 +73 ( 333) -35 ( 666) qznc: 163 ±21 +33 ( 314) -15 ( 661) Chris: 190 ±47 +80 ( 302) -33 ( 693) Andrei: 140 ±37 +60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence.
May 31 2016
parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote:

 The mistake is to split on "," instead of ','. The slow 
 splitter at the start of this thread searches for "\r\n".

 Your lazy-skip algorithm looks great!

 ./benchmark.ldc
 Search in Alice in Wonderland
        std: 168 ±6     +29 ( 107)   -3 ( 893)
     manual: 112 ±3     +28 (  81)   -1 ( 856)
       qznc: 149 ±4     +30 (  79)   -1 ( 898)
      Chris: 142 ±5     +28 ( 102)   -2 ( 898)
     Andrei: 153 ±3     +28 (  79)   -1 ( 919)
    Andrei2: 101 ±2     +34 (  31)   -1 ( 969)
 Search random haystack with random needle
        std: 172 ±19    +61 ( 161)  -11 ( 825)
     manual: 161 ±47    +73 ( 333)  -35 ( 666)
       qznc: 163 ±21    +33 ( 314)  -15 ( 661)
      Chris: 190 ±47    +80 ( 302)  -33 ( 693)
     Andrei: 140 ±37    +60 ( 315)  -27 ( 676)
    Andrei2: 103 ±6     +57 (  64)   -2 ( 935)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU       M 620    
 2.67GHz

 The Alice benchmark searches Alice in Wonderland for "find a 
 pleasure in all their simple joys" and
 finds it in the last sentence.
Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?
May 31 2016
parent reply qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
 Would it speed things up even more, if we put the function 
 `computeSkip` into the loop or is this done automatically by 
 the compiler?
LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
May 31 2016
parent reply Chris <wendlec tcd.ie> writes:
On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote:
 On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
 Would it speed things up even more, if we put the function 
 `computeSkip` into the loop or is this done automatically by 
 the compiler?
LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
Yep. It's really impressive. I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. `Adrei2` is that it performs consistently well.
May 31 2016
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/31/2016 04:18 PM, Chris wrote:
 I actually thought that dmd didn't place
 `computeSkip` inside of the loop. This begs the question if it should be
 moved to the loop, in case we use it in Phobos, to make sure that it is
 as fast as possible even with dmd. However, I like it the way it is now.
You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
May 31 2016
next sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
 which generates inline code on all compilers. -- Andrei
In general, it might be beneficial to use ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like that in order to optimise basic block placement. (We should probably have a compiler-independent API for that in core.*, by the way.) In this case, the skip computation path is probably small enough for that not to matter much, though. Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do. — David
May 31 2016
parent qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 22:50:37 UTC, David Nadlinger wrote:
 Another thing that might be interesting to do (now that you 
 have a "clever" baseline) is to start counting cycles and make 
 some comparisons against manual asm/intrinsics implementations. 
 For short(-ish) needles, PCMPESTRI is probably the most 
 promising candidate, although I suspect that for \r\n scanning 
 in long strings in particular, an optimised AVX2 solution might 
 have higher throughput.

 Of course these observations are not very valuable without 
 backing them up with measurements, but it seems like before 
 optimising a generic search algorithm for short-needle test 
 cases, having one's eyes on a solid SIMD baseline would be a 
 prudent thing to do.
The current algorithm is generic with respect to the predicate. Once we use SSE/AVX tricks, it is a special case for equality. As a next step in Phobos, this is probably worth it for strings. We could probably steal some well-optimized strcmp from somewhere.
Jun 02 2016
prev sibling next sibling parent Chris <wendlec tcd.ie> writes:
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
 which generates inline code on all compilers. -- Andrei
I've added it as `Andrei3`. It runs faster with dmd, but it's not as good with ldc. Seems like ldc performs some extra optimization when moving `computeSkip` into the loop, something it doesn't bother to do when it's already there. dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd Search in Alice in Wonderland std: 190 ±4 manual: 115 ±4 qznc: 106 ±2 Chris: 160 ±4 Andrei: 159 ±4 Andrei2: 108 ±2 Andrei3: 100 ±0 Search in random short strings std: 222 ±27 manual: 193 ±49 qznc: 120 ±12 Chris: 224 ±57 Andrei: 114 ±9 Andrei2: 106 ±5 Andrei3: 102 ±3 Mismatch in random long strings std: 186 ±28 manual: 206 ±85 qznc: 118 ±14 Chris: 265 ±104 Andrei: 194 ±85 Andrei2: 116 ±18 Andrei3: 102 ±4 Search random haystack with random needle std: 189 ±38 manual: 171 ±45 qznc: 118 ±11 Chris: 225 ±52 Andrei: 149 ±32 Andrei2: 110 ±7 Andrei3: 103 ±6 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz ./benchmark.ldc Search in Alice in Wonderland std: 170 ±1 manual: 143 ±2 qznc: 133 ±1 Chris: 144 ±1 Andrei: 196 ±10 Andrei2: 100 ±0 Andrei3: 111 ±1 Search in random short strings std: 223 ±30 manual: 211 ±51 qznc: 124 ±12 Chris: 223 ±61 Andrei: 115 ±10 Andrei2: 102 ±3 Andrei3: 105 ±4 Mismatch in random long strings std: 181 ±17 manual: 253 ±109 qznc: 146 ±24 Chris: 248 ±108 Andrei: 228 ±96 Andrei2: 101 ±2 Andrei3: 108 ±6 Search random haystack with random needle std: 187 ±22 manual: 208 ±60 qznc: 152 ±27 Chris: 202 ±58 Andrei: 173 ±35 Andrei2: 102 ±4 Andrei3: 110 ±8 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz
Jun 01 2016
prev sibling parent qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 On 05/31/2016 04:18 PM, Chris wrote:
 I actually thought that dmd didn't place
 `computeSkip` inside of the loop. This begs the question if it 
 should be
 moved to the loop, in case we use it in Phobos, to make sure 
 that it is
 as fast as possible even with dmd. However, I like it the way 
 it is now.
You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
I ported this to Phobos style (A2Phobos in the benchmark) and changed the pull request [0]. In practically all tests, this algorithms wipes the floor with the competition. [0] https://github.com/dlang/phobos/pull/4362
Jun 02 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 5/31/16 1:54 PM, qznc wrote:
 On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:
 I agree it's difficult to characterize the behavior of substring
 search with one number. There are many dimensions of variation. (But
 there's no reason for an emotional response.) A few possible baselines
 come to mind:

 * Search a long string for a one-character string, match and fail.
There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.
Oh, I see what you mean - it's a string consisting only of one character, not one character 'x'. I agree, but I'm just saying it's a sound baseline.
 * Take an English text string. Search for a substring consisting of
 its last portion (e.g. 5% of the length).
How long should the english text be? A Tweet? A book? A Gigabyte of log files?
I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 10_000, 100_000, 1_000_000.
 English text means basically ASCII and no Unicode?
My understanding (and correct me if I'm wrong) is that we're discussing search with the default predicate, ignoring equivalent grapheme representations etc. So there's no need to mind encoding at all, which means the text could be in any language. Foreign languages would further increase the vocabulary, but that's the only effect.
 * Take an English text string. Search for a substring consisting of a
 fraction of the text (e.g. 3%) with additional characters prepended.
 Repeat for appended.
Why the prepend/append? To force a mismatch?
Yah, and it's a good test for strings that have some equal portion yet they are different, which seems to me puts a greater strain on the algorithms. It is arguably a likely situation in the real world. Thanks, Andrei
May 31 2016
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char 
 in a string. Using a one-letter needle string is more like a 
 user mistake than something to optimize for.
At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
parent reply Seb <seb wilzba.ch> writes:
On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char 
 in a string. Using a one-letter needle string is more like a 
 user mistake than something to optimize for.
At compile time you may not know the length of the needle, like in the grep command.
1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specialize
Jun 01 2016
next sibling parent Chris <wendlec tcd.ie> writes:
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
 wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single 
 char in a string. Using a one-letter needle string is more 
 like a user mistake than something to optimize for.
At compile time you may not know the length of the needle, like in the grep command.
1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specialize
That makes sense. I think that std.algorithm needs to be revised and optimized for speed. We cannot afford to have suboptimal algorithms in there.
Jun 01 2016
prev sibling next sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/01/2016 08:41 AM, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char in a
 string. Using a one-letter needle string is more like a user mistake
 than something to optimize for.
At compile time you may not know the length of the needle, like in the grep command.
1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest!
That would require partial evaluation, which sadly we don't have in D. -- Andrei
Jun 01 2016
prev sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
 wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single 
 char in a string. Using a one-letter needle string is more 
 like a user mistake than something to optimize for.
At compile time you may not know the length of the needle, like in the grep command.
1) how about a CTFE find?
What I wanted to say, is that in real life, the input of the search routine is very often run-time user provided data. Think of search box in browsers and apps, command line parameter à la grep, etc. The "string" search function should not catastrophically break down on special input, like 1 character strings, unusual Unicode or when needle==haystack. I only said this to not lose the focus on what is being tried to be achieved here. It's often a danger of micro-optimization and unit test focused development, that a lot of time and effort are spent on improvements that are completely irrelevant when checked against what is really needed in real world (i.e. we're full in bike shed territory here).
Jun 01 2016
parent reply Chris <wendlec tcd.ie> writes:
On Wednesday, 1 June 2016 at 13:47:10 UTC, Patrick Schluter wrote:

 What I wanted to say, is that in real life, the input of the 
 search routine is very often run-time user provided data. Think 
 of search box in browsers and apps, command line parameter à la 
 grep, etc. The "string" search function should not 
 catastrophically break down on special input, like 1 character 
 strings, unusual Unicode or when needle==haystack. I only said 
 this to not lose the focus on what is being tried to be 
 achieved here.
 It's often a danger of micro-optimization and unit test focused 
 development, that a lot of time and effort are spent on 
 improvements that are completely irrelevant when checked 
 against what is really needed in real world (i.e. we're full in 
 bike shed territory here).
That's true. We should try to optimize for the worst case, i.e. random input. However, it is often the case that the needle is known at compile time, e.g. things like if (input.canFind("foo")) There might be ways to optimize those cases at compile time.
Jun 01 2016
parent Chris <wendlec tcd.ie> writes:
On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote:

For fun, I've written a search function that alternates between 
the beginning and the end of a string. I'm basically using 
Andrei's search algorithm for this. It is, of course only useful, 
if `needle` is expected to be either towards the beginning or the 
end of a string. If `needle` occurs multiple times, it matches 
where ever it encounters it first.

T[] findUnique(T)(T[] haystack, T[] needle)
{
	static size_t computeSkip(T[] needle)
	{
		size_t result = 1;
		while (result < needle.length && needle[$ - 1 - result] != 
needle[$ - 1])
		{
			++result;
		}
		return result;
	}
	
	if (needle.length == 0) return haystack;
	immutable size_t len = haystack.length;
	immutable size_t lastIndex = needle.length - 1;
	auto last = needle[lastIndex];
	auto first = needle[0];
	size_t skip = 0;
	for (size_t i = lastIndex, j = len-lastIndex-1; i < len; ++i, 
--j)
	{
		if (last != haystack[i])
			goto back;
		for (size_t k = 0; ; ++k)
		{
			if (k == lastIndex) return haystack[i-lastIndex..$];
			if (haystack[i - lastIndex + k] != needle[k]) break;
		}
		back:
		if (i > j) break;
		if (first != haystack[j])
			continue;
		for (size_t k = lastIndex; ; --k)
		{
			if (k == 0) return haystack[j..$];
			if (haystack[j + k] != needle[k]) break;
		}
		if (skip == 0) skip = computeSkip(needle);
		i += skip;
     j -= skip-1;
	}
	return haystack[$..$];
}

unittest
{
	string s1 = "Hello abc world!";
	assert (findUnique(s1, "abc") == "abc world!");
	assert (findUnique(s1, "def") == "");
	string s2 = "Ola, mundo!";
	assert (findUnique(s2, "do!") == "do!");
	string s3 = "abc";
	assert (findUnique(s3, "c") == "c");
	assert (findUnique(s3, "z") == "");
	string s4 = "aaabaaa";
	assert (findUnique(s4, "b") == "baaa");
}

I added it to the benchmarking suite (I had to turn off the error 
warnings, see description above). Naturally, it shines in the 
first test, as the needle is at the end of the text. The whole 
thing is to be taken with a grain of salt ;)

Search in Alice in Wonderland
        std: 535 ±8
     manual: 454 ±10
       qznc: 421 ±5
      Chris: 460 ±10
     Andrei: 643 ±19
    Andrei2: 314 ±6
    Andrei3: 352 ±5
     Biloop: 100 ±0
Search in random short strings
        std: 219 ±33
     manual: 194 ±47
       qznc: 123 ±11
      Chris: 219 ±69
     Andrei: 113 ±10
    Andrei2: 104 ±4
    Andrei3: 103 ±3
     Biloop: 195 ±46
Mismatch in random long strings
        std: 193 ±30
     manual: 249 ±103
       qznc: 156 ±30
      Chris: 260 ±108
     Andrei: 238 ±97
    Andrei2: 107 ±11
    Andrei3: 115 ±12
     Biloop: 141 ±44
Search random haystack with random needle
        std: 189 ±25
     manual: 193 ±57
       qznc: 152 ±26
      Chris: 203 ±59
     Andrei: 176 ±35
    Andrei2: 103 ±6
    Andrei3: 110 ±8
     Biloop: 141 ±28
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz
Jun 02 2016
prev sibling parent reply Chris <wendlec tcd.ie> writes:
On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:
 And Desktop:

 ./benchmark.ldc
        std: 129 ±24    +40 (3121)  -17 (6767)
     manual: 129 ±31    +59 (2668)  -21 (7244)
       qznc: 112 ±14    +30 (2542)   -9 (7312)
      Chris: 134 ±33    +58 (2835)  -23 (7068)
     Andrei: 123 ±27    +53 (2679)  -18 (7225)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 ./benchmark.dmd
        std: 157 ±31    +44 (3693)  -24 (6234)
     manual: 143 ±41    +73 (2854)  -28 (7091)
       qznc: 116 ±21    +35 (3092)  -14 (6844)
      Chris: 181 ±50    +74 (3452)  -38 (6510)
     Andrei: 136 ±38    +64 (2975)  -27 (6953)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz
Benchmark from desktop machine: DMD: std: 164 ±34 +43 (4054) -29 (5793) manual: 150 ±41 +72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43 +81 (2708) -29 (7232) Andrei: 136 ±31 +53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz === LDC: std: 138 ±23 +35 (3457) -18 (6360) manual: 145 ±33 +45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33 +56 (3061) -23 (6882) Andrei: 121 ±27 +52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz On my laptop Andrei's was the fasted (see post above).
May 31 2016
parent reply Wyatt <wyatt.epp gmail.com> writes:
On Tuesday, 31 May 2016 at 08:43:59 UTC, Chris wrote:
 On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:
 And Desktop:

 ./benchmark.ldc
        std: 129 ±24    +40 (3121)  -17 (6767)
     manual: 129 ±31    +59 (2668)  -21 (7244)
       qznc: 112 ±14    +30 (2542)   -9 (7312)
      Chris: 134 ±33    +58 (2835)  -23 (7068)
     Andrei: 123 ±27    +53 (2679)  -18 (7225)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 ./benchmark.dmd
        std: 157 ±31    +44 (3693)  -24 (6234)
     manual: 143 ±41    +73 (2854)  -28 (7091)
       qznc: 116 ±21    +35 (3092)  -14 (6844)
      Chris: 181 ±50    +74 (3452)  -38 (6510)
     Andrei: 136 ±38    +64 (2975)  -27 (6953)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz
Benchmark from desktop machine: DMD: std: 164 ±34 +43 (4054) -29 (5793) manual: 150 ±41 +72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43 +81 (2708) -29 (7232) Andrei: 136 ±31 +53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz === LDC: std: 138 ±23 +35 (3457) -18 (6360) manual: 145 ±33 +45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33 +56 (3061) -23 (6882) Andrei: 121 ±27 +52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz On my laptop Andrei's was the fasted (see post above).
Comparing the chips involved, could it be cache related? 3M cache; Andrei wins both: http://ark.intel.com/products/75459/Intel-Core-i5-4200U-Processor-3M-Cache-up-to-2_60-GHz 4M cache; qznc wins DMD (and is faster than the LDC's best? What?); Andrei wins LDC: http://ark.intel.com/products/43560/Intel-Core-i7-620M-Processor-4M-Cache-2_66-GHz 8M cache; qznc wins both: http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz Normally, I'd expect the 4200U to be similar to the desktop parts. Unless... Say, for the laptops (and I guess the desktops too, but it's more important in a mobile), did you verify the CPU frequency scaling wasn't interfering? -Wyatt
May 31 2016
parent qznc <qznc web.de> writes:
On Tuesday, 31 May 2016 at 17:31:29 UTC, Wyatt wrote:
 qznc wins DMD (and is faster than the LDC's best?
Careful! These are not absolute numbers, but relative slowdowns. You cannot compare the numbers between LDC and DMD.
May 31 2016
prev sibling parent Henrique bucher <henry vitorian.com> writes:
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:
 On 05/30/2016 05:31 AM, qznc wrote:
 On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 worthwhile to use word loads [0] instead. Really fancy would 
 be SSE.
I wrote a splitter in SSE4.2 some time ago as acontribution to a github project. Perhaps it is related. https://github.com/HFTrader/string-splitting/blob/master/splithb2.cpp Cheers,
Jul 11 2016
prev sibling next sibling parent reply Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 This outperforms both `manual_find` and the improved `std find`

 string findStringS_Manual(string haystack, string needle)
 {
 	
 	if (needle.length > haystack.length)
 		return haystack[$..$];
 	outer:
 	for (auto i = 0; i < haystack.length; i++)
 	{
 		if (haystack[i] != needle[0])
 			continue;
 		for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
 			if (haystack[j] != needle[k])
 			continue outer;
 		return haystack[i..$];
 	}
 	return haystack[$..$];
 }
Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k)
The upper bound of the outer loop should be haystack.length-needle.length. There's no point in searching after that point as the needle would not fit anymore and it avoids the buffer overrun you have in your code when the first character of the needle is found after that point. You can also remove then your ?: test in the j loop as that case is handled automatically then.
May 27 2016
parent reply Chris <wendlec tcd.ie> writes:
Oops, I've been found out. :-) Thanks. You're right of course, 
and I've already noticed that bug as it fails on not found. I got 
the bounds wrong.
May 27 2016
parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Friday, 27 May 2016 at 19:17:44 UTC, Chris wrote:
 Oops, I've been found out. :-) Thanks. You're right of course, 
 and I've already noticed that bug as it fails on not found. I 
 got the bounds wrong.
I had the same "bug" when I wrote my search function on the project at work. I also found out that using the simplest loop is the generally the best way to go. The processors nowaday have loop-caches, fast level1 caches etc, all the fancy search algorithms like Boyer-Moore etc. can only overcome their memory and initialisation overhead for really huge haystacks. Here's the C code of my memmem() function that replaced the Boyer-Moore I had before. This implementation outran the BM search for sizes up to several hundreds megaytes on SPARC64 VII+ machine. On x86-64 I'm sure it would even be worse. void * memmem(const void *haystack, size_t haystack_len, const void *needle, size_t needle_len) { if(!haystack_len || !needle_len || needle_len > haystack_len) return NULL; uint8_t u8 = *((uint8_t *)needle); const uint8_t *p = (const uint8_t *) haystack; for(size_t i = haystack_len - needle_len + 1; i; --i) { if(*p == u8 && !memcmp(p, needle, needle_len)) return (void *)p; else p++; } return NULL; } (sorry for posting C code, as I'm only beginning to learn D).
May 27 2016
prev sibling parent reply qznc <qznc web.de> writes:
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 The improved `std find` comes very close to the manual function 
 above now, sometimes it's even faster or at least as fast.

 std find    	took     12573666
 manual find 	took      7418454
 my std find 	took      6903854 <===
 findStringS 	took      7166720
 findStringS_  	took      6579529 <===
I just updated my benchmark. It now checks lots of different scenarios instead of you having to specify one. You can only set the number of iterations now. It generates a random scenario (haystack length, needle length, alphabet, etc) and runs it once with all algorithms. Instead of recording the absolute runtime, it records the slowdown compared to the fastest of the three. Average those and also compute the absolute deviation (the second number). Example: std find: 153 ±33 manual find: 112 ±19 my std find: 119 ±16 So manual find is on average 12% slower than the fastest ±19%. I would interpret that as no significant difference between manual and improved find.
May 27 2016
parent reply qznc <qznc web.de> writes:
Now added Chris' algo to the benchmark:

std find:    155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_
May 27 2016
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 05/27/2016 04:39 PM, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_
Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop * use only one induction variable ? Andrei
May 27 2016
parent qznc <qznc web.de> writes:
On Friday, 27 May 2016 at 20:50:52 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 04:39 PM, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_
Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop
Std find (currently in Phobos) and qznc find (see pull request) fulfill those three. Manual and Chris find are simpler.
 * use only one induction variable
None of those. I made such a variant of qznc find and it gives me the same numbers.
May 27 2016
prev sibling parent Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 20:39:03 UTC, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_
Did you fix the bugs in my algorithm? S
May 28 2016
prev sibling next sibling parent qznc <qznc web.de> writes:
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 I think it's clear that `std find` is veeeery slow. If anyone 
 wants to test my function, please do so. I don't want to spread 
 wrong data, but this is what I get at the moment (ldc latest 
 version).
If you want to see find (current or my improved version) get really slow, then reduce the alphabet for the haystack (e.g. --alphabet=ab). ;)
May 27 2016
prev sibling parent reply David Nadlinger <code klickverbot.at> writes:
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 I have to correct myself. A manual loop is faster, I couldn't 
 believe it either, so I benchmarked the same function with 
 `foreach` and `for`. Turns out that `for` is _way_ faster.
Just about the only reason I could think of for this to happen is if the compiler fails to inline the range primitives from std.array. Otherwise, the loops should be pretty much equivalent to LLVM's optimiser. This is so fundamental to D's strategic promise that we can't afford to get it wrong. — David
May 27 2016
parent Chris <wendlec tcd.ie> writes:
On Friday, 27 May 2016 at 21:31:48 UTC, David Nadlinger wrote:
 Just about the only reason I could think of for this to happen 
 is if the compiler fails to inline the range primitives from 
 std.array. Otherwise, the loops should be pretty much 
 equivalent to LLVM's optimiser.

 This is so fundamental to D's strategic promise that we can't 
 afford to get it wrong.

  — David
I agree, and it has been said on this forum that tight manual loops are faster than the range paradigm (I think bearophile was one of them, if I remember correctly). Maybe we should benchmark this too, just in case.
May 28 2016
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 8:05 PM, cvk012c wrote:
 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.
That conclusion would be hasty if not missing the whole point. You essentially measured the speed of one loop in various translators implementing various languages. Java code doing straight computation is on a par with C speed, no two ways about that. Python code using library primitives ain't no slouch either. Performance tuning in these languages becomes more difficult in larger applications where data layout, allocation, and indirect function calls start to dominate. The claim of systems languages being fast materializes only when you need to optimize data layout and nonstandard, custom algorithms. At that point systems languages give you additional options, whereas with high-level languages optimization becomes a very difficult proposition. Andrei
Mar 02 2013
next sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote:
[=E2=80=A6]
 That conclusion would be hasty if not missing the whole point. You=20
 essentially measured the speed of one loop in various translators=20
 implementing various languages. Java code doing straight computation is=
=20
 on a par with C speed, no two ways about that. Python code using library=
=20
 primitives ain't no slouch either. Performance tuning in these languages=
=20
 becomes more difficult in larger applications where data layout,=20
 allocation, and indirect function calls start to dominate.
[=E2=80=A6] Interestingly, there isn't only one Python implementation. There is only one language but there is CPython, PyPy, Jython, IronPython, to mention but 4. On computationally intensive code, PyPy (Python execution environment in RPython) is generally 10 to 30 times faster than CPython (Python execution environment written in C). C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote:
 […]
 That conclusion would be hasty if not missing the whole point. 
 You essentially measured the speed of one loop in various 
 translators implementing various languages. Java code doing 
 straight computation is on a par with C speed, no two ways 
 about that. Python code using library primitives ain't no 
 slouch either. Performance tuning in these languages becomes 
 more difficult in larger applications where data layout, 
 allocation, and indirect function calls start to dominate.
[…] Interestingly, there isn't only one Python implementation. There is only one language but there is CPython, PyPy, Jython, IronPython, to mention but 4. On computationally intensive code, PyPy (Python execution environment in RPython) is generally 10 to 30 times faster than CPython (Python execution environment written in C). C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered.
I'm not sure that's entirely fair. PyPy is fast because it implements a JIT, not because it's written in RPython.
Mar 02 2013
parent Russel Winder <russel winder.org.uk> writes:
On Sat, 2013-03-02 at 16:55 +0100, John Colvin wrote:
[=E2=80=A6]
 I'm not sure that's entirely fair. PyPy is fast because it=20
 implements a JIT, not because it's written in RPython.
The second sentence is almost entirely true, the first I am not so sure. Armin Rigo used to manage Psycho which was an amendment to the CPython code base to include a JIT. He quit trying to keep that going and so CPython doesn't have a JIT of any real importance. Unladen Swallow dissapeared. The JIT framework is an integral part of RPython rather than being specific only to PyPy. The point is that for I/O bound systems all languages are more or less equivalent, for computational intensive code, JITs are needed for virtual machine based systems or they will not compete. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 C is a (reasonably) well known and used language thought to 
 create fast
 code. RPython is Python but with some restrictions that is 
 statically
 compiled.  For writing interpreters, RPython spanks C. PyPy is 
 not the
 only language using RPython to implement the interpreter. C's 
 days in
 this game are seriously numbered.
Source? (Googling "rpython interpreter speed" didn't show anything) There's always claims that systems like RPython, or Haskell, or LuaJIT, or Java HotSpot are able to rival or beat C, yet all I ever see are small toy programs. Show me something big. Something with 100's of thousands of lines of code. If all this is true, where are the video games written in these languages that rival performance/fidelity of AAA titles? I've never seen that. Ever.
Mar 02 2013
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 2 March 2013 at 16:06:57 UTC, Peter Alexander wrote:
 On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 C is a (reasonably) well known and used language thought to 
 create fast
 code. RPython is Python but with some restrictions that is 
 statically
 compiled.  For writing interpreters, RPython spanks C. PyPy is 
 not the
 only language using RPython to implement the interpreter. C's 
 days in
 this game are seriously numbered.
Source? (Googling "rpython interpreter speed" didn't show anything) There's always claims that systems like RPython, or Haskell, or LuaJIT, or Java HotSpot are able to rival or beat C, yet all I ever see are small toy programs. Show me something big. Something with 100's of thousands of lines of code. If all this is true, where are the video games written in these languages that rival performance/fidelity of AAA titles? I've never seen that. Ever.
Rpython has been used to write a very fast jit (pypy) which is not the same as an interpreter. The interpreter is what you fall back to when you can't use the jit. I don't know how fast pypys interpreter is specifically, but it's not it's main selling point.
Mar 02 2013
prev sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Sat, 2013-03-02 at 17:06 +0100, Peter Alexander wrote:
[=E2=80=A6]
 Source? (Googling "rpython interpreter speed" didn't show=20
 anything)
For me, my source is my microbenchmarks, which are testing just loop with computation, i.e compute intensive data parallel. There is plenty of other stuff.
 There's always claims that systems like RPython, or Haskell, or=20
 LuaJIT, or Java HotSpot are able to rival or beat C, yet all I=20
 ever see are small toy programs. Show me something big. Something=20
 with 100's of thousands of lines of code.
100,000+ lines of code is a total irrelevance regarding system performance, as anyone who knows anything of profiling and benchmarking will tell you. If C is the only viable language why has Java got the majority of the commercial market. C++ and Fortran have the vast majority of the computational world.=20
 If all this is true, where are the video games written in these=20
 languages that rival performance/fidelity of AAA titles? I've=20
 never seen that. Ever.
Where is D is this arena. It is either C++ or C++ in the high performance graphics arena. If there are no AAA titles written in D why does this email list have anyone on it? Clearly I think your response is just trolling, I will stop feeding the troll. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
parent "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Saturday, 2 March 2013 at 19:19:09 UTC, Russel Winder wrote:
 100,000+ lines of code is a total irrelevance regarding system
 performance, as anyone who knows anything of profiling and 
 benchmarking
 will tell you.
I don't care about benchmarking, I care about writing large, fast systems. Small toy programs are irrelevant because they are not representative of real programs. Real programs are large (many thousands or millions of lines of code) and you don't have the luxury of being able to hand optimize every line of code.
 If C is the only viable language why has Java got the majority 
 of the
 commercial market. C++ and Fortran have the vast majority of the
 computational world.
Commercial market for what? For high performance applications C++ is still champion.
 If all this is true, where are the video games written in 
 these languages that rival performance/fidelity of AAA titles? 
 I've never seen that. Ever.
Where is D is this arena. It is either C++ or C++ in the high performance graphics arena. If there are no AAA titles written in D why does this email list have anyone on it?
D is getting there. Remedy Games are sponsoring DConf, and I predict we'll start to see some high quality indie games over the next few years written in D. Up till now, I imagine language stability has been the main thing holding back D. Tomasz was working on a high quality renderer before he got fed up with D's instability (http://h3.gd/code/nucleus/).
Mar 02 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Mar 02 2013
parent reply Russel Winder <russel winder.org.uk> writes:
On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
=20 What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
next sibling parent "jerro" <a a.com> writes:
On Saturday, 2 March 2013 at 19:42:47 UTC, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
And why do you think this wouldn't be possible if it was written in C? There are fast JITs written in C.
Mar 02 2013
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
Mar 02 2013
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
On 3/2/2013 7:43 AM, Russel Winder wrote:
For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't. T -- Everybody talks about it, but nobody does anything about it! -- Mark Twain
Mar 02 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't.
I meant what the C *compiler* generates.
Mar 02 2013
parent reply Russel Winder <russel winder.org.uk> writes:
On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:
 On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of =
C
 programmers are obsessed with optimizing what they *think* are the
 hotspots, but which really aren't.
=20 I meant what the C *compiler* generates.
Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do. To say more would require actual comparative data and I suspect no-one on this list will do that. =20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 03 2013
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
03-Mar-2013 12:03, Russel Winder пишет:
 On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:
 On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't.
I meant what the C *compiler* generates.
Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do.
Only one thing missing here is that JITs typically has much less time to do all and any of this.
 To say more would require actual comparative data and I suspect no-one
 on this list will do that.
Indeed unlikely. -- Dmitry Olshansky
Mar 03 2013
prev sibling next sibling parent "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 3 March 2013 at 08:03:53 UTC, Russel Winder wrote:
 Yes because the C/C++/D/etc. compilers are attempting to 
 predict the
 control flow of the program in execution and optimize all cases 
 for all
 possibilities. JIT's are just focussing on the runtime 
 bottlenecks with
 the actual data as being used. This allows for more focussed 
 code
 generation in the actual context. I would suspect that in many 
 cases the
 generated code is effectively the same but JITs can often do 
 unexpected
 and faster codes because they have more data to optimize with 
 and less
 optimization to do.
That is theory, but in practice, it doesn't work that well : you have instrument the code to get measure to optimize according to runtime data. Plus you can't monkey patch codegen (it is unspecified how x86 CPU load instructions, and so it isn't guaranteed that the CPU won't see garbage). That cost, plus the cost of the optimization itself will make the whole thing worth it only if the win when you optimize is high.
 To say more would require actual comparative data and I suspect 
 no-one
 on this list will do that.
It isn't worth. As explained above, you can conclude that this is highly dependent of the code you manipulate and the runtime data that is throw at it. Note that I discussed with people from LLVM on how this can be done in statically compiled code. In fact, it can but it is rather complicated and not worth automate for now.
Mar 03 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 12:03 AM, Russel Winder wrote:
 Yes because the C/C++/D/etc. compilers are attempting to predict the
 control flow of the program in execution and optimize all cases for all
 possibilities. JIT's are just focussing on the runtime bottlenecks with
 the actual data as being used. This allows for more focussed code
 generation in the actual context. I would suspect that in many cases the
 generated code is effectively the same but JITs can often do unexpected
 and faster codes because they have more data to optimize with and less
 optimization to do.
I've heard this was the advantage of JITs for the last 15 years. I haven't seen any data to justify it, and I have trouble thinking of any significant improvements to code gen that could be made with runtime data. That is for JITting statically typed languages. The situation is a bit better for dynamic ones, because the JIT can do a better job than a compiler because most of the time, only one type is used in a particular spot, and the JIT can pick that up and "statically" type it, as opposed to a compiler which cannot.
Mar 03 2013
parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 and I have trouble thinking of any significant improvements to 
 code gen that could be made with runtime data.
I've seen that the JavaHotSpot is able to unroll loops with a length known only at run-time. It unrolls them only when the run-time statistics say it's advantageous. (In theory the same is possible with profile-driven optimization). Bye, bearophile
Mar 03 2013
parent "SomeDude" <lovelydear mailmetrash.com> writes:
On Sunday, 3 March 2013 at 11:57:36 UTC, bearophile wrote:
 Walter Bright:

 and I have trouble thinking of any significant improvements to 
 code gen that could be made with runtime data.
I've seen that the JavaHotSpot is able to unroll loops with a length known only at run-time. It unrolls them only when the run-time statistics say it's advantageous. (In theory the same is possible with profile-driven optimization). Bye, bearophile
I believe the Hotspot runtime compiler is about the only one to perform runtime optimizations like that. I don't know if the CIL does this kind of things.
Mar 03 2013
prev sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 2 March 2013 at 20:02:09 UTC, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.
What's RPython doing that makes it faster?
Allowing PyPy to have a good JIT compiler.
I don't understand. Does that JIT generate faster code than a C compiler would generate?
In some cases yes, it can, but as you know that has nothing to do with the language the jit is written in. As I've said elsewhere, pypy being fast compared to cpython has nothing to do with the former being written in rpython and the latter in c* *other than perhaps the higher productivity of writing in rpython allowing more resources to be dedicated to optimisation.
Mar 02 2013
prev sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
Russel Winder:

 Allowing PyPy to have a good JIT compiler.
PyPy is not just a JIT, it's a more complex beast. It's more like a meta-JIT. The only one I know of. Bye, bearophile
Mar 04 2013
prev sibling parent reply "cvk012c" <cvk012c motorolasolutions.com> writes:
On Saturday, 2 March 2013 at 15:33:42 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 8:05 PM, cvk012c wrote:
 In my latest version of D script I didn't use splitter at all. 
 I used
 string specific indexOf function. Still result is very bad. 
 For text
 based protocols, such as SIP, performance of string 
 manipulating
 functions is very important. Unfortunately, looks like it is 
 not D
 strongest point at this time.
That conclusion would be hasty if not missing the whole point. You essentially measured the speed of one loop in various translators implementing various languages. Java code doing straight computation is on a par with C speed, no two ways about that. Python code using library primitives ain't no slouch either.
That is true, but what prevents D to do the same? Why Java code can be on par with C speed and D cannot? If Java and Python folks can do it, why D creators cannot? If performance optimizations is always number one priority in Java and Python community, why it is not a case for D? I will go even further: what prevents D to be faster than C? That's where D will really shine and people will start talking about. But that is probably a distant future of D and for now, my personal opinion is that ANY part of D code should not be slower than at least Java. If it is, it should be reported as an issues and fixed. Being more than twice slower than Java is a failure.
Mar 02 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/2/2013 11:14 AM, cvk012c wrote:
 That is true, but what prevents D to do the same?
Nothing.
 I will go even further: what prevents D to be faster than C?
Nothing. In fact, there are semantics in D that allow faster code to be generated than can be for C. They are unexploited at the moment, but they are there.
Mar 02 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Walter Bright:

 there are semantics in D that allow faster code to be generated 
 than can be for C. They are unexploited at the moment, but they 
 are there.
This is interesting. Maybe you should write a small article on this. Bye, bearophile
Mar 04 2013
prev sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
wrote:
 Try my hand-written version (elsewhere in thread).  I think it 
 can be done better too (use pointers instead of arrays).
That is usually a bad idea as it will fuck up pretty bad the aliasing analysis capabilities of the compiler.
Mar 02 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
deadalnix:

 That is usually a bad idea as it will fuck up pretty bad the 
 aliasing analysis capabilities of the compiler.
With LDC I've seen that using raw pointers sometimes gives a little extra performance, but with DMD I've some times a lower performance. Bye, bearophile
Mar 02 2013
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 4:28 PM, cvk012c wrote:
 On my hardware with -inline options it now takes about 15 secs which is
 still slower than Python but with both -inline and -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better
 performance boost and got only 7%. Counting that Python is not the
 fastest scripting language I think that similar Perl and Java scripts
 will outperform D easily.
I doubt that. 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the language and library. 2. With Python, after comparing 2-3 idioms - well, this is pretty much it. We doubled the speed in no time by just tuning options. D being a systems language allows you take your code to a million places if you want to optimize. 3. split has been discussed and improved for years in the Python community (just google for e.g. python split performance). Perl isn't all that fast actually, try this (takes 30+ seconds): $message = "REGISTER sip:comm.example.com SIP/2.0\r Content-Length: 0\r Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r To: <sip:12345 comm.example.com>\r User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r Max-Forwards: 70\r CSeq: 1 REGISTER\r Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r Call-ID: 2910497622026445\r From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; foreach my $i (0 .. 10000000) { foreach my $notused (split(/\r\n/, $message)) { } } Andrei
Mar 01 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-01 23:01, Andrei Alexandrescu wrote:

 I doubt that.

 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the
 language and library.
If that tiny portion is what's only used, then the rest doesn't matter.
 2. With Python, after comparing 2-3 idioms - well, this is pretty much
 it. We doubled the speed in no time by just tuning options. D being a
 systems language allows you take your code to a million places if you
 want to optimize.
Perhaps we should look over the default configuration. -- /Jacob Carlborg
Mar 02 2013
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 11:11 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:01, Andrei Alexandrescu wrote:

 I doubt that.

 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the
 language and library.
If that tiny portion is what's only used, then the rest doesn't matter.
Agreed. But I doubt that's the case in nontrivial applications. Andrei
Mar 02 2013
prev sibling next sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 03/01/2013 10:28 PM, cvk012c wrote:
 ...

 On my hardware with -inline options it now takes about 15 secs which is
 still slower than Python but with both -inline and -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better
 performance boost and got only 7%. Counting that Python is not the
 fastest scripting language I think that similar Perl and Java scripts
 will outperform D easily.
Never make such statements without doing actual measurements. Furthermore, it is completely meaningless anyway. Performance benchmarks always compare language implementations, not languages. (Case in point: You get twice the speed by using another compiler backend implementation.)
Mar 01 2013
parent "SomeDude" <lovelydear mailmetrash.com> writes:
On Friday, 1 March 2013 at 22:02:02 UTC, Timon Gehr wrote:
 On 03/01/2013 10:28 PM, cvk012c wrote:
 ...

 On my hardware with -inline options it now takes about 15 secs 
 which is
 still slower than Python but with both -inline and 
 -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much 
 better
 performance boost and got only 7%. Counting that Python is not 
 the
 fastest scripting language I think that similar Perl and Java 
 scripts
 will outperform D easily.
Never make such statements without doing actual measurements. Furthermore, it is completely meaningless anyway. Performance benchmarks always compare language implementations, not languages. (Case in point: You get twice the speed by using another compiler backend implementation.)
Still, there is a case to be made for a performance tests suite that could be run after (or before) each release of the language, like http://speed.pypy.org .
Mar 01 2013
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 16:28:08 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 On my hardware with -inline options it now takes about 15 secs which is  
 still slower than Python but with both -inline and -noboundscheck it  
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better  
 performance boost and got only 7%. Counting that Python is not the  
 fastest scripting language I think that similar Perl and Java scripts  
 will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.
Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx). struct MySplitter { private string s; private string source; this(string src) { source = src; popFront(); } property string front() { return s; } property bool empty() { return s.ptr == null; } void popFront() { s = source; if(!source.length) { source = null; } else { size_t i = 0; bool found = false; for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } } s = source[0..i]; if(found) source = source[i + 2..$]; else source = source[$..$]; } } } I'm sure splitter could be optimized to do the same thing I'm doing. Probably can reduce that a bit using pointers instead of strings. -Steve
Mar 01 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.
There's no decoding in find and splitter as far as I remember. Andrei
Mar 01 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.
There's no decoding in find and splitter as far as I remember.
Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator. Maybe fixing skipOver would fix the problem? Or don't call skipOver at all, since find isn't doing decoding, why do decoding when skipping the separator? I still feel we need a specialized string type... -Steve
Mar 01 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 5:47 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.
There's no decoding in find and splitter as far as I remember.
Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator.
You may be looking at the wrong splitter overload. Andrei
Mar 01 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 18:07:45 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:47 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.
There's no decoding in find and splitter as far as I remember.
Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator.
You may be looking at the wrong splitter overload.
Yes, I was. Very difficult to tell with the way template constraints are written! So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation. -Steve
Mar 01 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 So it's just pure heuristics.  Not hard to see how that would affect a  
 10 million cycle program.

 We may be able to create a string-specific version of splitter that  
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 01 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
That is a problem. Could you please file a but report? Thanks, Andrei
Mar 02 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
That is a problem. Could you please file a but report?
s/but/bug/ :o) Andrei
Mar 02 2013
parent "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Saturday, 2 March 2013 at 17:21:17 UTC, Andrei Alexandrescu
wrote:
 On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would 
 affect a
 10 million cycle program.

 We may be able to create a string-specific version of 
 splitter that
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
That is a problem. Could you please file a but report?
s/but/bug/ :o)
Ohhhhh!
 Andrei
Mar 04 2013
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 12:20:22 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
That is a problem. Could you please file a but report?
http://d.puremagic.com/issues/show_bug.cgi?id=9645 In trying to make a minimal test case, I found that the algorithm performs worse vs. the substring version as the number of content characters between separators grows. It actually starts out performing better than the substring version, by quite a bit (I would think we should be able to optimize there as well, the difference was about half) when the ratio is 1:1 (i.e. splitting "ababababa" on 'a' or "a"), but gets worse as you put more content between the separators. That should be a large clue. -Steve
Mar 04 2013
prev sibling parent "Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:
On Saturday, 2 March 2013 at 17:20:23 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would 
 affect a
 10 million cycle program.

 We may be able to create a string-specific version of 
 splitter that
 would take advantage of the representation.
Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
That is a problem. Could you please file a but report? Thanks, Andrei
I would have filed a bug report, but I was WAY too busy.
Mar 04 2013
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).
If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed. -- /Jacob Carlborg
Mar 02 2013
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).
If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
I wrote a custom splitter specialized for arrays. It's as fast. Andrei
Mar 02 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).
If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
I wrote a custom splitter specialized for arrays. It's as fast.
Why doesn't it get used for strings in this instance? Maybe there is a bad constraint somewhere. -Steve
Mar 02 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 2:37 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).
If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
I wrote a custom splitter specialized for arrays. It's as fast.
Why doesn't it get used for strings in this instance? Maybe there is a bad constraint somewhere.
I fear there's a misunderstanding somewhere. Splitter and find are already specialized appropriately for strings. Andrei
Mar 02 2013
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are  
 already specialized appropriately for strings.
OK, I would then counter that I have created a splitter range that beats its performance, by a lot. As far as I can tell, I'm not doing anything special. All I feel that I'm doing differently is instead of using find or indexOf, I'm looking at each char to see if it matches the first in the separator, and if it does, then I'm comparing the remaining characters. Pretty braindead, probably room for improvement. Splitter should beat this, or match it at least. From testing the sample code in this thread, my simple version beats splitter by 40%. Something isn't right somewhere, the difference should not be that drastic. -Steve
Mar 02 2013
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 5:48 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are
 already specialized appropriately for strings.
OK, I would then counter that I have created a splitter range
Prolly a link would clarify that. Andrei
Mar 02 2013
parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sat, 02 Mar 2013 18:14:28 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 5:48 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are
 already specialized appropriately for strings.
OK, I would then counter that I have created a splitter range
Prolly a link would clarify that.
Forget this, I read your other post. The MySplitter range is the one I was referring to. I saw your points and will respond there, but not today. I don't have time right now to continue the debate. -Steve
Mar 02 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/2/2013 8:12 AM, Jacob Carlborg wrote:
 If you need to write a custom splitter to get acceptable performance then it
 sounds to me that D or Phobos have failed.
In D, you can write custom versions of algorithms to exploit unique characteristics of particular data structures. I don't regard that as a failure. But if, to write a custom version, you had to switch to another language, then I would agree that is a failure of a "systems" programming language.
Mar 02 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-02 20:27, Walter Bright wrote:

 In D, you can write custom versions of algorithms to exploit unique
 characteristics of particular data structures. I don't regard that as a
 failure.
If you use the most obvious function/way from the standard library to perform a task in Python and you do the same in D. If the D version is slower that's not good. As you've seen here people will use what's available in the standard library and if that's not good enough compared to what's available in other languages' standard libraries, they will complain. If a function in Phobos can't match the corresponding function in another std lib, I think we're doing something wrong. Or the they just happened to use a language where the std lib was optimized better for that particular task. -- /Jacob Carlborg
Mar 03 2013
parent Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 7:07 AM, Jacob Carlborg wrote:
 If you use the most obvious function/way from the standard library to perform a
 task in Python and you do the same in D. If the D version is slower that's not
good.
Sure. But you should be cognizant of exactly what it is you're comparing, or you're apt to draw very wrong conclusions.
Mar 03 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/1/2013 1:28 PM, cvk012c wrote:
 But I still kind of disappointed because I expected a much better performance
 boost and got only 7%. Counting that Python is not the fastest scripting
 language I think that similar Perl and Java scripts will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.
Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.
Mar 01 2013
next sibling parent "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 2 March 2013 at 01:10:39 UTC, Walter Bright wrote:
 On 3/1/2013 1:28 PM, cvk012c wrote:
 But I still kind of disappointed because I expected a much 
 better performance
 boost and got only 7%. Counting that Python is not the fastest 
 scripting
 language I think that similar Perl and Java scripts will 
 outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.
Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.
This. If you wrote the splitter in pure python you would see an enormous performance gap between it and the D version. Having said that, maybe there are even more improvements we can make to improve speed in that part of photos, but considering we're already on a par with some quite mature and optimised C, there really isn't a problem.
Mar 02 2013
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in Python. It
 is written in C. You're actually comparing a bit of C code with a bit of
 D code.
Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter. -- /Jacob Carlborg
Mar 02 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Saturday, 2 March 2013 at 16:09:02 UTC, Jacob Carlborg wrote:
 On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in 
 Python. It
 is written in C. You're actually comparing a bit of C code 
 with a bit of
 D code.
Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter.
It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.
Mar 02 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed. -- /Jacob Carlborg
Mar 03 2013
next sibling parent reply "John Colvin" <john.loughran.colvin gmail.com> writes:
On Sunday, 3 March 2013 at 15:09:58 UTC, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of D 
 as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
I agree that anything that makes us faster is good, but I wouldn't go so far as to say we've failed if we're not as fast as the very fastest.
Mar 03 2013
parent reply Jacob Carlborg <doob me.com> writes:
On 2013-03-03 16:41, John Colvin wrote:

 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.
No, but if we need to drop down to C to get fast enough. -- /Jacob Carlborg
Mar 03 2013
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 11:38 PM, Jacob Carlborg wrote:
 On 2013-03-03 16:41, John Colvin wrote:

 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.
No, but if we need to drop down to C to get fast enough.
I can't think of a reason to need to do that.
Mar 03 2013
parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, March 03, 2013 23:40:18 Walter Bright wrote:
 On 3/3/2013 11:38 PM, Jacob Carlborg wrote:
 On 2013-03-03 16:41, John Colvin wrote:
 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.
No, but if we need to drop down to C to get fast enough.
I can't think of a reason to need to do that.
Given how D works, there is something _very_ wrong if we have to drop to C, if nothing else, because it's trivial to write code in D which is equivalent to what it would be in C. If our implementation of something is worse than a C implementation, that merely means that it needs to be refactored and optimized. It's possible that some of our abstractions will hamper efficiency in some cases (e.g. depending on how ranges are used - particularly with strings - you risk a performance hit in comparison to pure C), but that should generally be fixable via specializations and whatnot. If we have an efficiency problem, it's merely an implementation issue which needs to be sorted out. There should be nothing inherent to D which makes it so that you can't write code as fast as C or C++. - Jonathan M Davis
Mar 04 2013
parent Jacob Carlborg <doob me.com> writes:
On 2013-03-04 09:18, Jonathan M Davis wrote:

 Given how D works, there is something _very_ wrong if we have to drop to C, if
 nothing else, because it's trivial to write code in D which is equivalent to
 what it would be in C. If our implementation of something is worse than a C
 implementation, that merely means that it needs to be refactored and
 optimized. It's possible that some of our abstractions will hamper efficiency
in
 some cases (e.g. depending on how ranges are used - particularly with strings
 - you risk a performance hit in comparison to pure C), but that should
 generally be fixable via specializations and whatnot.

 If we have an efficiency problem, it's merely an implementation issue which
 needs to be sorted out. There should be nothing inherent to D which makes it
 so that you can't write code as fast as C or C++.
That's what I'm saying. -- /Jacob Carlborg
Mar 04 2013
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.
Mar 03 2013
next sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of D 
 as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.
Maybe it is time to look at the python implementation and see why it is faster.
Mar 03 2013
next sibling parent reply "jerro" <a a.com> writes:
 Maybe it is time to look at the python implementation and see 
 why it is faster.
It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Mar 03 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
jerro:

 $ time python3 test.py
Are Python3 strings still like wstrings/dstrings, or have they applied the patch that makes them UTF8? Bye, bearophile
Mar 03 2013
next sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2013-03-04 at 04:31 +0100, bearophile wrote:
[=E2=80=A6]
 Are Python3 strings still like wstrings/dstrings, or have they=20
 applied the patch that makes them UTF8?
Python3 is Unicode by default, with UTF-8 the default encoding. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 04 2013
prev sibling parent Ellery Newcomer <ellery-newcomer utulsa.edu> writes:
On 03/03/2013 07:31 PM, bearophile wrote:
 jerro:

 $ time python3 test.py
Are Python3 strings still like wstrings/dstrings, or have they applied the patch that makes them UTF8? Bye, bearophile
Looks like 3.3 will store them as utf8, but not 3.2 or below.
Mar 04 2013
prev sibling parent reply "deadalnix" <deadalnix gmail.com> writes:
On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see 
 why it is faster.
It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.
Mar 03 2013
next sibling parent "Don" <turnyourkidsintocash nospam.com> writes:
On Monday, 4 March 2013 at 03:58:20 UTC, deadalnix wrote:
 On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see 
 why it is faster.
It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.
But that does suggest that the main problem is that DMD has very little optimization aimed at reducing the number of bounds checks required.
Mar 04 2013
prev sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Sun, 03 Mar 2013 22:58:07 -0500, deadalnix <deadalnix gmail.com> wrote:

 On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see why it  
 is faster.
It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.
In this type of test, there is no danger with using noboundscheck. It's a very fair switch to use, D is just able to do it where python is not. A sufficiently smart compiler could eliminate the bounds check for this code, since it can be proven not to go out of bounds (in fact, a simple run without the noboundscheck proves it in very deterministic code like this). -Steve
Mar 04 2013
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/3/13 9:31 PM, deadalnix wrote:
 On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.
Maybe it is time to look at the python implementation and see why it is faster.
Was the conclusion that it's faster? Andrei
Mar 03 2013
parent "deadalnix" <deadalnix gmail.com> writes:
On Monday, 4 March 2013 at 04:36:30 UTC, Andrei Alexandrescu 
wrote:
 On 3/3/13 9:31 PM, deadalnix wrote:
 On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of 
 D as a
 compiled, systems programming language.
Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.
Maybe it is time to look at the python implementation and see why it is faster.
Was the conclusion that it's faster? Andrei
It seems that I was wrong, and missed the point that both benchmark weren't using the same string. Sorry for that.
Mar 03 2013
prev sibling parent Jacob Carlborg <doob me.com> writes:
On 2013-03-03 21:18, Walter Bright wrote:

 Nothing in this thread suggests that D needs to switch its library
 implementations to C.
Then that's good.
 Interestingly, I tried the indexOf() benchmark using C's memchr() and
 memcmp() from VC's runtime library. It was not faster than Andrei's
 optimized D one.
-- /Jacob Carlborg
Mar 03 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
 Does that really matter. He's using Python, if the function is 
 part of the standard library and if it's implement in Python or 
 C doesn't really matter.
You can look at it that way, but still, the fact that most of the work in the Python version is done by C code makes the timings less surprising. If split() was implemented in pure python and it would be only three times slower (when run with CPython) than std.algorithm.splitter, that would be very surprising and a sign that std.algorithm.splitter is doing something horribly wrong.
Mar 02 2013
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 3/2/13 11:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in Python. It
 is written in C. You're actually comparing a bit of C code with a bit of
 D code.
Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter.
The point here is that often one needs to write code that's more than gluing library calls together. I've seen that many, many times. Andrei
Mar 02 2013
prev sibling next sibling parent "jerro" <a a.com> writes:
On Friday, 1 March 2013 at 20:30:24 UTC, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
 Is this general problem with string performance in D or just
 splitter() issue?
 Did anybody compared performance of D string manipulating
 functions with other languages like Python,Perl,Java and C++?
I'm guessing you are building without optimization options. When compiled with "dmd -O -inline -noboundscheck -release tmp" the D code takes 11.1 seconds on my machine and the python script takes 16.1 seconds. You can make the D code faster by building with LDC or GDC: ldmd2 -O -noboundscheck -release tmp: 6.8 seconds gdmd -O -nboundscheck -inline -release tmp: 6.1 seconds So no, not slower than Python. I also suspect that much of the work in the Python code is actually done by functions that are implemented in C.
Mar 01 2013
prev sibling parent Robert <jfanatiker gmx.at> writes:
Hm, just recently a friend of mine and I hacked together at the FB
hacking cup, he in python and I in D. My solutions always were at least
faster by a factor of 80. For your example, I could not get a factor of
80, but with 
 -inline
it is at least faster than the python version (about 30% faster on my
machine)

Best regards,

Robert
Mar 01 2013