digitalmars.D - Slower than Python

cvk012c (56/56) Mar 01 2013 Tried to port my SIP parser from Python to D to boost performance

Andrei Alexandrescu (3/11) Mar 01 2013 Add -inline to the options.

simendsjo (10/25) Mar 01 2013 --noboundscheck can also help if you don't mind missing the

Andrei Alexandrescu (17/38) Mar 01 2013 Also, the D version has a different string to parse (~ is not a line
cvk012c (9/36) Mar 01 2013 On my hardware with -inline options it now takes about 15 secs

bearophile (16/19) Mar 01 2013 Why don't you write a Java version? It takes only few minutes,

cvk012c (69/75) Mar 01 2013 You are right. Why not. But instead of using Java split() method

Steven Schveighoffer (21/39) Mar 01 2013 Try my hand-written version (elsewhere in thread). I think it can be do...

cvk012c (7/34) Mar 01 2013 In my latest version of D script I didn't use splitter at all. I

anon123 (32/68) Mar 01 2013 My version is functionally equivalent, and measures 55 - 94 ms

Walter Bright (3/4) Mar 01 2013 I don't think so. Take a look at what happens to the message variable. I...
Steven Schveighoffer (3/6) Mar 01 2013 Try printing out message each time through the 10,000,000 iterations

Steven Schveighoffer (6/11) Mar 01 2013 indexOf uses the same mechanism as splitter to find the separators. If ...

Andrei Alexandrescu (3/13) Mar 02 2013 find and indexOf on arrays are on par with handwritten loops.

Steven Schveighoffer (11/25) Mar 02 2013 This is not a personal attack, please don't take it that way. My

Andrei Alexandrescu (121/126) Mar 02 2013 Not at all. I'm just confused about this exchange. You mention some

Steven Schveighoffer (80/106) Mar 04 2013 I didn't realize the situation. I assumed that there wasn't a version o...

Andrei Alexandrescu (4/63) Mar 04 2013 That's comparable, please file an enh request.

Steven Schveighoffer (3/4) Mar 04 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9646

qznc (89/93) May 22 2016 Below is an implementation, which matches MySplitter with dmd,

Seb (2/8) May 23 2016 have you thought about opening a PR to improve `splitter`?

qznc (7/18) May 23 2016 Yes, but I'm not sure about the goals.

Seb (5/24) May 23 2016 Good luck. If you need help, it's probably the best to open the
qznc (7/10) May 23 2016 Forget the "dead code comment" it is a actually a missing

Andrei Alexandrescu (5/14) May 23 2016 Most uses are forward, so perhaps it's best to eliminate the

Jack Stouffer (6/10) May 23 2016 I don't see the value add that things like filterBidirectional

Andrei Alexandrescu (4/12) May 23 2016 What would be the condition tested in the static if? Recall that

Jack Stouffer (4/7) May 23 2016 I honestly can't think of a time when would this be the case.

qznc (12/20) May 23 2016 Here is a commit, which removes the dead code:

Andrei Alexandrescu (2/3) May 23 2016 If it adds no overhead, you're right. -- Andrei
Jack Stouffer (6/12) May 23 2016 If the performance is really a problem (I think it's a micro

Andrei Alexandrescu (3/4) May 23 2016 splitter in the stdlib is likely to be very frequently useful, any bit

Andrei Alexandrescu (3/10) May 23 2016 You might be doing extra work even if the user never calls back and

qznc (8/16) May 23 2016 I think the actual slow thing is std.find, which splitter uses.

Andrei Alexandrescu (10/12) May 23 2016 Conventional wisdom has it that find() is brute force and that's that,

qznc (13/27) May 24 2016 I observed that Boyer-Moore from std is still slower. My guess is

Chris (18/50) May 24 2016 From Phobos [1]:

qznc (11/33) May 24 2016 Unfortunately, it is slower. My current measurements [0]:

Andrei Alexandrescu (4/41) May 24 2016 One somewhat unrelated thing that can be done with Boyer-Moore: use a

Andrei Alexandrescu (8/11) May 24 2016 Another idea: in every searched string there is one specific index that

qznc (16/26) May 24 2016 Plot twist: manual find is not always faster.

Andrei Alexandrescu (2/4) May 24 2016 Try http://asm.dlang.org for short tests. -- Andrei
qznc (19/21) May 24 2016 I discovered radare2. While it is actually a debugger (like gdb),

Chris (2/21) May 25 2016 Any suggestions how we can make find more efficient?

qznc (5/6) May 25 2016 I will send a pull request soon.

qznc (3/9) May 25 2016 And here it is: https://github.com/dlang/phobos/pull/4362

Chris (3/15) May 25 2016 Great stuff! I really appreciate your efforts. As I said, my code

Andrei Alexandrescu (8/34) May 24 2016 There's also Rabin-Karp that can be used when the sought range is

Chris (7/22) May 24 2016 Since I work a lot with text, I use canFind etc. a lot. It'd be
Chris (13/25) May 27 2016 I've played around with some algorithms and kept them as simple

Andrei Alexandrescu (8/17) May 27 2016 This is surprising. It should be easy to construct examples where brute

Chris (38/55) May 27 2016 I have to correct myself. A manual loop is faster, I couldn't

Chris (26/44) May 27 2016 Little bug fix:

qznc (10/12) May 27 2016 I think it really depends on the use case. The manual algorithm
Andrei Alexandrescu (4/6) May 27 2016 What you want to do here for indexed access is to match the last element...

qznc (26/30) May 27 2016 Another plot twist: Sentinels look interesting.

Andrei Alexandrescu (4/11) May 27 2016 Also mind the throwing possibilities.

qznc (25/39) May 28 2016 LDC. For DMD the numbers show no improvement:

Andrei Alexandrescu (33/36) May 28 2016 No need for a sentinel really so long as you first search for the last

Chris (17/55) May 28 2016 I included your function (I also fixed the bug in my

qznc (4/6) May 28 2016 DMD performance feels flaky to me.

qznc (33/44) May 28 2016 A nice one. Here are the numbers I got:

qznc (38/38) May 29 2016 I played around with the benchmark. Some more numbers:

Chris (15/54) May 29 2016 You can avoid "E: wrong result with Chris find" by using

Chris (32/32) May 30 2016 It's weird, on my machine, my find function is consistently

Jon Degenhardt (8/15) May 29 2016 A minor thing - you might consider also calculating the median

qznc (11/17) May 29 2016 I don't think that would help. This is not a normal distribution,

Jon Degenhardt (3/13) May 29 2016 Oh, I see. The benchmark varies the data on each run and
qznc (17/21) May 30 2016 So far, the results look disappointing. Andrei find does not get

Andrei Alexandrescu (9/29) May 30 2016 Please throw this hat into the ring as well: it should improve average

Chris (3/12) May 30 2016 Cool. So far, my experience with separate functions has been that

Chris (19/35) May 30 2016 Congratulations!

Andrei Alexandrescu (5/21) May 30 2016 Thanks for looking into this! @qznc, could you please look into updating...

qznc (44/67) May 30 2016 What function call should be replaced with inline code?

Andrei Alexandrescu (12/14) May 30 2016 I agree it's difficult to characterize the behavior of substring search

qznc (9/20) May 31 2016 There is a special version of find for searching a single char in

Andrei Alexandrescu (2/4) May 31 2016 How is splitting on ',' a user mistake? -- Andrei

qznc (26/31) May 31 2016 The mistake is to split on "," instead of ','. The slow splitter

Chris (4/28) May 31 2016 Would it speed things up even more, if we put the function

qznc (41/44) May 31 2016 LDC inlines it. DMD does not.

Chris (7/52) May 31 2016 Yep. It's really impressive. I actually thought that dmd didn't

Andrei Alexandrescu (3/7) May 31 2016 You may want to then try https://dpaste.dzfl.pl/392710b765a9, which

David Nadlinger (21/23) May 31 2016 In general, it might be beneficial to use

qznc (5/17) Jun 02 2016 The current algorithm is generic with respect to the predicate.

Chris (77/79) Jun 01 2016 I've added it as `Andrei3`. It runs faster with dmd, but it's not
qznc (7/17) Jun 02 2016 I ported this to Phobos style (A2Phobos in the benchmark) and

Andrei Alexandrescu (16/35) May 31 2016 Oh, I see what you mean - it's a string consisting only of one
Patrick Schluter (3/6) Jun 01 2016 At compile time you may not know the length of the needle, like

Seb (11/17) Jun 01 2016 1) how about a CTFE find?

Chris (4/23) Jun 01 2016 That makes sense. I think that std.algorithm needs to be revised
Andrei Alexandrescu (3/15) Jun 01 2016 That would require partial evaluation, which sadly we don't have in D.
Patrick Schluter (14/24) Jun 01 2016 What I wanted to say, is that in real life, the input of the

Chris (6/20) Jun 01 2016 That's true. We should try to optimize for the worst case, i.e.

Chris (106/106) Jun 02 2016 On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote:

Chris (20/37) May 31 2016 Benchmark from desktop machine:

Wyatt (16/56) May 31 2016 Comparing the chips involved, could it be cache related?

qznc (3/4) May 31 2016 Careful! These are not absolute numbers, but relative slowdowns.

Henrique bucher (5/9) Jul 11 2016 I wrote a splitter in SSE4.2 some time ago as acontribution to a

Patrick Schluter (8/33) May 27 2016 The upper bound of the outer loop should be

Chris (3/3) May 27 2016 Oops, I've been found out. :-) Thanks. You're right of course,

Patrick Schluter (27/30) May 27 2016 I had the same "bug" when I wrote my search function on the

qznc (16/23) May 27 2016 I just updated my benchmark. It now checks lots of different

qznc (5/5) May 27 2016 Now added Chris' algo to the benchmark:

Andrei Alexandrescu (10/15) May 27 2016 Does any of these feature simultaneously:

qznc (5/19) May 27 2016 Std find (currently in Phobos) and qznc find (see pull request)

Chris (2/7) May 28 2016 Did you fix the bugs in my algorithm? S

qznc (4/8) May 27 2016 If you want to see find (current or my improved version) get
David Nadlinger (8/11) May 27 2016 Just about the only reason I could think of for this to happen is

Chris (5/12) May 28 2016 I agree, and it has been said on this forum that tight manual

Andrei Alexandrescu (13/18) Mar 02 2013 That conclusion would be hasty if not missing the whole point. You

Russel Winder (27/34) Mar 02 2013 =20

John Colvin (3/32) Mar 02 2013 I'm not sure that's entirely fair. PyPy is fast because it

Russel Winder (21/23) Mar 02 2013 The second sentence is almost entirely true, the first I am not so sure.

Peter Alexander (10/19) Mar 02 2013 Source? (Googling "rpython interpreter speed" didn't show

John Colvin (6/25) Mar 02 2013 Rpython has been used to write a very fast jit (pypy) which is
Russel Winder (26/35) Mar 02 2013 For me, my source is my microbenchmarks, which are testing just loop

Peter Alexander (14/29) Mar 02 2013 I don't care about benchmarking, I care about writing large, fast

Walter Bright (2/3) Mar 02 2013 What's RPython doing that makes it faster?

Russel Winder (12/16) Mar 02 2013 Allowing PyPy to have a good JIT compiler.

jerro (3/9) Mar 02 2013 And why do you think this wouldn't be possible if it was written
Walter Bright (3/9) Mar 02 2013 I don't understand. Does that JIT generate faster code than a C compiler...

H. S. Teoh (8/19) Mar 02 2013 I don't know, but my wild guess is that a JIT optimizes the *right*

Walter Bright (2/18) Mar 02 2013 I meant what the C *compiler* generates.

Russel Winder (22/42) Mar 03 2013 Yes because the C/C++/D/etc. compilers are attempting to predict the

Dmitry Olshansky (6/37) Mar 03 2013 Only one thing missing here is that JITs typically has much less time to...
deadalnix (13/31) Mar 03 2013 That is theory, but in practice, it doesn't work that well : you
Walter Bright (8/16) Mar 03 2013 I've heard this was the advantage of JITs for the last 15 years. I haven...

bearophile (7/9) Mar 03 2013 I've seen that the JavaHotSpot is able to unroll loops with a

SomeDude (4/14) Mar 03 2013 I believe the Hotspot runtime compiler is about the only one to

John Colvin (7/17) Mar 02 2013 In some cases yes, it can, but as you know that has nothing to

bearophile (5/6) Mar 04 2013 PyPy is not just a JIT, it's a more complex beast. It's more like

cvk012c (15/31) Mar 02 2013 That is true, but what prevents D to do the same? Why Java code

Walter Bright (4/6) Mar 02 2013 Nothing. In fact, there are semantics in D that allow faster code to be

bearophile (5/8) Mar 04 2013 This is interesting. Maybe you should write a small article on

deadalnix (4/6) Mar 02 2013 That is usually a bad idea as it will fuck up pretty bad the

bearophile (6/8) Mar 02 2013 With LDC I've seen that using raw pointers sometimes gives a

Andrei Alexandrescu (30/37) Mar 01 2013 I doubt that.

Jacob Carlborg (5/12) Mar 02 2013 Perhaps we should look over the default configuration.

Andrei Alexandrescu (3/9) Mar 02 2013 Agreed. But I doubt that's the case in nontrivial applications.

Timon Gehr (6/14) Mar 01 2013 Never make such statements without doing actual measurements.

SomeDude (5/26) Mar 01 2013 Still, there is a case to be made for a performance tests suite

Steven Schveighoffer (53/61) Mar 01 2013 Phobos kind of refuses to treat strings like arrays of characters, it

Andrei Alexandrescu (3/5) Mar 01 2013 There's no decoding in find and splitter as far as I remember.

Steven Schveighoffer (10/14) Mar 01 2013 Looking at splitter, it uses skipOver to skip over the separator, and th...

Andrei Alexandrescu (3/13) Mar 01 2013 You may be looking at the wrong splitter overload.

Steven Schveighoffer (9/23) Mar 01 2013 Yes, I was. Very difficult to tell with the way template constraints ar...

Steven Schveighoffer (5/9) Mar 01 2013 Just found a disturbing artifact: splitter(message, '\n') is more than

Andrei Alexandrescu (4/14) Mar 02 2013 That is a problem. Could you please file a but report?

Andrei Alexandrescu (4/19) Mar 02 2013 s/but/bug/

Zach the Mystic (3/26) Mar 04 2013 Ohhhhh!

Steven Schveighoffer (11/26) Mar 04 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9645
Zach the Mystic (3/23) Mar 04 2013 I would have filed a bug report, but I was WAY too busy.

Jacob Carlborg (5/9) Mar 02 2013 If you need to write a custom splitter to get acceptable performance

Andrei Alexandrescu (3/11) Mar 02 2013 I wrote a custom splitter specialized for arrays. It's as fast.

Steven Schveighoffer (5/17) Mar 02 2013 Why doesn't it get used for strings in this instance? Maybe there is a ...

Andrei Alexandrescu (4/21) Mar 02 2013 I fear there's a misunderstanding somewhere. Splitter and find are

Steven Schveighoffer (13/15) Mar 02 2013 OK, I would then counter that I have created a splitter range that beats...

Andrei Alexandrescu (3/8) Mar 02 2013 Prolly a link would clarify that.

Steven Schveighoffer (7/16) Mar 02 2013 Forget this, I read your other post. The MySplitter range is the one I ...

Walter Bright (5/7) Mar 02 2013 In D, you can write custom versions of algorithms to exploit unique

Jacob Carlborg (12/15) Mar 03 2013 If you use the most obvious function/way from the standard library to

Walter Bright (3/5) Mar 03 2013 Sure. But you should be cognizant of exactly what it is you're comparing...

Walter Bright (3/7) Mar 01 2013 Python's splitter, which you are measuring, isn't written in Python. It ...

John Colvin (8/19) Mar 02 2013 This.
Jacob Carlborg (6/9) Mar 02 2013 Does that really matter. He's using Python, if the function is part of

John Colvin (5/14) Mar 02 2013 It does.

Jacob Carlborg (5/9) Mar 03 2013 Then D needs to get faster, or we need to switch to C for some std lib

John Colvin (4/14) Mar 03 2013 I agree that anything that makes us faster is good, but I

Jacob Carlborg (4/6) Mar 03 2013 No, but if we need to drop down to C to get fast enough.

Walter Bright (2/6) Mar 03 2013 I can't think of a reason to need to do that.

Jonathan M Davis (13/21) Mar 04 2013 Given how D works, there is something _very_ wrong if we have to drop to...

Jacob Carlborg (4/15) Mar 04 2013 That's what I'm saying.

Walter Bright (5/13) Mar 03 2013 Nothing in this thread suggests that D needs to switch its library

deadalnix (3/22) Mar 03 2013 Maybe it is time to look at the python implementation and see why

jerro (12/14) Mar 03 2013 It isn't faster:

bearophile (5/6) Mar 03 2013 Are Python3 strings still like wstrings/dstrings, or have they

Russel Winder (13/15) Mar 04 2013 Python3 is Unicode by default, with UTF-8 the default encoding.
Ellery Newcomer (2/8) Mar 04 2013 Looks like 3.3 will store them as utf8, but not 3.2 or below.

deadalnix (3/17) Mar 03 2013 Using noboundcheck isn't fair as you give up on safety, so you

Don (4/26) Mar 04 2013 But that does suggest that the main problem is that DMD has very
Steven Schveighoffer (8/30) Mar 04 2013 In this type of test, there is no danger with using noboundscheck. It's...

Andrei Alexandrescu (3/24) Mar 03 2013 Was the conclusion that it's faster?

deadalnix (4/36) Mar 03 2013 It seems that I was wrong, and missed the point that both

Jacob Carlborg (4/9) Mar 03 2013 --

jerro (6/9) Mar 02 2013 You can look at it that way, but still, the fact that most of the
Andrei Alexandrescu (4/11) Mar 02 2013 The point here is that often one needs to write code that's more than

jerro (13/28) Mar 01 2013 I'm guessing you are building without optimization options. When
Robert (9/9) Mar 01 2013 Hm, just recently a friend of mine and I hacked together at the FB

"cvk012c" <cvk012c motorolasolutions.com> writes:

Tried to port my SIP parser from Python to D to boost performance
but got opposite result.
I created a simple test script which splits SIP REGISTER message
10 million times. Python version takes about 14 seconds to
execute, D version is about 23 seconds which is 1.6 times slower.
I used DMD 2.062 and compiled my script with options -release and
-O. I used Python 3.3 64 bit.
I ran both scripts on the same hardware with Windows 7 64.
Is this general problem with string performance in D or just
splitter() issue?
Did anybody compared performance of D string manipulating
functions with other languages like Python,Perl,Java and C++?


Here is Python version of test script:

import time

message = "REGISTER sip:comm.example.com SIP/2.0\r\n\
Content-Length: 0\r\n\
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n\
To: <sip:12345 comm.example.com>\r\n\
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n\
Max-Forwards: 70\r\n\
CSeq: 1 REGISTER\r\n\
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n\
Call-ID: 2910497622026445\r\n\
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"

t1 = time.perf_counter()
for i in range(10000000):
    for notused in message.split("\r\n"):
      pass
print(time.perf_counter()-t1)


Here is D version:
import std.stdio,std.algorithm,std.datetime;

void main()
{
    auto message = "REGISTER sip:example.com SIP/2.0\r\n~
Content-Length: 0\r\n~
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n~
To: <sip:12345 comm.example.com>\r\n~
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n~
Max-Forwards: 70\r\n~
CSeq: 1 REGISTER\r\n~
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n~
Call-ID: 2910497622026445\r\n~
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";

    auto t1 = Clock.currTime();
    foreach(i; 0..10000000)
    {
      foreach(notused; splitter(message, "\r\n"))
      {
      }
    }
    writeln(Clock.currTime()-t1);
}

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times slower.
 I used DMD 2.062 and compiled my script with options -release and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.

Add -inline to the options.

Andrei

Mar 01 2013

"simendsjo" <simendsjo gmail.com> writes:

On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER 
 message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.

 Add -inline to the options.

 Andrei

--noboundscheck can also help if you don't mind missing the 
safety net.

$ rdmd -O -release sip
22 secs, 977 ms, 299 μs, and 8 hnsecs
$ rdmd -O -release -inline sip
12 secs, 245 ms, 567 μs, and 9 hnsecs
$ rdmd -O -release -inline -noboundscheck sip
10 secs, 171 ms, 209 μs, and 9 hnsecs

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 3:58 PM, simendsjo wrote:
 On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times slower.
 I used DMD 2.062 and compiled my script with options -release and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.

 Add -inline to the options.

 Andrei

 --noboundscheck can also help if you don't mind missing the safety net.

 $ rdmd -O -release sip
 22 secs, 977 ms, 299 μs, and 8 hnsecs
 $ rdmd -O -release -inline sip
 12 secs, 245 ms, 567 μs, and 9 hnsecs
 $ rdmd -O -release -inline -noboundscheck sip
 10 secs, 171 ms, 209 μs, and 9 hnsecs

Also, the D version has a different string to parse (~ is not a line 
continuation character). The fixed version:

    auto message = "REGISTER sip:example.com SIP/2.0\r
Content-Length: 0\r
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summaryq\";q=0.9\r
To: <sip:12345 comm.example.com>\r
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r
Max-Forwards: 70\r
CSeq: 1 REGISTER\r
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r
Call-ID: 2910497622026445\r
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";

That shaves one extra second bringing it down to 9.


Andrei

Mar 01 2013

"cvk012c" <cvk012c motorolasolutions.com> writes:

On Friday, 1 March 2013 at 20:58:09 UTC, simendsjo wrote:
 On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu 
 wrote:
 On 3/1/13 3:30 PM, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER 
 message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.

 Add -inline to the options.

 Andrei

 --noboundscheck can also help if you don't mind missing the 
 safety net.

 $ rdmd -O -release sip
 22 secs, 977 ms, 299 μs, and 8 hnsecs
 $ rdmd -O -release -inline sip
 12 secs, 245 ms, 567 μs, and 9 hnsecs
 $ rdmd -O -release -inline -noboundscheck sip
 10 secs, 171 ms, 209 μs, and 9 hnsecs


On my hardware with -inline options it now takes about 15 secs 
which is still slower than Python but with both -inline and 
-noboundscheck it takes 13 secs and finally beats Python.
But I still kind of disappointed because I expected a much better 
performance boost and got only 7%. Counting that Python is not 
the fastest scripting language I think that similar Perl and Java 
scripts will outperform D easily.
Thanks Andrei and simendsjo for a quick response though.

Mar 01 2013

"bearophile" <bearophileHUGS lycos.com> writes:

cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.

Why don't you write a Java version? It takes only few minutes, 
and you will have one more data point.

Python string functions are written in C, compiled very 
efficiently (the standard Python binaries on Windows are compiled 
with the Microsoft Compiler, but also Intel compile can be 
found), and they are well optimized in several years of work by 
people like Hettinger :-)

You will see Python2 code like this easily beat D for normal text 
files:

for line in file(foo.txt): ...

Both D and general performance aren't magical things. Performance 
comes from a long work of optimization of algorithms, code and 
compilers/virtual machines that run them.

Bye,
bearophile

Mar 01 2013

"cvk012c" <cvk012c motorolasolutions.com> writes:

On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.

 Why don't you write a Java version? It takes only few minutes, 
 and you will have one more data point.

You are right. Why not. But instead of using Java split() method 
I used combination of indexOf() and substring() methods to do the 
same job. The reason: Java split method implemented as a regular 
expression which will be unfair to compare to D splitter. Again, 
I created a similar D version of the script, compiled it with all 
suggested options:  -release -O -inline -noboundscheck and this 
time D version is more then twice slower than Java: 8.4 seconds 
vs 4.
D experts, please, take a look at my code and tell me what is 
wrong with it.

Java version:

public class Test
{
   public static void main(String[] args)
   {
     String message = "REGISTER sip:comm.example.com 
SIP/2.0\r\nContent-Length: 0\r\nContact: 
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su
mary\";q=0.9\r\nTo: 
<sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" 
\"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 
REGISTER\r\nVia: SIP/2.0/TLS 
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 
2910497622026445\r\nFrom: 
<sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";
     long t1 = System.currentTimeMillis();
     for (int i=0; i<10000000; i++)
     {
       int index1 = 0;
       while (true)
       {
         int index2 = message.indexOf("\r\n", index1);
         if (index2 == -1)
           break;
         String notused = message.substring(index1, index2);
         index1 = index2+2;
       }
     }
     System.out.println(System.currentTimeMillis()-t1);
   }
}



D version:

import std.stdio,std.string,std.datetime;

void main()
{
   auto message = "REGISTER sip:example.com 
SIP/2.0\r\nContent-Length: 0\r\nContact: 
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su
mary\";q=0.9\r\nTo: 
<sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" 
\"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 
REGISTER\r\nVia: SIP/2.0/TLS 
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 
2910497622026445\r\nFrom: 
<sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";
   auto t1 = Clock.currTime();
   for (int i=0; i<10000000; i++)
   {
     auto substring = message;
     while (true)
     {
       auto index = indexOf(substring, "\r\n");
       if (index == -1)
         break;
       auto notused = substring[0..index];
       substring = substring[index+2..$];
     }
   }
   writeln(Clock.currTime()-t1);
}

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.

 Why don't you write a Java version? It takes only few minutes, and you  
 will have one more data point.

 You are right. Why not. But instead of using Java split() method I used  
 combination of indexOf() and substring() methods to do the same job. The  
 reason: Java split method implemented as a regular expression which will  
 be unfair to compare to D splitter. Again, I created a similar D version  
 of the script, compiled it with all suggested options:  -release -O  
 -inline -noboundscheck and this time D version is more then twice slower  
 than Java: 8.4 seconds vs 4.
 D experts, please, take a look at my code and tell me what is wrong with  
 it.

Try my hand-written version (elsewhere in thread).  I think it can be done  
better too (use pointers instead of arrays).

The issue is a combination of the fact that:
1. splitter is designed for any range, not just strings.  Not an excuse  
really, but a string-specific version could be written that does better  
(clearly).
2. dmd is not always the best optimizer.  I saw one other person who said  
using a different d compiler resulted in a quicker time.
3. Any time you are looping 10 million times, small insignificant  
differences will be magnified.

Note one other thing -- Be VERY wary of test cases that are fully-visible  
at compile time.  Very smart compilers are known to reduce your code to  
something that isn't realistic.  I remember seeing a similar comparison  
not too long ago where one wondered why g++ was so much faster (0.09  
seconds or something) than D (4 or more seconds).  Turns out, the g++  
compiler optimized out his ENTIRE program :).

You may want to read in the "message" string at runtime to avoid such  
issues.

-Steve

Mar 01 2013

"cvk012c" <cvk012c motorolasolutions.com> writes:

On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
wrote:
 On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c 
 <cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform D 
 easily.
 Thanks Andrei and simendsjo for a quick response though.

 Why don't you write a Java version? It takes only few 
 minutes, and you will have one more data point.

 You are right. Why not. But instead of using Java split() 
 method I used combination of indexOf() and substring() methods 
 to do the same job. The reason: Java split method implemented 
 as a regular expression which will be unfair to compare to D 
 splitter. Again, I created a similar D version of the script, 
 compiled it with all suggested options:  -release -O -inline 
 -noboundscheck and this time D version is more then twice 
 slower than Java: 8.4 seconds vs 4.
 D experts, please, take a look at my code and tell me what is 
 wrong with it.

 The issue is a combination of the fact that:
 1. splitter is designed for any range, not just strings.  Not 
 an excuse really, but a string-specific version could be 
 written that does better (clearly).

In my latest version of D script I didn't use splitter at all. I 
used string specific indexOf function. Still result is very bad. 
For text based protocols, such as SIP, performance of string 
manipulating functions is very important. Unfortunately, looks 
like it is not D strongest point at this time.

Mar 01 2013

"anon123" <z z.z> writes:

On Saturday, 2 March 2013 at 01:05:35 UTC, cvk012c wrote:
 On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
 wrote:
 On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c 
 <cvk012c motorolasolutions.com> wrote:

 On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:
 cvk012c:

 I think that similar Perl and Java scripts will outperform 
 D easily.
 Thanks Andrei and simendsjo for a quick response though.

 Why don't you write a Java version? It takes only few 
 minutes, and you will have one more data point.

 You are right. Why not. But instead of using Java split() 
 method I used combination of indexOf() and substring() 
 methods to do the same job. The reason: Java split method 
 implemented as a regular expression which will be unfair to 
 compare to D splitter. Again, I created a similar D version 
 of the script, compiled it with all suggested options:  
 -release -O -inline -noboundscheck and this time D version is 
 more then twice slower than Java: 8.4 seconds vs 4.
 D experts, please, take a look at my code and tell me what is 
 wrong with it.

 The issue is a combination of the fact that:
 1. splitter is designed for any range, not just strings.  Not 
 an excuse really, but a string-specific version could be 
 written that does better (clearly).

 In my latest version of D script I didn't use splitter at all. 
 I used string specific indexOf function. Still result is very 
 bad. For text based protocols, such as SIP, performance of 
 string manipulating functions is very important. Unfortunately, 
 looks like it is not D strongest point at this time.

My version is functionally equivalent, and measures 55 - 94 ms 
(depending on compiler; LDC is best). Your version performed in 
about 7s on my machine.

Similar optimization taking advantage of immutability can be done 
on your python translation that results in measurements of <1ms.


import std.stdio,std.string,std.datetime;

void main()
{
   auto message = "REGISTER sip:example.com 
SIP/2.0\r\nContent-Length: 0\r\nContact: 
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su
mary\";q=0.9\r\nTo: 
<sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" 
\"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 
REGISTER\r\nVia: SIP/2.0/TLS 
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 
2910497622026445\r\nFrom: 
<sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";
   auto t1 = Clock.currTime();
   for (int i=0; i<10_000_000; i++)
   {
     while (true)
     {
       auto index = indexOf(message, "\r\n");
       if (index == -1)
         break;
       auto notused = message[0..index];
       message = message[index+2..$];
     }
   }
   writeln(Clock.currTime()-t1);
}

Mar 01 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/1/2013 5:19 PM, anon123 wrote:
 My version is functionally equivalent,

I don't think so. Take a look at what happens to the message variable. It is 
never restored to its original string on the next iteration of the loop.

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 20:19:19 -0500, anon123 <z z.z> wrote:

 My version is functionally equivalent, and measures 55 - 94 ms  
 (depending on compiler; LDC is best). Your version performed in about 7s  
 on my machine.

Try printing out message each time through the 10,000,000 iterations

-Steve

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used  
 string specific indexOf function. Still result is very bad. For text  
 based protocols, such as SIP, performance of string manipulating  
 functions is very important. Unfortunately, looks like it is not D  
 strongest point at this time.

indexOf uses the same mechanism as splitter to find the separators.  If it  
doesn't improve anything, I'd say that is where the problem lies  
(std.algorithm.find).

-Steve

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 8:29 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c
 <cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.

 indexOf uses the same mechanism as splitter to find the separators. If
 it doesn't improve anything, I'd say that is where the problem lies
 (std.algorithm.find).

find and indexOf on arrays are on par with handwritten loops.

Andrei

Mar 02 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 12:17:54 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 8:29 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c
 <cvk012c motorolasolutions.com> wrote:

 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.

 indexOf uses the same mechanism as splitter to find the separators. If
 it doesn't improve anything, I'd say that is where the problem lies
 (std.algorithm.find).

 find and indexOf on arrays are on par with handwritten loops.

This is not a personal attack, please don't take it that way.  My  
anecdotal tests with hand writing a custom splitter range to handle the  
OP's program gave me a 40% savings.  Whether it's find or not, I'm not  
sure, but there definitely is room for improvement.  Using indexOf instead  
of splitter seems to result in the same time usage, so something that is  
common between the two would be a logical choice to target.

It would be interesting to examine the assembly differences, but I haven't  
got the time right now.

-Steve

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 2:32 PM, Steven Schveighoffer wrote:
 This is not a personal attack, please don't take it that way.

Not at all. I'm just confused about this exchange. You mention some 
correct facts ("MySplitter finished in 6 seconds instead of 10"), then 
some mistaken suppositions ("a string-specific version could be 
written"), and derive conclusions from both ("Maybe there is a bad 
constraint somewhere"). Hard to deal with the mix.

 My
 anecdotal tests with hand writing a custom splitter range to handle the
 OP's program gave me a 40% savings. Whether it's find or not, I'm not
 sure, but there definitely is room for improvement.

I think I understand where the confusion comes from. If you're referring 
to MySplitter, that's not comparable. It uses this at the core:

for(; i + 1 < source.length; i++)
{
     if(source[i] == '\r' && source[i + 1] == '\n')
     {
         found = true;
         break;
     }
     ...
}

This is not close to the code that would work with arrays. We could 
specialize things for small arrays, but that hasn't been done yet. My 
point is it's not comparable with the classic brute force subsequence 
search.

What is comparable is shown below. indexOf2 defines the straight 
array-in-array brute force search. With dmd on my machine it runs in 
some 7 seconds. So does indexOf (this was my point). Then indexOf3 uses 
a simple trick that the Java implementation does: cache the first 
character. That reduces the running time to 5 seconds. Then indexOf4 
optimizes the loop searching that first character, bringing run time to 
about 4 seconds. On that machine the Java implementation runs in 3.5 
seconds.


Andrei

import std.stdio, std.string, std.datetime;

long indexOf2(string haystack, string needle)
{
     if (!needle.length) return 0;
     if (needle.length > haystack.length) return -1;
     auto limit = haystack.length - needle.length;
  bigloop:
     for (size_t i; i <= limit; ++i)
     {
         foreach (q; 0 .. needle.length)
         {
             if (haystack[i + q] != needle[q]) continue bigloop;
         }
         return i;
     }
     return -1;
}

long indexOf3(string haystack, string needle)
{
     if (!needle.length) return 0;
     if (needle.length > haystack.length) return -1;
     immutable c = needle[0];
     auto limit = haystack.length - needle.length;
     assert(limit < uint.max, "not handled");
  bigloop:
     for (uint i; i <= limit; ++i)
     {
         if (haystack.ptr[i] != c) continue;
         foreach (q; 1 .. needle.length)
         {
             if (haystack[i + q] != needle[q]) continue bigloop;
         }
         return i;
     }
     return -1;
}

long indexOf4(string haystack, string needle)
{
     if (!needle.length) return 0;
     if (needle.length > haystack.length) return -1;
     immutable c = needle[0];
     auto limit = haystack.length - needle.length;
     size_t i = 0;
  bigloop:
     for (;;)
     {
         while (i <= limit && haystack[i++] != c) {}
         if (i-- > limit) break;
         foreach (q; 1 .. needle.length)
         {
             if (haystack[i + q] != needle[q]) continue bigloop;
         }
         return i;
     }
     return -1;
}

unittest
{
     assert(indexOf2("hello, world", ",") == 5);
     assert(indexOf2("hello, world", "a") == -1);
     assert(indexOf3("hello, world", ",") == 5);
     assert(indexOf3("hello, world", "a") == -1);
     assert(indexOf4("hello, world", ",") == 5);
     assert(indexOf4("hello, world", "a") == -1);
}

void main(string[] args)
{
   auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 
0\r\nContact: 
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su
mary\";q=0.9\r\nTo: 
<sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My 
User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: 
SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 
2910497622026445\r\nFrom: 
<sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";
   auto t1 = Clock.currTime();
   for (int i=0; i<10000000; i++)
   {
     long index1 = 0;
     while (true)
     {
       auto index2 = indexOf(message[index1 .. $], "\r\n");
       if (index2 == -1)
         break;
       auto notused = message[index1 .. index1 + index2];
       if (args.length == 42) writeln(notused);
       index1 += index2 + 2;
     }
   }
   writeln(Clock.currTime()-t1);
}

Mar 02 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 17:29:38 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 2:32 PM, Steven Schveighoffer wrote:
 This is not a personal attack, please don't take it that way.

 Not at all. I'm just confused about this exchange. You mention some  
 correct facts ("MySplitter finished in 6 seconds instead of 10"), then  
 some mistaken suppositions ("a string-specific version could be  
 written"), and derive conclusions from both ("Maybe there is a bad  
 constraint somewhere"). Hard to deal with the mix.

I didn't realize the situation.  I assumed that there wasn't a version of  
splitter for strings that took array-specific shortcuts.  My corrected  
statement should read "the existing string-specific version could be  
improved."

My conclusion of "Maybe there is a bad constraint somewhere" was not  
derived from that, it was based on your statement elsewhere that "I wrote  
a custom splitter specialized for arrays. It's as fast."

Given that my tests have shown I can write a faster one quite easily than  
the implementation selected by phobos, and that you said you wrote one  
that's "as fast," I took that to mean you had written one that was more  
optimized than the chosen splitter version (and logically assumed you had  
included this version in Phobos with the intent it would be chosen when  
compiling with strings), leading me to suggest possibly that due to some  
bug, the "fast" implementation wasn't being chosen.  I didn't realize that  
"as fast" didn't mean "as fast as yours".  I actually don't know what you  
meant by that now.

 My
 anecdotal tests with hand writing a custom splitter range to handle the
 OP's program gave me a 40% savings. Whether it's find or not, I'm not
 sure, but there definitely is room for improvement.

 I think I understand where the confusion comes from. If you're referring  
 to MySplitter, that's not comparable. It uses this at the core:

 for(; i + 1 < source.length; i++)
 {
      if(source[i] == '\r' && source[i + 1] == '\n')
      {
          found = true;
          break;
      }
      ...
 }

 This is not close to the code that would work with arrays. We could  
 specialize things for small arrays, but that hasn't been done yet. My  
 point is it's not comparable with the classic brute force subsequence  
 search.

Very good point, here is a new version that takes any string as a  
separator:

struct MySplitter
{
     private string s;
     private string separator;
     private string source;
     this(string src, string sep)
     {
         source = src;
         separator = sep;
         popFront();
     }

      property string front()
     {
         return s;
     }

      property bool empty()
     {
         return s.ptr == null;
     }

     void popFront()
     {
         if(!source.length)
         {
             s = source;
             source = null;
         }
         else
         {
             size_t i = 0;
             bool found = false;
             for(; i + separator.length <= source.length; i++)
             {
                 if(source[i] == separator[0])
                 {
                     found = true;
                     for(size_t j = i+1, k=1; k < separator.length; ++j,  
++k)
                         if(source[j] != separator[k])
                         {
                             found = false;
                             break;
                         }
                     if(found)
                         break;
                 }
             }
             s = source[0..i];
             if(found)
                 source = source[i + separator.length..$];
             else
                 source = source[$..$];
         }
     }
}

Takes 7 seconds on my machine instead of 6, but not 10 like  
std.algorithm.splitter.  I don't even like the loop that well, it looks  
crude, I can probably optimize it further.

And it does not use any of your specified tricks.

Is this sufficiently comparable, or am I missing something else?

-Steve

Mar 04 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/4/13 10:39 AM, Steven Schveighoffer wrote:
 struct MySplitter
 {
 private string s;
 private string separator;
 private string source;
 this(string src, string sep)
 {
 source = src;
 separator = sep;
 popFront();
 }

  property string front()
 {
 return s;
 }

  property bool empty()
 {
 return s.ptr == null;
 }

 void popFront()
 {
 if(!source.length)
 {
 s = source;
 source = null;
 }
 else
 {
 size_t i = 0;
 bool found = false;
 for(; i + separator.length <= source.length; i++)
 {
 if(source[i] == separator[0])
 {
 found = true;
 for(size_t j = i+1, k=1; k < separator.length; ++j, ++k)
 if(source[j] != separator[k])
 {
 found = false;
 break;
 }
 if(found)
 break;
 }
 }
 s = source[0..i];
 if(found)
 source = source[i + separator.length..$];
 else
 source = source[$..$];
 }
 }
 }

 Takes 7 seconds on my machine instead of 6, but not 10 like
 std.algorithm.splitter. I don't even like the loop that well, it looks
 crude, I can probably optimize it further.

 And it does not use any of your specified tricks.

 Is this sufficiently comparable, or am I missing something else?

 -Steve

That's comparable, please file an enh request.


Thanks,

Andrei

Mar 04 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.

http://d.puremagic.com/issues/show_bug.cgi?id=9646

Mar 04 2013

qznc <qznc web.de> writes:

On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
wrote:
 On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.

 http://d.puremagic.com/issues/show_bug.cgi?id=9646


Below is an implementation, which matches MySplitter with dmd, 
but not with ldc. More precisely:

dmd -O -release -inline -noboundscheck:
std.algorithm.splitter took 5 secs, 170 ms, and 656 μs
MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec
my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs

ldc2 -O -release:
std.algorithm.splitter took 3 secs, 540 ms, and 84 μs
MySplitter took 2 secs, 288 ms, and 535 μs
my_splitter took 3 secs, 463 ms, and 897 μs

CPU: Intel i7 M 620 2.67GHz
dmd: v2.070.2
ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1)

auto my_splitter(alias pred = "a == b", Range, Separator)(Range 
r, Separator s)
if (is(typeof(binaryFun!pred(r.front, s.front)) : bool)
         && (hasSlicing!Range || isNarrowString!Range)
         && isForwardRange!Separator
         && (hasLength!Separator || isNarrowString!Separator))
{
     import std.algorithm.searching : find;
     import std.conv : unsigned;

     static struct Result
     {
     private:
         Range _input;
         Separator _separator;
         size_t _next_index;
         bool _empty;

          property auto separatorLength() { return 
_separator.length; }

         void findIndex()
         {
             auto found = find!pred(_input, _separator);
             _next_index = _input.length - found.length;
         }

     public:
         this(Range input, Separator separator)
         {
             _input = input;
             _separator = separator;
             _empty = false;
             findIndex();
         }

          property Range front()
         {
             assert(!empty);
             return _input[0 .. _next_index];
         }

         static if (isInfinite!Range)
         {
             enum bool empty = false;  // Propagate infiniteness
         }
         else
         {
              property bool empty()
             {
                 return _empty;
             }
         }

         void popFront()
         {
             assert(!empty);
             if (_input.empty) {
                 _empty = true;
                 assert(_next_index == 0);
             } else if (_input.length == _next_index) {
                 _input = _input[$ .. $];
                 _empty = true;
             } else {
                 _input = _input[_next_index + separatorLength .. 
$];
                 findIndex();
             }
         }

         static if (isForwardRange!Range)
         {
              property typeof(this) save()
             {
                 auto ret = this;
                 ret._input = _input.save;
                 return ret;
             }
         }
     }

     return Result(r, s);
}

May 22 2016

Seb <seb wilzba.ch> writes:

On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]


 Below is an implementation, which matches MySplitter with dmd, 
 but not with ldc. More precisely:

 [...]

have you thought about opening a PR to improve `splitter`?

May 23 2016

qznc <qznc web.de> writes:

On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:
 On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]


 Below is an implementation, which matches MySplitter with dmd, 
 but not with ldc. More precisely:

 [...]

 have you thought about opening a PR to improve `splitter`?

Yes, but I'm not sure about the goals.

I also want to dig a little deeper. Being "sometimes faster" 
without some understand why and when feels unsatisfying.

Additionally, there is this weird special case for a 
bidirectional range, which just adds unnecessary overhead. Is 
"remove dead code" a good enough reason in itself for a PR?

May 23 2016

Seb <seb wilzba.ch> writes:

On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:
 On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 [...]


 Below is an implementation, which matches MySplitter with 
 dmd, but not with ldc. More precisely:

 [...]

 have you thought about opening a PR to improve `splitter`?

 Yes, but I'm not sure about the goals.

 I also want to dig a little deeper. Being "sometimes faster" 
 without some understand why and when feels unsatisfying.

Good luck. If you need help, it's probably the best to open the 
PR as people can directly see your code and the Phobos repo is 
usually pretty active.

 Additionally, there is this weird special case for a 
 bidirectional range, which just adds unnecessary overhead.

 Is  "remove dead code" a good enough reason in itself for a PR?

Absolutely ;-)

May 23 2016

qznc <qznc web.de> writes:

On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 Additionally, there is this weird special case for a 
 bidirectional range, which just adds unnecessary overhead. Is 
 "remove dead code" a good enough reason in itself for a PR?

Forget the "dead code comment" it is a actually a missing 
feature. In the case the separator is a range and the input range 
is bidirectional, the splitter result should be bidirectional as 
well. It is not, because the implementation of back() and 
popBack() is missing, although some bookkeeping code for it 
exists.

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/23/2016 08:17 AM, qznc wrote:
 On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:
 Additionally, there is this weird special case for a bidirectional
 range, which just adds unnecessary overhead. Is "remove dead code" a
 good enough reason in itself for a PR?

 Forget the "dead code comment" it is a actually a missing feature. In
 the case the separator is a range and the input range is bidirectional,
 the splitter result should be bidirectional as well. It is not, because
 the implementation of back() and popBack() is missing, although some
 bookkeeping code for it exists.

Most uses are forward, so perhaps it's best to eliminate the 
bidirectional bookkeeping if it adds overhead, and confine those to a 
separate function e.g. splitterBidirectional. There is precedent for 
that in Phobos. -- Andrei

May 23 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:
 Most uses are forward, so perhaps it's best to eliminate the 
 bidirectional bookkeeping if it adds overhead, and confine 
 those to a separate function e.g. splitterBidirectional. There 
 is precedent for that in Phobos. -- Andrei

I don't see the value add that things like filterBidirectional 
give when the back and popBack functions can be static if-ed in 
or out the normal filter. filterBidirectional always felt like a 
relic of a time when static if's didn't exist, even though I know 
that's not the case.

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/23/16 9:44 AM, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:
 Most uses are forward, so perhaps it's best to eliminate the
 bidirectional bookkeeping if it adds overhead, and confine those to a
 separate function e.g. splitterBidirectional. There is precedent for
 that in Phobos. -- Andrei

 I don't see the value add that things like filterBidirectional give when
 the back and popBack functions can be static if-ed in or out the normal
 filter.

What would be the condition tested in the static if? Recall that 
sometimes you do want a forward range even when you could define a 
bidirectional range. -- Andrei

May 23 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:
 What would be the condition tested in the static if?

I would assume `static if (isBidirectionalRange!Range)`

 Recall that sometimes you do want a forward range even when you 
 could define a bidirectional range. -- Andrei

I honestly can't think of a time when would this be the case. 
What are you losing by having back and popBack defined?

May 23 2016

qznc <qznc web.de> writes:

On Monday, 23 May 2016 at 14:25:52 UTC, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu 
 wrote:
 What would be the condition tested in the static if?

 I would assume `static if (isBidirectionalRange!Range)`

 Recall that sometimes you do want a forward range even when 
 you could define a bidirectional range. -- Andrei

 I honestly can't think of a time when would this be the case. 
 What are you losing by having back and popBack defined?

Here is a commit, which removes the dead code:

https://github.com/qznc/phobos/commit/0cb1e70f4504a52b81c30342be7ec26d597fac69

It would be strictly better to implement the back and popBack, no?

My single test case says removing the code does not result in 
faster code. Maybe dmd optimizes the bookkeeping away.

I see three options:

1. Remove dead bookkeeping code
2. Implement back() and popBack()
3. Use alternative splitter implementation (and implement back() 
and popBack())

The third one would be the best, if it is really faster.

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/23/2016 10:47 AM, qznc wrote:
 It would be strictly better to implement the back and popBack, no?

If it adds no overhead, you're right. -- Andrei

May 23 2016

Jack Stouffer <jack jackstouffer.com> writes:

On Monday, 23 May 2016 at 14:47:22 UTC, qznc wrote:
 I see three options:

 1. Remove dead bookkeeping code
 2. Implement back() and popBack()
 3. Use alternative splitter implementation (and implement 
 back() and popBack())

 The third one would be the best, if it is really faster.

If the performance is really a problem (I think it's a micro 
optimization at best), then the best option is not to violate 
DRY. Have a template parameter, std.typecons.Flag, to turn off 
the back and popBack. Don't have two functions that are 95% the 
same code like filter and filterBidirectional.

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/23/2016 03:11 PM, Jack Stouffer wrote:
 (I think it's a micro optimization at best)

splitter in the stdlib is likely to be very frequently useful, any bit 
of speedup we put in it is likely to pay off immensely. -- Andrei

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/23/2016 10:25 AM, Jack Stouffer wrote:
 On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:
 What would be the condition tested in the static if?

 I would assume `static if (isBidirectionalRange!Range)`

 Recall that sometimes you do want a forward range even when you could
 define a bidirectional range. -- Andrei

 I honestly can't think of a time when would this be the case. What are
 you losing by having back and popBack defined?

You might be doing extra work even if the user never calls back and 
popBack. -- Andrei

May 23 2016

qznc <qznc web.de> writes:

On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:
 On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer 
 wrote:
 On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu 
 <SeeWebsiteForEmail erdani.org> wrote:

 That's comparable, please file an enh request.

 http://d.puremagic.com/issues/show_bug.cgi?id=9646


I think the actual slow thing is std.find, which splitter uses.

Benchmark: https://dpaste.dzfl.pl/1e31b6f95659

The output is:

std find    took    162167000
manual find took     99852000

Actually, std find should be faster, since it could use the Boyer 
Moore algorithm instead of naive string matching.

May 23 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the Boyer Moore
 algorithm instead of naive string matching.

Conventional wisdom has it that find() is brute force and that's that, 
but probably it's time to destroy. Selectively using advanced searching 
algorithms for the appropriate inputs is very DbI-ish.

There are a few nice precedents of blend algorithms, see e.g. 
http://effbot.org/zone/stringlib.htm.

Writing a generic subsequence search blend algorithm, one that chooses 
the right algorithm based on a combination of static and dynamic 
decisions, is quite a fun and challenging project. Who wanna?


Andrei

May 23 2016

qznc <qznc web.de> writes:

On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.

 Conventional wisdom has it that find() is brute force and 
 that's that, but probably it's time to destroy. Selectively 
 using advanced searching algorithms for the appropriate inputs 
 is very DbI-ish.

 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?

I observed that Boyer-Moore from std is still slower. My guess is 
due to the relatively short strings in my benchmark and the need 
to allocate for the tables. Knuth-Morris-Pratt might be 
worthwhile, because only requires a smaller table, which could be 
allocated on the stack.

The overall sentiment seems to be that KMP vs BM depends on the 
input. This means an uninformed find function could only use 
heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder 
implementation next to the BoyerMooreFinder. I filed an issue [0].

Apart from advanced algorithms, find should not be slower than a 
naive nested-for-loop implementation.

[0] https://issues.dlang.org/show_bug.cgi?id=16066

May 24 2016

Chris <wendlec tcd.ie> writes:

On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu 
 wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.

 Conventional wisdom has it that find() is brute force and 
 that's that, but probably it's time to destroy. Selectively 
 using advanced searching algorithms for the appropriate inputs 
 is very DbI-ish.

 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?

 I observed that Boyer-Moore from std is still slower. My guess 
 is due to the relatively short strings in my benchmark and the 
 need to allocate for the tables. Knuth-Morris-Pratt might be 
 worthwhile, because only requires a smaller table, which could 
 be allocated on the stack.

 The overall sentiment seems to be that KMP vs BM depends on the 
 input. This means an uninformed find function could only use 
 heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder 
 implementation next to the BoyerMooreFinder. I filed an issue 
 [0].

 Apart from advanced algorithms, find should not be slower than 
 a naive nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066

 From Phobos [1]:


         /* Note: This step could be made a lot faster.
         * A simple implementation is shown here. */
         this.skip = new size_t[needle.length];
         foreach (a; 0 .. needle.length)
         {
             size_t value = 0;
             while (value < needle.length
                    && !needlematch(needle, a, value))
             {
                 ++value;
             }
             this.skip[needle.length - a - 1] = value;
         }

[1] 
https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335

May 24 2016

qznc <qznc web.de> writes:

On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:
 On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 Apart from advanced algorithms, find should not be slower than 
 a naive nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066

 From Phobos [1]:


         /* Note: This step could be made a lot faster.
         * A simple implementation is shown here. */
         this.skip = new size_t[needle.length];
         foreach (a; 0 .. needle.length)
         {
             size_t value = 0;
             while (value < needle.length
                    && !needlematch(needle, a, value))
             {
                 ++value;
             }
             this.skip[needle.length - a - 1] = value;
         }

 [1] 
 https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335

Unfortunately, it is slower. My current measurements [0]:


std find    took    753837231
manual find took    129561980

std find    took    721676721
manual find took    216418870

The nested-for-loop implementation is roughly 4 times faster!

Caveat: I'm not completely sure my benchmark is fair yet.

[0] https://github.com/qznc/find-is-too-slow

May 24 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/24/2016 06:44 AM, qznc wrote:
 On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:
 On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:
 Apart from advanced algorithms, find should not be slower than a
 naive nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066

 From Phobos [1]:


         /* Note: This step could be made a lot faster.
         * A simple implementation is shown here. */
         this.skip = new size_t[needle.length];
         foreach (a; 0 .. needle.length)
         {
             size_t value = 0;
             while (value < needle.length
                    && !needlematch(needle, a, value))
             {
                 ++value;
             }
             this.skip[needle.length - a - 1] = value;
         }

 [1]
 https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335

 Unfortunately, it is slower. My current measurements [0]:


 std find    took    753837231
 manual find took    129561980

 std find    took    721676721
 manual find took    216418870

 The nested-for-loop implementation is roughly 4 times faster!

 Caveat: I'm not completely sure my benchmark is fair yet.

 [0] https://github.com/qznc/find-is-too-slow

One somewhat unrelated thing that can be done with Boyer-Moore: use a 
constant-length skip array instead of dynamic allocation. Upon 
saturation, continue with brute force. -- Andrei

May 24 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/24/2016 08:38 AM, Andrei Alexandrescu wrote:
 One somewhat unrelated thing that can be done with Boyer-Moore: use a
 constant-length skip array instead of dynamic allocation. Upon
 saturation, continue with brute force. -- Andrei

Another idea: in every searched string there is one specific index that 
offers the most information to a failed search, allowing for a large 
skip. Roughly speaking it's the largest index around which there are 
many characters not seen elsewhere in the searched string. (In a string 
with all distinct characters, that would be the last character.) If that 
offset is cheap to compute and offers a good skip, a search starting 
from that index would be very fast. -- Andrei

May 24 2016

qznc <qznc web.de> writes:

On Tuesday, 24 May 2016 at 10:44:12 UTC, qznc wrote:
 Unfortunately, it is slower. My current measurements [0]:


 std find    took    753837231
 manual find took    129561980

 std find    took    721676721
 manual find took    216418870

 The nested-for-loop implementation is roughly 4 times faster!

 Caveat: I'm not completely sure my benchmark is fair yet.

 [0] https://github.com/qznc/find-is-too-slow

Plot twist: manual find is not always faster.

My benchmark now generates random haystack and needle. Sometimes 
Phobos clearly wins. Example:

$ ./benchmark.ldc -n39 -l10000 -i10000
Found at 10000
std find    took    289529102
manual find took    368958679

This is compiled with LDC, needle is 39 chars, haystack 10000 
chars, and 10000 iterations. "Found at 10000" means the needle 
was not found.

Very often manual find wins though. Takeaway: Manual find is too 
naive. However, something slows down std find in general. More 
research needed.

PS: Any recommendations for a Linux dissassembler? Objdump is ok, 
but some arrows for jumps would be nice.

May 24 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/24/2016 03:47 PM, qznc wrote:
 PS: Any recommendations for a Linux dissassembler? Objdump is ok, but
 some arrows for jumps would be nice.

Try http://asm.dlang.org for short tests. -- Andrei

May 24 2016

qznc <qznc web.de> writes:

On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:
 PS: Any recommendations for a Linux dissassembler? Objdump is 
 ok, but some arrows for jumps would be nice.

I discovered radare2. While it is actually a debugger (like gdb), 
it includes a nice disassembler. Example view, which shows the 
loop calling std.algorithm.find repeatedly:

        |   0x00402e08,   0 |           nop [rax+rax+0x0]
       .--> 0x00402e10,   0 |           mov rdx, [rbx]
       ||   0x00402e13    0 |           mov rcx, [rbx+0x8]
       ||   0x00402e17    0 |           mov rdi, r15
       ||   0x00402e1a    0 |           mov rsi, r12
       ||   0x00402e1d    0 |           call 0x409160  ; 4 = 
sym._D3std9algorithm9searching34__T4fin
       ||   0x00402e22    0 |           mov rcx, [rbx]
       ||   0x00402e25    0 |           sub rcx, rax
       ||   0x00402e28,   0 |           mov [rbx+0x20], rcx
       ||   0x00402e2c,   0 |           dec ebp
       `==< 0x00402e2e    0 |           jnz 0x402e10  ; 5 = 
0x00402e10
        |   0x00402e30,   0 |           lea r15, [rsp]

Note the ASCII arrow on the left! Objdump should do that. :)

May 24 2016

Chris <wendlec tcd.ie> writes:

On Tuesday, 24 May 2016 at 22:01:17 UTC, qznc wrote:
 On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:

 I discovered radare2. While it is actually a debugger (like 
 gdb), it includes a nice disassembler. Example view, which 
 shows the loop calling std.algorithm.find repeatedly:

        |   0x00402e08,   0 |           nop [rax+rax+0x0]
       .--> 0x00402e10,   0 |           mov rdx, [rbx]
       ||   0x00402e13    0 |           mov rcx, [rbx+0x8]
       ||   0x00402e17    0 |           mov rdi, r15
       ||   0x00402e1a    0 |           mov rsi, r12
       ||   0x00402e1d    0 |           call 0x409160  ; 4 = 
 sym._D3std9algorithm9searching34__T4fin
       ||   0x00402e22    0 |           mov rcx, [rbx]
       ||   0x00402e25    0 |           sub rcx, rax
       ||   0x00402e28,   0 |           mov [rbx+0x20], rcx
       ||   0x00402e2c,   0 |           dec ebp
       `==< 0x00402e2e    0 |           jnz 0x402e10  ; 5 = 
 0x00402e10
        |   0x00402e30,   0 |           lea r15, [rsp]

 Note the ASCII arrow on the left! Objdump should do that. :)

Any suggestions how we can make find more efficient?

May 25 2016

qznc <qznc web.de> writes:

On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?

I will send a pull request soon.

My current patch [0] improves it, but still contains a bug.

[0] 
https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95

May 25 2016

qznc <qznc web.de> writes:

On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?

 I will send a pull request soon.

 My current patch [0] improves it, but still contains a bug.

 [0] 
 https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95

And here it is: https://github.com/dlang/phobos/pull/4362

Destroy!

May 25 2016

Chris <wendlec tcd.ie> writes:

On Wednesday, 25 May 2016 at 11:30:33 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:
 On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:
 Any suggestions how we can make find more efficient?

 I will send a pull request soon.

 My current patch [0] improves it, but still contains a bug.

 [0] 
 https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95

 And here it is: https://github.com/dlang/phobos/pull/4362

 Destroy!

Great stuff! I really appreciate your efforts. As I said, my code 
uses find/canFind a lot.

May 25 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/24/2016 03:54 AM, qznc wrote:
 On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the Boyer Moore
 algorithm instead of naive string matching.

 Conventional wisdom has it that find() is brute force and that's that,
 but probably it's time to destroy. Selectively using advanced
 searching algorithms for the appropriate inputs is very DbI-ish.

 There are a few nice precedents of blend algorithms, see e.g.
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that chooses
 the right algorithm based on a combination of static and dynamic
 decisions, is quite a fun and challenging project. Who wanna?

 I observed that Boyer-Moore from std is still slower. My guess is due to
 the relatively short strings in my benchmark and the need to allocate
 for the tables. Knuth-Morris-Pratt might be worthwhile, because only
 requires a smaller table, which could be allocated on the stack.

There's also Rabin-Karp that can be used when the sought range is 
relatively short.

 The overall sentiment seems to be that KMP vs BM depends on the input.
 This means an uninformed find function could only use heuristics.
 Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to
 the BoyerMooreFinder. I filed an issue [0].

Great. I keep on thinking how the advanced algorithms should "kick in" 
only after a partial match was long enough and frequent enough. As long 
as the partial match is short and infrequent, brute force will do best.

 Apart from advanced algorithms, find should not be slower than a naive
 nested-for-loop implementation.

 [0] https://issues.dlang.org/show_bug.cgi?id=16066

Nice, thanks.


Andrei

May 24 2016

Chris <wendlec tcd.ie> writes:

On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:
 Actually, std find should be faster, since it could use the 
 Boyer Moore
 algorithm instead of naive string matching.

 Conventional wisdom has it that find() is brute force and 
 that's that, but probably it's time to destroy. Selectively 
 using advanced searching algorithms for the appropriate inputs 
 is very DbI-ish.

Since I work a lot with text, I use canFind etc. a lot. It'd be 
nice to be able to choose faster algorithms. People are often 
advised to use library functions instead of rolling their own, 
because library functions are optimized for speed, but this 
doesn't seem to be true of Phobos. Simple everyday tasks like 
string matching should be as fast as C.

 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?


 Andrei

May 24 2016

Chris <wendlec tcd.ie> writes:

On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:
 On 05/23/2016 03:11 PM, qznc wrote:

 Conventional wisdom has it that find() is brute force and 
 that's that, but probably it's time to destroy. Selectively 
 using advanced searching algorithms for the appropriate inputs 
 is very DbI-ish.

 There are a few nice precedents of blend algorithms, see e.g. 
 http://effbot.org/zone/stringlib.htm.

 Writing a generic subsequence search blend algorithm, one that 
 chooses the right algorithm based on a combination of static 
 and dynamic decisions, is quite a fun and challenging project. 
 Who wanna?


 Andrei

I've played around with some algorithms and kept them as simple 
as possible, but I've found that a linear brute force for-loop 
always beats them (it's the extra decision(s), I suppose). It 
looks nice in theory, but in hardware reality a stupid loop is 
more efficient.

NB: I've found that `foreach` is usually faster than a manual 
`for`-loop.

I used qznc's benchmarking, and it's clear that the current 
implementation of std.algorithm.find is quite a poor performer. 
We do have to work on that. Library functions should be fast, no 
matter what. If someone uses `find`, they wanna be sure it's 
optimized for speed.

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/27/16 8:28 AM, Chris wrote:
 I've played around with some algorithms and kept them as simple as
 possible, but I've found that a linear brute force for-loop always beats
 them (it's the extra decision(s), I suppose). It looks nice in theory,
 but in hardware reality a stupid loop is more efficient.

This is surprising. It should be easy to construct examples where brute 
force search performs arbitrarily poor, e.g. searching for 
"aaaaaaa...aab" in a long string with only "a"s.

 NB: I've found that `foreach` is usually faster than a manual `for`-loop.

That's surprising too. I'd be interested to look at a disassembly if you 
have one.

 I used qznc's benchmarking, and it's clear that the current
 implementation of std.algorithm.find is quite a poor performer. We do
 have to work on that. Library functions should be fast, no matter what.
 If someone uses `find`, they wanna be sure it's optimized for speed.

That's definitely the way to go. Thanks for looking into it!


Andrei

May 27 2016

Chris <wendlec tcd.ie> writes:

On Friday, 27 May 2016 at 13:29:00 UTC, Andrei Alexandrescu wrote:
 On 5/27/16 8:28 AM, Chris wrote:

 This is surprising. It should be easy to construct examples 
 where brute force search performs arbitrarily poor, e.g. 
 searching for "aaaaaaa...aab" in a long string with only "a"s.

I will look into this. Atm, I'm using qznc's benchmark.

 NB: I've found that `foreach` is usually faster than a manual 
 `for`-loop.

 That's surprising too. I'd be interested to look at a 
 disassembly if you have one.

I have to correct myself. A manual loop is faster, I couldn't 
believe it either, so I benchmarked the same function with 
`foreach` and `for`. Turns out that `for` is _way_ faster.

 I used qznc's benchmarking, and it's clear that the current
 implementation of std.algorithm.find is quite a poor 
 performer. We do
 have to work on that. Library functions should be fast, no 
 matter what.
 If someone uses `find`, they wanna be sure it's optimized for 
 speed.

 That's definitely the way to go. Thanks for looking into it!


 Andrei

There is a bug, if it is one, in qznc's `manual_find`. It doesn't 
find strings at the end of a string, as "ing" in "string", and it 
does not find a string that is equal to the search string as in 
"abba" == "abba".

This outperforms both `manual_find` and the improved `std find`

string findStringS_Manual(string haystack, string needle)
{
	
	if (needle.length > haystack.length)
		return haystack[$..$];
	outer:
	for (auto i = 0; i < haystack.length; i++)
	{
		if (haystack[i] != needle[0])
			continue;
		for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
			if (haystack[j] != needle[k])
			continue outer;
		return haystack[i..$];
	}
	return haystack[$..$];
}

A typical example would be:

std find    	took     25642297
manual find 	took      7426676  // manual_find
my std find 	took      6654496  // improved find
findStringS 	took      7159556  // foreach(i, c; haystack)
findStringS_  	took      5315498  // for (auto i = 0 ...) see 
above

I think it's clear that `std find` is veeeery slow. If anyone 
wants to test my function, please do so. I don't want to spread 
wrong data, but this is what I get at the moment (ldc latest 
version).

May 27 2016

Chris <wendlec tcd.ie> writes:

On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 This outperforms both `manual_find` and the improved `std find`

 string findStringS_Manual(string haystack, string needle)
 {
 	
 	if (needle.length > haystack.length)
 		return haystack[$..$];
 	outer:
 	for (auto i = 0; i < haystack.length; i++)
 	{
 		if (haystack[i] != needle[0])
 			continue;
 		for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
 			if (haystack[j] != needle[k])
 			continue outer;
 		return haystack[i..$];
 	}
 	return haystack[$..$];
 }

Little bug fix:

for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < 
needle.length; ++j, ++k)

Else it will be out or range when you look for "za" and the 
string ends with "z". There is a slight performance penalty, but 
it's still faster than the rest. Maybe there are cleverer ways. 
But every little check and decision comes with a performance 
penalty.

The improved `std find` comes very close to the manual function 
above now, sometimes it's even faster or at least as fast.

std find    	took     12573666
manual find 	took      7418454
my std find 	took      6903854 <===
findStringS 	took      7166720
findStringS_  	took      6579529 <===

std find    	took     11849892
manual find 	took      7407091
my std find 	took      6573102 <===
findStringS 	took      7296610
findStringS_  	took      6573214 <===

std find    	took     10744361
manual find 	took      7576133
my std find 	took      6332014 <===
findStringS 	took      7224946
findStringS_  	took      6555976 <===

May 27 2016

qznc <qznc web.de> writes:

On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 The improved `std find` comes very close to the manual function 
 above now, sometimes it's even faster or at least as fast.

I think it really depends on the use case. The manual algorithm 
is really naive and std-find is slightly more advanced. My 
benchmark creates a random haystack and needle, but this is not 
real world data. I believe the "slightly more advanced" approach 
is a good idea, but there is no data to prove it.

It might be interesting what algorithms other language's standard 
libraries (C++, Python, Java, etc) use.

Thanks for looking at the benchmarking, Chris! The more eyes, the 
better. :)

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/27/16 10:41 AM, Chris wrote:
 The improved `std find` comes very close to the manual function above
 now, sometimes it's even faster or at least as fast.

What you want to do here for indexed access is to match the last element 
first. If no match, continue etc. If there's a match, enter an inner 
loop where you don't need to check for the end of the haystack. -- Andrei

May 27 2016

qznc <qznc web.de> writes:

On Friday, 27 May 2016 at 18:21:06 UTC, Andrei Alexandrescu wrote:
 What you want to do here for indexed access is to match the 
 last element first. If no match, continue etc. If there's a 
 match, enter an inner loop where you don't need to check for 
 the end of the haystack. -- Andrei

Another plot twist: Sentinels look interesting.

Wilzbach [0] mentioned it in the pull request:
"Btw on a related issue - it seems like no one has actually 
bother to implement Andrei's "Fastware" sentinels in Phobos. If 
you are already in benchmarking, you might give this 
specialization for mutable RandomAccessRanges a try ;-)"

I tried it [1]. It changes the numbers from

std find:    153 ±32
manual find: 112 ±19
qznc find:   121 ±18
Chris find:  132 ±20

to

std find:    160 ±31
manual find: 118 ±24
qznc find:   109 ±13  <--- using the sentinel trick
Chris find:  142 ±27

It is normal that the numbers of the other tests change, since 
those are relative to the fastest one in each run. When qznc find 
'wins' more often, the others get more slowdown.

This means another special case for mutable ranges could be 
worthwhile.

[0] 
https://github.com/dlang/phobos/pull/4362#issuecomment-222224387
[1] 
https://github.com/qznc/find-is-too-slow/commit/043d1d2f6dced625e964e8df883ad5a45d7fe654

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/27/2016 06:19 PM, qznc wrote:
 manual find: 118 ±24
 qznc find:   109 ±13  <--- using the sentinel trick
 Chris find:  142 ±27

 It is normal that the numbers of the other tests change, since those are
 relative to the fastest one in each run. When qznc find 'wins' more
 often, the others get more slowdown.

What compiler?

 This means another special case for mutable ranges could be worthwhile.

Also mind the throwing possibilities.


Andrei

May 27 2016

qznc <qznc web.de> writes:

On Saturday, 28 May 2016 at 01:30:11 UTC, Andrei Alexandrescu 
wrote:
 On 05/27/2016 06:19 PM, qznc wrote:
 manual find: 118 ±24
 qznc find:   109 ±13  <--- using the sentinel trick
 Chris find:  142 ±27

 It is normal that the numbers of the other tests change, since 
 those are
 relative to the fastest one in each run. When qznc find 'wins' 
 more
 often, the others get more slowdown.

 What compiler?

LDC. For DMD the numbers show no improvement:

std find:    170 ±31
manual find: 132 ±25
qznc find:   103 ±6
Chris find:  157 ±30

to

std find:    168 ±30
manual find: 131 ±25
qznc find:   104 ±8    <--- sentinel
Chris find:  155 ±29

(Caveat: Chris find has a bug, which is triggered once in the 
10000 runs each.)

The sentinel trick is only proof of concept. The sentinel value 
is `needleBack+1`, but range elements need not support addition. 
Finding a sentinel is hard and most certainly requires more 
assumptions about the ranges.

 This means another special case for mutable ranges could be 
 worthwhile.

 Also mind the throwing possibilities.

What do you mean? Using exceptions?


If anybody wants to replicate. It only takes a few seconds and 
there are no dependencies except dmd/ldc.

   git clone https://github.com/qznc/find-is-too-slow
   cd find-is-too-slow
   make dmd
   make ldc

May 28 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need not
 support addition. Finding a sentinel is hard and most certainly requires
 more assumptions about the ranges.

No need for a sentinel really so long as you first search for the last 
element of the needle, in the haystack.

Allow me to put on the table a simple brute force routine, specialized 
for arrays with copyable elements, no custom predicate etc. Far as I can 
tell it does no redundant work:

T[] find(T)(T[] haystack, T[] needle)
{
     if (needle.length == 0) return haystack;
     immutable lastIndex = needle.length - 1;
     auto last = needle[lastIndex];
     size_t j = lastIndex;
     for (; j < haystack.length; ++j)
     {
         if (haystack[j] != last) continue;
         immutable k = j - lastIndex;
         // last elements match
         for (size_t i = 0; ; ++i)
         {
             if (i == lastIndex) return haystack[k .. $];
             if (needle[i] != haystack[k + i]) break;
         }
     }
     return haystack[$ .. $];
}

unittest
{
     string s1 = "hello abc world";
     assert(find(s1, "abc") == "abc world");
     assert(find(s1, "def") == "");
}

void main(){}


Andrei

May 28 2016

Chris <wendlec tcd.ie> writes:

On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:
 On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need 
 not
 support addition. Finding a sentinel is hard and most 
 certainly requires
 more assumptions about the ranges.

 No need for a sentinel really so long as you first search for 
 the last element of the needle, in the haystack.

 Allow me to put on the table a simple brute force routine, 
 specialized for arrays with copyable elements, no custom 
 predicate etc. Far as I can tell it does no redundant work:

 T[] find(T)(T[] haystack, T[] needle)
 {
     if (needle.length == 0) return haystack;
     immutable lastIndex = needle.length - 1;
     auto last = needle[lastIndex];
     size_t j = lastIndex;
     for (; j < haystack.length; ++j)
     {
         if (haystack[j] != last) continue;
         immutable k = j - lastIndex;
         // last elements match
         for (size_t i = 0; ; ++i)
         {
             if (i == lastIndex) return haystack[k .. $];
             if (needle[i] != haystack[k + i]) break;
         }
     }
     return haystack[$ .. $];
 }

 unittest
 {
     string s1 = "hello abc world";
     assert(find(s1, "abc") == "abc world");
     assert(find(s1, "def") == "");
 }

 void main(){}


 Andrei

I included your function (I also fixed the bug in my 
findStringS_). Here's a typical result:

std find:    208 ±61
manual find: 127 ±29
qznc find:   108 ±13
Chris find:  140 ±32
Andrei find: 126 ±27
=====
std find:    207 ±59
manual find: 128 ±30
qznc find:   108 ±13
Chris find:  141 ±32
Andrei find: 125 ±27

I used dmd, because I don't have ldc on my laptop. qznc's find is 
clearly the winner.

May 28 2016

qznc <qznc web.de> writes:

On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote:
 I used dmd, because I don't have ldc on my laptop. qznc's find 
 is clearly the winner.

DMD performance feels flaky to me.

If performance is important, you should use ldc or gdc. 
Alternatively, you are Walter Bright and simply optimize dmd. ;)

May 28 2016

qznc <qznc web.de> writes:

On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu 
wrote:
 On 5/28/16 6:56 AM, qznc wrote:
 The sentinel value is `needleBack+1`, but range elements need 
 not
 support addition. Finding a sentinel is hard and most 
 certainly requires
 more assumptions about the ranges.

 No need for a sentinel really so long as you first search for 
 the last element of the needle, in the haystack.

 Allow me to put on the table a simple brute force routine, 
 specialized for arrays with copyable elements, no custom 
 predicate etc. Far as I can tell it does no redundant work:

A nice one. Here are the numbers I got:

LDC:
     std find:	156 ±33
  manual find:	117 ±24
    qznc find:	114 ±14
   Chris find:	135 ±24
  Andrei find:	112 ±27

DMD:
     std find:	177 ±31
  manual find:	140 ±28
    qznc find:	102 ±4
   Chris find:	165 ±31
  Andrei find:	130 ±25

My sentinel idea is to have only one condition branch in the 
inner loop. Andrei find still has to `if` in the inner loop. My 
inner loop looks like this:

   haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // 
place sentinel
   for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { 
// no condition
     if (!binaryFun!pred(haystack[j], needle[k])) { // only 
condition
       // ... ok in here comes another, but this not as costly
     }
   }

As long as you are matching the needle, there is only one 
condition to check. There is no check for k < needleLength, 
because the check always fails for needle[$] aka haystack[scout] 
due to the sentinel.

I thought it might be slower, since we need to write to memory 
twice instead of only reading. Turns out it is faster.

May 28 2016

qznc <qznc web.de> writes:

I played around with the benchmark. Some more numbers:

$ make ldc
ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
./benchmark.ldc
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
     std find: 153 ±25    +66 (1934)  -15 (7860)
  manual find: 122 ±28    +80 (1812)  -17 (8134)
    qznc find: 125 ±16    +18 (4644)  -15 (5126)
   Chris find: 148 ±29    +75 (1976)  -18 (7915)
  Andrei find: 114 ±23   +100 (1191)  -13 (8770)
  (avg slowdown vs fastest; absolute deviation)
$ make dmd
dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
E: wrong result with Chris find
E: wrong result with Chris find
E: wrong result with Chris find
     std find: 160 ±27    +44 (3162)  -20 (6709)
  manual find: 148 ±28    +54 (2701)  -19 (7178)
    qznc find: 102 ±3     +27 ( 766)   -1 (9136)
   Chris find: 175 ±30    +55 (2796)  -21 (7106)
  Andrei find: 122 ±22    +46 (2554)  -14 (7351)
  (avg slowdown vs fastest; absolute deviation)

The additional numbers on the right are the ±MAD separated by 
above or below the mean. For example Andrei find with ldc:

   Andrei find: 114 ±23   +100 (1191)  -13 (8770)

The mean slowdown is 114, which means 14% slower than the fastest 
one. The mean absolute deviation (MAD) is 23. More precisely, the 
mean deviation above the mean slowdown of 103 is 100 and -13 
below the mean slowdown. 1191 of the 10000 runs were above the 
mean slowdown and 8770 below. The 39 missing runs are equal to 
the mean slowdown.

What bothers me is that changing the alphabet changes the numbers 
so much. Currently, if you restrict the alphabets for haystack 
and needle, the numbers change. The benchmark already does a 
random subset on each run, but there is definitely a bias.

May 29 2016

Chris <wendlec tcd.ie> writes:

On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
 I played around with the benchmark. Some more numbers:

 $ make ldc
 ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc
 ./benchmark.ldc
 E: wrong result with Chris find
 E: wrong result with Chris find
 E: wrong result with Chris find
     std find: 153 ±25    +66 (1934)  -15 (7860)
  manual find: 122 ±28    +80 (1812)  -17 (8134)
    qznc find: 125 ±16    +18 (4644)  -15 (5126)
   Chris find: 148 ±29    +75 (1976)  -18 (7915)
  Andrei find: 114 ±23   +100 (1191)  -13 (8770)
  (avg slowdown vs fastest; absolute deviation)
 $ make dmd
 dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
 ./benchmark.dmd
 E: wrong result with Chris find
 E: wrong result with Chris find
 E: wrong result with Chris find
     std find: 160 ±27    +44 (3162)  -20 (6709)
  manual find: 148 ±28    +54 (2701)  -19 (7178)
    qznc find: 102 ±3     +27 ( 766)   -1 (9136)
   Chris find: 175 ±30    +55 (2796)  -21 (7106)
  Andrei find: 122 ±22    +46 (2554)  -14 (7351)
  (avg slowdown vs fastest; absolute deviation)

 The additional numbers on the right are the ±MAD separated by 
 above or below the mean. For example Andrei find with ldc:

   Andrei find: 114 ±23   +100 (1191)  -13 (8770)

 The mean slowdown is 114, which means 14% slower than the 
 fastest one. The mean absolute deviation (MAD) is 23. More 
 precisely, the mean deviation above the mean slowdown of 103 is 
 100 and -13 below the mean slowdown. 1191 of the 10000 runs 
 were above the mean slowdown and 8770 below. The 39 missing 
 runs are equal to the mean slowdown.

 What bothers me is that changing the alphabet changes the 
 numbers so much. Currently, if you restrict the alphabets for 
 haystack and needle, the numbers change. The benchmark already 
 does a random subset on each run, but there is definitely a 
 bias.

You can avoid "E: wrong result with Chris find" by using

outer:
     for (auto i = 0; i < haystack.length-needle.length; i++)
     {
         if (haystack[i] != needle[0])
             continue;
         for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
             if (haystack[j] != needle[k])
                 continue outer;
         return haystack[i..$];
     }

It's a tad faster.

I'm planning to test on more varied data and see where a bias may 
occur.

May 29 2016

Chris <wendlec tcd.ie> writes:

It's weird, on my machine, my find function is consistently 
faster than `manual find`

LDC:

     std find: 137 ±22    +33 (3580)  -17 (6251)
  manual find: 137 ±32    +53 (3105)  -23 (6834)
    qznc find: 106 ±8     +17 (2608)   -5 (7195)
   Chris find: 132 ±32    +58 (2803)  -23 (7131)
  Andrei find: 120 ±26    +54 (2466)  -17 (7454)

===

     std find: 136 ±22    +33 (3503)  -17 (6304)
  manual find: 137 ±33    +55 (3000)  -23 (6920)
    qznc find: 106 ±8     +17 (2535)   -5 (7287)
   Chris find: 132 ±33    +59 (2803)  -23 (7137)
  Andrei find: 119 ±25    +51 (2569)  -16 (7374)


It's negligible, but I wonder is it the extra initialization:

`size_t end = haystack.length - needle.length + 1;`

and

`size_t i = 0` is defined, before we know, if it's a legitimate 
search. But that shouldn't matter, or should it?

Or does it depend on the machine?

DMD slows all searches down considerably (except for `qznc find`, 
well done!:)

DMD:

     std find: 161 ±36    +46 (4015)  -31 (5847)
  manual find: 147 ±40    +68 (3001)  -28 (6917)
    qznc find: 106 ±8     +15 (2623)   -5 (7172)
   Chris find: 201 ±41    +73 (2830)  -28 (7120)
  Andrei find: 150 ±40    +61 (3347)  -30 (6575)


PS I've corrected the end of the search string to

`outer:
     for (auto i = 0; i < haystack.length-(needle.length-1); i++)`

else it will not match at the end of a string.

May 30 2016

Jon Degenhardt <jond noreply.com> writes:

On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:
 I played around with the benchmark. Some more numbers:

 The mean slowdown is 114, which means 14% slower than the 
 fastest one. The mean absolute deviation (MAD) is 23. More 
 precisely, the mean deviation above the mean slowdown of 103 is 
 100 and -13 below the mean slowdown. 1191 of the 10000 runs 
 were above the mean slowdown and 8770 below. The 39 missing 
 runs are equal to the mean slowdown.

A minor thing - you might consider also calculating the median 
and median version of MAD (median of absolute deviations from the 
median). The reason is that benchmarks often have outliers in the 
max time dimension, median will do a job reducing the effect of 
those outliers than mean. Your benchmark code could publish both 
forms.

--Jon

May 29 2016

qznc <qznc web.de> writes:

On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
 A minor thing - you might consider also calculating the median 
 and median version of MAD (median of absolute deviations from 
 the median). The reason is that benchmarks often have outliers 
 in the max time dimension, median will do a job reducing the 
 effect of those outliers than mean. Your benchmark code could 
 publish both forms.

I don't think that would help. This is not a normal distribution, 
so mean and median don't match anyways. What would you learn from 
the median?

When looking at the assembly I don't like the single-byte loads. 
Since string (ubyte[] here) is of extraordinary importance, it 
should be worthwhile to use word loads [0] instead. Really fancy 
would be SSE.

However, this is more advanced and requires further introspection.

[0] 
http://forum.dlang.org/post/aahhopdcrengakvoemrn forum.dlang.org

May 29 2016

Jon Degenhardt <jond noreply.com> writes:

On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:
 A minor thing - you might consider also calculating the median 
 and median version of MAD (median of absolute deviations from 
 the median). The reason is that benchmarks often have outliers 
 in the max time dimension, median will do a job reducing the 
 effect of those outliers than mean. Your benchmark code could 
 publish both forms.

 I don't think that would help. This is not a normal 
 distribution, so mean and median don't match anyways. What 
 would you learn from the median?

Oh, I see. The benchmark varies the data on each run and 
aggregates, is that right? Sorry, I missed that.

May 29 2016

qznc <qznc web.de> writes:

On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 When looking at the assembly I don't like the single-byte 
 loads. Since string (ubyte[] here) is of extraordinary 
 importance, it should be worthwhile to use word loads [0] 
 instead. Really fancy would be SSE.

So far, the results look disappointing. Andrei find does not get 
faster with wordwise matching:

./benchmark.ldc
     std find: 133 ±25    +38 (3384)  -19 (6486)
  manual find: 140 ±37    +64 (2953)  -25 (6962)
    qznc find: 114 ±17    +33 (2610)  -11 (7262)
   Chris find: 146 ±39    +66 (3045)  -28 (6873)
  Andrei find: 126 ±29    +54 (2720)  -19 (7189)
Wordwise find: 130 ±30    +53 (2934)  -21 (6980)

Interesting side-note: On my laptop Andrei find is faster than 
qznc find (for LDC), but on my desktop it reverses (see above). 
Both are Intel i7. Need to find a simpler processor. Maybe 
wordwise is faster there. Alternatively, find is purely memory 
bound and the L1 cache makes every difference disappear.

Also, note how std find is faster than manual find! Finding a 
reliable benchmark is hard. :/

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 05:31 AM, qznc wrote:
 On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 When looking at the assembly I don't like the single-byte loads. Since
 string (ubyte[] here) is of extraordinary importance, it should be
 worthwhile to use word loads [0] instead. Really fancy would be SSE.

 So far, the results look disappointing. Andrei find does not get faster
 with wordwise matching:

 ./benchmark.ldc
      std find: 133 ±25    +38 (3384)  -19 (6486)
   manual find: 140 ±37    +64 (2953)  -25 (6962)
     qznc find: 114 ±17    +33 (2610)  -11 (7262)
    Chris find: 146 ±39    +66 (3045)  -28 (6873)
   Andrei find: 126 ±29    +54 (2720)  -19 (7189)
 Wordwise find: 130 ±30    +53 (2934)  -21 (6980)

 Interesting side-note: On my laptop Andrei find is faster than qznc find
 (for LDC), but on my desktop it reverses (see above). Both are Intel i7.
 Need to find a simpler processor. Maybe wordwise is faster there.
 Alternatively, find is purely memory bound and the L1 cache makes every
 difference disappear.

 Also, note how std find is faster than manual find! Finding a reliable
 benchmark is hard. :/

Please throw this hat into the ring as well: it should improve average 
search on large vocabulary dramatically.

https://dpaste.dzfl.pl/dc8dc6e1eb53

It uses a BM-inspired trick - once the last characters matched, if the 
match subsequently fails it needn't start from the next character in the 
haystack. The "skip" is computed lazily and in a separate function so as 
to keep the loop tight. All in all a routine worth a look. I wanted to 
write this for a long time. -- Andrei

May 30 2016

Chris <wendlec tcd.ie> writes:

On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:
 Please throw this hat into the ring as well: it should improve 
 average search on large vocabulary dramatically.

 https://dpaste.dzfl.pl/dc8dc6e1eb53

 It uses a BM-inspired trick - once the last characters matched, 
 if the match subsequently fails it needn't start from the next 
 character in the haystack. The "skip" is computed lazily and in 
 a separate function so as to keep the loop tight. All in all a 
 routine worth a look. I wanted to write this for a long time. 
 -- Andrei

Cool. So far, my experience with separate functions has been that 
they slow down the loop. But this might do the trick.

May 30 2016

Chris <wendlec tcd.ie> writes:

On Monday, 30 May 2016 at 18:57:15 UTC, Chris wrote:
 On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu 
 wrote:
 Please throw this hat into the ring as well: it should improve 
 average search on large vocabulary dramatically.

 https://dpaste.dzfl.pl/dc8dc6e1eb53

 It uses a BM-inspired trick - once the last characters 
 matched, if the match subsequently fails it needn't start from 
 the next character in the haystack. The "skip" is computed 
 lazily and in a separate function so as to keep the loop 
 tight. All in all a routine worth a look. I wanted to write 
 this for a long time. -- Andrei

 Cool. So far, my experience with separate functions has been 
 that they slow down the loop. But this might do the trick.

Congratulations!

DMD:

./benchmark.dmd
        std: 178 ±31    +36 (4475)  -29 (5344)
     manual: 167 ±46    +82 (2883)  -32 (7054)
       qznc: 114 ±7     +18 (1990)   -5 (7288)
      Chris: 228 ±49    +82 (3050)  -35 (6780)
     Andrei: 103 ±5     +47 ( 658)   -2 (9295)
(avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

LDC:

        std: 184 ±19    +28 (3420)  -14 (6523)
     manual: 205 ±59   +120 (2463)  -39 (7443)
       qznc: 151 ±25    +44 (2983)  -17 (6911)
      Chris: 194 ±57    +78 (3702)  -46 (6251)
     Andrei: 101 ±2     +42 ( 435)   -1 (9542)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 04:00 PM, Chris wrote:
 ./benchmark.dmd
         std: 178 ±31    +36 (4475)  -29 (5344)
      manual: 167 ±46    +82 (2883)  -32 (7054)
        qznc: 114 ±7     +18 (1990)   -5 (7288)
       Chris: 228 ±49    +82 (3050)  -35 (6780)
      Andrei: 103 ±5     +47 ( 658)   -2 (9295)
 (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

 LDC:

         std: 184 ±19    +28 (3420)  -14 (6523)
      manual: 205 ±59   +120 (2463)  -39 (7443)
        qznc: 151 ±25    +44 (2983)  -17 (6911)
       Chris: 194 ±57    +78 (3702)  -46 (6251)
      Andrei: 101 ±2     +42 ( 435)   -1 (9542)
   (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

Thanks for looking into this!  qznc, could you please look into updating 
https://github.com/dlang/phobos/pull/4362 with this result? One possible 
tweak is see whether replacing the function call with inline code helps. 
Thanks! -- Andrei

May 30 2016

qznc <qznc web.de> writes:

On Monday, 30 May 2016 at 20:08:46 UTC, Andrei Alexandrescu wrote:
 On 05/30/2016 04:00 PM, Chris wrote:
 ./benchmark.dmd
         std: 178 ±31    +36 (4475)  -29 (5344)
      manual: 167 ±46    +82 (2883)  -32 (7054)
        qznc: 114 ±7     +18 (1990)   -5 (7288)
       Chris: 228 ±49    +82 (3050)  -35 (6780)
      Andrei: 103 ±5     +47 ( 658)   -2 (9295)
 (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

 LDC:

         std: 184 ±19    +28 (3420)  -14 (6523)
      manual: 205 ±59   +120 (2463)  -39 (7443)
        qznc: 151 ±25    +44 (2983)  -17 (6911)
       Chris: 194 ±57    +78 (3702)  -46 (6251)
      Andrei: 101 ±2     +42 ( 435)   -1 (9542)
   (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU   1.60GHz

 Thanks for looking into this!  qznc, could you please look into 
 updating https://github.com/dlang/phobos/pull/4362 with this 
 result? One possible tweak is see whether replacing the 
 function call with inline code helps. Thanks! -- Andrei

What function call should be replaced with inline code?

Overall, I'm very confused and displeased by the benchmark right 
now. Initially, the problem was that a naive manual find was 
slower than std find in Phobos. Look at the LDC numbers–manual 
find is slower than std find. By tweaking the benchmark, the 
numbers can be shifted around arbitrarily. Even without tweaking, 
it is very processor dependent. Here are the numbers from my 
laptop:

./benchmark.ldc
        std: 149 ±24    +29 (4275)  -21 (5604)
     manual: 129 ±37    +90 (2055)  -23 (7917)
       qznc: 131 ±22    +26 (4381)  -20 (5488)
      Chris: 154 ±35    +81 (2192)  -22 (7672)
     Andrei: 118 ±28    +82 (1762)  -16 (8194)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU       M 620    
2.67GHz

./benchmark.dmd
        std: 143 ±22    +26 (4219)  -19 (5602)
     manual: 158 ±31    +47 (3436)  -24 (6492)
       qznc: 101 ±2     +12 (1349)   -1 (7851)
      Chris: 216 ±38    +56 (3393)  -28 (6541)
     Andrei: 135 ±31    +48 (3326)  -24 (6589)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU       M 620    
2.67GHz

And Desktop:

./benchmark.ldc
        std: 129 ±24    +40 (3121)  -17 (6767)
     manual: 129 ±31    +59 (2668)  -21 (7244)
       qznc: 112 ±14    +30 (2542)   -9 (7312)
      Chris: 134 ±33    +58 (2835)  -23 (7068)
     Andrei: 123 ±27    +53 (2679)  -18 (7225)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

./benchmark.dmd
        std: 157 ±31    +44 (3693)  -24 (6234)
     manual: 143 ±41    +73 (2854)  -28 (7091)
       qznc: 116 ±21    +35 (3092)  -14 (6844)
      Chris: 181 ±50    +74 (3452)  -38 (6510)
     Andrei: 136 ±38    +64 (2975)  -27 (6953)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

May 30 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/30/2016 06:16 PM, qznc wrote:
 What function call should be replaced with inline code?

The call to computeSkip.

 Overall, I'm very confused and displeased by the benchmark right now.

I agree it's difficult to characterize the behavior of substring search 
with one number. There are many dimensions of variation. (But there's no 
reason for an emotional response.) A few possible baselines come to mind:

* Search a long string for a one-character string, match and fail.

* Take an English text string. Search for a substring consisting of its 
last portion (e.g. 5% of the length).

* Take an English text string. Search for a substring consisting of a 
fraction of the text (e.g. 3%) with additional characters prepended. 
Repeat for appended.


Andrei

May 30 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu 
wrote:
 I agree it's difficult to characterize the behavior of 
 substring search with one number. There are many dimensions of 
 variation. (But there's no reason for an emotional response.) A 
 few possible baselines come to mind:

 * Search a long string for a one-character string, match and 
 fail.

There is a special version of find for searching a single char in 
a string. Using a one-letter needle string is more like a user 
mistake than something to optimize for.

 * Take an English text string. Search for a substring 
 consisting of its last portion (e.g. 5% of the length).

How long should the english text be? A Tweet? A book? A Gigabyte 
of log files?

English text means basically ASCII and no Unicode?

 * Take an English text string. Search for a substring 
 consisting of a fraction of the text (e.g. 3%) with additional 
 characters prepended. Repeat for appended.

Why the prepend/append? To force a mismatch?

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/16 1:54 PM, qznc wrote:
 Using a one-letter needle string is more like a user mistake than
 something to optimize for.

How is splitting on ',' a user mistake? -- Andrei

May 31 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 18:31:47 UTC, Andrei Alexandrescu 
wrote:
 On 5/31/16 1:54 PM, qznc wrote:
 Using a one-letter needle string is more like a user mistake 
 than
 something to optimize for.

 How is splitting on ',' a user mistake? -- Andrei

The mistake is to split on "," instead of ','. The slow splitter 
at the start of this thread searches for "\r\n".

Your lazy-skip algorithm looks great!

./benchmark.ldc
Search in Alice in Wonderland
        std: 168 ±6     +29 ( 107)   -3 ( 893)
     manual: 112 ±3     +28 (  81)   -1 ( 856)
       qznc: 149 ±4     +30 (  79)   -1 ( 898)
      Chris: 142 ±5     +28 ( 102)   -2 ( 898)
     Andrei: 153 ±3     +28 (  79)   -1 ( 919)
    Andrei2: 101 ±2     +34 (  31)   -1 ( 969)
Search random haystack with random needle
        std: 172 ±19    +61 ( 161)  -11 ( 825)
     manual: 161 ±47    +73 ( 333)  -35 ( 666)
       qznc: 163 ±21    +33 ( 314)  -15 ( 661)
      Chris: 190 ±47    +80 ( 302)  -33 ( 693)
     Andrei: 140 ±37    +60 ( 315)  -27 ( 676)
    Andrei2: 103 ±6     +57 (  64)   -2 ( 935)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU       M 620    
2.67GHz

The Alice benchmark searches Alice in Wonderland for "find a 
pleasure in all their simple joys" and
finds it in the last sentence.

May 31 2016

Chris <wendlec tcd.ie> writes:

On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote:

 The mistake is to split on "," instead of ','. The slow 
 splitter at the start of this thread searches for "\r\n".

 Your lazy-skip algorithm looks great!

 ./benchmark.ldc
 Search in Alice in Wonderland
        std: 168 ±6     +29 ( 107)   -3 ( 893)
     manual: 112 ±3     +28 (  81)   -1 ( 856)
       qznc: 149 ±4     +30 (  79)   -1 ( 898)
      Chris: 142 ±5     +28 ( 102)   -2 ( 898)
     Andrei: 153 ±3     +28 (  79)   -1 ( 919)
    Andrei2: 101 ±2     +34 (  31)   -1 ( 969)
 Search random haystack with random needle
        std: 172 ±19    +61 ( 161)  -11 ( 825)
     manual: 161 ±47    +73 ( 333)  -35 ( 666)
       qznc: 163 ±21    +33 ( 314)  -15 ( 661)
      Chris: 190 ±47    +80 ( 302)  -33 ( 693)
     Andrei: 140 ±37    +60 ( 315)  -27 ( 676)
    Andrei2: 103 ±6     +57 (  64)   -2 ( 935)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU       M 620    
 2.67GHz

 The Alice benchmark searches Alice in Wonderland for "find a 
 pleasure in all their simple joys" and
 finds it in the last sentence.

Would it speed things up even more, if we put the function 
`computeSkip` into the loop or is this done automatically by the 
compiler?

May 31 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
 Would it speed things up even more, if we put the function 
 `computeSkip` into the loop or is this done automatically by 
 the compiler?

LDC inlines it. DMD does not.

More numbers:

./benchmark.ldc
Search in Alice in Wonderland
        std: 147 ±1
     manual: 100 ±0
       qznc: 121 ±1
      Chris: 103 ±1
     Andrei: 144 ±1
    Andrei2: 105 ±1
Search in random short strings
        std: 125 ±15
     manual: 117 ±10
       qznc: 104 ±6
      Chris: 123 ±14
     Andrei: 104 ±5
    Andrei2: 103 ±4
Mismatch in random long strings
        std: 140 ±22
     manual: 164 ±64
       qznc: 115 ±13
      Chris: 167 ±63
     Andrei: 161 ±68
    Andrei2: 106 ±9
Search random haystack with random needle
        std: 138 ±27
     manual: 135 ±33
       qznc: 116 ±16
      Chris: 141 ±36
     Andrei: 131 ±33
    Andrei2: 109 ±12
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

Random short strings has haystacks of 10 to 300 characters and 
needles of 2 to 10. Basically, no time for initialisation.

Random long strings has haystacks of size 1000, 10_000, 100_000, 
or 1_000_000 and needles 50 to 500. It inserts a character into a 
random index of the needle to force a mismatch.

The last one is the configuration as before.

Overall, Andrei2 (the lazy compute skip) is really impressive. :)

May 31 2016

Chris <wendlec tcd.ie> writes:

On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote:
 On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:
 Would it speed things up even more, if we put the function 
 `computeSkip` into the loop or is this done automatically by 
 the compiler?

 LDC inlines it. DMD does not.

 More numbers:

 ./benchmark.ldc
 Search in Alice in Wonderland
        std: 147 ±1
     manual: 100 ±0
       qznc: 121 ±1
      Chris: 103 ±1
     Andrei: 144 ±1
    Andrei2: 105 ±1
 Search in random short strings
        std: 125 ±15
     manual: 117 ±10
       qznc: 104 ±6
      Chris: 123 ±14
     Andrei: 104 ±5
    Andrei2: 103 ±4
 Mismatch in random long strings
        std: 140 ±22
     manual: 164 ±64
       qznc: 115 ±13
      Chris: 167 ±63
     Andrei: 161 ±68
    Andrei2: 106 ±9
 Search random haystack with random needle
        std: 138 ±27
     manual: 135 ±33
       qznc: 116 ±16
      Chris: 141 ±36
     Andrei: 131 ±33
    Andrei2: 109 ±12
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 Random short strings has haystacks of 10 to 300 characters and 
 needles of 2 to 10. Basically, no time for initialisation.

 Random long strings has haystacks of size 1000, 10_000, 
 100_000, or 1_000_000 and needles 50 to 500. It inserts a 
 character into a random index of the needle to force a mismatch.

 The last one is the configuration as before.

 Overall, Andrei2 (the lazy compute skip) is really impressive. 
 :)

Yep. It's really impressive. I actually thought that dmd didn't 
place `computeSkip` inside of the loop. This begs the question if 
it should be moved to the loop, in case we use it in Phobos, to 
make sure that it is as fast as possible even with dmd. However, 
I like it the way it is now.

`Adrei2` is that it performs consistently well.

May 31 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/31/2016 04:18 PM, Chris wrote:
 I actually thought that dmd didn't place
 `computeSkip` inside of the loop. This begs the question if it should be
 moved to the loop, in case we use it in Phobos, to make sure that it is
 as fast as possible even with dmd. However, I like it the way it is now.

You may want to then try https://dpaste.dzfl.pl/392710b765a9, which 
generates inline code on all compilers. -- Andrei

May 31 2016

David Nadlinger <code klickverbot.at> writes:

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
 which generates inline code on all compilers. -- Andrei

In general, it might be beneficial to use 
ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like 
that in order to optimise basic block placement. (We should 
probably have a compiler-independent API for that in core.*, by 
the way.) In this case, the skip computation path is probably 
small enough for that not to matter much, though.

Another thing that might be interesting to do (now that you have 
a "clever" baseline) is to start counting cycles and make some 
comparisons against manual asm/intrinsics implementations. For 
short(-ish) needles, PCMPESTRI is probably the most promising 
candidate, although I suspect that for \r\n scanning in long 
strings in particular, an optimised AVX2 solution might have 
higher throughput.

Of course these observations are not very valuable without 
backing them up with measurements, but it seems like before 
optimising a generic search algorithm for short-needle test 
cases, having one's eyes on a solid SIMD baseline would be a 
prudent thing to do.

  — David

May 31 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 22:50:37 UTC, David Nadlinger wrote:
 Another thing that might be interesting to do (now that you 
 have a "clever" baseline) is to start counting cycles and make 
 some comparisons against manual asm/intrinsics implementations. 
 For short(-ish) needles, PCMPESTRI is probably the most 
 promising candidate, although I suspect that for \r\n scanning 
 in long strings in particular, an optimised AVX2 solution might 
 have higher throughput.

 Of course these observations are not very valuable without 
 backing them up with measurements, but it seems like before 
 optimising a generic search algorithm for short-needle test 
 cases, having one's eyes on a solid SIMD baseline would be a 
 prudent thing to do.

The current algorithm is generic with respect to the predicate. 
Once we use SSE/AVX tricks, it is a special case for equality.

As a next step in Phobos, this is probably worth it for strings. 
We could probably steal some well-optimized strcmp from somewhere.

Jun 02 2016

Chris <wendlec tcd.ie> writes:

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
 which generates inline code on all compilers. -- Andrei

I've added it as `Andrei3`. It runs faster with dmd, but it's not 
as good with ldc. Seems like ldc performs some extra optimization 
when moving `computeSkip` into the loop, something it doesn't 
bother to do when it's already there.

dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd
./benchmark.dmd
Search in Alice in Wonderland
        std: 190 ±4
     manual: 115 ±4
       qznc: 106 ±2
      Chris: 160 ±4
     Andrei: 159 ±4
    Andrei2: 108 ±2
    Andrei3: 100 ±0
Search in random short strings
        std: 222 ±27
     manual: 193 ±49
       qznc: 120 ±12
      Chris: 224 ±57
     Andrei: 114 ±9
    Andrei2: 106 ±5
    Andrei3: 102 ±3
Mismatch in random long strings
        std: 186 ±28
     manual: 206 ±85
       qznc: 118 ±14
      Chris: 265 ±104
     Andrei: 194 ±85
    Andrei2: 116 ±18
    Andrei3: 102 ±4
Search random haystack with random needle
        std: 189 ±38
     manual: 171 ±45
       qznc: 118 ±11
      Chris: 225 ±52
     Andrei: 149 ±32
    Andrei2: 110 ±7
    Andrei3: 103 ±6
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

./benchmark.ldc
Search in Alice in Wonderland
        std: 170 ±1
     manual: 143 ±2
       qznc: 133 ±1
      Chris: 144 ±1
     Andrei: 196 ±10
    Andrei2: 100 ±0
    Andrei3: 111 ±1
Search in random short strings
        std: 223 ±30
     manual: 211 ±51
       qznc: 124 ±12
      Chris: 223 ±61
     Andrei: 115 ±10
    Andrei2: 102 ±3
    Andrei3: 105 ±4
Mismatch in random long strings
        std: 181 ±17
     manual: 253 ±109
       qznc: 146 ±24
      Chris: 248 ±108
     Andrei: 228 ±96
    Andrei2: 101 ±2
    Andrei3: 108 ±6
Search random haystack with random needle
        std: 187 ±22
     manual: 208 ±60
       qznc: 152 ±27
      Chris: 202 ±58
     Andrei: 173 ±35
    Andrei2: 102 ±4
    Andrei3: 110 ±8
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

Jun 01 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu 
wrote:
 On 05/31/2016 04:18 PM, Chris wrote:
 I actually thought that dmd didn't place
 `computeSkip` inside of the loop. This begs the question if it 
 should be
 moved to the loop, in case we use it in Phobos, to make sure 
 that it is
 as fast as possible even with dmd. However, I like it the way 
 it is now.

 You may want to then try https://dpaste.dzfl.pl/392710b765a9, 
 which generates inline code on all compilers. -- Andrei

I ported this to Phobos style (A2Phobos in the benchmark) and 
changed the pull request [0].
In practically all tests, this algorithms wipes the floor with 
the competition.

[0] https://github.com/dlang/phobos/pull/4362

Jun 02 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/31/16 1:54 PM, qznc wrote:
 On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:
 I agree it's difficult to characterize the behavior of substring
 search with one number. There are many dimensions of variation. (But
 there's no reason for an emotional response.) A few possible baselines
 come to mind:

 * Search a long string for a one-character string, match and fail.

 There is a special version of find for searching a single char in a
 string. Using a one-letter needle string is more like a user mistake
 than something to optimize for.

Oh, I see what you mean - it's a string consisting only of one 
character, not one character 'x'. I agree, but I'm just saying it's a 
sound baseline.

 * Take an English text string. Search for a substring consisting of
 its last portion (e.g. 5% of the length).

 How long should the english text be? A Tweet? A book? A Gigabyte of log
 files?

I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 
10_000, 100_000, 1_000_000.

 English text means basically ASCII and no Unicode?

My understanding (and correct me if I'm wrong) is that we're discussing 
search with the default predicate, ignoring equivalent grapheme 
representations etc. So there's no need to mind encoding at all, which 
means the text could be in any language. Foreign languages would further 
increase the vocabulary, but that's the only effect.

 * Take an English text string. Search for a substring consisting of a
 fraction of the text (e.g. 3%) with additional characters prepended.
 Repeat for appended.

 Why the prepend/append? To force a mismatch?

Yah, and it's a good test for strings that have some equal portion yet 
they are different, which seems to me puts a greater strain on the 
algorithms. It is arguably a likely situation in the real world.


Thanks,

Andrei

May 31 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char 
 in a string. Using a one-letter needle string is more like a 
 user mistake than something to optimize for.

At compile time you may not know the length of the needle, like 
in the grep command.

Jun 01 2016

Seb <seb wilzba.ch> writes:

On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char 
 in a string. Using a one-letter needle string is more like a 
 user mistake than something to optimize for.

 At compile time you may not know the length of the needle, like 
 in the grep command.

1) how about a CTFE find?

s.find!(needle, pred)

If we can initialize boyer-moore or KMP at compile time - it 
should be the fastest!

At ndslice such functions are called bifacial:

http://dlang.org/phobos/std_experimental_ndslice_iteration.html

Imho a lot more in std.algorithm should be able to profit from 
facts known at compile-time.

2) Even for a runtime runtime one-letter needle I am pretty sure 
it's worth to specialize

Jun 01 2016

Chris <wendlec tcd.ie> writes:

On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
 wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single 
 char in a string. Using a one-letter needle string is more 
 like a user mistake than something to optimize for.

 At compile time you may not know the length of the needle, 
 like in the grep command.

 1) how about a CTFE find?

 s.find!(needle, pred)

 If we can initialize boyer-moore or KMP at compile time - it 
 should be the fastest!

 At ndslice such functions are called bifacial:

 http://dlang.org/phobos/std_experimental_ndslice_iteration.html

 Imho a lot more in std.algorithm should be able to profit from 
 facts known at compile-time.

 2) Even for a runtime runtime one-letter needle I am pretty 
 sure it's worth to specialize


That makes sense. I think that std.algorithm needs to be revised 
and optimized for speed. We cannot afford to have suboptimal 
algorithms in there.

Jun 01 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/01/2016 08:41 AM, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single char in a
 string. Using a one-letter needle string is more like a user mistake
 than something to optimize for.

 At compile time you may not know the length of the needle, like in the
 grep command.

 1) how about a CTFE find?

 s.find!(needle, pred)

 If we can initialize boyer-moore or KMP at compile time - it should be
 the fastest!

That would require partial evaluation, which sadly we don't have in D. 
-- Andrei

Jun 01 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:
 On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter 
 wrote:
 On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:
 There is a special version of find for searching a single 
 char in a string. Using a one-letter needle string is more 
 like a user mistake than something to optimize for.

 At compile time you may not know the length of the needle, 
 like in the grep command.

 1) how about a CTFE find?

What I wanted to say, is that in real life, the input of the 
search routine is very often run-time user provided data. Think 
of search box in browsers and apps, command line parameter à la 
grep, etc. The "string" search function should not 
catastrophically break down on special input, like 1 character 
strings, unusual Unicode or when needle==haystack. I only said 
this to not lose the focus on what is being tried to be achieved 
here.
It's often a danger of micro-optimization and unit test focused 
development, that a lot of time and effort are spent on 
improvements that are completely irrelevant when checked against 
what is really needed in real world (i.e. we're full in bike shed 
territory here).

Jun 01 2016

Chris <wendlec tcd.ie> writes:

On Wednesday, 1 June 2016 at 13:47:10 UTC, Patrick Schluter wrote:

 What I wanted to say, is that in real life, the input of the 
 search routine is very often run-time user provided data. Think 
 of search box in browsers and apps, command line parameter à la 
 grep, etc. The "string" search function should not 
 catastrophically break down on special input, like 1 character 
 strings, unusual Unicode or when needle==haystack. I only said 
 this to not lose the focus on what is being tried to be 
 achieved here.
 It's often a danger of micro-optimization and unit test focused 
 development, that a lot of time and effort are spent on 
 improvements that are completely irrelevant when checked 
 against what is really needed in real world (i.e. we're full in 
 bike shed territory here).

That's true. We should try to optimize for the worst case, i.e. 
random input. However, it is often the case that the needle is 
known at compile time, e.g. things like

if (input.canFind("foo"))

There might be ways to optimize those cases at compile time.

Jun 01 2016

Chris <wendlec tcd.ie> writes:

On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote:

For fun, I've written a search function that alternates between 
the beginning and the end of a string. I'm basically using 
Andrei's search algorithm for this. It is, of course only useful, 
if `needle` is expected to be either towards the beginning or the 
end of a string. If `needle` occurs multiple times, it matches 
where ever it encounters it first.

T[] findUnique(T)(T[] haystack, T[] needle)
{
	static size_t computeSkip(T[] needle)
	{
		size_t result = 1;
		while (result < needle.length && needle[$ - 1 - result] != 
needle[$ - 1])
		{
			++result;
		}
		return result;
	}
	
	if (needle.length == 0) return haystack;
	immutable size_t len = haystack.length;
	immutable size_t lastIndex = needle.length - 1;
	auto last = needle[lastIndex];
	auto first = needle[0];
	size_t skip = 0;
	for (size_t i = lastIndex, j = len-lastIndex-1; i < len; ++i, 
--j)
	{
		if (last != haystack[i])
			goto back;
		for (size_t k = 0; ; ++k)
		{
			if (k == lastIndex) return haystack[i-lastIndex..$];
			if (haystack[i - lastIndex + k] != needle[k]) break;
		}
		back:
		if (i > j) break;
		if (first != haystack[j])
			continue;
		for (size_t k = lastIndex; ; --k)
		{
			if (k == 0) return haystack[j..$];
			if (haystack[j + k] != needle[k]) break;
		}
		if (skip == 0) skip = computeSkip(needle);
		i += skip;
     j -= skip-1;
	}
	return haystack[$..$];
}

unittest
{
	string s1 = "Hello abc world!";
	assert (findUnique(s1, "abc") == "abc world!");
	assert (findUnique(s1, "def") == "");
	string s2 = "Ola, mundo!";
	assert (findUnique(s2, "do!") == "do!");
	string s3 = "abc";
	assert (findUnique(s3, "c") == "c");
	assert (findUnique(s3, "z") == "");
	string s4 = "aaabaaa";
	assert (findUnique(s4, "b") == "baaa");
}

I added it to the benchmarking suite (I had to turn off the error 
warnings, see description above). Naturally, it shines in the 
first test, as the needle is at the end of the text. The whole 
thing is to be taken with a grain of salt ;)

Search in Alice in Wonderland
        std: 535 ±8
     manual: 454 ±10
       qznc: 421 ±5
      Chris: 460 ±10
     Andrei: 643 ±19
    Andrei2: 314 ±6
    Andrei3: 352 ±5
     Biloop: 100 ±0
Search in random short strings
        std: 219 ±33
     manual: 194 ±47
       qznc: 123 ±11
      Chris: 219 ±69
     Andrei: 113 ±10
    Andrei2: 104 ±4
    Andrei3: 103 ±3
     Biloop: 195 ±46
Mismatch in random long strings
        std: 193 ±30
     manual: 249 ±103
       qznc: 156 ±30
      Chris: 260 ±108
     Andrei: 238 ±97
    Andrei2: 107 ±11
    Andrei3: 115 ±12
     Biloop: 141 ±44
Search random haystack with random needle
        std: 189 ±25
     manual: 193 ±57
       qznc: 152 ±26
      Chris: 203 ±59
     Andrei: 176 ±35
    Andrei2: 103 ±6
    Andrei3: 110 ±8
     Biloop: 141 ±28
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

Jun 02 2016

Chris <wendlec tcd.ie> writes:

On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:
 And Desktop:

 ./benchmark.ldc
        std: 129 ±24    +40 (3121)  -17 (6767)
     manual: 129 ±31    +59 (2668)  -21 (7244)
       qznc: 112 ±14    +30 (2542)   -9 (7312)
      Chris: 134 ±33    +58 (2835)  -23 (7068)
     Andrei: 123 ±27    +53 (2679)  -18 (7225)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 ./benchmark.dmd
        std: 157 ±31    +44 (3693)  -24 (6234)
     manual: 143 ±41    +73 (2854)  -28 (7091)
       qznc: 116 ±21    +35 (3092)  -14 (6844)
      Chris: 181 ±50    +74 (3452)  -38 (6510)
     Andrei: 136 ±38    +64 (2975)  -27 (6953)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

Benchmark from desktop machine:

DMD:
        std: 164 ±34    +43 (4054)  -29 (5793)
     manual: 150 ±41    +72 (2889)  -29 (7032)
       qznc: 103 ±6     +42 ( 878)   -2 (9090)
      Chris: 205 ±43    +81 (2708)  -29 (7232)
     Andrei: 136 ±31    +53 (2948)  -22 (6977)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

===

LDC:

        std: 138 ±23    +35 (3457)  -18 (6360)
     manual: 145 ±33    +45 (3748)  -27 (6181)
       qznc: 105 ±7     +17 (2267)   -4 (7534)
      Chris: 135 ±33    +56 (3061)  -23 (6882)
     Andrei: 121 ±27    +52 (2630)  -18 (7301)
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

On my laptop Andrei's was the fasted (see post above).

May 31 2016

Wyatt <wyatt.epp gmail.com> writes:

On Tuesday, 31 May 2016 at 08:43:59 UTC, Chris wrote:
 On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:
 And Desktop:

 ./benchmark.ldc
        std: 129 ±24    +40 (3121)  -17 (6767)
     manual: 129 ±31    +59 (2668)  -21 (7244)
       qznc: 112 ±14    +30 (2542)   -9 (7312)
      Chris: 134 ±33    +58 (2835)  -23 (7068)
     Andrei: 123 ±27    +53 (2679)  -18 (7225)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 ./benchmark.dmd
        std: 157 ±31    +44 (3693)  -24 (6234)
     manual: 143 ±41    +73 (2854)  -28 (7091)
       qznc: 116 ±21    +35 (3092)  -14 (6844)
      Chris: 181 ±50    +74 (3452)  -38 (6510)
     Andrei: 136 ±38    +64 (2975)  -27 (6953)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU   3.40GHz

 Benchmark from desktop machine:

 DMD:
        std: 164 ±34    +43 (4054)  -29 (5793)
     manual: 150 ±41    +72 (2889)  -29 (7032)
       qznc: 103 ±6     +42 ( 878)   -2 (9090)
      Chris: 205 ±43    +81 (2708)  -29 (7232)
     Andrei: 136 ±31    +53 (2948)  -22 (6977)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

 ===

 LDC:

        std: 138 ±23    +35 (3457)  -18 (6360)
     manual: 145 ±33    +45 (3748)  -27 (6181)
       qznc: 105 ±7     +17 (2267)   -4 (7534)
      Chris: 135 ±33    +56 (3061)  -23 (6882)
     Andrei: 121 ±27    +52 (2630)  -18 (7301)
  (avg slowdown vs fastest; absolute deviation)
 CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU   3.40GHz

 On my laptop Andrei's was the fasted (see post above).

Comparing the chips involved, could it be cache related?

3M cache; Andrei wins both:
http://ark.intel.com/products/75459/Intel-Core-i5-4200U-Processor-3M-Cache-up-to-2_60-GHz

4M cache; qznc wins DMD (and is faster than the LDC's best? 
What?); Andrei wins LDC:
http://ark.intel.com/products/43560/Intel-Core-i7-620M-Processor-4M-Cache-2_66-GHz

8M cache; qznc wins both:
http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz
http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz

Normally, I'd expect the 4200U to be similar to the desktop 
parts.  Unless...

Say, for the laptops (and I guess the desktops too, but it's more 
important in a mobile), did you verify the CPU frequency scaling 
wasn't interfering?

-Wyatt

May 31 2016

qznc <qznc web.de> writes:

On Tuesday, 31 May 2016 at 17:31:29 UTC, Wyatt wrote:
 qznc wins DMD (and is faster than the LDC's best?

Careful! These are not absolute numbers, but relative slowdowns. 
You cannot compare the numbers between LDC and DMD.

May 31 2016

Henrique bucher <henry vitorian.com> writes:

On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:
 On 05/30/2016 05:31 AM, qznc wrote:
 On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:
 worthwhile to use word loads [0] instead. Really fancy would 
 be SSE.



I wrote a splitter in SSE4.2 some time ago as  acontribution to a 
github project. Perhaps it is related.

https://github.com/HFTrader/string-splitting/blob/master/splithb2.cpp

Cheers,

Jul 11 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 This outperforms both `manual_find` and the improved `std find`

 string findStringS_Manual(string haystack, string needle)
 {
 	
 	if (needle.length > haystack.length)
 		return haystack[$..$];
 	outer:
 	for (auto i = 0; i < haystack.length; i++)
 	{
 		if (haystack[i] != needle[0])
 			continue;
 		for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k)
 			if (haystack[j] != needle[k])
 			continue outer;
 		return haystack[i..$];
 	}
 	return haystack[$..$];
 }

 Little bug fix:

 for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < 
 needle.length; ++j, ++k)

The upper bound of the outer loop should be 
haystack.length-needle.length. There's no point in searching 
after that point as the needle would not fit anymore and it 
avoids the buffer overrun you have in your code when the first 
character of the needle is found after that point.
You can also remove then your ?: test in the j loop as that case 
is handled automatically then.

May 27 2016

Chris <wendlec tcd.ie> writes:

Oops, I've been found out. :-) Thanks. You're right of course, 
and I've already noticed that bug as it fails on not found. I got 
the bounds wrong.

May 27 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Friday, 27 May 2016 at 19:17:44 UTC, Chris wrote:
 Oops, I've been found out. :-) Thanks. You're right of course, 
 and I've already noticed that bug as it fails on not found. I 
 got the bounds wrong.

I had the same "bug" when I wrote my search function on the 
project at work. I also found out that using the simplest loop is 
the generally the best way to go. The processors nowaday have 
loop-caches, fast level1 caches etc, all the fancy search 
algorithms like Boyer-Moore etc. can only overcome their memory 
and initialisation overhead for really huge haystacks.

Here's the C code of my memmem() function that replaced the 
Boyer-Moore I had before. This implementation outran the BM 
search for sizes up to several hundreds megaytes on SPARC64 VII+ 
machine. On x86-64 I'm sure it would even be worse.

void * memmem(const void *haystack, size_t haystack_len, const 
void *needle, size_t needle_len)
{
   if(!haystack_len || !needle_len || needle_len > haystack_len)
     return NULL;

   uint8_t u8 = *((uint8_t *)needle);
   const uint8_t *p = (const uint8_t *) haystack;
   for(size_t i = haystack_len - needle_len + 1; i; --i) {
     if(*p == u8 && !memcmp(p, needle, needle_len))
       return (void *)p;
     else
       p++;
   }
   return NULL;
}

(sorry for posting C code, as I'm only beginning to learn D).

May 27 2016

qznc <qznc web.de> writes:

On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:
 The improved `std find` comes very close to the manual function 
 above now, sometimes it's even faster or at least as fast.

 std find    	took     12573666
 manual find 	took      7418454
 my std find 	took      6903854 <===
 findStringS 	took      7166720
 findStringS_  	took      6579529 <===

I just updated my benchmark. It now checks lots of different 
scenarios instead of you having to specify one. You can only set 
the number of iterations now.

It generates a random scenario (haystack length, needle length, 
alphabet, etc) and runs it once with all algorithms. Instead of 
recording the absolute runtime, it records the slowdown compared 
to the fastest of the three. Average those and also compute the 
absolute deviation (the second number).

Example:
std find:    153 ±33
manual find: 112 ±19
my std find: 119 ±16

So manual find is on average 12% slower than the fastest ±19%. I 
would interpret that as no significant difference between manual 
and improved find.

May 27 2016

qznc <qznc web.de> writes:

Now added Chris' algo to the benchmark:

std find:    155 ±33
manual find: 112 ±19
qznc find:   122 ±18  <--- improved find
Chris find:  133 ±20  <--- findStringS_

May 27 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 05/27/2016 04:39 PM, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_

Does any of these feature simultaneously:

* check the last element first
* consequently stop early when the length of the haystack falls below 
the length of the needle
* consequently no check against the haystack limit is necessary in the 
inner loop
* use only one induction variable

?


Andrei

May 27 2016

qznc <qznc web.de> writes:

On Friday, 27 May 2016 at 20:50:52 UTC, Andrei Alexandrescu wrote:
 On 05/27/2016 04:39 PM, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_

 Does any of these feature simultaneously:

 * check the last element first
 * consequently stop early when the length of the haystack falls 
 below the length of the needle
 * consequently no check against the haystack limit is necessary 
 in the inner loop

Std find (currently in Phobos) and qznc find (see pull request) 
fulfill those three. Manual and Chris find are simpler.

 * use only one induction variable

None of those. I made such a variant of qznc find and it gives me 
the same numbers.

May 27 2016

Chris <wendlec tcd.ie> writes:

On Friday, 27 May 2016 at 20:39:03 UTC, qznc wrote:
 Now added Chris' algo to the benchmark:

 std find:    155 ±33
 manual find: 112 ±19
 qznc find:   122 ±18  <--- improved find
 Chris find:  133 ±20  <--- findStringS_

Did you fix the bugs in my algorithm? S

May 28 2016

qznc <qznc web.de> writes:

On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 I think it's clear that `std find` is veeeery slow. If anyone 
 wants to test my function, please do so. I don't want to spread 
 wrong data, but this is what I get at the moment (ldc latest 
 version).

If you want to see find (current or my improved version) get 
really slow, then reduce the alphabet for the haystack (e.g. 
--alphabet=ab). ;)

May 27 2016

David Nadlinger <code klickverbot.at> writes:

On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:
 I have to correct myself. A manual loop is faster, I couldn't 
 believe it either, so I benchmarked the same function with 
 `foreach` and `for`. Turns out that `for` is _way_ faster.

Just about the only reason I could think of for this to happen is 
if the compiler fails to inline the range primitives from 
std.array. Otherwise, the loops should be pretty much equivalent 
to LLVM's optimiser.

This is so fundamental to D's strategic promise that we can't 
afford to get it wrong.

  — David

May 27 2016

Chris <wendlec tcd.ie> writes:

On Friday, 27 May 2016 at 21:31:48 UTC, David Nadlinger wrote:
 Just about the only reason I could think of for this to happen 
 is if the compiler fails to inline the range primitives from 
 std.array. Otherwise, the loops should be pretty much 
 equivalent to LLVM's optimiser.

 This is so fundamental to D's strategic promise that we can't 
 afford to get it wrong.

  — David

I agree, and it has been said on this forum that tight manual 
loops are faster than the range paradigm (I think bearophile was 
one of them, if I remember correctly). Maybe we should benchmark 
this too, just in case.

May 28 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 8:05 PM, cvk012c wrote:
 In my latest version of D script I didn't use splitter at all. I used
 string specific indexOf function. Still result is very bad. For text
 based protocols, such as SIP, performance of string manipulating
 functions is very important. Unfortunately, looks like it is not D
 strongest point at this time.

That conclusion would be hasty if not missing the whole point. You 
essentially measured the speed of one loop in various translators 
implementing various languages. Java code doing straight computation is 
on a par with C speed, no two ways about that. Python code using library 
primitives ain't no slouch either. Performance tuning in these languages 
becomes more difficult in larger applications where data layout, 
allocation, and indirect function calls start to dominate.

The claim of systems languages being fast materializes only when you 
need to optimize data layout and nonstandard, custom algorithms. At that 
point systems languages give you additional options, whereas with 
high-level languages optimization becomes a very difficult proposition.


Andrei

Mar 02 2013

Russel Winder <russel winder.org.uk> writes:

On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote:
[=E2=80=A6]
 That conclusion would be hasty if not missing the whole point. You=20
 essentially measured the speed of one loop in various translators=20
 implementing various languages. Java code doing straight computation is=

=20
 on a par with C speed, no two ways about that. Python code using library=

=20
 primitives ain't no slouch either. Performance tuning in these languages=

=20
 becomes more difficult in larger applications where data layout,=20
 allocation, and indirect function calls start to dominate.

[=E2=80=A6]

Interestingly, there isn't only one Python implementation. There is only
one language but there is CPython, PyPy, Jython, IronPython, to mention
but 4.

On computationally intensive code, PyPy (Python execution environment in
RPython) is generally 10 to 30 times faster than CPython (Python
execution environment written in C).

C is a (reasonably) well known and used language thought to create fast
code. RPython is Python but with some restrictions that is statically
compiled.  For writing interpreters, RPython spanks C. PyPy is not the
only language using RPython to implement the interpreter. C's days in
this game are seriously numbered.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 02 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote:
 […]
 That conclusion would be hasty if not missing the whole point. 
 You essentially measured the speed of one loop in various 
 translators implementing various languages. Java code doing 
 straight computation is on a par with C speed, no two ways 
 about that. Python code using library primitives ain't no 
 slouch either. Performance tuning in these languages becomes 
 more difficult in larger applications where data layout, 
 allocation, and indirect function calls start to dominate.

 […]

 Interestingly, there isn't only one Python implementation. 
 There is only
 one language but there is CPython, PyPy, Jython, IronPython, to 
 mention
 but 4.

 On computationally intensive code, PyPy (Python execution 
 environment in
 RPython) is generally 10 to 30 times faster than CPython (Python
 execution environment written in C).

 C is a (reasonably) well known and used language thought to 
 create fast
 code. RPython is Python but with some restrictions that is 
 statically
 compiled.  For writing interpreters, RPython spanks C. PyPy is 
 not the
 only language using RPython to implement the interpreter. C's 
 days in
 this game are seriously numbered.

I'm not sure that's entirely fair. PyPy is fast because it 
implements a JIT, not because it's written in RPython.

Mar 02 2013

Russel Winder <russel winder.org.uk> writes:

On Sat, 2013-03-02 at 16:55 +0100, John Colvin wrote:
[=E2=80=A6]
 I'm not sure that's entirely fair. PyPy is fast because it=20
 implements a JIT, not because it's written in RPython.

The second sentence is almost entirely true, the first I am not so sure.
Armin Rigo used to manage Psycho which was an amendment to the CPython
code base to include a JIT. He quit trying to keep that going and so
CPython doesn't have a JIT of any real importance. Unladen Swallow
dissapeared. The JIT framework is an integral part of RPython rather
than being specific only to PyPy.

The point is that for I/O bound systems all languages are more or less
equivalent, for computational intensive code, JITs are needed for
virtual machine based systems or they will not compete.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 02 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 C is a (reasonably) well known and used language thought to 
 create fast
 code. RPython is Python but with some restrictions that is 
 statically
 compiled.  For writing interpreters, RPython spanks C. PyPy is 
 not the
 only language using RPython to implement the interpreter. C's 
 days in
 this game are seriously numbered.

Source? (Googling "rpython interpreter speed" didn't show 
anything)

There's always claims that systems like RPython, or Haskell, or 
LuaJIT, or Java HotSpot are able to rival or beat C, yet all I 
ever see are small toy programs. Show me something big. Something 
with 100's of thousands of lines of code.

If all this is true, where are the video games written in these 
languages that rival performance/fidelity of AAA titles? I've 
never seen that. Ever.

Mar 02 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 2 March 2013 at 16:06:57 UTC, Peter Alexander wrote:
 On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:
 C is a (reasonably) well known and used language thought to 
 create fast
 code. RPython is Python but with some restrictions that is 
 statically
 compiled.  For writing interpreters, RPython spanks C. PyPy is 
 not the
 only language using RPython to implement the interpreter. C's 
 days in
 this game are seriously numbered.

 Source? (Googling "rpython interpreter speed" didn't show 
 anything)

 There's always claims that systems like RPython, or Haskell, or 
 LuaJIT, or Java HotSpot are able to rival or beat C, yet all I 
 ever see are small toy programs. Show me something big. 
 Something with 100's of thousands of lines of code.

 If all this is true, where are the video games written in these 
 languages that rival performance/fidelity of AAA titles? I've 
 never seen that. Ever.

Rpython has been used to write a very fast jit (pypy) which is 
not the same as an interpreter. The interpreter is what you fall 
back to when you can't use the jit.

I don't know how fast pypys interpreter is specifically, but it's 
not it's main selling point.

Mar 02 2013

Russel Winder <russel winder.org.uk> writes:

On Sat, 2013-03-02 at 17:06 +0100, Peter Alexander wrote:
[=E2=80=A6]
 Source? (Googling "rpython interpreter speed" didn't show=20
 anything)

For me, my source is my microbenchmarks, which are testing just loop
with computation, i.e compute intensive data parallel. There is plenty
of other stuff.

 There's always claims that systems like RPython, or Haskell, or=20
 LuaJIT, or Java HotSpot are able to rival or beat C, yet all I=20
 ever see are small toy programs. Show me something big. Something=20
 with 100's of thousands of lines of code.

100,000+ lines of code is a total irrelevance regarding system
performance, as anyone who knows anything of profiling and benchmarking
will tell you.

If C is the only viable language why has Java got the majority of the
commercial market. C++ and Fortran have the vast majority of the
computational world.=20

 If all this is true, where are the video games written in these=20
 languages that rival performance/fidelity of AAA titles? I've=20
 never seen that. Ever.

Where is D is this arena. It is either C++ or C++ in the high
performance graphics arena. If there are no AAA titles written in D why
does this email list have anyone on it?

Clearly I think your response is just trolling, I will stop feeding the
troll.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 02 2013

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Saturday, 2 March 2013 at 19:19:09 UTC, Russel Winder wrote:
 100,000+ lines of code is a total irrelevance regarding system
 performance, as anyone who knows anything of profiling and 
 benchmarking
 will tell you.

I don't care about benchmarking, I care about writing large, fast 
systems. Small toy programs are irrelevant because they are not 
representative of real programs. Real programs are large (many 
thousands or millions of lines of code) and you don't have the 
luxury of being able to hand optimize every line of code.


 If C is the only viable language why has Java got the majority 
 of the
 commercial market. C++ and Fortran have the vast majority of the
 computational world.

Commercial market for what? For high performance applications C++ 
is still champion.


 If all this is true, where are the video games written in 
 these languages that rival performance/fidelity of AAA titles? 
 I've never seen that. Ever.

 Where is D is this arena. It is either C++ or C++ in the high
 performance graphics arena. If there are no AAA titles written 
 in D why
 does this email list have anyone on it?

D is getting there. Remedy Games are sponsoring DConf, and I 
predict we'll start to see some high quality indie games over the 
next few years written in D. Up till now, I imagine language 
stability has been the main thing holding back D. Tomasz was 
working on a high quality renderer before he got fed up with D's 
instability (http://h3.gd/code/nucleus/).

Mar 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

What's RPython doing that makes it faster?

Mar 02 2013

Russel Winder <russel winder.org.uk> writes:

On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

=20
 What's RPython doing that makes it faster?

Allowing PyPy to have a good JIT compiler.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 02 2013

"jerro" <a a.com> writes:

On Saturday, 2 March 2013 at 19:42:47 UTC, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 
 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

And why do you think this wouldn't be possible if it was written 
in C? There are fast JITs written in C.

Mar 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

I don't understand. Does that JIT generate faster code than a C compiler would 
generate?

Mar 02 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
On 3/2/2013 7:43 AM, Russel Winder wrote:
For writing interpreters, RPython spanks C.

What's RPython doing that makes it faster?

Allowing PyPy to have a good JIT compiler.

 
 I don't understand. Does that JIT generate faster code than a C
 compiler would generate?

I don't know, but my wild guess is that a JIT optimizes the *right*
hotspots based on real-time performance measurements, whereas a lot of C
programmers are obsessed with optimizing what they *think* are the
hotspots, but which really aren't.


T

-- 
Everybody talks about it, but nobody does anything about it!  -- Mark Twain

Mar 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

 I don't understand. Does that JIT generate faster code than a C
 compiler would generate?

 I don't know, but my wild guess is that a JIT optimizes the *right*
 hotspots based on real-time performance measurements, whereas a lot of C
 programmers are obsessed with optimizing what they *think* are the
 hotspots, but which really aren't.

I meant what the C *compiler* generates.

Mar 02 2013

Russel Winder <russel winder.org.uk> writes:

On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:
 On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

 I don't understand. Does that JIT generate faster code than a C
 compiler would generate?

 I don't know, but my wild guess is that a JIT optimizes the *right*
 hotspots based on real-time performance measurements, whereas a lot of =


C
 programmers are obsessed with optimizing what they *think* are the
 hotspots, but which really aren't.

=20
 I meant what the C *compiler* generates.

Yes because the C/C++/D/etc. compilers are attempting to predict the
control flow of the program in execution and optimize all cases for all
possibilities. JIT's are just focussing on the runtime bottlenecks with
the actual data as being used. This allows for more focussed code
generation in the actual context. I would suspect that in many cases the
generated code is effectively the same but JITs can often do unexpected
and faster codes because they have more data to optimize with and less
optimization to do.

To say more would require actual comparative data and I suspect no-one
on this list will do that. =20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 03 2013

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

03-Mar-2013 12:03, Russel Winder пишет:
 On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:
 On 3/2/2013 12:08 PM, H. S. Teoh wrote:
 On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

 I don't understand. Does that JIT generate faster code than a C
 compiler would generate?

 I don't know, but my wild guess is that a JIT optimizes the *right*
 hotspots based on real-time performance measurements, whereas a lot of C
 programmers are obsessed with optimizing what they *think* are the
 hotspots, but which really aren't.

 I meant what the C *compiler* generates.

 Yes because the C/C++/D/etc. compilers are attempting to predict the
 control flow of the program in execution and optimize all cases for all
 possibilities. JIT's are just focussing on the runtime bottlenecks with
 the actual data as being used. This allows for more focussed code
 generation in the actual context. I would suspect that in many cases the
 generated code is effectively the same but JITs can often do unexpected
 and faster codes because they have more data to optimize with and less
 optimization to do.

Only one thing missing here is that JITs typically has much less time to 
do all and any of this.

 To say more would require actual comparative data and I suspect no-one
 on this list will do that.

Indeed unlikely.

-- 
Dmitry Olshansky

Mar 03 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 3 March 2013 at 08:03:53 UTC, Russel Winder wrote:
 Yes because the C/C++/D/etc. compilers are attempting to 
 predict the
 control flow of the program in execution and optimize all cases 
 for all
 possibilities. JIT's are just focussing on the runtime 
 bottlenecks with
 the actual data as being used. This allows for more focussed 
 code
 generation in the actual context. I would suspect that in many 
 cases the
 generated code is effectively the same but JITs can often do 
 unexpected
 and faster codes because they have more data to optimize with 
 and less
 optimization to do.

That is theory, but in practice, it doesn't work that well : you 
have instrument the code to get measure to optimize according to 
runtime data. Plus you can't monkey patch codegen (it is 
unspecified how x86 CPU load instructions, and so it isn't 
guaranteed that the CPU won't see garbage). That cost, plus the 
cost of the optimization itself will make the whole thing worth 
it only if the win when you optimize is high.

 To say more would require actual comparative data and I suspect 
 no-one
 on this list will do that.

It isn't worth. As explained above, you can conclude that this is 
highly dependent of the code you manipulate and the runtime data 
that is throw at it. Note that I discussed with people from LLVM 
on how this can be done in statically compiled code. In fact, it 
can but it is rather complicated and not worth automate for now.

Mar 03 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 12:03 AM, Russel Winder wrote:
 Yes because the C/C++/D/etc. compilers are attempting to predict the
 control flow of the program in execution and optimize all cases for all
 possibilities. JIT's are just focussing on the runtime bottlenecks with
 the actual data as being used. This allows for more focussed code
 generation in the actual context. I would suspect that in many cases the
 generated code is effectively the same but JITs can often do unexpected
 and faster codes because they have more data to optimize with and less
 optimization to do.

I've heard this was the advantage of JITs for the last 15 years. I haven't seen 
any data to justify it, and I have trouble thinking of any significant 
improvements to code gen that could be made with runtime data.

That is for JITting statically typed languages. The situation is a bit better 
for dynamic ones, because the JIT can do a better job than a compiler because 
most of the time, only one type is used in a particular spot, and the JIT can 
pick that up and "statically" type it, as opposed to a compiler which cannot.

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 and I have trouble thinking of any significant improvements to 
 code gen that could be made with runtime data.

I've seen that the JavaHotSpot is able to unroll loops with a 
length known only at run-time. It unrolls them only when the 
run-time statistics say it's advantageous.

(In theory the same is possible with profile-driven optimization).

Bye,
bearophile

Mar 03 2013

"SomeDude" <lovelydear mailmetrash.com> writes:

On Sunday, 3 March 2013 at 11:57:36 UTC, bearophile wrote:
 Walter Bright:

 and I have trouble thinking of any significant improvements to 
 code gen that could be made with runtime data.

 I've seen that the JavaHotSpot is able to unroll loops with a 
 length known only at run-time. It unrolls them only when the 
 run-time statistics say it's advantageous.

 (In theory the same is possible with profile-driven 
 optimization).

 Bye,
 bearophile

I believe the Hotspot runtime compiler is about the only one to 
perform runtime optimizations like that. I don't know if the CIL 
does this kind of things.

Mar 03 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 2 March 2013 at 20:02:09 UTC, Walter Bright wrote:
 On 3/2/2013 11:42 AM, Russel Winder wrote:
 On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:
 On 3/2/2013 7:43 AM, Russel Winder wrote:
 For writing interpreters, RPython spanks C.

 What's RPython doing that makes it faster?

 Allowing PyPy to have a good JIT compiler.

 I don't understand. Does that JIT generate faster code than a C 
 compiler would generate?

In some cases yes, it can,  but as you know that has nothing to 
do with the language the jit is written in. As I've said 
elsewhere, pypy being fast compared to cpython has nothing to do 
with the former being written in rpython and the latter in c*

*other than perhaps the higher productivity of writing in rpython 
allowing more resources to be dedicated to optimisation.

Mar 02 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Russel Winder:

 Allowing PyPy to have a good JIT compiler.

PyPy is not just a JIT, it's a more complex beast. It's more like 
a meta-JIT. The only one I know of.

Bye,
bearophile

Mar 04 2013

"cvk012c" <cvk012c motorolasolutions.com> writes:

On Saturday, 2 March 2013 at 15:33:42 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 8:05 PM, cvk012c wrote:
 In my latest version of D script I didn't use splitter at all. 
 I used
 string specific indexOf function. Still result is very bad. 
 For text
 based protocols, such as SIP, performance of string 
 manipulating
 functions is very important. Unfortunately, looks like it is 
 not D
 strongest point at this time.

 That conclusion would be hasty if not missing the whole point. 
 You essentially measured the speed of one loop in various 
 translators implementing various languages. Java code doing 
 straight computation is on a par with C speed, no two ways 
 about that. Python code using library primitives ain't no 
 slouch either.

That is true, but what prevents D to do the same? Why Java code 
can be on par with C speed and D cannot? If Java and Python folks 
can do it, why D creators cannot? If performance optimizations is 
always number one priority in Java and Python community, why it 
is not a case for D?
I will go even further: what prevents D to be faster than C? 
That's where D will really shine and people will start talking 
about.
But that is probably a distant future of D and for now, my 
personal opinion is that ANY part of D code should not be slower 
than at least Java.
If it is, it should be reported as an issues and fixed.
Being more than twice slower than Java is a failure.

Mar 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/2/2013 11:14 AM, cvk012c wrote:
 That is true, but what prevents D to do the same?

Nothing.

 I will go even further: what prevents D to be faster than C?

Nothing. In fact, there are semantics in D that allow faster code to be 
generated than can be for C. They are unexploited at the moment, but they are
there.

Mar 02 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Walter Bright:

 there are semantics in D that allow faster code to be generated 
 than can be for C. They are unexploited at the moment, but they 
 are there.

This is interesting. Maybe you should write a small article on 
this.

Bye,
bearophile

Mar 04 2013

"deadalnix" <deadalnix gmail.com> writes:

On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer 
wrote:
 Try my hand-written version (elsewhere in thread).  I think it 
 can be done better too (use pointers instead of arrays).

That is usually a bad idea as it will fuck up pretty bad the 
aliasing analysis capabilities of the compiler.

Mar 02 2013

"bearophile" <bearophileHUGS lycos.com> writes:

deadalnix:

 That is usually a bad idea as it will fuck up pretty bad the 
 aliasing analysis capabilities of the compiler.

With LDC I've seen that using raw pointers sometimes gives a 
little extra performance, but with DMD I've some times a lower 
performance.

Bye,
bearophile

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 4:28 PM, cvk012c wrote:
 On my hardware with -inline options it now takes about 15 secs which is
 still slower than Python but with both -inline and -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better
 performance boost and got only 7%. Counting that Python is not the
 fastest scripting language I think that similar Perl and Java scripts
 will outperform D easily.

I doubt that.

1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the 
language and library.

2. With Python, after comparing 2-3 idioms - well, this is pretty much 
it. We doubled the speed in no time by just tuning options. D being a 
systems language allows you take your code to a million places if you 
want to optimize.

3. split has been discussed and improved for years in the Python 
community (just google for e.g. python split performance). Perl isn't 
all that fast actually, try this (takes 30+ seconds):

$message = "REGISTER sip:comm.example.com SIP/2.0\r
Content-Length: 0\r
Contact:
<sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r
To: <sip:12345 comm.example.com>\r
User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r
Max-Forwards: 70\r
CSeq: 1 REGISTER\r
Via: SIP/2.0/TLS
10.1.3.114:59788;branch=z9hG4bK2910497772630690\r
Call-ID: 2910497622026445\r
From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n";

foreach my $i (0 .. 10000000)
{
     foreach my $notused (split(/\r\n/, $message))
     {
     }
}


Andrei

Mar 01 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-01 23:01, Andrei Alexandrescu wrote:

 I doubt that.

 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the
 language and library.

If that tiny portion is what's only used, then the rest doesn't matter.

 2. With Python, after comparing 2-3 idioms - well, this is pretty much
 it. We doubled the speed in no time by just tuning options. D being a
 systems language allows you take your code to a million places if you
 want to optimize.

Perhaps we should look over the default configuration.

-- 
/Jacob Carlborg

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 11:11 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:01, Andrei Alexandrescu wrote:

 I doubt that.

 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the
 language and library.

 If that tiny portion is what's only used, then the rest doesn't matter.

Agreed. But I doubt that's the case in nontrivial applications.

Andrei

Mar 02 2013

Timon Gehr <timon.gehr gmx.ch> writes:

On 03/01/2013 10:28 PM, cvk012c wrote:
 ...

 On my hardware with -inline options it now takes about 15 secs which is
 still slower than Python but with both -inline and -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better
 performance boost and got only 7%. Counting that Python is not the
 fastest scripting language I think that similar Perl and Java scripts
 will outperform D easily.

Never make such statements without doing actual measurements. 
Furthermore, it is completely meaningless anyway. Performance benchmarks 
always compare language implementations, not languages.

(Case in point: You get twice the speed by using another compiler 
backend implementation.)

Mar 01 2013

"SomeDude" <lovelydear mailmetrash.com> writes:

On Friday, 1 March 2013 at 22:02:02 UTC, Timon Gehr wrote:
 On 03/01/2013 10:28 PM, cvk012c wrote:
 ...

 On my hardware with -inline options it now takes about 15 secs 
 which is
 still slower than Python but with both -inline and 
 -noboundscheck it
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much 
 better
 performance boost and got only 7%. Counting that Python is not 
 the
 fastest scripting language I think that similar Perl and Java 
 scripts
 will outperform D easily.

 Never make such statements without doing actual measurements. 
 Furthermore, it is completely meaningless anyway. Performance 
 benchmarks always compare language implementations, not 
 languages.

 (Case in point: You get twice the speed by using another 
 compiler backend implementation.)

Still, there is a case to be made for a performance tests suite 
that could be run after (or before) each release of the language, 
like http://speed.pypy.org
.

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 16:28:08 -0500, cvk012c  
<cvk012c motorolasolutions.com> wrote:

 On my hardware with -inline options it now takes about 15 secs which is  
 still slower than Python but with both -inline and -noboundscheck it  
 takes 13 secs and finally beats Python.
 But I still kind of disappointed because I expected a much better  
 performance boost and got only 7%. Counting that Python is not the  
 fastest scripting language I think that similar Perl and Java scripts  
 will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.

Phobos kind of refuses to treat strings like arrays of characters, it  
insists on decoding.

With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on  
my system (64-bit macosx).

struct MySplitter
{
     private string s;
     private string source;
     this(string src)
     {
         source = src;
         popFront();
     }

      property string front()
     {
         return s;
     }

      property bool empty()
     {
         return s.ptr == null;
     }

     void popFront()
     {
         s = source;
         if(!source.length)
         {
             source = null;
         }
         else
         {
             size_t i = 0;
             bool found = false;
             for(; i + 1 < source.length; i++)
             {
                 if(source[i] == '\r' && source[i + 1] == '\n')
                 {
                     found = true;
                     break;
                 }
             }
             s = source[0..i];
             if(found)
                 source = source[i + 2..$];
             else
                 source = source[$..$];
         }
     }
}

I'm sure splitter could be optimized to do the same thing I'm doing.

Probably can reduce that a bit using pointers instead of strings.

-Steve

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

There's no decoding in find and splitter as far as I remember.

Andrei

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 There's no decoding in find and splitter as far as I remember.

Looking at splitter, it uses skipOver to skip over the separator, and that  
seems to call R.front and R.popFront.  Actually, it calls it for both the  
source string and the separator.

Maybe fixing skipOver would fix the problem?  Or don't call skipOver at  
all, since find isn't doing decoding, why do decoding when skipping the  
separator?

I still feel we need a specialized string type...

-Steve

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 5:47 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 There's no decoding in find and splitter as far as I remember.

 Looking at splitter, it uses skipOver to skip over the separator, and
 that seems to call R.front and R.popFront. Actually, it calls it for
 both the source string and the separator.

You may be looking at the wrong splitter overload.

Andrei

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 18:07:45 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:47 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 5:31 PM, Steven Schveighoffer wrote:
 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 There's no decoding in find and splitter as far as I remember.

 Looking at splitter, it uses skipOver to skip over the separator, and
 that seems to call R.front and R.popFront. Actually, it calls it for
 both the source string and the separator.

 You may be looking at the wrong splitter overload.

Yes, I was.  Very difficult to tell with the way template constraints are  
written!

So it's just pure heuristics.  Not hard to see how that would affect a 10  
million cycle program.

We may be able to create a string-specific version of splitter that would  
take advantage of the representation.

-Steve

Mar 01 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer  
<schveiguy yahoo.com> wrote:

 So it's just pure heuristics.  Not hard to see how that would affect a  
 10 million cycle program.

 We may be able to create a string-specific version of splitter that  
 would take advantage of the representation.

Just found a disturbing artifact: splitter(message, '\n') is more than  
twice as slow as splitter(message, "\n")

-Steve

Mar 01 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.

 Just found a disturbing artifact: splitter(message, '\n') is more than
 twice as slow as splitter(message, "\n")

 -Steve

That is a problem. Could you please file a but report?

Thanks,

Andrei

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.

 Just found a disturbing artifact: splitter(message, '\n') is more than
 twice as slow as splitter(message, "\n")

 -Steve

 That is a problem. Could you please file a but report?

s/but/bug/

:o)

Andrei

Mar 02 2013

"Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:

On Saturday, 2 March 2013 at 17:21:17 UTC, Andrei Alexandrescu
wrote:
 On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would 
 affect a
 10 million cycle program.

 We may be able to create a string-specific version of 
 splitter that
 would take advantage of the representation.

 Just found a disturbing artifact: splitter(message, '\n') is 
 more than
 twice as slow as splitter(message, "\n")

 -Steve

 That is a problem. Could you please file a but report?

 s/but/bug/

 :o)

Ohhhhh!
 Andrei

Mar 04 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 12:20:22 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would affect a
 10 million cycle program.

 We may be able to create a string-specific version of splitter that
 would take advantage of the representation.

 Just found a disturbing artifact: splitter(message, '\n') is more than
 twice as slow as splitter(message, "\n")

 -Steve

 That is a problem. Could you please file a but report?

http://d.puremagic.com/issues/show_bug.cgi?id=9645

In trying to make a minimal test case, I found that the algorithm performs  
worse vs. the substring version as the number of content characters  
between separators grows.  It actually starts out performing better than  
the substring version, by quite a bit (I would think we should be able to  
optimize there as well, the difference was about half) when the ratio is  
1:1 (i.e. splitting "ababababa" on 'a' or "a"), but gets worse as you put  
more content between the separators.  That should be a large clue.

-Steve

Mar 04 2013

"Zach the Mystic" <reachBUTMINUSTHISzach gOOGLYmail.com> writes:

On Saturday, 2 March 2013 at 17:20:23 UTC, Andrei Alexandrescu 
wrote:
 On 3/1/13 6:26 PM, Steven Schveighoffer wrote:
 On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer
 <schveiguy yahoo.com> wrote:

 So it's just pure heuristics. Not hard to see how that would 
 affect a
 10 million cycle program.

 We may be able to create a string-specific version of 
 splitter that
 would take advantage of the representation.

 Just found a disturbing artifact: splitter(message, '\n') is 
 more than
 twice as slow as splitter(message, "\n")

 -Steve

 That is a problem. Could you please file a but report?

 Thanks,

 Andrei

I would have filed a bug report, but I was WAY too busy.

Mar 04 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).

If you need to write a custom splitter to get acceptable performance 
then it sounds to me that D or Phobos have failed.

-- 
/Jacob Carlborg

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).

 If you need to write a custom splitter to get acceptable performance
 then it sounds to me that D or Phobos have failed.

I wrote a custom splitter specialized for arrays. It's as fast.

Andrei

Mar 02 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).

 If you need to write a custom splitter to get acceptable performance
 then it sounds to me that D or Phobos have failed.

 I wrote a custom splitter specialized for arrays. It's as fast.

Why doesn't it get used for strings in this instance?  Maybe there is a  
bad constraint somewhere.

-Steve

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 2:37 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 11:12 AM, Jacob Carlborg wrote:
 On 2013-03-01 23:31, Steven Schveighoffer wrote:

 Phobos kind of refuses to treat strings like arrays of characters, it
 insists on decoding.

 With DMD and a hand-written splitter, it takes 6 seconds instead of 10
 on my system (64-bit macosx).

 If you need to write a custom splitter to get acceptable performance
 then it sounds to me that D or Phobos have failed.

 I wrote a custom splitter specialized for arrays. It's as fast.

 Why doesn't it get used for strings in this instance? Maybe there is a
 bad constraint somewhere.

I fear there's a misunderstanding somewhere. Splitter and find are 
already specialized appropriately for strings.

Andrei

Mar 02 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are  
 already specialized appropriately for strings.

OK, I would then counter that I have created a splitter range that beats  
its performance, by a lot.  As far as I can tell, I'm not doing anything  
special.

All I feel that I'm doing differently is instead of using find or indexOf,  
I'm looking at each char to see if it matches the first in the separator,  
and if it does, then I'm comparing the remaining characters.  Pretty  
braindead, probably room for improvement.

Splitter should beat this, or match it at least.  From testing the sample  
code in this thread, my simple version beats splitter by 40%.  Something  
isn't right somewhere, the difference should not be that drastic.

-Steve

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 5:48 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are
 already specialized appropriately for strings.

 OK, I would then counter that I have created a splitter range

Prolly a link would clarify that.

Andrei

Mar 02 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sat, 02 Mar 2013 18:14:28 -0500, Andrei Alexandrescu  
<SeeWebsiteForEmail erdani.org> wrote:

 On 3/2/13 5:48 PM, Steven Schveighoffer wrote:
 On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu
 <SeeWebsiteForEmail erdani.org> wrote:

 I fear there's a misunderstanding somewhere. Splitter and find are
 already specialized appropriately for strings.

 OK, I would then counter that I have created a splitter range

 Prolly a link would clarify that.

Forget this, I read your other post.  The MySplitter range is the one I  
was referring to.

I saw your points and will respond there, but not today.  I don't have  
time right now to continue the debate.

-Steve

Mar 02 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/2/2013 8:12 AM, Jacob Carlborg wrote:
 If you need to write a custom splitter to get acceptable performance then it
 sounds to me that D or Phobos have failed.


In D, you can write custom versions of algorithms to exploit unique 
characteristics of particular data structures. I don't regard that as a failure.

But if, to write a custom version, you had to switch to another language, then
I 
would agree that is a failure of a "systems" programming language.

Mar 02 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-02 20:27, Walter Bright wrote:

 In D, you can write custom versions of algorithms to exploit unique
 characteristics of particular data structures. I don't regard that as a
 failure.

If you use the most obvious function/way from the standard library to 
perform a task in Python and you do the same in D. If the D version is 
slower that's not good.

As you've seen here people will use what's available in the standard 
library and if that's not good enough compared to what's available in 
other languages' standard libraries, they will complain. If a function 
in Phobos can't match the corresponding function in another std lib, I 
think we're doing something wrong. Or the they just happened to use a 
language where the std lib was optimized better for that particular task.

-- 
/Jacob Carlborg

Mar 03 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 7:07 AM, Jacob Carlborg wrote:
 If you use the most obvious function/way from the standard library to perform a
 task in Python and you do the same in D. If the D version is slower that's not
good.

Sure. But you should be cognizant of exactly what it is you're comparing, or 
you're apt to draw very wrong conclusions.

Mar 03 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/1/2013 1:28 PM, cvk012c wrote:
 But I still kind of disappointed because I expected a much better performance
 boost and got only 7%. Counting that Python is not the fastest scripting
 language I think that similar Perl and Java scripts will outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.

Python's splitter, which you are measuring, isn't written in Python. It is 
written in C. You're actually comparing a bit of C code with a bit of D code.

Mar 01 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 2 March 2013 at 01:10:39 UTC, Walter Bright wrote:
 On 3/1/2013 1:28 PM, cvk012c wrote:
 But I still kind of disappointed because I expected a much 
 better performance
 boost and got only 7%. Counting that Python is not the fastest 
 scripting
 language I think that similar Perl and Java scripts will 
 outperform D easily.
 Thanks Andrei and simendsjo for a quick response though.

 Python's splitter, which you are measuring, isn't written in 
 Python. It is written in C. You're actually comparing a bit of 
 C code with a bit of D code.

This.

If you wrote the splitter in pure python you would see an 
enormous performance gap between it and the D version.

Having said that, maybe there are even more improvements we can 
make to improve speed in that part of photos, but considering 
we're already on a par with some quite mature and optimised C, 
there really isn't a problem.

Mar 02 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in Python. It
 is written in C. You're actually comparing a bit of C code with a bit of
 D code.

Does that really matter. He's using Python, if the function is part of 
the standard library and if it's implement in Python or C doesn't really 
matter.

-- 
/Jacob Carlborg

Mar 02 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Saturday, 2 March 2013 at 16:09:02 UTC, Jacob Carlborg wrote:
 On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in 
 Python. It
 is written in C. You're actually comparing a bit of C code 
 with a bit of
 D code.

 Does that really matter. He's using Python, if the function is 
 part of the standard library and if it's implement in Python or 
 C doesn't really matter.

It does.

Failing to beat mature, optimised C is not anything to be ashamed 
of, but being slower than python would be an abject failure of D 
as a compiled, systems programming language.

Mar 02 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.

Then D needs to get faster, or we need to switch to C for some std lib 
functions. In that case, as Walter said, we have failed.

-- 
/Jacob Carlborg

Mar 03 2013

"John Colvin" <john.loughran.colvin gmail.com> writes:

On Sunday, 3 March 2013 at 15:09:58 UTC, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of D 
 as a
 compiled, systems programming language.

 Then D needs to get faster, or we need to switch to C for some 
 std lib functions. In that case, as Walter said, we have failed.

I agree that anything that makes us faster is good, but I 
wouldn't go so far as to say we've failed if we're not as fast as 
the very fastest.

Mar 03 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-03 16:41, John Colvin wrote:

 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.

No, but if we need to drop down to C to get fast enough.

-- 
/Jacob Carlborg

Mar 03 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 11:38 PM, Jacob Carlborg wrote:
 On 2013-03-03 16:41, John Colvin wrote:

 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.

 No, but if we need to drop down to C to get fast enough.

I can't think of a reason to need to do that.

Mar 03 2013

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, March 03, 2013 23:40:18 Walter Bright wrote:
 On 3/3/2013 11:38 PM, Jacob Carlborg wrote:
 On 2013-03-03 16:41, John Colvin wrote:
 I agree that anything that makes us faster is good, but I wouldn't go so
 far as to say we've failed if we're not as fast as the very fastest.

 
 No, but if we need to drop down to C to get fast enough.

 
 I can't think of a reason to need to do that.

Given how D works, there is something _very_ wrong if we have to drop to C, if 
nothing else, because it's trivial to write code in D which is equivalent to 
what it would be in C. If our implementation of something is worse than a C 
implementation, that merely means that it needs to be refactored and 
optimized. It's possible that some of our abstractions will hamper efficiency
in 
some cases (e.g. depending on how ranges are used - particularly with strings 
- you risk a performance hit in comparison to pure C), but that should 
generally be fixable via specializations and whatnot.

If we have an efficiency problem, it's merely an implementation issue which 
needs to be sorted out. There should be nothing inherent to D which makes it 
so that you can't write code as fast as C or C++.

- Jonathan M Davis

Mar 04 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-04 09:18, Jonathan M Davis wrote:

 Given how D works, there is something _very_ wrong if we have to drop to C, if
 nothing else, because it's trivial to write code in D which is equivalent to
 what it would be in C. If our implementation of something is worse than a C
 implementation, that merely means that it needs to be refactored and
 optimized. It's possible that some of our abstractions will hamper efficiency
in
 some cases (e.g. depending on how ranges are used - particularly with strings
 - you risk a performance hit in comparison to pure C), but that should
 generally be fixable via specializations and whatnot.

 If we have an efficiency problem, it's merely an implementation issue which
 needs to be sorted out. There should be nothing inherent to D which makes it
 so that you can't write code as fast as C or C++.

That's what I'm saying.

-- 
/Jacob Carlborg

Mar 04 2013

Walter Bright <newshound2 digitalmars.com> writes:

On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.

 Then D needs to get faster, or we need to switch to C for some std lib
 functions. In that case, as Walter said, we have failed.

Nothing in this thread suggests that D needs to switch its library 
implementations to C.

Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() 
from VC's runtime library. It was not faster than Andrei's optimized D one.

Mar 03 2013

"deadalnix" <deadalnix gmail.com> writes:

On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of D 
 as a
 compiled, systems programming language.

 Then D needs to get faster, or we need to switch to C for some 
 std lib
 functions. In that case, as Walter said, we have failed.

 Nothing in this thread suggests that D needs to switch its 
 library implementations to C.

 Interestingly, I tried the indexOf() benchmark using C's 
 memchr() and memcmp() from VC's runtime library. It was not 
 faster than Andrei's optimized D one.

Maybe it is time to look at the python implementation and see why 
it is faster.

Mar 03 2013

"jerro" <a a.com> writes:

 Maybe it is time to look at the python implementation and see 
 why it is faster.

It isn't faster:

$ time python3 test.py

real    0m14.217s
user    0m14.209s
sys     0m0.004s
$ gdmd -O -inline -release -noboundscheck test
$ time ./test

real    0m5.323s
user    0m5.312s
sys     0m0.008s

D code here uses the same string as the python code, not the one 
in cvk012c's D code.

Mar 03 2013

"bearophile" <bearophileHUGS lycos.com> writes:

jerro:

 $ time python3 test.py

Are Python3 strings still like wstrings/dstrings, or have they 
applied the patch that makes them UTF8?

Bye,
bearophile

Mar 03 2013

Russel Winder <russel winder.org.uk> writes:

On Mon, 2013-03-04 at 04:31 +0100, bearophile wrote:
[=E2=80=A6]
 Are Python3 strings still like wstrings/dstrings, or have they=20
 applied the patch that makes them UTF8?

Python3 is Unicode by default, with UTF-8 the default encoding.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Mar 04 2013

Ellery Newcomer <ellery-newcomer utulsa.edu> writes:

On 03/03/2013 07:31 PM, bearophile wrote:
 jerro:

 $ time python3 test.py

 Are Python3 strings still like wstrings/dstrings, or have they applied
 the patch that makes them UTF8?

 Bye,
 bearophile

Looks like 3.3 will store them as utf8, but not 3.2 or below.

Mar 04 2013

"deadalnix" <deadalnix gmail.com> writes:

On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see 
 why it is faster.

 It isn't faster:

 $ time python3 test.py

 real    0m14.217s
 user    0m14.209s
 sys     0m0.004s
 $ gdmd -O -inline -release -noboundscheck test
 $ time ./test

 real    0m5.323s
 user    0m5.312s
 sys     0m0.008s

 D code here uses the same string as the python code, not the 
 one in cvk012c's D code.

Using noboundcheck isn't fair as you give up on safety, so you 
are not equivalent to python either.

Mar 03 2013

"Don" <turnyourkidsintocash nospam.com> writes:

On Monday, 4 March 2013 at 03:58:20 UTC, deadalnix wrote:
 On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see 
 why it is faster.

 It isn't faster:

 $ time python3 test.py

 real    0m14.217s
 user    0m14.209s
 sys     0m0.004s
 $ gdmd -O -inline -release -noboundscheck test
 $ time ./test

 real    0m5.323s
 user    0m5.312s
 sys     0m0.008s

 D code here uses the same string as the python code, not the 
 one in cvk012c's D code.

 Using noboundcheck isn't fair as you give up on safety, so you 
 are not equivalent to python either.

But that does suggest that the main problem is that DMD has very 
little optimization aimed at reducing the number of bounds checks 
required.

Mar 04 2013

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

On Sun, 03 Mar 2013 22:58:07 -0500, deadalnix <deadalnix gmail.com> wrote:

 On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:
 Maybe it is time to look at the python implementation and see why it  
 is faster.

 It isn't faster:

 $ time python3 test.py

 real    0m14.217s
 user    0m14.209s
 sys     0m0.004s
 $ gdmd -O -inline -release -noboundscheck test
 $ time ./test

 real    0m5.323s
 user    0m5.312s
 sys     0m0.008s

 D code here uses the same string as the python code, not the one in  
 cvk012c's D code.

 Using noboundcheck isn't fair as you give up on safety, so you are not  
 equivalent to python either.

In this type of test, there is no danger with using noboundscheck.  It's a  
very fair switch to use, D is just able to do it where python is not.  A  
sufficiently smart compiler could eliminate the bounds check for this  
code, since it can be proven not to go out of bounds (in fact, a simple  
run without the noboundscheck proves it in very deterministic code like  
this).

-Steve

Mar 04 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/3/13 9:31 PM, deadalnix wrote:
 On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be ashamed of,
 but being slower than python would be an abject failure of D as a
 compiled, systems programming language.

 Then D needs to get faster, or we need to switch to C for some std lib
 functions. In that case, as Walter said, we have failed.

 Nothing in this thread suggests that D needs to switch its library
 implementations to C.

 Interestingly, I tried the indexOf() benchmark using C's memchr() and
 memcmp() from VC's runtime library. It was not faster than Andrei's
 optimized D one.

 Maybe it is time to look at the python implementation and see why it is
 faster.

Was the conclusion that it's faster?

Andrei

Mar 03 2013

"deadalnix" <deadalnix gmail.com> writes:

On Monday, 4 March 2013 at 04:36:30 UTC, Andrei Alexandrescu 
wrote:
 On 3/3/13 9:31 PM, deadalnix wrote:
 On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:
 On 3/3/2013 7:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 17:48, John Colvin wrote:

 It does.

 Failing to beat mature, optimised C is not anything to be 
 ashamed of,
 but being slower than python would be an abject failure of 
 D as a
 compiled, systems programming language.

 Then D needs to get faster, or we need to switch to C for 
 some std lib
 functions. In that case, as Walter said, we have failed.

 Nothing in this thread suggests that D needs to switch its 
 library
 implementations to C.

 Interestingly, I tried the indexOf() benchmark using C's 
 memchr() and
 memcmp() from VC's runtime library. It was not faster than 
 Andrei's
 optimized D one.

 Maybe it is time to look at the python implementation and see 
 why it is
 faster.

 Was the conclusion that it's faster?

 Andrei

It seems that I was wrong, and missed the point that both 
benchmark weren't using the same string. Sorry for that.

Mar 03 2013

Jacob Carlborg <doob me.com> writes:

On 2013-03-03 21:18, Walter Bright wrote:

 Nothing in this thread suggests that D needs to switch its library
 implementations to C.

Then that's good.

 Interestingly, I tried the indexOf() benchmark using C's memchr() and
 memcmp() from VC's runtime library. It was not faster than Andrei's
 optimized D one.


-- 
/Jacob Carlborg

Mar 03 2013

"jerro" <a a.com> writes:

 Does that really matter. He's using Python, if the function is 
 part of the standard library and if it's implement in Python or 
 C doesn't really matter.

You can look at it that way, but still, the fact that most of the 
work in the Python version is done by C code makes the timings 
less surprising. If split() was implemented in pure python and it 
would be only three times slower (when run with CPython) than 
std.algorithm.splitter, that would be very surprising and a sign 
that std.algorithm.splitter is doing something horribly wrong.

Mar 02 2013

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 3/2/13 11:09 AM, Jacob Carlborg wrote:
 On 2013-03-02 02:10, Walter Bright wrote:

 Python's splitter, which you are measuring, isn't written in Python. It
 is written in C. You're actually comparing a bit of C code with a bit of
 D code.

 Does that really matter. He's using Python, if the function is part of
 the standard library and if it's implement in Python or C doesn't really
 matter.

The point here is that often one needs to write code that's more than 
gluing library calls together. I've seen that many, many times.

Andrei

Mar 02 2013

"jerro" <a a.com> writes:

On Friday, 1 March 2013 at 20:30:24 UTC, cvk012c wrote:
 Tried to port my SIP parser from Python to D to boost 
 performance
 but got opposite result.
 I created a simple test script which splits SIP REGISTER message
 10 million times. Python version takes about 14 seconds to
 execute, D version is about 23 seconds which is 1.6 times 
 slower.
 I used DMD 2.062 and compiled my script with options -release 
 and
 -O. I used Python 3.3 64 bit.
 I ran both scripts on the same hardware with Windows 7 64.
 Is this general problem with string performance in D or just
 splitter() issue?
 Did anybody compared performance of D string manipulating
 functions with other languages like Python,Perl,Java and C++?

I'm guessing you are building without optimization options. When 
compiled with "dmd -O -inline -noboundscheck -release tmp" the D 
code takes 11.1 seconds on my machine and the python script takes 
16.1 seconds. You can make the D code faster by building with LDC 
or GDC:

ldmd2 -O -noboundscheck -release tmp:
6.8 seconds

gdmd -O -nboundscheck -inline -release tmp:
6.1 seconds

So no, not slower than Python. I also suspect that much of the 
work in the Python code is actually done by functions that are 
implemented in C.

Mar 01 2013

Robert <jfanatiker gmx.at> writes:

Hm, just recently a friend of mine and I hacked together at the FB
hacking cup, he in python and I in D. My solutions always were at least
faster by a factor of 80. For your example, I could not get a factor of
80, but with 
 -inline
it is at least faster than the python version (about 30% faster on my
machine)

Best regards,

Robert

Mar 01 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Slower than Python