digitalmars.D - Slower than Python
- cvk012c (56/56) Mar 01 2013 Tried to port my SIP parser from Python to D to boost performance
- Andrei Alexandrescu (3/11) Mar 01 2013 Add -inline to the options.
- simendsjo (10/25) Mar 01 2013 --noboundscheck can also help if you don't mind missing the
- Andrei Alexandrescu (17/38) Mar 01 2013 Also, the D version has a different string to parse (~ is not a line
- cvk012c (9/36) Mar 01 2013 On my hardware with -inline options it now takes about 15 secs
- bearophile (16/19) Mar 01 2013 Why don't you write a Java version? It takes only few minutes,
- cvk012c (69/75) Mar 01 2013 You are right. Why not. But instead of using Java split() method
- Steven Schveighoffer (21/39) Mar 01 2013 Try my hand-written version (elsewhere in thread). I think it can be do...
- cvk012c (7/34) Mar 01 2013 In my latest version of D script I didn't use splitter at all. I
- anon123 (32/68) Mar 01 2013 My version is functionally equivalent, and measures 55 - 94 ms
- Walter Bright (3/4) Mar 01 2013 I don't think so. Take a look at what happens to the message variable. I...
- Steven Schveighoffer (3/6) Mar 01 2013 Try printing out message each time through the 10,000,000 iterations
- Steven Schveighoffer (6/11) Mar 01 2013 indexOf uses the same mechanism as splitter to find the separators. If ...
- Andrei Alexandrescu (3/13) Mar 02 2013 find and indexOf on arrays are on par with handwritten loops.
- Steven Schveighoffer (11/25) Mar 02 2013 This is not a personal attack, please don't take it that way. My
- Andrei Alexandrescu (121/126) Mar 02 2013 Not at all. I'm just confused about this exchange. You mention some
- Steven Schveighoffer (80/106) Mar 04 2013 I didn't realize the situation. I assumed that there wasn't a version o...
- Andrei Alexandrescu (4/63) Mar 04 2013 That's comparable, please file an enh request.
- Steven Schveighoffer (3/4) Mar 04 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9646
- qznc (89/93) May 22 2016 Below is an implementation, which matches MySplitter with dmd,
- Seb (2/8) May 23 2016 have you thought about opening a PR to improve `splitter`?
- qznc (7/18) May 23 2016 Yes, but I'm not sure about the goals.
- Seb (5/24) May 23 2016 Good luck. If you need help, it's probably the best to open the
- qznc (7/10) May 23 2016 Forget the "dead code comment" it is a actually a missing
- Andrei Alexandrescu (5/14) May 23 2016 Most uses are forward, so perhaps it's best to eliminate the
- Jack Stouffer (6/10) May 23 2016 I don't see the value add that things like filterBidirectional
- Andrei Alexandrescu (4/12) May 23 2016 What would be the condition tested in the static if? Recall that
- Jack Stouffer (4/7) May 23 2016 I honestly can't think of a time when would this be the case.
- qznc (12/20) May 23 2016 Here is a commit, which removes the dead code:
- Andrei Alexandrescu (2/3) May 23 2016 If it adds no overhead, you're right. -- Andrei
- Jack Stouffer (6/12) May 23 2016 If the performance is really a problem (I think it's a micro
- Andrei Alexandrescu (3/4) May 23 2016 splitter in the stdlib is likely to be very frequently useful, any bit
- Andrei Alexandrescu (3/10) May 23 2016 You might be doing extra work even if the user never calls back and
- qznc (8/16) May 23 2016 I think the actual slow thing is std.find, which splitter uses.
- Andrei Alexandrescu (10/12) May 23 2016 Conventional wisdom has it that find() is brute force and that's that,
- qznc (13/27) May 24 2016 I observed that Boyer-Moore from std is still slower. My guess is
- Chris (18/50) May 24 2016 From Phobos [1]:
- qznc (11/33) May 24 2016 Unfortunately, it is slower. My current measurements [0]:
- Andrei Alexandrescu (4/41) May 24 2016 One somewhat unrelated thing that can be done with Boyer-Moore: use a
- Andrei Alexandrescu (8/11) May 24 2016 Another idea: in every searched string there is one specific index that
- qznc (16/26) May 24 2016 Plot twist: manual find is not always faster.
- Andrei Alexandrescu (2/4) May 24 2016 Try http://asm.dlang.org for short tests. -- Andrei
- qznc (19/21) May 24 2016 I discovered radare2. While it is actually a debugger (like gdb),
- Chris (2/21) May 25 2016 Any suggestions how we can make find more efficient?
- Andrei Alexandrescu (8/34) May 24 2016 There's also Rabin-Karp that can be used when the sought range is
- Chris (7/22) May 24 2016 Since I work a lot with text, I use canFind etc. a lot. It'd be
- Chris (13/25) May 27 2016 I've played around with some algorithms and kept them as simple
- Andrei Alexandrescu (8/17) May 27 2016 This is surprising. It should be easy to construct examples where brute
- Chris (38/55) May 27 2016 I have to correct myself. A manual loop is faster, I couldn't
- Chris (26/44) May 27 2016 Little bug fix:
- qznc (10/12) May 27 2016 I think it really depends on the use case. The manual algorithm
- Andrei Alexandrescu (4/6) May 27 2016 What you want to do here for indexed access is to match the last element...
- qznc (26/30) May 27 2016 Another plot twist: Sentinels look interesting.
- Andrei Alexandrescu (4/11) May 27 2016 Also mind the throwing possibilities.
- qznc (25/39) May 28 2016 LDC. For DMD the numbers show no improvement:
- Andrei Alexandrescu (33/36) May 28 2016 No need for a sentinel really so long as you first search for the last
- Chris (17/55) May 28 2016 I included your function (I also fixed the bug in my
- qznc (4/6) May 28 2016 DMD performance feels flaky to me.
- qznc (33/44) May 28 2016 A nice one. Here are the numbers I got:
- qznc (38/38) May 29 2016 I played around with the benchmark. Some more numbers:
- Chris (15/54) May 29 2016 You can avoid "E: wrong result with Chris find" by using
- Chris (32/32) May 30 2016 It's weird, on my machine, my find function is consistently
- Jon Degenhardt (8/15) May 29 2016 A minor thing - you might consider also calculating the median
- qznc (11/17) May 29 2016 I don't think that would help. This is not a normal distribution,
- Jon Degenhardt (3/13) May 29 2016 Oh, I see. The benchmark varies the data on each run and
- qznc (17/21) May 30 2016 So far, the results look disappointing. Andrei find does not get
- Andrei Alexandrescu (9/29) May 30 2016 Please throw this hat into the ring as well: it should improve average
- Chris (3/12) May 30 2016 Cool. So far, my experience with separate functions has been that
- Chris (19/35) May 30 2016 Congratulations!
- Andrei Alexandrescu (5/21) May 30 2016 Thanks for looking into this! @qznc, could you please look into updating...
- qznc (44/67) May 30 2016 What function call should be replaced with inline code?
- Andrei Alexandrescu (12/14) May 30 2016 I agree it's difficult to characterize the behavior of substring search
- qznc (9/20) May 31 2016 There is a special version of find for searching a single char in
- Andrei Alexandrescu (2/4) May 31 2016 How is splitting on ',' a user mistake? -- Andrei
- qznc (26/31) May 31 2016 The mistake is to split on "," instead of ','. The slow splitter
- Chris (4/28) May 31 2016 Would it speed things up even more, if we put the function
- qznc (41/44) May 31 2016 LDC inlines it. DMD does not.
- Chris (7/52) May 31 2016 Yep. It's really impressive. I actually thought that dmd didn't
- Andrei Alexandrescu (3/7) May 31 2016 You may want to then try https://dpaste.dzfl.pl/392710b765a9, which
- David Nadlinger (21/23) May 31 2016 In general, it might be beneficial to use
- qznc (5/17) Jun 02 2016 The current algorithm is generic with respect to the predicate.
- Chris (77/79) Jun 01 2016 I've added it as `Andrei3`. It runs faster with dmd, but it's not
- qznc (7/17) Jun 02 2016 I ported this to Phobos style (A2Phobos in the benchmark) and
- Andrei Alexandrescu (16/35) May 31 2016 Oh, I see what you mean - it's a string consisting only of one
- Patrick Schluter (3/6) Jun 01 2016 At compile time you may not know the length of the needle, like
- Seb (11/17) Jun 01 2016 1) how about a CTFE find?
- Chris (4/23) Jun 01 2016 That makes sense. I think that std.algorithm needs to be revised
- Andrei Alexandrescu (3/15) Jun 01 2016 That would require partial evaluation, which sadly we don't have in D.
- Patrick Schluter (14/24) Jun 01 2016 What I wanted to say, is that in real life, the input of the
- Chris (20/37) May 31 2016 Benchmark from desktop machine:
- Henrique bucher (5/9) Jul 11 2016 I wrote a splitter in SSE4.2 some time ago as acontribution to a
- Patrick Schluter (8/33) May 27 2016 The upper bound of the outer loop should be
- Chris (3/3) May 27 2016 Oops, I've been found out. :-) Thanks. You're right of course,
- Patrick Schluter (27/30) May 27 2016 I had the same "bug" when I wrote my search function on the
- qznc (16/23) May 27 2016 I just updated my benchmark. It now checks lots of different
- qznc (5/5) May 27 2016 Now added Chris' algo to the benchmark:
- Andrei Alexandrescu (10/15) May 27 2016 Does any of these feature simultaneously:
- qznc (5/19) May 27 2016 Std find (currently in Phobos) and qznc find (see pull request)
- Chris (2/7) May 28 2016 Did you fix the bugs in my algorithm? S
- qznc (4/8) May 27 2016 If you want to see find (current or my improved version) get
- David Nadlinger (8/11) May 27 2016 Just about the only reason I could think of for this to happen is
- Chris (5/12) May 28 2016 I agree, and it has been said on this forum that tight manual
- Andrei Alexandrescu (13/18) Mar 02 2013 That conclusion would be hasty if not missing the whole point. You
- Russel Winder (27/34) Mar 02 2013 =20
- John Colvin (3/32) Mar 02 2013 I'm not sure that's entirely fair. PyPy is fast because it
- Russel Winder (21/23) Mar 02 2013 The second sentence is almost entirely true, the first I am not so sure.
- Peter Alexander (10/19) Mar 02 2013 Source? (Googling "rpython interpreter speed" didn't show
- John Colvin (6/25) Mar 02 2013 Rpython has been used to write a very fast jit (pypy) which is
- Russel Winder (26/35) Mar 02 2013 For me, my source is my microbenchmarks, which are testing just loop
- Peter Alexander (14/29) Mar 02 2013 I don't care about benchmarking, I care about writing large, fast
- Walter Bright (2/3) Mar 02 2013 What's RPython doing that makes it faster?
- Russel Winder (12/16) Mar 02 2013 Allowing PyPy to have a good JIT compiler.
- jerro (3/9) Mar 02 2013 And why do you think this wouldn't be possible if it was written
- Walter Bright (3/9) Mar 02 2013 I don't understand. Does that JIT generate faster code than a C compiler...
- H. S. Teoh (8/19) Mar 02 2013 I don't know, but my wild guess is that a JIT optimizes the *right*
- Walter Bright (2/18) Mar 02 2013 I meant what the C *compiler* generates.
- Russel Winder (22/42) Mar 03 2013 Yes because the C/C++/D/etc. compilers are attempting to predict the
- Dmitry Olshansky (6/37) Mar 03 2013 Only one thing missing here is that JITs typically has much less time to...
- deadalnix (13/31) Mar 03 2013 That is theory, but in practice, it doesn't work that well : you
- Walter Bright (8/16) Mar 03 2013 I've heard this was the advantage of JITs for the last 15 years. I haven...
- bearophile (7/9) Mar 03 2013 I've seen that the JavaHotSpot is able to unroll loops with a
- SomeDude (4/14) Mar 03 2013 I believe the Hotspot runtime compiler is about the only one to
- John Colvin (7/17) Mar 02 2013 In some cases yes, it can, but as you know that has nothing to
- bearophile (5/6) Mar 04 2013 PyPy is not just a JIT, it's a more complex beast. It's more like
- cvk012c (15/31) Mar 02 2013 That is true, but what prevents D to do the same? Why Java code
- Walter Bright (4/6) Mar 02 2013 Nothing. In fact, there are semantics in D that allow faster code to be
- bearophile (5/8) Mar 04 2013 This is interesting. Maybe you should write a small article on
- deadalnix (4/6) Mar 02 2013 That is usually a bad idea as it will fuck up pretty bad the
- bearophile (6/8) Mar 02 2013 With LDC I've seen that using raw pointers sometimes gives a
- Andrei Alexandrescu (30/37) Mar 01 2013 I doubt that.
- Jacob Carlborg (5/12) Mar 02 2013 Perhaps we should look over the default configuration.
- Andrei Alexandrescu (3/9) Mar 02 2013 Agreed. But I doubt that's the case in nontrivial applications.
- Timon Gehr (6/14) Mar 01 2013 Never make such statements without doing actual measurements.
- SomeDude (5/26) Mar 01 2013 Still, there is a case to be made for a performance tests suite
- Steven Schveighoffer (53/61) Mar 01 2013 Phobos kind of refuses to treat strings like arrays of characters, it
- Andrei Alexandrescu (3/5) Mar 01 2013 There's no decoding in find and splitter as far as I remember.
- Steven Schveighoffer (10/14) Mar 01 2013 Looking at splitter, it uses skipOver to skip over the separator, and th...
- Andrei Alexandrescu (3/13) Mar 01 2013 You may be looking at the wrong splitter overload.
- Steven Schveighoffer (9/23) Mar 01 2013 Yes, I was. Very difficult to tell with the way template constraints ar...
- Steven Schveighoffer (5/9) Mar 01 2013 Just found a disturbing artifact: splitter(message, '\n') is more than
- Andrei Alexandrescu (4/14) Mar 02 2013 That is a problem. Could you please file a but report?
- Andrei Alexandrescu (4/19) Mar 02 2013 s/but/bug/
- Zach the Mystic (3/26) Mar 04 2013 Ohhhhh!
- Steven Schveighoffer (11/26) Mar 04 2013 http://d.puremagic.com/issues/show_bug.cgi?id=9645
- Zach the Mystic (3/23) Mar 04 2013 I would have filed a bug report, but I was WAY too busy.
- Jacob Carlborg (5/9) Mar 02 2013 If you need to write a custom splitter to get acceptable performance
- Andrei Alexandrescu (3/11) Mar 02 2013 I wrote a custom splitter specialized for arrays. It's as fast.
- Steven Schveighoffer (5/17) Mar 02 2013 Why doesn't it get used for strings in this instance? Maybe there is a ...
- Andrei Alexandrescu (4/21) Mar 02 2013 I fear there's a misunderstanding somewhere. Splitter and find are
- Steven Schveighoffer (13/15) Mar 02 2013 OK, I would then counter that I have created a splitter range that beats...
- Andrei Alexandrescu (3/8) Mar 02 2013 Prolly a link would clarify that.
- Steven Schveighoffer (7/16) Mar 02 2013 Forget this, I read your other post. The MySplitter range is the one I ...
- Walter Bright (5/7) Mar 02 2013 In D, you can write custom versions of algorithms to exploit unique
- Jacob Carlborg (12/15) Mar 03 2013 If you use the most obvious function/way from the standard library to
- Walter Bright (3/5) Mar 03 2013 Sure. But you should be cognizant of exactly what it is you're comparing...
- Walter Bright (3/7) Mar 01 2013 Python's splitter, which you are measuring, isn't written in Python. It ...
- John Colvin (8/19) Mar 02 2013 This.
- Jacob Carlborg (6/9) Mar 02 2013 Does that really matter. He's using Python, if the function is part of
- John Colvin (5/14) Mar 02 2013 It does.
- Jacob Carlborg (5/9) Mar 03 2013 Then D needs to get faster, or we need to switch to C for some std lib
- John Colvin (4/14) Mar 03 2013 I agree that anything that makes us faster is good, but I
- Jacob Carlborg (4/6) Mar 03 2013 No, but if we need to drop down to C to get fast enough.
- Walter Bright (2/6) Mar 03 2013 I can't think of a reason to need to do that.
- Jonathan M Davis (13/21) Mar 04 2013 Given how D works, there is something _very_ wrong if we have to drop to...
- Jacob Carlborg (4/15) Mar 04 2013 That's what I'm saying.
- Walter Bright (5/13) Mar 03 2013 Nothing in this thread suggests that D needs to switch its library
- deadalnix (3/22) Mar 03 2013 Maybe it is time to look at the python implementation and see why
- jerro (12/14) Mar 03 2013 It isn't faster:
- bearophile (5/6) Mar 03 2013 Are Python3 strings still like wstrings/dstrings, or have they
- Russel Winder (13/15) Mar 04 2013 Python3 is Unicode by default, with UTF-8 the default encoding.
- Ellery Newcomer (2/8) Mar 04 2013 Looks like 3.3 will store them as utf8, but not 3.2 or below.
- deadalnix (3/17) Mar 03 2013 Using noboundcheck isn't fair as you give up on safety, so you
- Don (4/26) Mar 04 2013 But that does suggest that the main problem is that DMD has very
- Steven Schveighoffer (8/30) Mar 04 2013 In this type of test, there is no danger with using noboundscheck. It's...
- Andrei Alexandrescu (3/24) Mar 03 2013 Was the conclusion that it's faster?
- deadalnix (4/36) Mar 03 2013 It seems that I was wrong, and missed the point that both
- Jacob Carlborg (4/9) Mar 03 2013 --
- jerro (6/9) Mar 02 2013 You can look at it that way, but still, the fact that most of the
- Andrei Alexandrescu (4/11) Mar 02 2013 The point here is that often one needs to write code that's more than
- jerro (13/28) Mar 01 2013 I'm guessing you are building without optimization options. When
- Robert (9/9) Mar 01 2013 Hm, just recently a friend of mine and I hacked together at the FB
Tried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64. Is this general problem with string performance in D or just splitter() issue? Did anybody compared performance of D string manipulating functions with other languages like Python,Perl,Java and C++? Here is Python version of test script: import time message = "REGISTER sip:comm.example.com SIP/2.0\r\n\ Content-Length: 0\r\n\ Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n\ To: <sip:12345 comm.example.com>\r\n\ User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n\ Max-Forwards: 70\r\n\ CSeq: 1 REGISTER\r\n\ Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n\ Call-ID: 2910497622026445\r\n\ From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n" t1 = time.perf_counter() for i in range(10000000): for notused in message.split("\r\n"): pass print(time.perf_counter()-t1) Here is D version: import std.stdio,std.algorithm,std.datetime; void main() { auto message = "REGISTER sip:example.com SIP/2.0\r\n~ Content-Length: 0\r\n~ Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r\n~ To: <sip:12345 comm.example.com>\r\n~ User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\n~ Max-Forwards: 70\r\n~ CSeq: 1 REGISTER\r\n~ Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\n~ Call-ID: 2910497622026445\r\n~ From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); foreach(i; 0..10000000) { foreach(notused; splitter(message, "\r\n")) { } } writeln(Clock.currTime()-t1); }
Mar 01 2013
On 3/1/13 3:30 PM, cvk012c wrote:Tried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64.Add -inline to the options. Andrei
Mar 01 2013
On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu wrote:On 3/1/13 3:30 PM, cvk012c wrote:--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecsTried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64.Add -inline to the options. Andrei
Mar 01 2013
On 3/1/13 3:58 PM, simendsjo wrote:On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu wrote:Also, the D version has a different string to parse (~ is not a line continuation character). The fixed version: auto message = "REGISTER sip:example.com SIP/2.0\r Content-Length: 0\r Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summaryq\";q=0.9\r To: <sip:12345 comm.example.com>\r User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r Max-Forwards: 70\r CSeq: 1 REGISTER\r Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r Call-ID: 2910497622026445\r From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; That shaves one extra second bringing it down to 9. AndreiOn 3/1/13 3:30 PM, cvk012c wrote:--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecsTried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64.Add -inline to the options. Andrei
Mar 01 2013
On Friday, 1 March 2013 at 20:58:09 UTC, simendsjo wrote:On Friday, 1 March 2013 at 20:50:15 UTC, Andrei Alexandrescu wrote:On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.On 3/1/13 3:30 PM, cvk012c wrote:--noboundscheck can also help if you don't mind missing the safety net. $ rdmd -O -release sip 22 secs, 977 ms, 299 μs, and 8 hnsecs $ rdmd -O -release -inline sip 12 secs, 245 ms, 567 μs, and 9 hnsecs $ rdmd -O -release -inline -noboundscheck sip 10 secs, 171 ms, 209 μs, and 9 hnsecsTried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64.Add -inline to the options. Andrei
Mar 01 2013
cvk012c:I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Why don't you write a Java version? It takes only few minutes, and you will have one more data point. Python string functions are written in C, compiled very efficiently (the standard Python binaries on Windows are compiled with the Microsoft Compiler, but also Intel compile can be found), and they are well optimized in several years of work by people like Hettinger :-) You will see Python2 code like this easily beat D for normal text files: for line in file(foo.txt): ... Both D and general performance aren't magical things. Performance comes from a long work of optimization of algorithms, code and compilers/virtual machines that run them. Bye, bearophile
Mar 01 2013
On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:cvk012c:You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it. Java version: public class Test { public static void main(String[] args) { String message = "REGISTER sip:comm.example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; long t1 = System.currentTimeMillis(); for (int i=0; i<10000000; i++) { int index1 = 0; while (true) { int index2 = message.indexOf("\r\n", index1); if (index2 == -1) break; String notused = message.substring(index1, index2); index1 = index2+2; } } System.out.println(System.currentTimeMillis()-t1); } } D version: import std.stdio,std.string,std.datetime; void main() { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10000000; i++) { auto substring = message; while (true) { auto index = indexOf(substring, "\r\n"); if (index == -1) break; auto notused = substring[0..index]; substring = substring[index+2..$]; } } writeln(Clock.currTime()-t1); }I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
Mar 01 2013
On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:Try my hand-written version (elsewhere in thread). I think it can be done better too (use pointers instead of arrays). The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly). 2. dmd is not always the best optimizer. I saw one other person who said using a different d compiler resulted in a quicker time. 3. Any time you are looping 10 million times, small insignificant differences will be magnified. Note one other thing -- Be VERY wary of test cases that are fully-visible at compile time. Very smart compilers are known to reduce your code to something that isn't realistic. I remember seeing a similar comparison not too long ago where one wondered why g++ was so much faster (0.09 seconds or something) than D (4 or more seconds). Turns out, the g++ compiler optimized out his ENTIRE program :). You may want to read in the "message" string at runtime to avoid such issues. -Stevecvk012c:You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
Mar 01 2013
On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer wrote:On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly).cvk012c:You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
Mar 01 2013
On Saturday, 2 March 2013 at 01:05:35 UTC, cvk012c wrote:On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer wrote:My version is functionally equivalent, and measures 55 - 94 ms (depending on compiler; LDC is best). Your version performed in about 7s on my machine. Similar optimization taking advantage of immutability can be done on your python translation that results in measurements of <1ms. import std.stdio,std.string,std.datetime; void main() { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10_000_000; i++) { while (true) { auto index = indexOf(message, "\r\n"); if (index == -1) break; auto notused = message[0..index]; message = message[index+2..$]; } } writeln(Clock.currTime()-t1); }On Fri, 01 Mar 2013 19:19:35 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.On Friday, 1 March 2013 at 21:52:13 UTC, bearophile wrote:The issue is a combination of the fact that: 1. splitter is designed for any range, not just strings. Not an excuse really, but a string-specific version could be written that does better (clearly).cvk012c:You are right. Why not. But instead of using Java split() method I used combination of indexOf() and substring() methods to do the same job. The reason: Java split method implemented as a regular expression which will be unfair to compare to D splitter. Again, I created a similar D version of the script, compiled it with all suggested options: -release -O -inline -noboundscheck and this time D version is more then twice slower than Java: 8.4 seconds vs 4. D experts, please, take a look at my code and tell me what is wrong with it.I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Why don't you write a Java version? It takes only few minutes, and you will have one more data point.
Mar 01 2013
On 3/1/2013 5:19 PM, anon123 wrote:My version is functionally equivalent,I don't think so. Take a look at what happens to the message variable. It is never restored to its original string on the next iteration of the loop.
Mar 01 2013
On Fri, 01 Mar 2013 20:19:19 -0500, anon123 <z z.z> wrote:My version is functionally equivalent, and measures 55 - 94 ms (depending on compiler; LDC is best). Your version performed in about 7s on my machine.Try printing out message each time through the 10,000,000 iterations -Steve
Mar 01 2013
On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find). -Steve
Mar 01 2013
On 3/1/13 8:29 PM, Steven Schveighoffer wrote:On Fri, 01 Mar 2013 20:05:34 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:find and indexOf on arrays are on par with handwritten loops. AndreiIn my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find).
Mar 02 2013
On Sat, 02 Mar 2013 12:17:54 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/1/13 8:29 PM, Steven Schveighoffer wrote:This is not a personal attack, please don't take it that way. My anecdotal tests with hand writing a custom splitter range to handle the OP's program gave me a 40% savings. Whether it's find or not, I'm not sure, but there definitely is room for improvement. Using indexOf instead of splitter seems to result in the same time usage, so something that is common between the two would be a logical choice to target. It would be interesting to examine the assembly differences, but I haven't got the time right now. -SteveOn Fri, 01 Mar 2013 20:05:34 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:find and indexOf on arrays are on par with handwritten loops.In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.indexOf uses the same mechanism as splitter to find the separators. If it doesn't improve anything, I'd say that is where the problem lies (std.algorithm.find).
Mar 02 2013
On 3/2/13 2:32 PM, Steven Schveighoffer wrote:This is not a personal attack, please don't take it that way.Not at all. I'm just confused about this exchange. You mention some correct facts ("MySplitter finished in 6 seconds instead of 10"), then some mistaken suppositions ("a string-specific version could be written"), and derive conclusions from both ("Maybe there is a bad constraint somewhere"). Hard to deal with the mix.My anecdotal tests with hand writing a custom splitter range to handle the OP's program gave me a 40% savings. Whether it's find or not, I'm not sure, but there definitely is room for improvement.I think I understand where the confusion comes from. If you're referring to MySplitter, that's not comparable. It uses this at the core: for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } ... } This is not close to the code that would work with arrays. We could specialize things for small arrays, but that hasn't been done yet. My point is it's not comparable with the classic brute force subsequence search. What is comparable is shown below. indexOf2 defines the straight array-in-array brute force search. With dmd on my machine it runs in some 7 seconds. So does indexOf (this was my point). Then indexOf3 uses a simple trick that the Java implementation does: cache the first character. That reduces the running time to 5 seconds. Then indexOf4 optimizes the loop searching that first character, bringing run time to about 4 seconds. On that machine the Java implementation runs in 3.5 seconds. Andrei import std.stdio, std.string, std.datetime; long indexOf2(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; auto limit = haystack.length - needle.length; bigloop: for (size_t i; i <= limit; ++i) { foreach (q; 0 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } long indexOf3(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; immutable c = needle[0]; auto limit = haystack.length - needle.length; assert(limit < uint.max, "not handled"); bigloop: for (uint i; i <= limit; ++i) { if (haystack.ptr[i] != c) continue; foreach (q; 1 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } long indexOf4(string haystack, string needle) { if (!needle.length) return 0; if (needle.length > haystack.length) return -1; immutable c = needle[0]; auto limit = haystack.length - needle.length; size_t i = 0; bigloop: for (;;) { while (i <= limit && haystack[i++] != c) {} if (i-- > limit) break; foreach (q; 1 .. needle.length) { if (haystack[i + q] != needle[q]) continue bigloop; } return i; } return -1; } unittest { assert(indexOf2("hello, world", ",") == 5); assert(indexOf2("hello, world", "a") == -1); assert(indexOf3("hello, world", ",") == 5); assert(indexOf3("hello, world", "a") == -1); assert(indexOf4("hello, world", ",") == 5); assert(indexOf4("hello, world", "a") == -1); } void main(string[] args) { auto message = "REGISTER sip:example.com SIP/2.0\r\nContent-Length: 0\r\nContact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-su mary\";q=0.9\r\nTo: <sip:12345 comm.example.com>\r\nUser-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r\nMax-Forwards: 70\r\nCSeq: 1 REGISTER\r\nVia: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r\nCall-ID: 2910497622026445\r\nFrom: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; auto t1 = Clock.currTime(); for (int i=0; i<10000000; i++) { long index1 = 0; while (true) { auto index2 = indexOf(message[index1 .. $], "\r\n"); if (index2 == -1) break; auto notused = message[index1 .. index1 + index2]; if (args.length == 42) writeln(notused); index1 += index2 + 2; } } writeln(Clock.currTime()-t1); }
Mar 02 2013
On Sat, 02 Mar 2013 17:29:38 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/2/13 2:32 PM, Steven Schveighoffer wrote:I didn't realize the situation. I assumed that there wasn't a version of splitter for strings that took array-specific shortcuts. My corrected statement should read "the existing string-specific version could be improved." My conclusion of "Maybe there is a bad constraint somewhere" was not derived from that, it was based on your statement elsewhere that "I wrote a custom splitter specialized for arrays. It's as fast." Given that my tests have shown I can write a faster one quite easily than the implementation selected by phobos, and that you said you wrote one that's "as fast," I took that to mean you had written one that was more optimized than the chosen splitter version (and logically assumed you had included this version in Phobos with the intent it would be chosen when compiling with strings), leading me to suggest possibly that due to some bug, the "fast" implementation wasn't being chosen. I didn't realize that "as fast" didn't mean "as fast as yours". I actually don't know what you meant by that now.This is not a personal attack, please don't take it that way.Not at all. I'm just confused about this exchange. You mention some correct facts ("MySplitter finished in 6 seconds instead of 10"), then some mistaken suppositions ("a string-specific version could be written"), and derive conclusions from both ("Maybe there is a bad constraint somewhere"). Hard to deal with the mix.Very good point, here is a new version that takes any string as a separator: struct MySplitter { private string s; private string separator; private string source; this(string src, string sep) { source = src; separator = sep; popFront(); } property string front() { return s; } property bool empty() { return s.ptr == null; } void popFront() { if(!source.length) { s = source; source = null; } else { size_t i = 0; bool found = false; for(; i + separator.length <= source.length; i++) { if(source[i] == separator[0]) { found = true; for(size_t j = i+1, k=1; k < separator.length; ++j, ++k) if(source[j] != separator[k]) { found = false; break; } if(found) break; } } s = source[0..i]; if(found) source = source[i + separator.length..$]; else source = source[$..$]; } } } Takes 7 seconds on my machine instead of 6, but not 10 like std.algorithm.splitter. I don't even like the loop that well, it looks crude, I can probably optimize it further. And it does not use any of your specified tricks. Is this sufficiently comparable, or am I missing something else? -SteveMy anecdotal tests with hand writing a custom splitter range to handle the OP's program gave me a 40% savings. Whether it's find or not, I'm not sure, but there definitely is room for improvement.I think I understand where the confusion comes from. If you're referring to MySplitter, that's not comparable. It uses this at the core: for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } ... } This is not close to the code that would work with arrays. We could specialize things for small arrays, but that hasn't been done yet. My point is it's not comparable with the classic brute force subsequence search.
Mar 04 2013
On 3/4/13 10:39 AM, Steven Schveighoffer wrote:struct MySplitter { private string s; private string separator; private string source; this(string src, string sep) { source = src; separator = sep; popFront(); } property string front() { return s; } property bool empty() { return s.ptr == null; } void popFront() { if(!source.length) { s = source; source = null; } else { size_t i = 0; bool found = false; for(; i + separator.length <= source.length; i++) { if(source[i] == separator[0]) { found = true; for(size_t j = i+1, k=1; k < separator.length; ++j, ++k) if(source[j] != separator[k]) { found = false; break; } if(found) break; } } s = source[0..i]; if(found) source = source[i + separator.length..$]; else source = source[$..$]; } } } Takes 7 seconds on my machine instead of 6, but not 10 like std.algorithm.splitter. I don't even like the loop that well, it looks crude, I can probably optimize it further. And it does not use any of your specified tricks. Is this sufficiently comparable, or am I missing something else? -SteveThat's comparable, please file an enh request. Thanks, Andrei
Mar 04 2013
On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That's comparable, please file an enh request.http://d.puremagic.com/issues/show_bug.cgi?id=9646
Mar 04 2013
On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: dmd -O -release -inline -noboundscheck: std.algorithm.splitter took 5 secs, 170 ms, and 656 μs MySplitter took 4 secs, 835 ms, 748 μs, and 1 hnsec my_splitter took 4 secs, 862 ms, 914 μs, and 4 hnsecs ldc2 -O -release: std.algorithm.splitter took 3 secs, 540 ms, and 84 μs MySplitter took 2 secs, 288 ms, and 535 μs my_splitter took 3 secs, 463 ms, and 897 μs CPU: Intel i7 M 620 2.67GHz dmd: v2.070.2 ldc: 0.17.1 (based on DMD v2.068.2 and LLVM 3.7.1) auto my_splitter(alias pred = "a == b", Range, Separator)(Range r, Separator s) if (is(typeof(binaryFun!pred(r.front, s.front)) : bool) && (hasSlicing!Range || isNarrowString!Range) && isForwardRange!Separator && (hasLength!Separator || isNarrowString!Separator)) { import std.algorithm.searching : find; import std.conv : unsigned; static struct Result { private: Range _input; Separator _separator; size_t _next_index; bool _empty; property auto separatorLength() { return _separator.length; } void findIndex() { auto found = find!pred(_input, _separator); _next_index = _input.length - found.length; } public: this(Range input, Separator separator) { _input = input; _separator = separator; _empty = false; findIndex(); } property Range front() { assert(!empty); return _input[0 .. _next_index]; } static if (isInfinite!Range) { enum bool empty = false; // Propagate infiniteness } else { property bool empty() { return _empty; } } void popFront() { assert(!empty); if (_input.empty) { _empty = true; assert(_next_index == 0); } else if (_input.length == _next_index) { _input = _input[$ .. $]; _empty = true; } else { _input = _input[_next_index + separatorLength .. $]; findIndex(); } } static if (isForwardRange!Range) { property typeof(this) save() { auto ret = this; ret._input = _input.save; return ret; } } } return Result(r, s); }That's comparable, please file an enh request.http://d.puremagic.com/issues/show_bug.cgi?id=9646
May 22 2016
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:have you thought about opening a PR to improve `splitter`?[...]Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]
May 23 2016
On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying. Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR?On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:have you thought about opening a PR to improve `splitter`?[...]Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]
May 23 2016
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:On Monday, 23 May 2016 at 11:52:35 UTC, Seb wrote:Good luck. If you need help, it's probably the best to open the PR as people can directly see your code and the Phobos repo is usually pretty active.On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:Yes, but I'm not sure about the goals. I also want to dig a little deeper. Being "sometimes faster" without some understand why and when feels unsatisfying.On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:have you thought about opening a PR to improve `splitter`?[...]Below is an implementation, which matches MySplitter with dmd, but not with ldc. More precisely: [...]Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead.Is "remove dead code" a good enough reason in itself for a PR?Absolutely ;-)
May 23 2016
On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:Additionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR?Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists.
May 23 2016
On 05/23/2016 08:17 AM, qznc wrote:On Monday, 23 May 2016 at 12:01:52 UTC, qznc wrote:Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- AndreiAdditionally, there is this weird special case for a bidirectional range, which just adds unnecessary overhead. Is "remove dead code" a good enough reason in itself for a PR?Forget the "dead code comment" it is a actually a missing feature. In the case the separator is a range and the input range is bidirectional, the splitter result should be bidirectional as well. It is not, because the implementation of back() and popBack() is missing, although some bookkeeping code for it exists.
May 23 2016
On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:Most uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- AndreiI don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter. filterBidirectional always felt like a relic of a time when static if's didn't exist, even though I know that's not the case.
May 23 2016
On 5/23/16 9:44 AM, Jack Stouffer wrote:On Monday, 23 May 2016 at 13:20:55 UTC, Andrei Alexandrescu wrote:What would be the condition tested in the static if? Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- AndreiMost uses are forward, so perhaps it's best to eliminate the bidirectional bookkeeping if it adds overhead, and confine those to a separate function e.g. splitterBidirectional. There is precedent for that in Phobos. -- AndreiI don't see the value add that things like filterBidirectional give when the back and popBack functions can be static if-ed in or out the normal filter.
May 23 2016
On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:What would be the condition tested in the static if?I would assume `static if (isBidirectionalRange!Range)`Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- AndreiI honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
May 23 2016
On Monday, 23 May 2016 at 14:25:52 UTC, Jack Stouffer wrote:On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:Here is a commit, which removes the dead code: https://github.com/qznc/phobos/commit/0cb1e70f4504a52b81c30342be7ec26d597fac69 It would be strictly better to implement the back and popBack, no? My single test case says removing the code does not result in faster code. Maybe dmd optimizes the bookkeeping away. I see three options: 1. Remove dead bookkeeping code 2. Implement back() and popBack() 3. Use alternative splitter implementation (and implement back() and popBack()) The third one would be the best, if it is really faster.What would be the condition tested in the static if?I would assume `static if (isBidirectionalRange!Range)`Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- AndreiI honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
May 23 2016
On 05/23/2016 10:47 AM, qznc wrote:It would be strictly better to implement the back and popBack, no?If it adds no overhead, you're right. -- Andrei
May 23 2016
On Monday, 23 May 2016 at 14:47:22 UTC, qznc wrote:I see three options: 1. Remove dead bookkeeping code 2. Implement back() and popBack() 3. Use alternative splitter implementation (and implement back() and popBack()) The third one would be the best, if it is really faster.If the performance is really a problem (I think it's a micro optimization at best), then the best option is not to violate DRY. Have a template parameter, std.typecons.Flag, to turn off the back and popBack. Don't have two functions that are 95% the same code like filter and filterBidirectional.
May 23 2016
On 05/23/2016 03:11 PM, Jack Stouffer wrote:(I think it's a micro optimization at best)splitter in the stdlib is likely to be very frequently useful, any bit of speedup we put in it is likely to pay off immensely. -- Andrei
May 23 2016
On 05/23/2016 10:25 AM, Jack Stouffer wrote:On Monday, 23 May 2016 at 14:17:59 UTC, Andrei Alexandrescu wrote:You might be doing extra work even if the user never calls back and popBack. -- AndreiWhat would be the condition tested in the static if?I would assume `static if (isBidirectionalRange!Range)`Recall that sometimes you do want a forward range even when you could define a bidirectional range. -- AndreiI honestly can't think of a time when would this be the case. What are you losing by having back and popBack defined?
May 23 2016
On Sunday, 22 May 2016 at 18:56:30 UTC, qznc wrote:On Monday, 4 March 2013 at 19:11:17 UTC, Steven Schveighoffer wrote:I think the actual slow thing is std.find, which splitter uses. Benchmark: https://dpaste.dzfl.pl/1e31b6f95659 The output is: std find took 162167000 manual find took 99852000 Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.On Mon, 04 Mar 2013 12:35:23 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:That's comparable, please file an enh request.http://d.puremagic.com/issues/show_bug.cgi?id=9646
May 23 2016
On 05/23/2016 03:11 PM, qznc wrote:Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei
May 23 2016
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:On 05/23/2016 03:11 PM, qznc wrote:I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?
May 24 2016
On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335On 05/23/2016 03:11 PM, qznc wrote:I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack. The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0]. Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?
May 24 2016
On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:On Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:Unfortunately, it is slower. My current measurements [0]: std find took 753837231 manual find took 129561980 std find took 721676721 manual find took 216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slowApart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
May 24 2016
On 05/24/2016 06:44 AM, qznc wrote:On Tuesday, 24 May 2016 at 09:34:30 UTC, Chris wrote:One somewhat unrelated thing that can be done with Boyer-Moore: use a constant-length skip array instead of dynamic allocation. Upon saturation, continue with brute force. -- AndreiOn Tuesday, 24 May 2016 at 07:54:38 UTC, qznc wrote:Unfortunately, it is slower. My current measurements [0]: std find took 753837231 manual find took 129561980 std find took 721676721 manual find took 216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slowApart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066From Phobos [1]: /* Note: This step could be made a lot faster. * A simple implementation is shown here. */ this.skip = new size_t[needle.length]; foreach (a; 0 .. needle.length) { size_t value = 0; while (value < needle.length && !needlematch(needle, a, value)) { ++value; } this.skip[needle.length - a - 1] = value; } [1] https://github.com/dlang/phobos/blob/master/std/algorithm/searching.d#L335
May 24 2016
On 05/24/2016 08:38 AM, Andrei Alexandrescu wrote:One somewhat unrelated thing that can be done with Boyer-Moore: use a constant-length skip array instead of dynamic allocation. Upon saturation, continue with brute force. -- AndreiAnother idea: in every searched string there is one specific index that offers the most information to a failed search, allowing for a large skip. Roughly speaking it's the largest index around which there are many characters not seen elsewhere in the searched string. (In a string with all distinct characters, that would be the last character.) If that offset is cheap to compute and offers a good skip, a search starting from that index would be very fast. -- Andrei
May 24 2016
On Tuesday, 24 May 2016 at 10:44:12 UTC, qznc wrote:Unfortunately, it is slower. My current measurements [0]: std find took 753837231 manual find took 129561980 std find took 721676721 manual find took 216418870 The nested-for-loop implementation is roughly 4 times faster! Caveat: I'm not completely sure my benchmark is fair yet. [0] https://github.com/qznc/find-is-too-slowPlot twist: manual find is not always faster. My benchmark now generates random haystack and needle. Sometimes Phobos clearly wins. Example: $ ./benchmark.ldc -n39 -l10000 -i10000 Found at 10000 std find took 289529102 manual find took 368958679 This is compiled with LDC, needle is 39 chars, haystack 10000 chars, and 10000 iterations. "Found at 10000" means the needle was not found. Very often manual find wins though. Takeaway: Manual find is too naive. However, something slows down std find in general. More research needed. PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice.
May 24 2016
On 05/24/2016 03:47 PM, qznc wrote:PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice.Try http://asm.dlang.org for short tests. -- Andrei
May 24 2016
On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote:PS: Any recommendations for a Linux dissassembler? Objdump is ok, but some arrows for jumps would be nice.I discovered radare2. While it is actually a debugger (like gdb), it includes a nice disassembler. Example view, which shows the loop calling std.algorithm.find repeatedly: | 0x00402e08, 0 | nop [rax+rax+0x0] .--> 0x00402e10, 0 | mov rdx, [rbx] || 0x00402e13 0 | mov rcx, [rbx+0x8] || 0x00402e17 0 | mov rdi, r15 || 0x00402e1a 0 | mov rsi, r12 || 0x00402e1d 0 | call 0x409160 ; 4 = sym._D3std9algorithm9searching34__T4fin || 0x00402e22 0 | mov rcx, [rbx] || 0x00402e25 0 | sub rcx, rax || 0x00402e28, 0 | mov [rbx+0x20], rcx || 0x00402e2c, 0 | dec ebp `==< 0x00402e2e 0 | jnz 0x402e10 ; 5 = 0x00402e10 | 0x00402e30, 0 | lea r15, [rsp] Note the ASCII arrow on the left! Objdump should do that. :)
May 24 2016
On Tuesday, 24 May 2016 at 22:01:17 UTC, qznc wrote:On Tuesday, 24 May 2016 at 19:47:52 UTC, qznc wrote: I discovered radare2. While it is actually a debugger (like gdb), it includes a nice disassembler. Example view, which shows the loop calling std.algorithm.find repeatedly: | 0x00402e08, 0 | nop [rax+rax+0x0] .--> 0x00402e10, 0 | mov rdx, [rbx] || 0x00402e13 0 | mov rcx, [rbx+0x8] || 0x00402e17 0 | mov rdi, r15 || 0x00402e1a 0 | mov rsi, r12 || 0x00402e1d 0 | call 0x409160 ; 4 = sym._D3std9algorithm9searching34__T4fin || 0x00402e22 0 | mov rcx, [rbx] || 0x00402e25 0 | sub rcx, rax || 0x00402e28, 0 | mov [rbx+0x20], rcx || 0x00402e2c, 0 | dec ebp `==< 0x00402e2e 0 | jnz 0x402e10 ; 5 = 0x00402e10 | 0x00402e30, 0 | lea r15, [rsp] Note the ASCII arrow on the left! Objdump should do that. :)Any suggestions how we can make find more efficient?
May 25 2016
On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:Any suggestions how we can make find more efficient?I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
May 25 2016
On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:And here it is: https://github.com/dlang/phobos/pull/4362 Destroy!Any suggestions how we can make find more efficient?I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
May 25 2016
On Wednesday, 25 May 2016 at 11:30:33 UTC, qznc wrote:On Wednesday, 25 May 2016 at 10:45:35 UTC, qznc wrote:Great stuff! I really appreciate your efforts. As I said, my code uses find/canFind a lot.On Wednesday, 25 May 2016 at 10:28:47 UTC, Chris wrote:And here it is: https://github.com/dlang/phobos/pull/4362 Destroy!Any suggestions how we can make find more efficient?I will send a pull request soon. My current patch [0] improves it, but still contains a bug. [0] https://github.com/qznc/find-is-too-slow/commit/76efc706a2c1d9e885800ca9b830850935700a95
May 25 2016
On 05/24/2016 03:54 AM, qznc wrote:On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:There's also Rabin-Karp that can be used when the sought range is relatively short.On 05/23/2016 03:11 PM, qznc wrote:I observed that Boyer-Moore from std is still slower. My guess is due to the relatively short strings in my benchmark and the need to allocate for the tables. Knuth-Morris-Pratt might be worthwhile, because only requires a smaller table, which could be allocated on the stack.Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna?The overall sentiment seems to be that KMP vs BM depends on the input. This means an uninformed find function could only use heuristics. Anyway, Phobos could use a KnuthMorrisPrattFinder implementation next to the BoyerMooreFinder. I filed an issue [0].Great. I keep on thinking how the advanced algorithms should "kick in" only after a partial match was long enough and frequent enough. As long as the partial match is short and infrequent, brute force will do best.Apart from advanced algorithms, find should not be slower than a naive nested-for-loop implementation. [0] https://issues.dlang.org/show_bug.cgi?id=16066Nice, thanks. Andrei
May 24 2016
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:On 05/23/2016 03:11 PM, qznc wrote:Since I work a lot with text, I use canFind etc. a lot. It'd be nice to be able to choose faster algorithms. People are often advised to use library functions instead of rolling their own, because library functions are optimized for speed, but this doesn't seem to be true of Phobos. Simple everyday tasks like string matching should be as fast as C.Actually, std find should be faster, since it could use the Boyer Moore algorithm instead of naive string matching.Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish.There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? Andrei
May 24 2016
On Monday, 23 May 2016 at 22:19:18 UTC, Andrei Alexandrescu wrote:On 05/23/2016 03:11 PM, qznc wrote: Conventional wisdom has it that find() is brute force and that's that, but probably it's time to destroy. Selectively using advanced searching algorithms for the appropriate inputs is very DbI-ish. There are a few nice precedents of blend algorithms, see e.g. http://effbot.org/zone/stringlib.htm. Writing a generic subsequence search blend algorithm, one that chooses the right algorithm based on a combination of static and dynamic decisions, is quite a fun and challenging project. Who wanna? AndreiI've played around with some algorithms and kept them as simple as possible, but I've found that a linear brute force for-loop always beats them (it's the extra decision(s), I suppose). It looks nice in theory, but in hardware reality a stupid loop is more efficient. NB: I've found that `foreach` is usually faster than a manual `for`-loop. I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed.
May 27 2016
On 5/27/16 8:28 AM, Chris wrote:I've played around with some algorithms and kept them as simple as possible, but I've found that a linear brute force for-loop always beats them (it's the extra decision(s), I suppose). It looks nice in theory, but in hardware reality a stupid loop is more efficient.This is surprising. It should be easy to construct examples where brute force search performs arbitrarily poor, e.g. searching for "aaaaaaa...aab" in a long string with only "a"s.NB: I've found that `foreach` is usually faster than a manual `for`-loop.That's surprising too. I'd be interested to look at a disassembly if you have one.I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed.That's definitely the way to go. Thanks for looking into it! Andrei
May 27 2016
On Friday, 27 May 2016 at 13:29:00 UTC, Andrei Alexandrescu wrote:On 5/27/16 8:28 AM, Chris wrote: This is surprising. It should be easy to construct examples where brute force search performs arbitrarily poor, e.g. searching for "aaaaaaa...aab" in a long string with only "a"s.I will look into this. Atm, I'm using qznc's benchmark.I have to correct myself. A manual loop is faster, I couldn't believe it either, so I benchmarked the same function with `foreach` and `for`. Turns out that `for` is _way_ faster.NB: I've found that `foreach` is usually faster than a manual `for`-loop.That's surprising too. I'd be interested to look at a disassembly if you have one.There is a bug, if it is one, in qznc's `manual_find`. It doesn't find strings at the end of a string, as "ing" in "string", and it does not find a string that is equal to the search string as in "abba" == "abba". This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; } A typical example would be: std find took 25642297 manual find took 7426676 // manual_find my std find took 6654496 // improved find findStringS took 7159556 // foreach(i, c; haystack) findStringS_ took 5315498 // for (auto i = 0 ...) see above I think it's clear that `std find` is veeeery slow. If anyone wants to test my function, please do so. I don't want to spread wrong data, but this is what I get at the moment (ldc latest version).I used qznc's benchmarking, and it's clear that the current implementation of std.algorithm.find is quite a poor performer. We do have to work on that. Library functions should be fast, no matter what. If someone uses `find`, they wanna be sure it's optimized for speed.That's definitely the way to go. Thanks for looking into it! Andrei
May 27 2016
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; }Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k) Else it will be out or range when you look for "za" and the string ends with "z". There is a slight performance penalty, but it's still faster than the rest. Maybe there are cleverer ways. But every little check and decision comes with a performance penalty. The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. std find took 12573666 manual find took 7418454 my std find took 6903854 <=== findStringS took 7166720 findStringS_ took 6579529 <=== std find took 11849892 manual find took 7407091 my std find took 6573102 <=== findStringS took 7296610 findStringS_ took 6573214 <=== std find took 10744361 manual find took 7576133 my std find took 6332014 <=== findStringS took 7224946 findStringS_ took 6555976 <===
May 27 2016
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast.I think it really depends on the use case. The manual algorithm is really naive and std-find is slightly more advanced. My benchmark creates a random haystack and needle, but this is not real world data. I believe the "slightly more advanced" approach is a good idea, but there is no data to prove it. It might be interesting what algorithms other language's standard libraries (C++, Python, Java, etc) use. Thanks for looking at the benchmarking, Chris! The more eyes, the better. :)
May 27 2016
On 5/27/16 10:41 AM, Chris wrote:The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast.What you want to do here for indexed access is to match the last element first. If no match, continue etc. If there's a match, enter an inner loop where you don't need to check for the end of the haystack. -- Andrei
May 27 2016
On Friday, 27 May 2016 at 18:21:06 UTC, Andrei Alexandrescu wrote:What you want to do here for indexed access is to match the last element first. If no match, continue etc. If there's a match, enter an inner loop where you don't need to check for the end of the haystack. -- AndreiAnother plot twist: Sentinels look interesting. Wilzbach [0] mentioned it in the pull request: "Btw on a related issue - it seems like no one has actually bother to implement Andrei's "Fastware" sentinels in Phobos. If you are already in benchmarking, you might give this specialization for mutable RandomAccessRanges a try ;-)" I tried it [1]. It changes the numbers from std find: 153 ±32 manual find: 112 ±19 qznc find: 121 ±18 Chris find: 132 ±20 to std find: 160 ±31 manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown. This means another special case for mutable ranges could be worthwhile. [0] https://github.com/dlang/phobos/pull/4362#issuecomment-222224387 [1] https://github.com/qznc/find-is-too-slow/commit/043d1d2f6dced625e964e8df883ad5a45d7fe654
May 27 2016
On 05/27/2016 06:19 PM, qznc wrote:manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown.What compiler?This means another special case for mutable ranges could be worthwhile.Also mind the throwing possibilities. Andrei
May 27 2016
On Saturday, 28 May 2016 at 01:30:11 UTC, Andrei Alexandrescu wrote:On 05/27/2016 06:19 PM, qznc wrote:LDC. For DMD the numbers show no improvement: std find: 170 ±31 manual find: 132 ±25 qznc find: 103 ±6 Chris find: 157 ±30 to std find: 168 ±30 manual find: 131 ±25 qznc find: 104 ±8 <--- sentinel Chris find: 155 ±29 (Caveat: Chris find has a bug, which is triggered once in the 10000 runs each.) The sentinel trick is only proof of concept. The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges.manual find: 118 ±24 qznc find: 109 ±13 <--- using the sentinel trick Chris find: 142 ±27 It is normal that the numbers of the other tests change, since those are relative to the fastest one in each run. When qznc find 'wins' more often, the others get more slowdown.What compiler?What do you mean? Using exceptions? If anybody wants to replicate. It only takes a few seconds and there are no dependencies except dmd/ldc. git clone https://github.com/qznc/find-is-too-slow cd find-is-too-slow make dmd make ldcThis means another special case for mutable ranges could be worthwhile.Also mind the throwing possibilities.
May 28 2016
On 5/28/16 6:56 AM, qznc wrote:The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges.No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei
May 28 2016
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu wrote:On 5/28/16 6:56 AM, qznc wrote:I included your function (I also fixed the bug in my findStringS_). Here's a typical result: std find: 208 ±61 manual find: 127 ±29 qznc find: 108 ±13 Chris find: 140 ±32 Andrei find: 126 ±27 ===== std find: 207 ±59 manual find: 128 ±30 qznc find: 108 ±13 Chris find: 141 ±32 Andrei find: 125 ±27 I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner.The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges.No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work: T[] find(T)(T[] haystack, T[] needle) { if (needle.length == 0) return haystack; immutable lastIndex = needle.length - 1; auto last = needle[lastIndex]; size_t j = lastIndex; for (; j < haystack.length; ++j) { if (haystack[j] != last) continue; immutable k = j - lastIndex; // last elements match for (size_t i = 0; ; ++i) { if (i == lastIndex) return haystack[k .. $]; if (needle[i] != haystack[k + i]) break; } } return haystack[$ .. $]; } unittest { string s1 = "hello abc world"; assert(find(s1, "abc") == "abc world"); assert(find(s1, "def") == ""); } void main(){} Andrei
May 28 2016
On Saturday, 28 May 2016 at 14:18:10 UTC, Chris wrote:I used dmd, because I don't have ldc on my laptop. qznc's find is clearly the winner.DMD performance feels flaky to me. If performance is important, you should use ldc or gdc. Alternatively, you are Walter Bright and simply optimize dmd. ;)
May 28 2016
On Saturday, 28 May 2016 at 12:47:59 UTC, Andrei Alexandrescu wrote:On 5/28/16 6:56 AM, qznc wrote:A nice one. Here are the numbers I got: LDC: std find: 156 ±33 manual find: 117 ±24 qznc find: 114 ±14 Chris find: 135 ±24 Andrei find: 112 ±27 DMD: std find: 177 ±31 manual find: 140 ±28 qznc find: 102 ±4 Chris find: 165 ±31 Andrei find: 130 ±25 My sentinel idea is to have only one condition branch in the inner loop. Andrei find still has to `if` in the inner loop. My inner loop looks like this: haystack[scout] = cast(typeof(needleBack)) (needleBack+1); // place sentinel for (size_t j = scout + 1 - needleLength, k = 0;; ++j, ++k) { // no condition if (!binaryFun!pred(haystack[j], needle[k])) { // only condition // ... ok in here comes another, but this not as costly } } As long as you are matching the needle, there is only one condition to check. There is no check for k < needleLength, because the check always fails for needle[$] aka haystack[scout] due to the sentinel. I thought it might be slower, since we need to write to memory twice instead of only reading. Turns out it is faster.The sentinel value is `needleBack+1`, but range elements need not support addition. Finding a sentinel is hard and most certainly requires more assumptions about the ranges.No need for a sentinel really so long as you first search for the last element of the needle, in the haystack. Allow me to put on the table a simple brute force routine, specialized for arrays with copyable elements, no custom predicate etc. Far as I can tell it does no redundant work:
May 28 2016
I played around with the benchmark. Some more numbers: $ make ldc ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc ./benchmark.ldc E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 153 ±25 +66 (1934) -15 (7860) manual find: 122 ±28 +80 (1812) -17 (8134) qznc find: 125 ±16 +18 (4644) -15 (5126) Chris find: 148 ±29 +75 (1976) -18 (7915) Andrei find: 114 ±23 +100 (1191) -13 (8770) (avg slowdown vs fastest; absolute deviation) $ make dmd dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 160 ±27 +44 (3162) -20 (6709) manual find: 148 ±28 +54 (2701) -19 (7178) qznc find: 102 ±3 +27 ( 766) -1 (9136) Chris find: 175 ±30 +55 (2796) -21 (7106) Andrei find: 122 ±22 +46 (2554) -14 (7351) (avg slowdown vs fastest; absolute deviation) The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc: Andrei find: 114 ±23 +100 (1191) -13 (8770) The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown. What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias.
May 29 2016
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:I played around with the benchmark. Some more numbers: $ make ldc ldmd2 -O -release -inline -noboundscheck *.d -ofbenchmark.ldc ./benchmark.ldc E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 153 ±25 +66 (1934) -15 (7860) manual find: 122 ±28 +80 (1812) -17 (8134) qznc find: 125 ±16 +18 (4644) -15 (5126) Chris find: 148 ±29 +75 (1976) -18 (7915) Andrei find: 114 ±23 +100 (1191) -13 (8770) (avg slowdown vs fastest; absolute deviation) $ make dmd dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd E: wrong result with Chris find E: wrong result with Chris find E: wrong result with Chris find std find: 160 ±27 +44 (3162) -20 (6709) manual find: 148 ±28 +54 (2701) -19 (7178) qznc find: 102 ±3 +27 ( 766) -1 (9136) Chris find: 175 ±30 +55 (2796) -21 (7106) Andrei find: 122 ±22 +46 (2554) -14 (7351) (avg slowdown vs fastest; absolute deviation) The additional numbers on the right are the ±MAD separated by above or below the mean. For example Andrei find with ldc: Andrei find: 114 ±23 +100 (1191) -13 (8770) The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown. What bothers me is that changing the alphabet changes the numbers so much. Currently, if you restrict the alphabets for haystack and needle, the numbers change. The benchmark already does a random subset on each run, but there is definitely a bias.You can avoid "E: wrong result with Chris find" by using outer: for (auto i = 0; i < haystack.length-needle.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } It's a tad faster. I'm planning to test on more varied data and see where a bias may occur.
May 29 2016
It's weird, on my machine, my find function is consistently faster than `manual find` LDC: std find: 137 ±22 +33 (3580) -17 (6251) manual find: 137 ±32 +53 (3105) -23 (6834) qznc find: 106 ±8 +17 (2608) -5 (7195) Chris find: 132 ±32 +58 (2803) -23 (7131) Andrei find: 120 ±26 +54 (2466) -17 (7454) === std find: 136 ±22 +33 (3503) -17 (6304) manual find: 137 ±33 +55 (3000) -23 (6920) qznc find: 106 ±8 +17 (2535) -5 (7287) Chris find: 132 ±33 +59 (2803) -23 (7137) Andrei find: 119 ±25 +51 (2569) -16 (7374) It's negligible, but I wonder is it the extra initialization: `size_t end = haystack.length - needle.length + 1;` and `size_t i = 0` is defined, before we know, if it's a legitimate search. But that shouldn't matter, or should it? Or does it depend on the machine? DMD slows all searches down considerably (except for `qznc find`, well done!:) DMD: std find: 161 ±36 +46 (4015) -31 (5847) manual find: 147 ±40 +68 (3001) -28 (6917) qznc find: 106 ±8 +15 (2623) -5 (7172) Chris find: 201 ±41 +73 (2830) -28 (7120) Andrei find: 150 ±40 +61 (3347) -30 (6575) PS I've corrected the end of the search string to `outer: for (auto i = 0; i < haystack.length-(needle.length-1); i++)` else it will not match at the end of a string.
May 30 2016
On Sunday, 29 May 2016 at 12:22:23 UTC, qznc wrote:I played around with the benchmark. Some more numbers: The mean slowdown is 114, which means 14% slower than the fastest one. The mean absolute deviation (MAD) is 23. More precisely, the mean deviation above the mean slowdown of 103 is 100 and -13 below the mean slowdown. 1191 of the 10000 runs were above the mean slowdown and 8770 below. The 39 missing runs are equal to the mean slowdown.A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms. --Jon
May 29 2016
On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms.I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median? When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE. However, this is more advanced and requires further introspection. [0] http://forum.dlang.org/post/aahhopdcrengakvoemrn forum.dlang.org
May 29 2016
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:On Sunday, 29 May 2016 at 18:50:40 UTC, Jon Degenhardt wrote:Oh, I see. The benchmark varies the data on each run and aggregates, is that right? Sorry, I missed that.A minor thing - you might consider also calculating the median and median version of MAD (median of absolute deviations from the median). The reason is that benchmarks often have outliers in the max time dimension, median will do a job reducing the effect of those outliers than mean. Your benchmark code could publish both forms.I don't think that would help. This is not a normal distribution, so mean and median don't match anyways. What would you learn from the median?
May 29 2016
On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:When looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE.So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25 +38 (3384) -19 (6486) manual find: 140 ±37 +64 (2953) -25 (6962) qznc find: 114 ±17 +33 (2610) -11 (7262) Chris find: 146 ±39 +66 (3045) -28 (6873) Andrei find: 126 ±29 +54 (2720) -19 (7189) Wordwise find: 130 ±30 +53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
May 30 2016
On 05/30/2016 05:31 AM, qznc wrote:On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- AndreiWhen looking at the assembly I don't like the single-byte loads. Since string (ubyte[] here) is of extraordinary importance, it should be worthwhile to use word loads [0] instead. Really fancy would be SSE.So far, the results look disappointing. Andrei find does not get faster with wordwise matching: ./benchmark.ldc std find: 133 ±25 +38 (3384) -19 (6486) manual find: 140 ±37 +64 (2953) -25 (6962) qznc find: 114 ±17 +33 (2610) -11 (7262) Chris find: 146 ±39 +66 (3045) -28 (6873) Andrei find: 126 ±29 +54 (2720) -19 (7189) Wordwise find: 130 ±30 +53 (2934) -21 (6980) Interesting side-note: On my laptop Andrei find is faster than qznc find (for LDC), but on my desktop it reverses (see above). Both are Intel i7. Need to find a simpler processor. Maybe wordwise is faster there. Alternatively, find is purely memory bound and the L1 cache makes every difference disappear. Also, note how std find is faster than manual find! Finding a reliable benchmark is hard. :/
May 30 2016
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:Please throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- AndreiCool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick.
May 30 2016
On Monday, 30 May 2016 at 18:57:15 UTC, Chris wrote:On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:Congratulations! DMD: ./benchmark.dmd std: 178 ±31 +36 (4475) -29 (5344) manual: 167 ±46 +82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49 +82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHz LDC: std: 184 ±19 +28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25 +44 (2983) -17 (6911) Chris: 194 ±57 +78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHzPlease throw this hat into the ring as well: it should improve average search on large vocabulary dramatically. https://dpaste.dzfl.pl/dc8dc6e1eb53 It uses a BM-inspired trick - once the last characters matched, if the match subsequently fails it needn't start from the next character in the haystack. The "skip" is computed lazily and in a separate function so as to keep the loop tight. All in all a routine worth a look. I wanted to write this for a long time. -- AndreiCool. So far, my experience with separate functions has been that they slow down the loop. But this might do the trick.
May 30 2016
On 05/30/2016 04:00 PM, Chris wrote:./benchmark.dmd std: 178 ±31 +36 (4475) -29 (5344) manual: 167 ±46 +82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49 +82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHz LDC: std: 184 ±19 +28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25 +44 (2983) -17 (6911) Chris: 194 ±57 +78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHzThanks for looking into this! qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei
May 30 2016
On Monday, 30 May 2016 at 20:08:46 UTC, Andrei Alexandrescu wrote:On 05/30/2016 04:00 PM, Chris wrote:What function call should be replaced with inline code? Overall, I'm very confused and displeased by the benchmark right now. Initially, the problem was that a naive manual find was slower than std find in Phobos. Look at the LDC numbers–manual find is slower than std find. By tweaking the benchmark, the numbers can be shifted around arbitrarily. Even without tweaking, it is very processor dependent. Here are the numbers from my laptop: ./benchmark.ldc std: 149 ±24 +29 (4275) -21 (5604) manual: 129 ±37 +90 (2055) -23 (7917) qznc: 131 ±22 +26 (4381) -20 (5488) Chris: 154 ±35 +81 (2192) -22 (7672) Andrei: 118 ±28 +82 (1762) -16 (8194) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz ./benchmark.dmd std: 143 ±22 +26 (4219) -19 (5602) manual: 158 ±31 +47 (3436) -24 (6492) qznc: 101 ±2 +12 (1349) -1 (7851) Chris: 216 ±38 +56 (3393) -28 (6541) Andrei: 135 ±31 +48 (3326) -24 (6589) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz And Desktop: ./benchmark.ldc std: 129 ±24 +40 (3121) -17 (6767) manual: 129 ±31 +59 (2668) -21 (7244) qznc: 112 ±14 +30 (2542) -9 (7312) Chris: 134 ±33 +58 (2835) -23 (7068) Andrei: 123 ±27 +53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz ./benchmark.dmd std: 157 ±31 +44 (3693) -24 (6234) manual: 143 ±41 +73 (2854) -28 (7091) qznc: 116 ±21 +35 (3092) -14 (6844) Chris: 181 ±50 +74 (3452) -38 (6510) Andrei: 136 ±38 +64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz./benchmark.dmd std: 178 ±31 +36 (4475) -29 (5344) manual: 167 ±46 +82 (2883) -32 (7054) qznc: 114 ±7 +18 (1990) -5 (7288) Chris: 228 ±49 +82 (3050) -35 (6780) Andrei: 103 ±5 +47 ( 658) -2 (9295) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHz LDC: std: 184 ±19 +28 (3420) -14 (6523) manual: 205 ±59 +120 (2463) -39 (7443) qznc: 151 ±25 +44 (2983) -17 (6911) Chris: 194 ±57 +78 (3702) -46 (6251) Andrei: 101 ±2 +42 ( 435) -1 (9542) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i5-4200U CPU 1.60GHzThanks for looking into this! qznc, could you please look into updating https://github.com/dlang/phobos/pull/4362 with this result? One possible tweak is see whether replacing the function call with inline code helps. Thanks! -- Andrei
May 30 2016
On 05/30/2016 06:16 PM, qznc wrote:What function call should be replaced with inline code?The call to computeSkip.Overall, I'm very confused and displeased by the benchmark right now.I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail. * Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length). * Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended. Andrei
May 30 2016
On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail.There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.* Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length).How long should the english text be? A Tweet? A book? A Gigabyte of log files? English text means basically ASCII and no Unicode?* Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended.Why the prepend/append? To force a mismatch?
May 31 2016
On 5/31/16 1:54 PM, qznc wrote:Using a one-letter needle string is more like a user mistake than something to optimize for.How is splitting on ',' a user mistake? -- Andrei
May 31 2016
On Tuesday, 31 May 2016 at 18:31:47 UTC, Andrei Alexandrescu wrote:On 5/31/16 1:54 PM, qznc wrote:The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19 +61 ( 161) -11 ( 825) manual: 161 ±47 +73 ( 333) -35 ( 666) qznc: 163 ±21 +33 ( 314) -15 ( 661) Chris: 190 ±47 +80 ( 302) -33 ( 693) Andrei: 140 ±37 +60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence.Using a one-letter needle string is more like a user mistake than something to optimize for.How is splitting on ',' a user mistake? -- Andrei
May 31 2016
On Tuesday, 31 May 2016 at 18:56:14 UTC, qznc wrote:The mistake is to split on "," instead of ','. The slow splitter at the start of this thread searches for "\r\n". Your lazy-skip algorithm looks great! ./benchmark.ldc Search in Alice in Wonderland std: 168 ±6 +29 ( 107) -3 ( 893) manual: 112 ±3 +28 ( 81) -1 ( 856) qznc: 149 ±4 +30 ( 79) -1 ( 898) Chris: 142 ±5 +28 ( 102) -2 ( 898) Andrei: 153 ±3 +28 ( 79) -1 ( 919) Andrei2: 101 ±2 +34 ( 31) -1 ( 969) Search random haystack with random needle std: 172 ±19 +61 ( 161) -11 ( 825) manual: 161 ±47 +73 ( 333) -35 ( 666) qznc: 163 ±21 +33 ( 314) -15 ( 661) Chris: 190 ±47 +80 ( 302) -33 ( 693) Andrei: 140 ±37 +60 ( 315) -27 ( 676) Andrei2: 103 ±6 +57 ( 64) -2 ( 935) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7 CPU M 620 2.67GHz The Alice benchmark searches Alice in Wonderland for "find a pleasure in all their simple joys" and finds it in the last sentence.Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?
May 31 2016
On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
May 31 2016
On Tuesday, 31 May 2016 at 19:59:50 UTC, qznc wrote:On Tuesday, 31 May 2016 at 19:29:25 UTC, Chris wrote:Yep. It's really impressive. I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now. `Adrei2` is that it performs consistently well.Would it speed things up even more, if we put the function `computeSkip` into the loop or is this done automatically by the compiler?LDC inlines it. DMD does not. More numbers: ./benchmark.ldc Search in Alice in Wonderland std: 147 ±1 manual: 100 ±0 qznc: 121 ±1 Chris: 103 ±1 Andrei: 144 ±1 Andrei2: 105 ±1 Search in random short strings std: 125 ±15 manual: 117 ±10 qznc: 104 ±6 Chris: 123 ±14 Andrei: 104 ±5 Andrei2: 103 ±4 Mismatch in random long strings std: 140 ±22 manual: 164 ±64 qznc: 115 ±13 Chris: 167 ±63 Andrei: 161 ±68 Andrei2: 106 ±9 Search random haystack with random needle std: 138 ±27 manual: 135 ±33 qznc: 116 ±16 Chris: 141 ±36 Andrei: 131 ±33 Andrei2: 109 ±12 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz Random short strings has haystacks of 10 to 300 characters and needles of 2 to 10. Basically, no time for initialisation. Random long strings has haystacks of size 1000, 10_000, 100_000, or 1_000_000 and needles 50 to 500. It inserts a character into a random index of the needle to force a mismatch. The last one is the configuration as before. Overall, Andrei2 (the lazy compute skip) is really impressive. :)
May 31 2016
On 05/31/2016 04:18 PM, Chris wrote:I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now.You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
May 31 2016
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote:You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- AndreiIn general, it might be beneficial to use ldc.intrinsics.llvm_expect (cf. __builtin_expect) for things like that in order to optimise basic block placement. (We should probably have a compiler-independent API for that in core.*, by the way.) In this case, the skip computation path is probably small enough for that not to matter much, though. Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do. — David
May 31 2016
On Tuesday, 31 May 2016 at 22:50:37 UTC, David Nadlinger wrote:Another thing that might be interesting to do (now that you have a "clever" baseline) is to start counting cycles and make some comparisons against manual asm/intrinsics implementations. For short(-ish) needles, PCMPESTRI is probably the most promising candidate, although I suspect that for \r\n scanning in long strings in particular, an optimised AVX2 solution might have higher throughput. Of course these observations are not very valuable without backing them up with measurements, but it seems like before optimising a generic search algorithm for short-needle test cases, having one's eyes on a solid SIMD baseline would be a prudent thing to do.The current algorithm is generic with respect to the predicate. Once we use SSE/AVX tricks, it is a special case for equality. As a next step in Phobos, this is probably worth it for strings. We could probably steal some well-optimized strcmp from somewhere.
Jun 02 2016
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote:You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- AndreiI've added it as `Andrei3`. It runs faster with dmd, but it's not as good with ldc. Seems like ldc performs some extra optimization when moving `computeSkip` into the loop, something it doesn't bother to do when it's already there. dmd -O -release -inline -noboundscheck *.d -ofbenchmark.dmd ./benchmark.dmd Search in Alice in Wonderland std: 190 ±4 manual: 115 ±4 qznc: 106 ±2 Chris: 160 ±4 Andrei: 159 ±4 Andrei2: 108 ±2 Andrei3: 100 ±0 Search in random short strings std: 222 ±27 manual: 193 ±49 qznc: 120 ±12 Chris: 224 ±57 Andrei: 114 ±9 Andrei2: 106 ±5 Andrei3: 102 ±3 Mismatch in random long strings std: 186 ±28 manual: 206 ±85 qznc: 118 ±14 Chris: 265 ±104 Andrei: 194 ±85 Andrei2: 116 ±18 Andrei3: 102 ±4 Search random haystack with random needle std: 189 ±38 manual: 171 ±45 qznc: 118 ±11 Chris: 225 ±52 Andrei: 149 ±32 Andrei2: 110 ±7 Andrei3: 103 ±6 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz ./benchmark.ldc Search in Alice in Wonderland std: 170 ±1 manual: 143 ±2 qznc: 133 ±1 Chris: 144 ±1 Andrei: 196 ±10 Andrei2: 100 ±0 Andrei3: 111 ±1 Search in random short strings std: 223 ±30 manual: 211 ±51 qznc: 124 ±12 Chris: 223 ±61 Andrei: 115 ±10 Andrei2: 102 ±3 Andrei3: 105 ±4 Mismatch in random long strings std: 181 ±17 manual: 253 ±109 qznc: 146 ±24 Chris: 248 ±108 Andrei: 228 ±96 Andrei2: 101 ±2 Andrei3: 108 ±6 Search random haystack with random needle std: 187 ±22 manual: 208 ±60 qznc: 152 ±27 Chris: 202 ±58 Andrei: 173 ±35 Andrei2: 102 ±4 Andrei3: 110 ±8 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz
Jun 01 2016
On Tuesday, 31 May 2016 at 21:29:34 UTC, Andrei Alexandrescu wrote:On 05/31/2016 04:18 PM, Chris wrote:I ported this to Phobos style (A2Phobos in the benchmark) and changed the pull request [0]. In practically all tests, this algorithms wipes the floor with the competition. [0] https://github.com/dlang/phobos/pull/4362I actually thought that dmd didn't place `computeSkip` inside of the loop. This begs the question if it should be moved to the loop, in case we use it in Phobos, to make sure that it is as fast as possible even with dmd. However, I like it the way it is now.You may want to then try https://dpaste.dzfl.pl/392710b765a9, which generates inline code on all compilers. -- Andrei
Jun 02 2016
On 5/31/16 1:54 PM, qznc wrote:On Tuesday, 31 May 2016 at 01:55:16 UTC, Andrei Alexandrescu wrote:Oh, I see what you mean - it's a string consisting only of one character, not one character 'x'. I agree, but I'm just saying it's a sound baseline.I agree it's difficult to characterize the behavior of substring search with one number. There are many dimensions of variation. (But there's no reason for an emotional response.) A few possible baselines come to mind: * Search a long string for a one-character string, match and fail.There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.I'm thinking of a few exponentially increasing sizes. e.g. 100, 1000, 10_000, 100_000, 1_000_000.* Take an English text string. Search for a substring consisting of its last portion (e.g. 5% of the length).How long should the english text be? A Tweet? A book? A Gigabyte of log files?English text means basically ASCII and no Unicode?My understanding (and correct me if I'm wrong) is that we're discussing search with the default predicate, ignoring equivalent grapheme representations etc. So there's no need to mind encoding at all, which means the text could be in any language. Foreign languages would further increase the vocabulary, but that's the only effect.Yah, and it's a good test for strings that have some equal portion yet they are different, which seems to me puts a greater strain on the algorithms. It is arguably a likely situation in the real world. Thanks, Andrei* Take an English text string. Search for a substring consisting of a fraction of the text (e.g. 3%) with additional characters prepended. Repeat for appended.Why the prepend/append? To force a mismatch?
May 31 2016
On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specializeThere is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:That makes sense. I think that std.algorithm needs to be revised and optimized for speed. We cannot afford to have suboptimal algorithms in there.On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest! At ndslice such functions are called bifacial: http://dlang.org/phobos/std_experimental_ndslice_iteration.html Imho a lot more in std.algorithm should be able to profit from facts known at compile-time. 2) Even for a runtime runtime one-letter needle I am pretty sure it's worth to specializeThere is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
On 06/01/2016 08:41 AM, Seb wrote:On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:That would require partial evaluation, which sadly we don't have in D. -- AndreiOn Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:1) how about a CTFE find? s.find!(needle, pred) If we can initialize boyer-moore or KMP at compile time - it should be the fastest!There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
On Wednesday, 1 June 2016 at 12:41:19 UTC, Seb wrote:On Wednesday, 1 June 2016 at 12:14:07 UTC, Patrick Schluter wrote:What I wanted to say, is that in real life, the input of the search routine is very often run-time user provided data. Think of search box in browsers and apps, command line parameter à la grep, etc. The "string" search function should not catastrophically break down on special input, like 1 character strings, unusual Unicode or when needle==haystack. I only said this to not lose the focus on what is being tried to be achieved here. It's often a danger of micro-optimization and unit test focused development, that a lot of time and effort are spent on improvements that are completely irrelevant when checked against what is really needed in real world (i.e. we're full in bike shed territory here).On Tuesday, 31 May 2016 at 17:54:34 UTC, qznc wrote:1) how about a CTFE find?There is a special version of find for searching a single char in a string. Using a one-letter needle string is more like a user mistake than something to optimize for.At compile time you may not know the length of the needle, like in the grep command.
Jun 01 2016
On Wednesday, 1 June 2016 at 13:47:10 UTC, Patrick Schluter wrote:That's true. We should try to optimize for the worst case, i.e. random input. However, it is often the case that the needle is known at compile time, e.g. things like if (input.canFind("foo")) There might be ways to optimize those cases at compile time.What I wanted to say, is that in real life, the input of the search routine is very often run-time user provided data. Think of search box in browsers and apps, command line parameter à la grep, etc. The "string" search function should not catastrophically break down on special input, like 1 character strings, unusual Unicode or when needle==haystack. I only said this to not lose the focus on what is being tried to be achieved here. It's often a danger of micro-optimization and unit test focused development, that a lot of time and effort are spent on improvements that are completely irrelevant when checked against what is really needed in real world (i.e. we're full in bike shed territory here).
Jun 01 2016
On Wednesday, 1 June 2016 at 14:32:38 UTC, Chris wrote: For fun, I've written a search function that alternates between the beginning and the end of a string. I'm basically using Andrei's search algorithm for this. It is, of course only useful, if `needle` is expected to be either towards the beginning or the end of a string. If `needle` occurs multiple times, it matches where ever it encounters it first. T[] findUnique(T)(T[] haystack, T[] needle) { static size_t computeSkip(T[] needle) { size_t result = 1; while (result < needle.length && needle[$ - 1 - result] != needle[$ - 1]) { ++result; } return result; } if (needle.length == 0) return haystack; immutable size_t len = haystack.length; immutable size_t lastIndex = needle.length - 1; auto last = needle[lastIndex]; auto first = needle[0]; size_t skip = 0; for (size_t i = lastIndex, j = len-lastIndex-1; i < len; ++i, --j) { if (last != haystack[i]) goto back; for (size_t k = 0; ; ++k) { if (k == lastIndex) return haystack[i-lastIndex..$]; if (haystack[i - lastIndex + k] != needle[k]) break; } back: if (i > j) break; if (first != haystack[j]) continue; for (size_t k = lastIndex; ; --k) { if (k == 0) return haystack[j..$]; if (haystack[j + k] != needle[k]) break; } if (skip == 0) skip = computeSkip(needle); i += skip; j -= skip-1; } return haystack[$..$]; } unittest { string s1 = "Hello abc world!"; assert (findUnique(s1, "abc") == "abc world!"); assert (findUnique(s1, "def") == ""); string s2 = "Ola, mundo!"; assert (findUnique(s2, "do!") == "do!"); string s3 = "abc"; assert (findUnique(s3, "c") == "c"); assert (findUnique(s3, "z") == ""); string s4 = "aaabaaa"; assert (findUnique(s4, "b") == "baaa"); } I added it to the benchmarking suite (I had to turn off the error warnings, see description above). Naturally, it shines in the first test, as the needle is at the end of the text. The whole thing is to be taken with a grain of salt ;) Search in Alice in Wonderland std: 535 ±8 manual: 454 ±10 qznc: 421 ±5 Chris: 460 ±10 Andrei: 643 ±19 Andrei2: 314 ±6 Andrei3: 352 ±5 Biloop: 100 ±0 Search in random short strings std: 219 ±33 manual: 194 ±47 qznc: 123 ±11 Chris: 219 ±69 Andrei: 113 ±10 Andrei2: 104 ±4 Andrei3: 103 ±3 Biloop: 195 ±46 Mismatch in random long strings std: 193 ±30 manual: 249 ±103 qznc: 156 ±30 Chris: 260 ±108 Andrei: 238 ±97 Andrei2: 107 ±11 Andrei3: 115 ±12 Biloop: 141 ±44 Search random haystack with random needle std: 189 ±25 manual: 193 ±57 qznc: 152 ±26 Chris: 203 ±59 Andrei: 176 ±35 Andrei2: 103 ±6 Andrei3: 110 ±8 Biloop: 141 ±28 (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz
Jun 02 2016
On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:And Desktop: ./benchmark.ldc std: 129 ±24 +40 (3121) -17 (6767) manual: 129 ±31 +59 (2668) -21 (7244) qznc: 112 ±14 +30 (2542) -9 (7312) Chris: 134 ±33 +58 (2835) -23 (7068) Andrei: 123 ±27 +53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz ./benchmark.dmd std: 157 ±31 +44 (3693) -24 (6234) manual: 143 ±41 +73 (2854) -28 (7091) qznc: 116 ±21 +35 (3092) -14 (6844) Chris: 181 ±50 +74 (3452) -38 (6510) Andrei: 136 ±38 +64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHzBenchmark from desktop machine: DMD: std: 164 ±34 +43 (4054) -29 (5793) manual: 150 ±41 +72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43 +81 (2708) -29 (7232) Andrei: 136 ±31 +53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz === LDC: std: 138 ±23 +35 (3457) -18 (6360) manual: 145 ±33 +45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33 +56 (3061) -23 (6882) Andrei: 121 ±27 +52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz On my laptop Andrei's was the fasted (see post above).
May 31 2016
On Tuesday, 31 May 2016 at 08:43:59 UTC, Chris wrote:On Monday, 30 May 2016 at 22:16:27 UTC, qznc wrote:Comparing the chips involved, could it be cache related? 3M cache; Andrei wins both: http://ark.intel.com/products/75459/Intel-Core-i5-4200U-Processor-3M-Cache-up-to-2_60-GHz 4M cache; qznc wins DMD (and is faster than the LDC's best? What?); Andrei wins LDC: http://ark.intel.com/products/43560/Intel-Core-i7-620M-Processor-4M-Cache-2_66-GHz 8M cache; qznc wins both: http://ark.intel.com/products/65719/Intel-Core-i7-3770-Processor-8M-Cache-up-to-3_90-GHz http://ark.intel.com/products/75122/Intel-Core-i7-4770-Processor-8M-Cache-up-to-3_90-GHz Normally, I'd expect the 4200U to be similar to the desktop parts. Unless... Say, for the laptops (and I guess the desktops too, but it's more important in a mobile), did you verify the CPU frequency scaling wasn't interfering? -WyattAnd Desktop: ./benchmark.ldc std: 129 ±24 +40 (3121) -17 (6767) manual: 129 ±31 +59 (2668) -21 (7244) qznc: 112 ±14 +30 (2542) -9 (7312) Chris: 134 ±33 +58 (2835) -23 (7068) Andrei: 123 ±27 +53 (2679) -18 (7225) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHz ./benchmark.dmd std: 157 ±31 +44 (3693) -24 (6234) manual: 143 ±41 +73 (2854) -28 (7091) qznc: 116 ±21 +35 (3092) -14 (6844) Chris: 181 ±50 +74 (3452) -38 (6510) Andrei: 136 ±38 +64 (2975) -27 (6953) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-3770 CPU 3.40GHzBenchmark from desktop machine: DMD: std: 164 ±34 +43 (4054) -29 (5793) manual: 150 ±41 +72 (2889) -29 (7032) qznc: 103 ±6 +42 ( 878) -2 (9090) Chris: 205 ±43 +81 (2708) -29 (7232) Andrei: 136 ±31 +53 (2948) -22 (6977) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz === LDC: std: 138 ±23 +35 (3457) -18 (6360) manual: 145 ±33 +45 (3748) -27 (6181) qznc: 105 ±7 +17 (2267) -4 (7534) Chris: 135 ±33 +56 (3061) -23 (6882) Andrei: 121 ±27 +52 (2630) -18 (7301) (avg slowdown vs fastest; absolute deviation) CPU ID: GenuineIntel Intel(R) Core(TM) i7-4770 CPU 3.40GHz On my laptop Andrei's was the fasted (see post above).
May 31 2016
On Tuesday, 31 May 2016 at 17:31:29 UTC, Wyatt wrote:qznc wins DMD (and is faster than the LDC's best?Careful! These are not absolute numbers, but relative slowdowns. You cannot compare the numbers between LDC and DMD.
May 31 2016
On Monday, 30 May 2016 at 18:20:39 UTC, Andrei Alexandrescu wrote:On 05/30/2016 05:31 AM, qznc wrote:I wrote a splitter in SSE4.2 some time ago as acontribution to a github project. Perhaps it is related. https://github.com/HFTrader/string-splitting/blob/master/splithb2.cpp Cheers,On Sunday, 29 May 2016 at 21:07:21 UTC, qznc wrote:worthwhile to use word loads [0] instead. Really fancy would be SSE.
Jul 11 2016
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:The upper bound of the outer loop should be haystack.length-needle.length. There's no point in searching after that point as the needle would not fit anymore and it avoids the buffer overrun you have in your code when the first character of the needle is found after that point. You can also remove then your ?: test in the j loop as that case is handled automatically then.This outperforms both `manual_find` and the improved `std find` string findStringS_Manual(string haystack, string needle) { if (needle.length > haystack.length) return haystack[$..$]; outer: for (auto i = 0; i < haystack.length; i++) { if (haystack[i] != needle[0]) continue; for (size_t j = i+1, k = 1; k < needle.length; ++j, ++k) if (haystack[j] != needle[k]) continue outer; return haystack[i..$]; } return haystack[$..$]; }Little bug fix: for (size_t j = (i+1 < haystack.length) ? i+1 : i, k = 1; k < needle.length; ++j, ++k)
May 27 2016
Oops, I've been found out. :-) Thanks. You're right of course, and I've already noticed that bug as it fails on not found. I got the bounds wrong.
May 27 2016
On Friday, 27 May 2016 at 19:17:44 UTC, Chris wrote:Oops, I've been found out. :-) Thanks. You're right of course, and I've already noticed that bug as it fails on not found. I got the bounds wrong.I had the same "bug" when I wrote my search function on the project at work. I also found out that using the simplest loop is the generally the best way to go. The processors nowaday have loop-caches, fast level1 caches etc, all the fancy search algorithms like Boyer-Moore etc. can only overcome their memory and initialisation overhead for really huge haystacks. Here's the C code of my memmem() function that replaced the Boyer-Moore I had before. This implementation outran the BM search for sizes up to several hundreds megaytes on SPARC64 VII+ machine. On x86-64 I'm sure it would even be worse. void * memmem(const void *haystack, size_t haystack_len, const void *needle, size_t needle_len) { if(!haystack_len || !needle_len || needle_len > haystack_len) return NULL; uint8_t u8 = *((uint8_t *)needle); const uint8_t *p = (const uint8_t *) haystack; for(size_t i = haystack_len - needle_len + 1; i; --i) { if(*p == u8 && !memcmp(p, needle, needle_len)) return (void *)p; else p++; } return NULL; } (sorry for posting C code, as I'm only beginning to learn D).
May 27 2016
On Friday, 27 May 2016 at 14:41:29 UTC, Chris wrote:The improved `std find` comes very close to the manual function above now, sometimes it's even faster or at least as fast. std find took 12573666 manual find took 7418454 my std find took 6903854 <=== findStringS took 7166720 findStringS_ took 6579529 <===I just updated my benchmark. It now checks lots of different scenarios instead of you having to specify one. You can only set the number of iterations now. It generates a random scenario (haystack length, needle length, alphabet, etc) and runs it once with all algorithms. Instead of recording the absolute runtime, it records the slowdown compared to the fastest of the three. Average those and also compute the absolute deviation (the second number). Example: std find: 153 ±33 manual find: 112 ±19 my std find: 119 ±16 So manual find is on average 12% slower than the fastest ±19%. I would interpret that as no significant difference between manual and improved find.
May 27 2016
Now added Chris' algo to the benchmark: std find: 155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_
May 27 2016
On 05/27/2016 04:39 PM, qznc wrote:Now added Chris' algo to the benchmark: std find: 155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop * use only one induction variable ? Andrei
May 27 2016
On Friday, 27 May 2016 at 20:50:52 UTC, Andrei Alexandrescu wrote:On 05/27/2016 04:39 PM, qznc wrote:Std find (currently in Phobos) and qznc find (see pull request) fulfill those three. Manual and Chris find are simpler.Now added Chris' algo to the benchmark: std find: 155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_Does any of these feature simultaneously: * check the last element first * consequently stop early when the length of the haystack falls below the length of the needle * consequently no check against the haystack limit is necessary in the inner loop* use only one induction variableNone of those. I made such a variant of qznc find and it gives me the same numbers.
May 27 2016
On Friday, 27 May 2016 at 20:39:03 UTC, qznc wrote:Now added Chris' algo to the benchmark: std find: 155 ±33 manual find: 112 ±19 qznc find: 122 ±18 <--- improved find Chris find: 133 ±20 <--- findStringS_Did you fix the bugs in my algorithm? S
May 28 2016
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:I think it's clear that `std find` is veeeery slow. If anyone wants to test my function, please do so. I don't want to spread wrong data, but this is what I get at the moment (ldc latest version).If you want to see find (current or my improved version) get really slow, then reduce the alphabet for the haystack (e.g. --alphabet=ab). ;)
May 27 2016
On Friday, 27 May 2016 at 14:06:09 UTC, Chris wrote:I have to correct myself. A manual loop is faster, I couldn't believe it either, so I benchmarked the same function with `foreach` and `for`. Turns out that `for` is _way_ faster.Just about the only reason I could think of for this to happen is if the compiler fails to inline the range primitives from std.array. Otherwise, the loops should be pretty much equivalent to LLVM's optimiser. This is so fundamental to D's strategic promise that we can't afford to get it wrong. — David
May 27 2016
On Friday, 27 May 2016 at 21:31:48 UTC, David Nadlinger wrote:Just about the only reason I could think of for this to happen is if the compiler fails to inline the range primitives from std.array. Otherwise, the loops should be pretty much equivalent to LLVM's optimiser. This is so fundamental to D's strategic promise that we can't afford to get it wrong. — DavidI agree, and it has been said on this forum that tight manual loops are faster than the range paradigm (I think bearophile was one of them, if I remember correctly). Maybe we should benchmark this too, just in case.
May 28 2016
On 3/1/13 8:05 PM, cvk012c wrote:In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.That conclusion would be hasty if not missing the whole point. You essentially measured the speed of one loop in various translators implementing various languages. Java code doing straight computation is on a par with C speed, no two ways about that. Python code using library primitives ain't no slouch either. Performance tuning in these languages becomes more difficult in larger applications where data layout, allocation, and indirect function calls start to dominate. The claim of systems languages being fast materializes only when you need to optimize data layout and nonstandard, custom algorithms. At that point systems languages give you additional options, whereas with high-level languages optimization becomes a very difficult proposition. Andrei
Mar 02 2013
On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote: [=E2=80=A6]That conclusion would be hasty if not missing the whole point. You=20 essentially measured the speed of one loop in various translators=20 implementing various languages. Java code doing straight computation is==20on a par with C speed, no two ways about that. Python code using library==20primitives ain't no slouch either. Performance tuning in these languages==20becomes more difficult in larger applications where data layout,=20 allocation, and indirect function calls start to dominate.[=E2=80=A6] Interestingly, there isn't only one Python implementation. There is only one language but there is CPython, PyPy, Jython, IronPython, to mention but 4. On computationally intensive code, PyPy (Python execution environment in RPython) is generally 10 to 30 times faster than CPython (Python execution environment written in C). C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:On Sat, 2013-03-02 at 10:33 -0500, Andrei Alexandrescu wrote: […]I'm not sure that's entirely fair. PyPy is fast because it implements a JIT, not because it's written in RPython.That conclusion would be hasty if not missing the whole point. You essentially measured the speed of one loop in various translators implementing various languages. Java code doing straight computation is on a par with C speed, no two ways about that. Python code using library primitives ain't no slouch either. Performance tuning in these languages becomes more difficult in larger applications where data layout, allocation, and indirect function calls start to dominate.[…] Interestingly, there isn't only one Python implementation. There is only one language but there is CPython, PyPy, Jython, IronPython, to mention but 4. On computationally intensive code, PyPy (Python execution environment in RPython) is generally 10 to 30 times faster than CPython (Python execution environment written in C). C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered.
Mar 02 2013
On Sat, 2013-03-02 at 16:55 +0100, John Colvin wrote: [=E2=80=A6]I'm not sure that's entirely fair. PyPy is fast because it=20 implements a JIT, not because it's written in RPython.The second sentence is almost entirely true, the first I am not so sure. Armin Rigo used to manage Psycho which was an amendment to the CPython code base to include a JIT. He quit trying to keep that going and so CPython doesn't have a JIT of any real importance. Unladen Swallow dissapeared. The JIT framework is an integral part of RPython rather than being specific only to PyPy. The point is that for I/O bound systems all languages are more or less equivalent, for computational intensive code, JITs are needed for virtual machine based systems or they will not compete. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered.Source? (Googling "rpython interpreter speed" didn't show anything) There's always claims that systems like RPython, or Haskell, or LuaJIT, or Java HotSpot are able to rival or beat C, yet all I ever see are small toy programs. Show me something big. Something with 100's of thousands of lines of code. If all this is true, where are the video games written in these languages that rival performance/fidelity of AAA titles? I've never seen that. Ever.
Mar 02 2013
On Saturday, 2 March 2013 at 16:06:57 UTC, Peter Alexander wrote:On Saturday, 2 March 2013 at 15:43:57 UTC, Russel Winder wrote:Rpython has been used to write a very fast jit (pypy) which is not the same as an interpreter. The interpreter is what you fall back to when you can't use the jit. I don't know how fast pypys interpreter is specifically, but it's not it's main selling point.C is a (reasonably) well known and used language thought to create fast code. RPython is Python but with some restrictions that is statically compiled. For writing interpreters, RPython spanks C. PyPy is not the only language using RPython to implement the interpreter. C's days in this game are seriously numbered.Source? (Googling "rpython interpreter speed" didn't show anything) There's always claims that systems like RPython, or Haskell, or LuaJIT, or Java HotSpot are able to rival or beat C, yet all I ever see are small toy programs. Show me something big. Something with 100's of thousands of lines of code. If all this is true, where are the video games written in these languages that rival performance/fidelity of AAA titles? I've never seen that. Ever.
Mar 02 2013
On Sat, 2013-03-02 at 17:06 +0100, Peter Alexander wrote: [=E2=80=A6]Source? (Googling "rpython interpreter speed" didn't show=20 anything)For me, my source is my microbenchmarks, which are testing just loop with computation, i.e compute intensive data parallel. There is plenty of other stuff.There's always claims that systems like RPython, or Haskell, or=20 LuaJIT, or Java HotSpot are able to rival or beat C, yet all I=20 ever see are small toy programs. Show me something big. Something=20 with 100's of thousands of lines of code.100,000+ lines of code is a total irrelevance regarding system performance, as anyone who knows anything of profiling and benchmarking will tell you. If C is the only viable language why has Java got the majority of the commercial market. C++ and Fortran have the vast majority of the computational world.=20If all this is true, where are the video games written in these=20 languages that rival performance/fidelity of AAA titles? I've=20 never seen that. Ever.Where is D is this arena. It is either C++ or C++ in the high performance graphics arena. If there are no AAA titles written in D why does this email list have anyone on it? Clearly I think your response is just trolling, I will stop feeding the troll. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 02 2013
On Saturday, 2 March 2013 at 19:19:09 UTC, Russel Winder wrote:100,000+ lines of code is a total irrelevance regarding system performance, as anyone who knows anything of profiling and benchmarking will tell you.I don't care about benchmarking, I care about writing large, fast systems. Small toy programs are irrelevant because they are not representative of real programs. Real programs are large (many thousands or millions of lines of code) and you don't have the luxury of being able to hand optimize every line of code.If C is the only viable language why has Java got the majority of the commercial market. C++ and Fortran have the vast majority of the computational world.Commercial market for what? For high performance applications C++ is still champion.D is getting there. Remedy Games are sponsoring DConf, and I predict we'll start to see some high quality indie games over the next few years written in D. Up till now, I imagine language stability has been the main thing holding back D. Tomasz was working on a high quality renderer before he got fed up with D's instability (http://h3.gd/code/nucleus/).If all this is true, where are the video games written in these languages that rival performance/fidelity of AAA titles? I've never seen that. Ever.Where is D is this arena. It is either C++ or C++ in the high performance graphics arena. If there are no AAA titles written in D why does this email list have anyone on it?
Mar 02 2013
On 3/2/2013 7:43 AM, Russel Winder wrote:For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderFor writing interpreters, RPython spanks C.=20 What's RPython doing that makes it faster?
Mar 02 2013
On Saturday, 2 March 2013 at 19:42:47 UTC, Russel Winder wrote:On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:And why do you think this wouldn't be possible if it was written in C? There are fast JITs written in C.On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
On 3/2/2013 11:42 AM, Russel Winder wrote:On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:On 3/2/2013 11:42 AM, Russel Winder wrote:I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't. T -- Everybody talks about it, but nobody does anything about it! -- Mark TwainOn Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
On 3/2/2013 12:08 PM, H. S. Teoh wrote:On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:I meant what the C *compiler* generates.On 3/2/2013 11:42 AM, Russel Winder wrote:I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't.On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:On 3/2/2013 12:08 PM, H. S. Teoh wrote:COn Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:On 3/2/2013 11:42 AM, Russel Winder wrote:I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of =On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do. To say more would require actual comparative data and I suspect no-one on this list will do that. =20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winderprogrammers are obsessed with optimizing what they *think* are the hotspots, but which really aren't.=20 I meant what the C *compiler* generates.
Mar 03 2013
03-Mar-2013 12:03, Russel Winder пишет:On Sat, 2013-03-02 at 12:52 -0800, Walter Bright wrote:Only one thing missing here is that JITs typically has much less time to do all and any of this.On 3/2/2013 12:08 PM, H. S. Teoh wrote:Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do.On Sat, Mar 02, 2013 at 12:02:08PM -0800, Walter Bright wrote:I meant what the C *compiler* generates.On 3/2/2013 11:42 AM, Russel Winder wrote:I don't know, but my wild guess is that a JIT optimizes the *right* hotspots based on real-time performance measurements, whereas a lot of C programmers are obsessed with optimizing what they *think* are the hotspots, but which really aren't.On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?To say more would require actual comparative data and I suspect no-one on this list will do that.Indeed unlikely. -- Dmitry Olshansky
Mar 03 2013
On Sunday, 3 March 2013 at 08:03:53 UTC, Russel Winder wrote:Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do.That is theory, but in practice, it doesn't work that well : you have instrument the code to get measure to optimize according to runtime data. Plus you can't monkey patch codegen (it is unspecified how x86 CPU load instructions, and so it isn't guaranteed that the CPU won't see garbage). That cost, plus the cost of the optimization itself will make the whole thing worth it only if the win when you optimize is high.To say more would require actual comparative data and I suspect no-one on this list will do that.It isn't worth. As explained above, you can conclude that this is highly dependent of the code you manipulate and the runtime data that is throw at it. Note that I discussed with people from LLVM on how this can be done in statically compiled code. In fact, it can but it is rather complicated and not worth automate for now.
Mar 03 2013
On 3/3/2013 12:03 AM, Russel Winder wrote:Yes because the C/C++/D/etc. compilers are attempting to predict the control flow of the program in execution and optimize all cases for all possibilities. JIT's are just focussing on the runtime bottlenecks with the actual data as being used. This allows for more focussed code generation in the actual context. I would suspect that in many cases the generated code is effectively the same but JITs can often do unexpected and faster codes because they have more data to optimize with and less optimization to do.I've heard this was the advantage of JITs for the last 15 years. I haven't seen any data to justify it, and I have trouble thinking of any significant improvements to code gen that could be made with runtime data. That is for JITting statically typed languages. The situation is a bit better for dynamic ones, because the JIT can do a better job than a compiler because most of the time, only one type is used in a particular spot, and the JIT can pick that up and "statically" type it, as opposed to a compiler which cannot.
Mar 03 2013
Walter Bright:and I have trouble thinking of any significant improvements to code gen that could be made with runtime data.I've seen that the JavaHotSpot is able to unroll loops with a length known only at run-time. It unrolls them only when the run-time statistics say it's advantageous. (In theory the same is possible with profile-driven optimization). Bye, bearophile
Mar 03 2013
On Sunday, 3 March 2013 at 11:57:36 UTC, bearophile wrote:Walter Bright:I believe the Hotspot runtime compiler is about the only one to perform runtime optimizations like that. I don't know if the CIL does this kind of things.and I have trouble thinking of any significant improvements to code gen that could be made with runtime data.I've seen that the JavaHotSpot is able to unroll loops with a length known only at run-time. It unrolls them only when the run-time statistics say it's advantageous. (In theory the same is possible with profile-driven optimization). Bye, bearophile
Mar 03 2013
On Saturday, 2 March 2013 at 20:02:09 UTC, Walter Bright wrote:On 3/2/2013 11:42 AM, Russel Winder wrote:In some cases yes, it can, but as you know that has nothing to do with the language the jit is written in. As I've said elsewhere, pypy being fast compared to cpython has nothing to do with the former being written in rpython and the latter in c* *other than perhaps the higher productivity of writing in rpython allowing more resources to be dedicated to optimisation.On Sat, 2013-03-02 at 11:30 -0800, Walter Bright wrote:I don't understand. Does that JIT generate faster code than a C compiler would generate?On 3/2/2013 7:43 AM, Russel Winder wrote:Allowing PyPy to have a good JIT compiler.For writing interpreters, RPython spanks C.What's RPython doing that makes it faster?
Mar 02 2013
Russel Winder:Allowing PyPy to have a good JIT compiler.PyPy is not just a JIT, it's a more complex beast. It's more like a meta-JIT. The only one I know of. Bye, bearophile
Mar 04 2013
On Saturday, 2 March 2013 at 15:33:42 UTC, Andrei Alexandrescu wrote:On 3/1/13 8:05 PM, cvk012c wrote:That is true, but what prevents D to do the same? Why Java code can be on par with C speed and D cannot? If Java and Python folks can do it, why D creators cannot? If performance optimizations is always number one priority in Java and Python community, why it is not a case for D? I will go even further: what prevents D to be faster than C? That's where D will really shine and people will start talking about. But that is probably a distant future of D and for now, my personal opinion is that ANY part of D code should not be slower than at least Java. If it is, it should be reported as an issues and fixed. Being more than twice slower than Java is a failure.In my latest version of D script I didn't use splitter at all. I used string specific indexOf function. Still result is very bad. For text based protocols, such as SIP, performance of string manipulating functions is very important. Unfortunately, looks like it is not D strongest point at this time.That conclusion would be hasty if not missing the whole point. You essentially measured the speed of one loop in various translators implementing various languages. Java code doing straight computation is on a par with C speed, no two ways about that. Python code using library primitives ain't no slouch either.
Mar 02 2013
On 3/2/2013 11:14 AM, cvk012c wrote:That is true, but what prevents D to do the same?Nothing.I will go even further: what prevents D to be faster than C?Nothing. In fact, there are semantics in D that allow faster code to be generated than can be for C. They are unexploited at the moment, but they are there.
Mar 02 2013
Walter Bright:there are semantics in D that allow faster code to be generated than can be for C. They are unexploited at the moment, but they are there.This is interesting. Maybe you should write a small article on this. Bye, bearophile
Mar 04 2013
On Saturday, 2 March 2013 at 00:47:02 UTC, Steven Schveighoffer wrote:Try my hand-written version (elsewhere in thread). I think it can be done better too (use pointers instead of arrays).That is usually a bad idea as it will fuck up pretty bad the aliasing analysis capabilities of the compiler.
Mar 02 2013
deadalnix:That is usually a bad idea as it will fuck up pretty bad the aliasing analysis capabilities of the compiler.With LDC I've seen that using raw pointers sometimes gives a little extra performance, but with DMD I've some times a lower performance. Bye, bearophile
Mar 02 2013
On 3/1/13 4:28 PM, cvk012c wrote:On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily.I doubt that. 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the language and library. 2. With Python, after comparing 2-3 idioms - well, this is pretty much it. We doubled the speed in no time by just tuning options. D being a systems language allows you take your code to a million places if you want to optimize. 3. split has been discussed and improved for years in the Python community (just google for e.g. python split performance). Perl isn't all that fast actually, try this (takes 30+ seconds): $message = "REGISTER sip:comm.example.com SIP/2.0\r Content-Length: 0\r Contact: <sip:12345 10.1.3.114:59788;transport=tls>;expires=4294967295;events=\"message-summary\";q=0.9\r To: <sip:12345 comm.example.com>\r User-Agent: (\"VENDOR=MyCompany\" \"My User Agent\")\r Max-Forwards: 70\r CSeq: 1 REGISTER\r Via: SIP/2.0/TLS 10.1.3.114:59788;branch=z9hG4bK2910497772630690\r Call-ID: 2910497622026445\r From: <sip:12345 comm.example.com>;tag=2910497618150713\r\n\r\n"; foreach my $i (0 .. 10000000) { foreach my $notused (split(/\r\n/, $message)) { } } Andrei
Mar 01 2013
On 2013-03-01 23:01, Andrei Alexandrescu wrote:I doubt that. 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the language and library.If that tiny portion is what's only used, then the rest doesn't matter.2. With Python, after comparing 2-3 idioms - well, this is pretty much it. We doubled the speed in no time by just tuning options. D being a systems language allows you take your code to a million places if you want to optimize.Perhaps we should look over the default configuration. -- /Jacob Carlborg
Mar 02 2013
On 3/2/13 11:11 AM, Jacob Carlborg wrote:On 2013-03-01 23:01, Andrei Alexandrescu wrote:Agreed. But I doubt that's the case in nontrivial applications. AndreiI doubt that. 1. Microbenchmarks are a crapshoot, they exercise a tiny portion of the language and library.If that tiny portion is what's only used, then the rest doesn't matter.
Mar 02 2013
On 03/01/2013 10:28 PM, cvk012c wrote:... On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily.Never make such statements without doing actual measurements. Furthermore, it is completely meaningless anyway. Performance benchmarks always compare language implementations, not languages. (Case in point: You get twice the speed by using another compiler backend implementation.)
Mar 01 2013
On Friday, 1 March 2013 at 22:02:02 UTC, Timon Gehr wrote:On 03/01/2013 10:28 PM, cvk012c wrote:Still, there is a case to be made for a performance tests suite that could be run after (or before) each release of the language, like http://speed.pypy.org .... On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily.Never make such statements without doing actual measurements. Furthermore, it is completely meaningless anyway. Performance benchmarks always compare language implementations, not languages. (Case in point: You get twice the speed by using another compiler backend implementation.)
Mar 01 2013
On Fri, 01 Mar 2013 16:28:08 -0500, cvk012c <cvk012c motorolasolutions.com> wrote:On my hardware with -inline options it now takes about 15 secs which is still slower than Python but with both -inline and -noboundscheck it takes 13 secs and finally beats Python. But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx). struct MySplitter { private string s; private string source; this(string src) { source = src; popFront(); } property string front() { return s; } property bool empty() { return s.ptr == null; } void popFront() { s = source; if(!source.length) { source = null; } else { size_t i = 0; bool found = false; for(; i + 1 < source.length; i++) { if(source[i] == '\r' && source[i + 1] == '\n') { found = true; break; } } s = source[0..i]; if(found) source = source[i + 2..$]; else source = source[$..$]; } } } I'm sure splitter could be optimized to do the same thing I'm doing. Probably can reduce that a bit using pointers instead of strings. -Steve
Mar 01 2013
On 3/1/13 5:31 PM, Steven Schveighoffer wrote:Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding.There's no decoding in find and splitter as far as I remember. Andrei
Mar 01 2013
On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/1/13 5:31 PM, Steven Schveighoffer wrote:Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator. Maybe fixing skipOver would fix the problem? Or don't call skipOver at all, since find isn't doing decoding, why do decoding when skipping the separator? I still feel we need a specialized string type... -StevePhobos kind of refuses to treat strings like arrays of characters, it insists on decoding.There's no decoding in find and splitter as far as I remember.
Mar 01 2013
On 3/1/13 5:47 PM, Steven Schveighoffer wrote:On Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:You may be looking at the wrong splitter overload. AndreiOn 3/1/13 5:31 PM, Steven Schveighoffer wrote:Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator.Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding.There's no decoding in find and splitter as far as I remember.
Mar 01 2013
On Fri, 01 Mar 2013 18:07:45 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/1/13 5:47 PM, Steven Schveighoffer wrote:Yes, I was. Very difficult to tell with the way template constraints are written! So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation. -SteveOn Fri, 01 Mar 2013 17:35:04 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:You may be looking at the wrong splitter overload.On 3/1/13 5:31 PM, Steven Schveighoffer wrote:Looking at splitter, it uses skipOver to skip over the separator, and that seems to call R.front and R.popFront. Actually, it calls it for both the source string and the separator.Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding.There's no decoding in find and splitter as far as I remember.
Mar 01 2013
On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 01 2013
On 3/1/13 6:26 PM, Steven Schveighoffer wrote:On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:That is a problem. Could you please file a but report? Thanks, AndreiSo it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 02 2013
On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:On 3/1/13 6:26 PM, Steven Schveighoffer wrote:s/but/bug/ :o) AndreiOn Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:That is a problem. Could you please file a but report?So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 02 2013
On Saturday, 2 March 2013 at 17:21:17 UTC, Andrei Alexandrescu wrote:On 3/2/13 12:20 PM, Andrei Alexandrescu wrote:Ohhhhh!On 3/1/13 6:26 PM, Steven Schveighoffer wrote:s/but/bug/ :o)On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:That is a problem. Could you please file a but report?So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -SteveAndrei
Mar 04 2013
On Sat, 02 Mar 2013 12:20:22 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/1/13 6:26 PM, Steven Schveighoffer wrote:http://d.puremagic.com/issues/show_bug.cgi?id=9645 In trying to make a minimal test case, I found that the algorithm performs worse vs. the substring version as the number of content characters between separators grows. It actually starts out performing better than the substring version, by quite a bit (I would think we should be able to optimize there as well, the difference was about half) when the ratio is 1:1 (i.e. splitting "ababababa" on 'a' or "a"), but gets worse as you put more content between the separators. That should be a large clue. -SteveOn Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:That is a problem. Could you please file a but report?So it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 04 2013
On Saturday, 2 March 2013 at 17:20:23 UTC, Andrei Alexandrescu wrote:On 3/1/13 6:26 PM, Steven Schveighoffer wrote:I would have filed a bug report, but I was WAY too busy.On Fri, 01 Mar 2013 18:22:54 -0500, Steven Schveighoffer <schveiguy yahoo.com> wrote:That is a problem. Could you please file a but report? Thanks, AndreiSo it's just pure heuristics. Not hard to see how that would affect a 10 million cycle program. We may be able to create a string-specific version of splitter that would take advantage of the representation.Just found a disturbing artifact: splitter(message, '\n') is more than twice as slow as splitter(message, "\n") -Steve
Mar 04 2013
On 2013-03-01 23:31, Steven Schveighoffer wrote:Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx).If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed. -- /Jacob Carlborg
Mar 02 2013
On 3/2/13 11:12 AM, Jacob Carlborg wrote:On 2013-03-01 23:31, Steven Schveighoffer wrote:I wrote a custom splitter specialized for arrays. It's as fast. AndreiPhobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx).If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
Mar 02 2013
On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/2/13 11:12 AM, Jacob Carlborg wrote:Why doesn't it get used for strings in this instance? Maybe there is a bad constraint somewhere. -SteveOn 2013-03-01 23:31, Steven Schveighoffer wrote:I wrote a custom splitter specialized for arrays. It's as fast.Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx).If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
Mar 02 2013
On 3/2/13 2:37 PM, Steven Schveighoffer wrote:On Sat, 02 Mar 2013 12:14:07 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I fear there's a misunderstanding somewhere. Splitter and find are already specialized appropriately for strings. AndreiOn 3/2/13 11:12 AM, Jacob Carlborg wrote:Why doesn't it get used for strings in this instance? Maybe there is a bad constraint somewhere.On 2013-03-01 23:31, Steven Schveighoffer wrote:I wrote a custom splitter specialized for arrays. It's as fast.Phobos kind of refuses to treat strings like arrays of characters, it insists on decoding. With DMD and a hand-written splitter, it takes 6 seconds instead of 10 on my system (64-bit macosx).If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.
Mar 02 2013
On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I fear there's a misunderstanding somewhere. Splitter and find are already specialized appropriately for strings.OK, I would then counter that I have created a splitter range that beats its performance, by a lot. As far as I can tell, I'm not doing anything special. All I feel that I'm doing differently is instead of using find or indexOf, I'm looking at each char to see if it matches the first in the separator, and if it does, then I'm comparing the remaining characters. Pretty braindead, probably room for improvement. Splitter should beat this, or match it at least. From testing the sample code in this thread, my simple version beats splitter by 40%. Something isn't right somewhere, the difference should not be that drastic. -Steve
Mar 02 2013
On 3/2/13 5:48 PM, Steven Schveighoffer wrote:On Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Prolly a link would clarify that. AndreiI fear there's a misunderstanding somewhere. Splitter and find are already specialized appropriately for strings.OK, I would then counter that I have created a splitter range
Mar 02 2013
On Sat, 02 Mar 2013 18:14:28 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 3/2/13 5:48 PM, Steven Schveighoffer wrote:Forget this, I read your other post. The MySplitter range is the one I was referring to. I saw your points and will respond there, but not today. I don't have time right now to continue the debate. -SteveOn Sat, 02 Mar 2013 16:16:22 -0500, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Prolly a link would clarify that.I fear there's a misunderstanding somewhere. Splitter and find are already specialized appropriately for strings.OK, I would then counter that I have created a splitter range
Mar 02 2013
On 3/2/2013 8:12 AM, Jacob Carlborg wrote:If you need to write a custom splitter to get acceptable performance then it sounds to me that D or Phobos have failed.In D, you can write custom versions of algorithms to exploit unique characteristics of particular data structures. I don't regard that as a failure. But if, to write a custom version, you had to switch to another language, then I would agree that is a failure of a "systems" programming language.
Mar 02 2013
On 2013-03-02 20:27, Walter Bright wrote:In D, you can write custom versions of algorithms to exploit unique characteristics of particular data structures. I don't regard that as a failure.If you use the most obvious function/way from the standard library to perform a task in Python and you do the same in D. If the D version is slower that's not good. As you've seen here people will use what's available in the standard library and if that's not good enough compared to what's available in other languages' standard libraries, they will complain. If a function in Phobos can't match the corresponding function in another std lib, I think we're doing something wrong. Or the they just happened to use a language where the std lib was optimized better for that particular task. -- /Jacob Carlborg
Mar 03 2013
On 3/3/2013 7:07 AM, Jacob Carlborg wrote:If you use the most obvious function/way from the standard library to perform a task in Python and you do the same in D. If the D version is slower that's not good.Sure. But you should be cognizant of exactly what it is you're comparing, or you're apt to draw very wrong conclusions.
Mar 03 2013
On 3/1/2013 1:28 PM, cvk012c wrote:But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.
Mar 01 2013
On Saturday, 2 March 2013 at 01:10:39 UTC, Walter Bright wrote:On 3/1/2013 1:28 PM, cvk012c wrote:This. If you wrote the splitter in pure python you would see an enormous performance gap between it and the D version. Having said that, maybe there are even more improvements we can make to improve speed in that part of photos, but considering we're already on a par with some quite mature and optimised C, there really isn't a problem.But I still kind of disappointed because I expected a much better performance boost and got only 7%. Counting that Python is not the fastest scripting language I think that similar Perl and Java scripts will outperform D easily. Thanks Andrei and simendsjo for a quick response though.Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.
Mar 02 2013
On 2013-03-02 02:10, Walter Bright wrote:Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter. -- /Jacob Carlborg
Mar 02 2013
On Saturday, 2 March 2013 at 16:09:02 UTC, Jacob Carlborg wrote:On 2013-03-02 02:10, Walter Bright wrote:It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Python's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter.
Mar 02 2013
On 2013-03-02 17:48, John Colvin wrote:It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed. -- /Jacob Carlborg
Mar 03 2013
On Sunday, 3 March 2013 at 15:09:58 UTC, Jacob Carlborg wrote:On 2013-03-02 17:48, John Colvin wrote:I agree that anything that makes us faster is good, but I wouldn't go so far as to say we've failed if we're not as fast as the very fastest.It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Mar 03 2013
On 2013-03-03 16:41, John Colvin wrote:I agree that anything that makes us faster is good, but I wouldn't go so far as to say we've failed if we're not as fast as the very fastest.No, but if we need to drop down to C to get fast enough. -- /Jacob Carlborg
Mar 03 2013
On 3/3/2013 11:38 PM, Jacob Carlborg wrote:On 2013-03-03 16:41, John Colvin wrote:I can't think of a reason to need to do that.I agree that anything that makes us faster is good, but I wouldn't go so far as to say we've failed if we're not as fast as the very fastest.No, but if we need to drop down to C to get fast enough.
Mar 03 2013
On Sunday, March 03, 2013 23:40:18 Walter Bright wrote:On 3/3/2013 11:38 PM, Jacob Carlborg wrote:Given how D works, there is something _very_ wrong if we have to drop to C, if nothing else, because it's trivial to write code in D which is equivalent to what it would be in C. If our implementation of something is worse than a C implementation, that merely means that it needs to be refactored and optimized. It's possible that some of our abstractions will hamper efficiency in some cases (e.g. depending on how ranges are used - particularly with strings - you risk a performance hit in comparison to pure C), but that should generally be fixable via specializations and whatnot. If we have an efficiency problem, it's merely an implementation issue which needs to be sorted out. There should be nothing inherent to D which makes it so that you can't write code as fast as C or C++. - Jonathan M DavisOn 2013-03-03 16:41, John Colvin wrote:I can't think of a reason to need to do that.I agree that anything that makes us faster is good, but I wouldn't go so far as to say we've failed if we're not as fast as the very fastest.No, but if we need to drop down to C to get fast enough.
Mar 04 2013
On 2013-03-04 09:18, Jonathan M Davis wrote:Given how D works, there is something _very_ wrong if we have to drop to C, if nothing else, because it's trivial to write code in D which is equivalent to what it would be in C. If our implementation of something is worse than a C implementation, that merely means that it needs to be refactored and optimized. It's possible that some of our abstractions will hamper efficiency in some cases (e.g. depending on how ranges are used - particularly with strings - you risk a performance hit in comparison to pure C), but that should generally be fixable via specializations and whatnot. If we have an efficiency problem, it's merely an implementation issue which needs to be sorted out. There should be nothing inherent to D which makes it so that you can't write code as fast as C or C++.That's what I'm saying. -- /Jacob Carlborg
Mar 04 2013
On 3/3/2013 7:09 AM, Jacob Carlborg wrote:On 2013-03-02 17:48, John Colvin wrote:Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Mar 03 2013
On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:On 3/3/2013 7:09 AM, Jacob Carlborg wrote:Maybe it is time to look at the python implementation and see why it is faster.On 2013-03-02 17:48, John Colvin wrote:Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Mar 03 2013
Maybe it is time to look at the python implementation and see why it is faster.It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Mar 03 2013
jerro:$ time python3 test.pyAre Python3 strings still like wstrings/dstrings, or have they applied the patch that makes them UTF8? Bye, bearophile
Mar 03 2013
On Mon, 2013-03-04 at 04:31 +0100, bearophile wrote: [=E2=80=A6]Are Python3 strings still like wstrings/dstrings, or have they=20 applied the patch that makes them UTF8?Python3 is Unicode by default, with UTF-8 the default encoding. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 04 2013
On 03/03/2013 07:31 PM, bearophile wrote:jerro:Looks like 3.3 will store them as utf8, but not 3.2 or below.$ time python3 test.pyAre Python3 strings still like wstrings/dstrings, or have they applied the patch that makes them UTF8? Bye, bearophile
Mar 04 2013
On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.Maybe it is time to look at the python implementation and see why it is faster.It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Mar 03 2013
On Monday, 4 March 2013 at 03:58:20 UTC, deadalnix wrote:On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:But that does suggest that the main problem is that DMD has very little optimization aimed at reducing the number of bounds checks required.Using noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.Maybe it is time to look at the python implementation and see why it is faster.It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Mar 04 2013
On Sun, 03 Mar 2013 22:58:07 -0500, deadalnix <deadalnix gmail.com> wrote:On Monday, 4 March 2013 at 03:20:57 UTC, jerro wrote:In this type of test, there is no danger with using noboundscheck. It's a very fair switch to use, D is just able to do it where python is not. A sufficiently smart compiler could eliminate the bounds check for this code, since it can be proven not to go out of bounds (in fact, a simple run without the noboundscheck proves it in very deterministic code like this). -SteveUsing noboundcheck isn't fair as you give up on safety, so you are not equivalent to python either.Maybe it is time to look at the python implementation and see why it is faster.It isn't faster: $ time python3 test.py real 0m14.217s user 0m14.209s sys 0m0.004s $ gdmd -O -inline -release -noboundscheck test $ time ./test real 0m5.323s user 0m5.312s sys 0m0.008s D code here uses the same string as the python code, not the one in cvk012c's D code.
Mar 04 2013
On 3/3/13 9:31 PM, deadalnix wrote:On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:Was the conclusion that it's faster? AndreiOn 3/3/2013 7:09 AM, Jacob Carlborg wrote:Maybe it is time to look at the python implementation and see why it is faster.On 2013-03-02 17:48, John Colvin wrote:Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Mar 03 2013
On Monday, 4 March 2013 at 04:36:30 UTC, Andrei Alexandrescu wrote:On 3/3/13 9:31 PM, deadalnix wrote:It seems that I was wrong, and missed the point that both benchmark weren't using the same string. Sorry for that.On Sunday, 3 March 2013 at 20:18:58 UTC, Walter Bright wrote:Was the conclusion that it's faster? AndreiOn 3/3/2013 7:09 AM, Jacob Carlborg wrote:Maybe it is time to look at the python implementation and see why it is faster.On 2013-03-02 17:48, John Colvin wrote:Nothing in this thread suggests that D needs to switch its library implementations to C. Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.It does. Failing to beat mature, optimised C is not anything to be ashamed of, but being slower than python would be an abject failure of D as a compiled, systems programming language.Then D needs to get faster, or we need to switch to C for some std lib functions. In that case, as Walter said, we have failed.
Mar 03 2013
On 2013-03-03 21:18, Walter Bright wrote:Nothing in this thread suggests that D needs to switch its library implementations to C.Then that's good.Interestingly, I tried the indexOf() benchmark using C's memchr() and memcmp() from VC's runtime library. It was not faster than Andrei's optimized D one.-- /Jacob Carlborg
Mar 03 2013
Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter.You can look at it that way, but still, the fact that most of the work in the Python version is done by C code makes the timings less surprising. If split() was implemented in pure python and it would be only three times slower (when run with CPython) than std.algorithm.splitter, that would be very surprising and a sign that std.algorithm.splitter is doing something horribly wrong.
Mar 02 2013
On 3/2/13 11:09 AM, Jacob Carlborg wrote:On 2013-03-02 02:10, Walter Bright wrote:The point here is that often one needs to write code that's more than gluing library calls together. I've seen that many, many times. AndreiPython's splitter, which you are measuring, isn't written in Python. It is written in C. You're actually comparing a bit of C code with a bit of D code.Does that really matter. He's using Python, if the function is part of the standard library and if it's implement in Python or C doesn't really matter.
Mar 02 2013
On Friday, 1 March 2013 at 20:30:24 UTC, cvk012c wrote:Tried to port my SIP parser from Python to D to boost performance but got opposite result. I created a simple test script which splits SIP REGISTER message 10 million times. Python version takes about 14 seconds to execute, D version is about 23 seconds which is 1.6 times slower. I used DMD 2.062 and compiled my script with options -release and -O. I used Python 3.3 64 bit. I ran both scripts on the same hardware with Windows 7 64. Is this general problem with string performance in D or just splitter() issue? Did anybody compared performance of D string manipulating functions with other languages like Python,Perl,Java and C++?I'm guessing you are building without optimization options. When compiled with "dmd -O -inline -noboundscheck -release tmp" the D code takes 11.1 seconds on my machine and the python script takes 16.1 seconds. You can make the D code faster by building with LDC or GDC: ldmd2 -O -noboundscheck -release tmp: 6.8 seconds gdmd -O -nboundscheck -inline -release tmp: 6.1 seconds So no, not slower than Python. I also suspect that much of the work in the Python code is actually done by functions that are implemented in C.
Mar 01 2013
Hm, just recently a friend of mine and I hacked together at the FB hacking cup, he in python and I in D. My solutions always were at least faster by a factor of 80. For your example, I could not get a factor of 80, but with -inline it is at least faster than the python version (about 30% faster on my machine) Best regards, Robert
Mar 01 2013