digitalmars.D - Why is D slower than LuaJIT?
- Andreas Mayer (10/14) Dec 22 2010 It runs in 281 ms on my computer.
- Trass3r (2/4) Dec 22 2010 LuaJIT seems to have a really good backend.
- Andreas Mayer (2/4) Dec 22 2010 Maybe someone could do that for me? I don't have ldc or gdc here. There ...
- Steven Schveighoffer (9/30) Dec 22 2010 Without any imperical testing, I would guess this has something to do wi...
- BLS (5/6) Dec 22 2010 quote...
- Adam D. Ruppe (4/6) Dec 22 2010 Yeah, this is almost certainly the problem. I rewrote the
- loser (2/9) Dec 22 2010 So is it justified enough to throw my W's incompetence card on the table...
- bearophile (5/6) Dec 22 2010 It's not wise to prematurely improve the inlining a lot right now when t...
- Andrej Mitrovic (4/9) Dec 22 2010 I have just tried removing enforce usage from Phobos and recompiling
- Walter Bright (2/5) Dec 22 2010 Try looking at the obj2asm dump of the inner loop.
- Gary Whatmore (11/32) Dec 22 2010 Your mp3 player or file system was doing stuff while executing the bench...
- BLS (7/8) Dec 22 2010 You talk like a prayer, don't you ? No need to measure. I believe ...
- Ary Borenszweig (10/10) Dec 22 2010 Lua is a proven, robust language
- Andreas Mayer (12/32) Dec 22 2010 Both benchmarks were run under the same conditions. Once the executables...
- Eric Poggel (2/3) Dec 22 2010 Could you point me to more information? This sounds interesting.
- Iain Buclaw (11/20) Dec 22 2010 language, I made a small benchmark. It consists of this code:
- Andreas Mayer (2/4) Dec 22 2010 What else does it do? I want to add it to the Lua version.
- spir (19/41) Dec 22 2010 language, I made a small benchmark. It consists of this code:
- Gary Whatmore (2/13) Dec 22 2010 Are you suggesting D's conferences have smaller publicity value than ACM...
- Ary Borenszweig (1/1) Dec 22 2010 You said Lua is a joke language. It doesn't seem to be one...
- Gary Whatmore (2/3) Dec 22 2010 Okay then, maybe it's not completely true. I meant it doesn't work in la...
- spir (16/21) Dec 22 2010 ge scale applications unlike a static systems programming language. Need...
- Walter Bright (9/11) Dec 22 2010 I notice you are using doubles in D. dmd currently uses the x87 to evalu...
- Andreas Mayer (4/13) Dec 22 2010 Using long instead of double, it is still slower than LuaJIT (223 ms on ...
- Walter Bright (2/3) Dec 22 2010 It depends on what you want to do. A lot of people are successfully usin...
- Lutger Blijdestijn (10/29) Dec 22 2010 You may want to explore the great language shootout before drawing that
- Lutger Blijdestijn (3/3) Dec 22 2010 I meant to link this, it includes all benchmarks and ranks gdc at 5th pl...
- Walter Bright (2/6) Dec 22 2010 I forgot to mention. In the D version, use integers as a loop counter, n...
- bearophile (11/13) Dec 22 2010 I have done (and I am doing) many benchmarks with D, and I too have seen...
- Walter Bright (9/18) Dec 22 2010 The Java JIT did that 15 years ago. I think you forget that I wrote on a...
- bearophile (4/6) Dec 22 2010 I didn't know this, much...
- Andrei Alexandrescu (34/48) Dec 22 2010 Thanks for posting the numbers. That's a long time, particularly
- bearophile (6/8) Dec 22 2010 From other tests I have seen that often FP-heavy code is faster with Lua...
- spir (17/35) Dec 23 2010 =20
- Simen kjaeraas (13/23) Dec 23 2010 With floating-point numbers, the above solution does not always work. If
- bearophile (4/5) Dec 23 2010 The type is known at compile time, so you can split the algorithm in two...
- Simen kjaeraas (7/11) Dec 23 2010 Absolutely. However, in this example doubles were used.
- Andrei Alexandrescu (3/6) Dec 23 2010 That's what the code currently does.
- Andrei Alexandrescu (5/30) Dec 23 2010 You need to look at this specialization:
- Andrei Alexandrescu (6/21) Dec 23 2010 Oops, I was wrong. The two instances of map do something, I thought
- Andrei Alexandrescu (61/75) Dec 22 2010 I reproduced the problem with a test program as shown below. On my
- Andrej Mitrovic (7/97) Dec 22 2010 That's odd, I'm getting opposite results:
- Andreas Mayer (4/6) Dec 22 2010 Your version: 40 ms (iota and baseline give the same timings)
- Andrei Alexandrescu (23/29) Dec 22 2010 Cool, thanks. I also tested against this C++ baseline:
- Brad Roberts (9/47) Dec 22 2010 Since the timing code isn't here, I'm assuming you guys are doing the
- Andrei Alexandrescu (5/13) Dec 23 2010 One thing I didn't mention is that I also measured with 10x the counter
- bearophile (4/10) Dec 23 2010 Is improving the compiler instead an option?
- Andrei Alexandrescu (5/13) Dec 23 2010 It's more of a separate matter than an option. Iota currently does a
- spir (18/24) Dec 23 2010 =20
- Simen kjaeraas (15/34) Dec 23 2010 Nothing. The syntax only works in foreach and opSlice.
- spir (7/24) Dec 23 2010 Nice! (even impressive :-)
- Jonathan M Davis (5/28) Dec 23 2010 I believe that the only place that .. works is within []. If an object o...
- spir (21/49) Dec 23 2010 ine
- spir (60/63) Dec 23 2010 After having watched Iota's very general implementation, I tried the sam...
- Andrei Alexandrescu (4/14) Dec 23 2010 I cut my losses reading here :o). No performance test is meaningful
- =?ISO-8859-1?Q?Pelle_M=E5nsson?= (11/25) Dec 23 2010 I changed the code to this:
- Andrei Alexandrescu (70/84) Dec 23 2010 I wrote a new test bench and got 41 ms for the baseline and 220 ms for
- Simen kjaeraas (5/10) Dec 23 2010 It seems to me that having a Cached range might be a better, more genera...
- Andrei Alexandrescu (3/11) Dec 23 2010 Agreed.
- Simen kjaeraas (5/6) Dec 23 2010 BTW, shouldn't range constructors call .save for forward ranges? This
- Andrei Alexandrescu (5/9) Dec 23 2010 Currently higher-order ranges assume that the range passed-in is good to...
- Jimmy Cao (5/5) Dec 23 2010 I hope that in the future more implementations in D can be compared for
- bearophile (7/8) Dec 24 2010 This is for the better, to increase language consistency (as in Python).
- Marco Leise (36/48) Jun 01 2013 Actually "D" is 1.5 times faster on my computer*:
To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
Or is D, unlike I thought, not suitable for high performance computing? What should I do?LuaJIT seems to have a really good backend. You better compare with ldc or gdc.
Dec 22 2010
Trass3r Wrote:LuaJIT seems to have a really good backend. You better compare with ldc or gdc.Maybe someone could do that for me? I don't have ldc or gdc here. There are some Debian packages, but they are D version 1 only?
Dec 22 2010
On Wed, 22 Dec 2010 17:04:21 -0500, Andreas Mayer <spam bacon.eggs> wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code). I hope that someday this is solved, because it doesn't look very good for D... -Steveauto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
On 22/12/2010 23:06, Steven Schveighoffer wrote:(also, ensure you use -O -release -inline for the most optimized code).quote... // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) Sometimes having a look at the source helps :)
Dec 22 2010
Steven Schveighoffer wrote:I would guess this has something to do with the lack of inlining for algorithmic functions.Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance.
Dec 22 2010
Adam D. Ruppe Wrote:Steven Schveighoffer wrote:So is it justified enough to throw my W's incompetence card on the table at this point? How else it is possible that a simple scripting language with simple JIT optimization heuristics can outperform a performance oriented systems programming language. It seems most D design decisions are based on the perceived performance value (not as aggressively as in C++ groups). I'd like to see how this theory doesn't hold water now?I would guess this has something to do with the lack of inlining for algorithmic functions.Yeah, this is almost certainly the problem. I rewrote the code using a traditional C style loop, no external functions, and I'm getting roughly equal performance.
Dec 22 2010
loser:So is it justified enough to throw my W's incompetence card on the table at this point? How else it is possible that a simple scripting language with simple JIT optimization heuristics can outperform a performance oriented systems programming language.It's not wise to prematurely improve the inlining a lot right now when there is no 64 bit version yet, and there are holes or missing parts in several corners of the language. Performance tuning has a lower priority. Designing a good language and performance-tuning its implementation ask for different skills. The very good author of Lua-JIT is probably not good at designing a C++-class language :-) What's needed now is to smooth the rough corners of the D language, not to squeeze out every bit of performance. Bye, bearophile
Dec 22 2010
On 12/22/10, Steven Schveighoffer <schveiguy yahoo.com> wrote:Without any imperical testing, I would guess this has something to do with the lack of inlining for algorithmic functions. This is due primarily to uses of enforce, which use lazy parameters, which are currently not inlinable (also, ensure you use -O -release -inline for the most optimized code).I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs.
Dec 22 2010
Andrej Mitrovic wrote:I have just tried removing enforce usage from Phobos and recompiling the library, and compiling again with -O -release -inline. It doesn't appear to make a difference in the timing speed over multiple runs.Try looking at the obj2asm dump of the inner loop.
Dec 22 2010
Andreas Mayer Wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D.auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms.Your mp3 player or file system was doing stuff while executing the benchmark. You probably don't know how to run the test many times and use the average/minimum result for both languages. For example D does not have JIT startup cost so take the minimum result for D, JIT has varying startup speed so take the average or slowest result for Luajit. Compare these. More fair for native code D.That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery?My guesses are: 1) you didn't even test this and didn't use optimizations. -> User error 2) whenever doing benchmarks you must compare the competing stuff against all D compilers, cut and paste the object code of different compilers and manually build the fastest executable. 3) you didn't use inline assembler or profiler for D 4) you were using unstable Phobos functions. There is no doubt the final Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a joke. 5) you were using old d runtime garbage collector. One fellow here made a Patch your dmd to use this instead. Not intending to start a religious war but if your native code runs slower than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is also a joke language, D is for high performance servers and operating systems. In the worst case, disassemble the luajit program, steal its codes and write it using inline assembler in D. D must win these performance battles. It's technically superior.Or is D, unlike I thought, not suitable for high performance computing? What should I do?It is. D a) takes time to mature and b) you didn't fully utilize the compiler.
Dec 22 2010
On 22/12/2010 23:31, Gary Whatmore wrote:Not intending to start a religious war but if your native code runs slower than*JIT* code, you're doing something wrong. D will always beat JIT.You talk like a prayer, don't you ? No need to measure. I believe ... Anyway I don't care about LUA. The dynamic ASM generation on the other hand is something which should be seriously be considered for D templates. Why ? What comes immediately in mind..no more Bloated executables. my 2 euro cents
Dec 22 2010
Lua is a proven, robust language Lua has been used in many industrial applications (e.g., Adobe's Photoshop Lightroom), with an emphasis on embedded systems (e.g., the Ginga middleware for digital TV in Brazil) and games (e.g., World of Warcraft). Lua is currently the leading scripting language in games. Lua has a solid reference manual and there are several books about it. Several versions of Lua have been released and used in real applications since its creation in 1993. Lua featured in HOPL III, the Third ACM SIGPLAN History of Programming Languages Conference, in June 2007.
Dec 22 2010
Gary Whatmore Wrote:Andreas Mayer Wrote:True enough. Yet it doesn't make me very optimistic what the final performance tradeoff would be as long as you use high level abstractions. Sure, with D you can always go on the C or even assembly level.To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:First note: this is a synthetic toy benchmark. Take it with a grain of salt. It represent in no way the true state of D.auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);Your mp3 player or file system was doing stuff while executing the benchmark. You probably don't know how to run the test many times and use the average/minimum result for both languages. For example D does not have JIT startup cost so take the minimum result for D, JIT has varying startup speed so take the average or slowest result for Luajit. Compare these. More fair for native code D.Both benchmarks were run under the same conditions. Once the executables were inside the disk cache, the run times didn't vary much. Plus this benchmark already is unfair against LuaJIT: the startup time and the time needed for optimization and code generation are included in the times I gave. The D example on the other hand doesn't include the time needed for compilation. The D compiler needs 360 ms to compile this example. If the comparison were fair and included compilation time in the D timings, D would lose even more.My guesses are: 1) you didn't even test this and didn't use optimizations. -> User errorI enabled all dmd optimizations I was aware of. Maybe I forgot some?2) whenever doing benchmarks you must compare the competing stuff against all D compilers, cut and paste the object code of different compilers and manually build the fastest executable.That seems like an unreasonable task. Writing the code in assembler would be simpler. But I'm using a high level language because I want to use high level abstractions. Like map and reduce, instead of writing assembler.3) you didn't use inline assembler or profiler for DSee 2).4) you were using unstable Phobos functions. There is no doubt the final Phobos 2.0 will beat Luajit. D *is* a compiler statical language, Luajit just a joke.I used the latest dmd release (and that is very new). As you can see, LuaJIT beats D by far. I wouldn't call it a joke. If a joke beats D, then what is D? This way of argumentation doesn't sound very advantageous for you.5) you were using old d runtime garbage collector. One fellow here made a Patch your dmd to use this instead.There shouldn't be any GC activity. Ranges work lazily. They don't allocate arrays for the data they are working on. You can post a package with "bleeding edge" dmd and Phobos sources with updated GC and so on. Then I could try that.Not intending to start a religious war but if your native code runs slower than *JIT* code, you're doing something wrong. D will always beat JIT. Lua is also a joke language, D is for high performance servers and operating systems. In the worst case, disassemble the luajit program, steal its codes and write it using inline assembler in D. D must win these performance battles. It's technically superior.But D didn't win. Not here. And what was I doing wrong? Please point out. I posted this because I was surprised myself and I thought "that can't be".
Dec 22 2010
On 12/22/2010 5:31 PM, Gary Whatmore wrote:5) you were using old d runtime garbage collector. One fellow here made a Patch your dmd to use this instead.Could you point me to more information? This sounds interesting.
Dec 22 2010
== Quote from Andreas Mayer (spam bacon.eggs)'s articleTo see what performance advantage D would give me over using a scriptinglanguage, I made a small benchmark. It consists of this code: One corner case doesn't amount to an end-all be-all proof to anything.something wrong?auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I doThe first Lua version uses a simplified design. I thought maybe that is unfairto ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. As has been already echoed, the lack of inlining algorithmic functions may be one reason for the added cost to runtime. Another may be simply that there is a lot more going on behind the scenes than what you give credit for in D. Regards :~)
Dec 22 2010
Iain Buclaw Wrote:Another may be simply that there is a lot more going on behind the scenes than what you give credit for in D.What else does it do? I want to add it to the Lua version.
Dec 22 2010
On Wed, 22 Dec 2010 17:04:21 -0500 Andreas Mayer <spam bacon.eggs> wrote:To see what performance advantage D would give me over using a scripting =language, I made a small benchmark. It consists of this code:=20do something wrong?auto L =3D iota(0.0, 10000000.0); auto L2 =3D map!"a / 2"(L); auto L3 =3D map!"a + 2"(L2); auto V =3D reduce!"a + b"(L3);=20 It runs in 281 ms on my computer. =20 The same code in Lua (using LuaJIT) runs in 23 ms. =20 That's about 10 times faster. I would have expected D to be faster. Did I==20 The first Lua version uses a simplified design. I thought maybe that is u=nfair to ranges, which are more complicated. You could argue ranges have mo= re features and do more work. To make it fair, I made a second Lua version = of the above benchmark that emulates ranges. It is still 29 ms fast.=20 The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr =20 Could someone help me solving this mystery? =20 Or is D, unlike I thought, not suitable for high performance computing? W=hat should I do? Dunno why D seems slow. But Lua is a very fast dynamic language, very simpl= e & rather low level (both on language & implementation sides). Benchmark t= rials in Lua often run much faster than python or ruby equivalents (often 1= 0 X). Depending on the domain, LuaJIT often adds a speed factor of an order= of magnitude. This alltogether brings comparable performance to some compi= led languages using high-level features such as virtual funcs, GC,... highe= r-order funcs, ranges ;-) denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 22 2010
Ary Borenszweig Wrote:Lua is a proven, robust language Lua has been used in many industrial applications (e.g., Adobe's Photoshop Lightroom), with an emphasis on embedded systems (e.g., the Ginga middleware for digital TV in Brazil) and games (e.g., World of Warcraft). Lua is currently the leading scripting language in games. Lua has a solid reference manual and there are several books about it. Several versions of Lua have been released and used in real applications since its creation in 1993. Lua featured in HOPL III, the Third ACM SIGPLAN History of Programming Languages Conference, in June 2007.Are you suggesting D's conferences have smaller publicity value than ACM SIGPLAN HOPL? D has had several international conferences (videos available online) and the ACCU fellowship likes D. The NWCPP group likes D. I find to hard to believe D wouldn't replace Lua everywhere, when the time comes.
Dec 22 2010
You said Lua is a joke language. It doesn't seem to be one...
Dec 22 2010
Ary Borenszweig Wrote:You said Lua is a joke language. It doesn't seem to be one...Okay then, maybe it's not completely true. I meant it doesn't work in large scale applications unlike a static systems programming language. Need to study how extensively it's used in that game. I just tihkn being popular doesn't mean it's not a joke. Think about PHP !
Dec 22 2010
On Wed, 22 Dec 2010 18:26:33 -0500 Gary Whatmore <no spam.sp> wrote:Ary Borenszweig Wrote: =20ge scale applications unlike a static systems programming language. Need to= study how extensively it's used in that game. I just tihkn being popular d= oesn't mean it's not a joke. Think about PHP ! I find your assertions rather pointless, if not meaningless. Lua is certain= ly a joke language in D's preferred applications domains, just like D is a = joke language as an embedded, data-description, or user-scripting language. And precisely: the fact that Lua (well helped by LuaJIT) can outperform D o= n its own terrain --even if on a single example-- should give you *even mor= e _respect_* for this language and its designers. (And raise serious worrie= s about D ;-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.comYou said Lua is a joke language. It doesn't seem to be one...=20 Okay then, maybe it's not completely true. I meant it doesn't work in lar=
Dec 22 2010
Andreas Mayer wrote:Or is D, unlike I thought, not suitable for high performance computing? What should I do?I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors. Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.
Dec 22 2010
Walter Bright Wrote:I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors.Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT.Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.I shouldn't use D now? How long until it is ready?
Dec 22 2010
Andreas Mayer wrote:I shouldn't use D now? How long until it is ready?It depends on what you want to do. A lot of people are successfully using D.
Dec 22 2010
Andreas Mayer wrote:Walter Bright Wrote:You may want to explore the great language shootout before drawing that conclusion: http://shootout.alioth.debian.org/ LuaJit ranks high there, but still a bit below the fastest compiled languages (and the fastest java). D is not included anymore, but it once was and these benchmarks can still be found: http://shootout.alioth.debian.org/debian/performance.php LuaJit performance is impressive, far above any 'scripting' language. Just look at some numbers in the shootout comparing it to ruby or python.I notice you are using doubles in D. dmd currently uses the x87 to evaluate doubles, and on some processors the x87 is slow relative to using the XMM instructions. Also, dmd's back end doesn't align the doubles on 16 byte boundaries, which can also slow down the floating point on some processors.Using long instead of double, it is still slower than LuaJIT (223 ms on my machine). Even with int it still takes 101 ms and is at least 3x slower than LuaJIT.Both of these code gen issues with dmd are well known, and I'd like to solve them after we address higher priority issues. If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.I shouldn't use D now? How long until it is ready?
Dec 22 2010
I meant to link this, it includes all benchmarks and ranks gdc at 5th place and dmd at 8 (from 2008): http://shootout.alioth.debian.org/debian/benchmark.php?test=all&lang=all
Dec 22 2010
Walter Bright wrote:Andreas Mayer wrote:I forgot to mention. In the D version, use integers as a loop counter, not doubles.Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
Andreas Mayer:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:I have done (and I am doing) many benchmarks with D, and I too have seen similar results. I have discussed this topic two times in past, this was one time: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=110419 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=110420 For Floating-point-heavy code Lua-JIT is often faster than D compiled with DMD. I have found that on the SciMark2 benchmark too Lua-JIT is faster than D code compiled with DMD. On the other hand if I use LDC I am often able to beat LuaJIT 2.0.0-beta3 (we are now at beta5) (if the D code doesn't ask for too much inlining). The Lua-JIT is written by a very smart person, maybe a kind of genius that has recently given ideas to designers of V8 and Firefox JS Engine. The LuaJIT uses very well SSE registers and being a JIT it has more runtime information about the code, so it is able to optimize it better. It unrolls dynamically, inlines dynamic things, etc. DMD doesn't perform enough optimizations. Keep in mind that the main purpose of DMD is now to finish implementing D (and sometimes to find what to implement! Because there are some unfinished corners in D design). Performance tuning is mostly for later. ------------------------- Walter Bright:If it's not clear, I'd like to emphasize that these are compiler issues, not D language issues.<Surely Lua looks like a far worse language regarding optimization opportunities. But people around here (like you) must start to realize that JIT compilation is not what it used to be. Today the JIT compilation done by the JavaVM is able to perform de-virtualization, dynamic loop unrolling, inlining across "compilation units", and some other optimizations that despite are "not language issues" are not done or not done enough by static compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about as fast in Java and C, and for some sub-benchmarks it is faster :-) Bye, bearophile
Dec 22 2010
bearophile wrote:Surely Lua looks like a far worse language regarding optimization opportunities. But people around here (like you) must start to realize that JIT compilation is not what it used to be. Today the JIT compilation done by the JavaVM is able to perform de-virtualization, dynamic loop unrolling,The Java JIT did that 15 years ago. I think you forget that I wrote on a Java compiler way back in the day (the companion JIT was done by Steve Russell, yep, the Optlink guy).inlining across "compilation units",dmd does cross-module inlining.and some other optimizations that despite are "not language issues" are not done or not done enough by static compilers like LDC, GCC, DMD. The result is that SciMark2 benchmark is about as fast in Java and C, and for some sub-benchmarks it is faster :-)Inherent Java slowdowns are not in numerical code. The Java language isn't inherently worse at numerics than C, C++, D, etc. Where Java is inherently worse is in its excessive reliance on dynamic allocation (and that is rare in numeric code - you don't "new" a double).
Dec 22 2010
I think you forget that I wrote on a Java compiler way back in the dayI remember it :-)dmd does cross-module inlining.I didn't know this, much... Bye, bearophile
Dec 22 2010
On 12/22/10 4:04 PM, Andreas Mayer wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative. Another issue is the implementation of reduce. Reduce is fairly general which may mean that it generates mediocre code for that particular case. We can always optimize the general case and perhaps specialize for select cases. Once we figure where the problem is, there are numerous possibilities to improve the code: 1. Have iota check in the constructor whether the limits allow ++ to be precise. If so, use that. Of course, that means an extra runtime test... 3. Give up on iota being a random access or bidirectional range. If it's a forward range, we don't need to worry about going backwards. 4. Improve reduce as described above.auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer.The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?Thanks very much for taking the time to measure and post results, this is very helpful. As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment. Nevertheless, we need to look into improving this particular microbenchmark. Please don't forget to print the result of the computation in both languages, as there's always the possibility of some oversight. Andrei
Dec 22 2010
Andrei:As this test essentially measures the performance of iota and reduce, it would be hasty to generalize the assessment.From other tests I have seen that often FP-heavy code is faster with Lua-JIT than with D-DMD. But on average the speed difference is much less than 10 times, generally no more than 2 times. One benchmark, Lua and D code (both OOP and C-style included, plus several manually optimized D versions): http://tinyurl.com/yeo2g8j Bye, bearophile
Dec 22 2010
On Wed, 22 Dec 2010 20:16:45 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Thanks for posting the numbers. That's a long time, particularly=20 considering that the two map instances don't do anything. So the bulk of==20the computation is: =20 auto L =3D iota(0.0, 10000000.0); auto V =3D reduce!"a + b"(L3); =20 There is one inherent problem that affects the speed of iota: in iota,=20 the value at position i is computed as 0.0 + i * step, where step is=20 computed from the limits. That's one addition and a multiplication for=20 each pass through iota. Given that the actual workload of the loop is=20 only one addition, we are doing a lot more work. I suspect that that's=20 the main issue there. =20 The reason for which iota does that instead of the simpler increment is=20 that iota must iterate the same values forward and backward. Using ++=20 may interact with floating-point vagaries, so the code is currently=20 conservative.There is a point I don't understand here: Iota is a range-struct template, = with void popFront() { current +=3D step; } So, how does the computation of an arbitrary element at a given index affec= t looping speed? For mappings (and any kind of traversal, indeed), there sh= ould be an addition per element. Else, why define a range interface at all?= What do I miss? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
spir <denis.spir gmail.com> wrote:There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; } So, how does the computation of an arbitrary element at a given index affect looping speed? For mappings (and any kind of traversal, indeed), there should be an addition per element. Else, why define a range interface at all? What do I miss?With floating-point numbers, the above solution does not always work. If step == 1, increasing current by step amount will stop working at some point, at which the range will then grind to a halt. If instead one multiplies step by the current number of steps taken, and adds to the origin, this problem disappears. As an example of when this problem shows up, try this code: float f = 16_777_216; auto f2 = f + 1; assert( f == f2 ); The assert passes. -- Simen
Dec 23 2010
Simen kjaeraas:With floating-point numbers, the above solution does not always work.The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type. Bye, bearophile
Dec 23 2010
bearophile <bearophileHUGS lycos.com> wrote:Simen kjaeraas:Absolutely. However, in this example doubles were used. Also, though it may be doable for integers, other types may also want that optimization (BigInt comes to mind). A behavesAsIntegral!T might fix that. -- SimenWith floating-point numbers, the above solution does not always work.The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type.
Dec 23 2010
On 12/23/10 6:57 AM, bearophile wrote:Simen kjaeraas:That's what the code currently does. AndreiWith floating-point numbers, the above solution does not always work.The type is known at compile time, so you can split the algorithm in two with a "static if", and do something else if it's an integral type.
Dec 23 2010
On 12/23/10 6:22 AM, spir wrote:On Wed, 22 Dec 2010 20:16:45 -0600 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org> wrote:You need to look at this specialization: http://www.dsource.org/projects/phobos/browser/trunk/phobos/std/range.d#L3800 and keep in mind Simen's explanation. AndreiThanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3); There is one inherent problem that affects the speed of iota: in iota, the value at position i is computed as 0.0 + i * step, where step is computed from the limits. That's one addition and a multiplication for each pass through iota. Given that the actual workload of the loop is only one addition, we are doing a lot more work. I suspect that that's the main issue there. The reason for which iota does that instead of the simpler increment is that iota must iterate the same values forward and backward. Using ++ may interact with floating-point vagaries, so the code is currently conservative.There is a point I don't understand here: Iota is a range-struct template, with void popFront() { current += step; }
Dec 23 2010
On 12/22/10 8:16 PM, Andrei Alexandrescu wrote:On 12/22/10 4:04 PM, Andreas Mayer wrote:Oops, I was wrong. The two instances of map do something, I thought they're all applied to L when in fact they are chained. So my estimates are incorrect. At any rate, clearly iota incurs a 2x cost, which probably composes with other similar costs incurred by map. AndreiTo see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:Thanks for posting the numbers. That's a long time, particularly considering that the two map instances don't do anything. So the bulk of the computation is: auto L = iota(0.0, 10000000.0); auto V = reduce!"a + b"(L3);auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer.
Dec 23 2010
On 12/22/10 4:04 PM, Andreas Mayer wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); result = reduce!"a + b"(L); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln("%f", result); }auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
That's odd, I'm getting opposite results: iota = 78ms baseline = 187ms Andreas' old code gives: 421ms This is over multiple runs so I'm getting the average out of about 20 runs. On 12/23/10, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:On 12/22/10 4:04 PM, Andreas Mayer wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:I reproduced the problem with a test program as shown below. On my machine the D iota runs in 108ms, whereas a baseline using a handwritten loop runs in 43 ms. I then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop. Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks! Andrei // D version, with std.algorithm // ~ 281 ms, using dmd 2.051 (dmd -O -release -inline) import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); result = reduce!"a + b"(L); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } } writefln("%f", result); }auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 22 2010
Andrei Alexandrescu Wrote:Andreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks!Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.
Dec 22 2010
On 12/22/10 11:06 PM, Andreas Mayer wrote:Andrei Alexandrescu Wrote:Cool, thanks. I also tested against this C++ baseline: #include <stdio.h> int main() { const double limit = 10000000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf("%f\n", result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? AndreiAndreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks!Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.
Dec 22 2010
On Wed, 22 Dec 2010, Andrei Alexandrescu wrote:On 12/22/10 11:06 PM, Andreas Mayer wrote:Since the timing code isn't here, I'm assuming you guys are doing the testing around the whole app. While that might be interesting, it's hiding an awfully large and important difference, application startup time. C has very little, D quite a bit more, and I don't know what Lua looks like there. If the goal is to test this math code, you'll need to separate the two. At this point, I highly suspect you're really measuring the runtime costs.Andrei Alexandrescu Wrote:Cool, thanks. I also tested against this C++ baseline: #include <stdio.h> int main() { const double limit = 10000000.0; double result = 0.0; for (double i = 0; i != limit; ++i) { result += i; } printf("%f\n", result); } The baseline (compiled with -O3) runs in 21 ms on my machine, which means (if my and Andreas' machines are similar in performance) that Lua has essentially native performance for this loop and D has an issue in code generation that makes it 2x slower. I think this could be filed as a performance bug for dmd. I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas? AndreiAndreas, any chance you could run this on your machine and compare it with Lua? (I don't have Lua installed.) Thanks!Your version: 40 ms (iota and baseline give the same timings) LuaJIT with map calls removed: 21 ms Interesting results.
Dec 22 2010
On 12/22/10 11:40 PM, Brad Roberts wrote:Since the timing code isn't here, I'm assuming you guys are doing the testing around the whole app. While that might be interesting, it's hiding an awfully large and important difference, application startup time. C has very little, D quite a bit more, and I don't know what Lua looks like there. If the goal is to test this math code, you'll need to separate the two. At this point, I highly suspect you're really measuring the runtime costs.One thing I didn't mention is that I also measured with 10x the counter limit. That brings the run time to seconds, and the relative difference persists. So application startup time is negligible in this case. Andrei
Dec 23 2010
Andrei:I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas?Is improving the compiler instead an option? Bye, bearophile
Dec 23 2010
On 12/23/10 2:52 AM, bearophile wrote:Andrei:It's more of a separate matter than an option. Iota currently does a fair amount of work for floating-point types, and a contemporary optimizer cannot be reasonably expected to simplify that code. AndreiI'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas?Is improving the compiler instead an option?
Dec 23 2010
On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm thinking what to do about iota, which has good features but exacts=20 too much cost on tight loop performance. One solution would be to define==20iota to be the simple, forward range that I defined as Iota2 in my=20 previous post. Then, we need a different name for the full-fledged iota=20 (random-access, has known length, iterates through the same numbers=20 forward and backward etc). Ideas?I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure = whether it's worth allowing different types for bounds and/or for step; I'd= rather make things simple.) Then, you could call it Interval, what do you = think? Note: The result would be very similar to python (x)ranges. D has a notatio= n for a slightly narrower notion: '..'. Thus, what about: Interval!int interval =3D 1..9; or else: auto interval =3D Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now? Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
spir <denis.spir gmail.com> wrote:On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Nothing. The syntax only works in foreach and opSlice. However, this works: final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } auto intInterval = Intervals[1..2]; auto stringInterval = Intervals["foo".."bar"]; -- SimenI'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas?I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now?
Dec 23 2010
On Thu, 23 Dec 2010 14:40:13 +0100 "Simen kjaeraas" <simen.kjaras gmail.com> wrote:Nice! (even impressive :-) Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.comWhat kind of thingie does "i..j" actually construct as of now? =20=20 Nothing. The syntax only works in foreach and opSlice. =20 However, this works: =20 final abstract class Intervals { struct Interval( T ) { T start, end; } static Interval!T opSlice( T )( T start, T end ) { return Interval!T( start, end ); } } =20 auto intInterval =3D Intervals[1..2]; auto stringInterval =3D Intervals["foo".."bar"];
Dec 23 2010
On Thursday 23 December 2010 05:22:55 spir wrote:On Wed, 22 Dec 2010 23:22:56 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I believe that the only place that .. works is within []. If an object overrides an opSlice() which takes parameters, then that syntax can be used. I don't believe that it works on its own at all. - Jonathan M DavisI'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to define iota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged iota (random-access, has known length, iterates through the same numbers forward and backward etc). Ideas?I would keep length and add an opIn: if (e in interval) {...}. (I'm unsure whether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what do you think? Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval = 1..9; or else: auto interval = Interval!int(1..9); ? What kind of thingie does "i..j" actually construct as of now?
Dec 23 2010
On Thu, 23 Dec 2010 05:29:32 -0800 Jonathan M Davis <jmdavisProg gmx.com> wrote:On Thursday 23 December 2010 05:22:55 spir wrote:ineOn Wed, 22 Dec 2010 23:22:56 -0600 =20 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I'm thinking what to do about iota, which has good features but exacts too much cost on tight loop performance. One solution would be to def=taiota to be the simple, forward range that I defined as Iota2 in my previous post. Then, we need a different name for the full-fledged io=ure(random-access, has known length, iterates through the same numbers forward and backward etc). Ideas?=20 I would keep length and add an opIn: if (e in interval) {...}. (I'm uns=dowhether it's worth allowing different types for bounds and/or for step; I'd rather make things simple.) Then, you could call it Interval, what =errides=20you think? =20 Note: The result would be very similar to python (x)ranges. D has a notation for a slightly narrower notion: '..'. Thus, what about: Interval!int interval =3D 1..9; or else: auto interval =3D Interval!int(1..9); ? =20 What kind of thingie does "i..j" actually construct as of now?=20 I believe that the only place that .. works is within []. If an object ov=an opSlice() which takes parameters, then that syntax can be used.;-) There's also foreach(n ; i..j) {...} Precisely, that's what I was thinking at when stating that D has a notation= for a very close (but narrower) notion. Slicing is related, but much farth= er since it does not necessarily resuire iteration (bit it's result does al= low it). Note: Iota is right-side exclusive like i..j . (I've just been caught by th= is trap ;-)I don't believe that it works on its own at all.Certainly not. This would be a syntactic addition. The reason why I asked w= hat "i..j" currently yield --if it yield anything (could just rewrite). denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
On Wed, 22 Dec 2010 22:14:34 -0600 Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I then replaced iota's implementation with a simpler one that's a=20 forward range. Then the performance became exactly the same as for the=20 simple loop.After having watched Iota's very general implementation, I tried the same c= hange, precisely. Actually, with an even simpler range requiring a single e= lement type for (first,last,step). For any reason, this alternative is slig= htly slower by me than using Iota (don't cry watching absolute times, my co= mputer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7=20 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called pa= st the end adds ~ 20% time. Note: I think this demonstates that using Iota does not perform undue compu= tations (multiplication to get Nth element with multiplication + addition),= or do I misinterpret? Anyway, what is wrong in my code? What doesn't it perform better? import std.algorithm : map, filter, reduce; import std.range : iota; struct Interval (T) { alias T Element; Element first, last, step; private Element element; this (Element first, Element last, Element step=3D1) { this.first =3D first; this.last =3D last; this.step =3D step; this.element =3D first; } property void popFront () { this.element +=3D this.step; } property bool empty () { return (this.element > this.last); } property Element front () { return this.element; } } void main () { auto nums =3D Interval!float(1.1,8.8, 2.2); foreach(n ; nums) writef("%s ", n); writeln(); auto t1 =3D time(); auto nums1 =3D Interval!int(0, 10_000_000); auto halves1 =3D map!"a/2"(nums1); auto incs1 =3D map!"a+2"(halves1); auto result1 =3D reduce!"a+b"(incs1); writefln("Interval time: %s", time() - t1); =20 auto t2 =3D time(); auto nums2 =3D iota(0, 10_000_000); auto halves2 =3D map!"a/2"(nums2); auto incs2 =3D map!"a+2"(halves2); auto result2 =3D reduce!"a+b"(incs2); writefln("Iota time: %s", time() - t2); } Denis -- -- -- -- -- -- -- vit esse estrany =E2=98=A3 spir.wikidot.com
Dec 23 2010
On 12/23/10 7:04 AM, spir wrote:On Wed, 22 Dec 2010 22:14:34 -0600 Andrei Alexandrescu<SeeWebsiteForEmail erdani.org> wrote:I cut my losses reading here :o). No performance test is meaningful without all optimizations turned on. AndreiI then replaced iota's implementation with a simpler one that's a forward range. Then the performance became exactly the same as for the simple loop.After having watched Iota's very general implementation, I tried the same change, precisely. Actually, with an even simpler range requiring a single element type for (first,last,step). For any reason, this alternative is slightly slower by me than using Iota (don't cry watching absolute times, my computer is old and slow ;-). Sample code below, typical results are: 1.1 3.3 5.5 7.7 Interval time: 1149 Iota time: 1066 Note: adding an assert to ensure front or popfront is not wrongly called past the end adds ~ 20% time.
Dec 23 2010
On 12/22/2010 11:04 PM, Andreas Mayer wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:I changed the code to this: auto L = iota(0, 10000000); auto L2 = map!"a / 2.0"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3); and ripped the caching out of std.algorithm.map. :-) This made it go from about 1.4 seconds to about 0.4 seconds on my machine. Note that I did no rigorous or scientific testing. Also, if you really really need the performance you can change it all to lower level code, should you want to.auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 23 2010
On 12/22/10 4:04 PM, Andreas Mayer wrote:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:I wrote a new test bench and got 41 ms for the baseline and 220 ms for the code based on map and iota. (Surprisingly, the extra work didn't affect the run time, which suggests the loop is dominated by the counter increment and test.) Then I took out the cache in map and got 136 ms. Finally, I replaced the use of iota with iota2 and got performance equal to that of handwritten code. Code below. I decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231 If the other Phobos folks approve, I'll also specialize iota for floating point numbers to be a forward range and defer the decision on defining a "randomAccessIota" for floating point numbers to later. That would complete the library improvements pointed to by this test, leaving further optimization to compiler improvements. Thanks Andreas for starting this. Andrei import std.algorithm; import std.stdio; import std.range; import std.traits; struct Iota2(N, S) if (isFloatingPoint!N && isNumeric!S) { private N start, end, current; private S step; this(N start, N end, S step) { this.start = start; this.end = end; this.step = step; current = start; } /// Range primitives property bool empty() const { return current >= end; } /// Ditto property N front() { return current; } /// Ditto alias front moveFront; /// Ditto void popFront() { assert(!empty); current += step; } property Iota2 save() { return this; } } auto iota2(B, E, S)(B begin, E end, S step) if (is(typeof((E.init - B.init) + 1 * S.init))) { return Iota2!(CommonType!(Unqual!B, Unqual!E), S)(begin, end, step); } void main(string args[]) { double result; auto limit = 10_000_000.0; if (args.length > 1) { writeln("iota"); auto L = iota2(0.0, limit, 1.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); result = reduce!"a + b"(L3); } else { writeln("baseline"); result = 0.0; for (double i = 0; i != limit; ++i) { result += (i / 2) + 2; } } writefln("%f", result); }auto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong? The first Lua version uses a simplified design. I thought maybe that is unfair to ranges, which are more complicated. You could argue ranges have more features and do more work. To make it fair, I made a second Lua version of the above benchmark that emulates ranges. It is still 29 ms fast. The full D version is here: http://pastebin.com/R5AGHyPx The Lua version: http://pastebin.com/Sa7rp6uz Lua version that emulates ranges: http://pastebin.com/eAKMSWyr Could someone help me solving this mystery? Or is D, unlike I thought, not suitable for high performance computing? What should I do?
Dec 23 2010
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:I decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231It seems to me that having a Cached range might be a better, more general solution in any case. -- Simen
Dec 23 2010
On 12/23/10 10:09 AM, Simen kjaeraas wrote:Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Agreed. AndreiI decided to check in the map cache removal. We discussed it a fair amount among Phobos devs. I have no doubts caching might help in certain cases, but it does lead to surprising performance loss for simple cases like the one tested here. See http://www.dsource.org/projects/phobos/changeset/2231It seems to me that having a Cached range might be a better, more general solution in any case.
Dec 23 2010
Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:http://www.dsource.org/projects/phobos/changeset/2231BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't. -- Simen
Dec 23 2010
On 12/23/10 10:14 AM, Simen kjaeraas wrote:Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> wrote:Currently higher-order ranges assume that the range passed-in is good to take ownership of. A range or algorithm should call save only in case extra copies need to be created. Andreihttp://www.dsource.org/projects/phobos/changeset/2231BTW, shouldn't range constructors call .save for forward ranges? This one certainly doesn't.
Dec 23 2010
I hope that in the future more implementations in D can be compared for performance against their equivalent Lua translations. It seems that LuaJIT is a super speedy dynamic language, and it is specifically designed to break into the performance ranges of optimized static languages, which makes it a formidable competitor.
Dec 23 2010
spir:Note: Iota is right-side exclusive like i..j . (I've just been caught by this trap ;-)This is for the better, to increase language consistency (as in Python). In APL (where the iota name comes from) the semantics was different: i 5 ==> 1, 2, 3, 4, 5 Integer ranges are very common, so some generalization seems useful here. The x..y:z syntax may be the range with z stride, and be syntax sugar for iota(x, y, z). Bye, bearophile
Dec 24 2010
Am Wed, 22 Dec 2010 17:04:21 -0500 schrieb Andreas Mayer <spam bacon.eggs>:To see what performance advantage D would give me over using a scripting language, I made a small benchmark. It consists of this code:Actually "D" is 1.5 times faster on my computer*: LDC** ======== 18 ms GDC*** =========== 25 ms LuaJIT 2.0.0 b7 ============ 27 ms DMD ========================================= 93 ms All compilers based on DMD 2.062 front-end. * 64-bit Linux, 2.0 Ghz Mobile Core 2 Duo. ** based on LLVM 3.2 *** based on GCC 4.7.2 I modified the iota template to more closely reflect the one used in the original Lua code: --------------------- import std.algorithm; import std.stdio; import std.traits; auto iota(B, E)(B begin, E end) if (isFloatingPoint!(CommonType!(B, E))) { alias CommonType!(B, E) Value; static struct Result { private Value start, end; property bool empty() const { return start >= end; } property Value front() const { return start; } void popFront() { start++; } } return Result(begin, end); } void main() { auto L = iota(0.0, 10000000.0), L2 = map!(a => a / 2)(L), L3 = map!(a => a + 2)(L2), V = reduce!((a, b) => a + b)(L3); writefln("%f", V); } -- Marcoauto L = iota(0.0, 10000000.0); auto L2 = map!"a / 2"(L); auto L3 = map!"a + 2"(L2); auto V = reduce!"a + b"(L3);It runs in 281 ms on my computer. The same code in Lua (using LuaJIT) runs in 23 ms. That's about 10 times faster. I would have expected D to be faster. Did I do something wrong?
Jun 01 2013