www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - More embarrassing microbenchmars for D's GC.

reply Leandro Lucarella <llucax gmail.com> writes:
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

I've done a very simple microbenchmark testing a single linked list.

What the test does is fill the list with long elements, and then iterate
the list and increment the associated value. This test mostly test the
allocation speed.

I've done plain a C version, a C++ version (using STL double-linked list,
so it has some disvantage here), a D version without using the GC, a D
version using the GC, and a Python version.

Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C         C++       D no gc   D gc       D gc dis  Python
Fill   0.136136  0.142705  0.131238  22.628577  0.242637  12.952816
Inc    0.025134  0.038768  0.037456   0.050480  0.051545   3.765271
Total  0.161270  0.181473  0.168694  22.679057  0.294182  16.718087

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

You can see Python is almost twice faster than D doing allocation (and the
python example has some intentional overhead to make it as close as D code
as possible, using more Pythonic code could yield better results).

But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?
2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>


	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)

Tests attached.

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
RENUNCIO PARA IR A REZARLE A SAN CAYETANO
	-- Crónica TV
Jun 09 2008
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Leandro Lucarella:
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).
Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco). Try putting your code into a main() and add a: from psyco.classes import __metaclass__ At the top. The code: import sys from datetime import datetime import psyco psyco.full() from psyco.classes import __metaclass__ class Node: def __init__(self): self.data = 0 self.next = None def main(): start = datetime.now() list = Node() n = list for i in xrange(long(sys.argv[1])): n.data = i n.next = Node() n = n.next n.next = None end = datetime.now() delta = datetime.now() - start elapsed1 = delta.seconds * 1000000 + delta.microseconds start = datetime.now() p = list while p: p.data += 1 p = p.next delta = datetime.now() - start elapsed2 = delta.seconds * 1000000 + delta.microseconds print "Fill: %lu usec" % elapsed1 print "Inc: %lu usec" % elapsed2 print "Total: %lu usec" % (elapsed1 + elapsed2) main() ----------------------- Original: 9.32 s Just moving code into a function: 8.39 s the same, using Psyco: 1.94 s Bye, bearophile
Jun 09 2008
parent reply Leandro Lucarella <llucax gmail.com> writes:
bearophile, el  9 de junio a las 17:03 me escribiste:
 Leandro Lucarella:
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).
Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco).
Thanks for the info, I've imported psyco but didn't play with it really =) -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Sus discipulos se miraron sin entended hasta que uno preguntose: Peperino, soy Daniel Q. de Olivos tengo 54 años y aún soy virgen. A lo que Peperino respondiole: Si sos ganso, ganso ser. Y lo frotó, y lo curó y lo sanó. A lo que todos dijeron: ­¡¡¡Peperino se la come, Peperino se la come!!! -- Peperino Pómoro
Jun 10 2008
parent bearophile <bearophileHUGS lycos.com> writes:
Leandro Lucarella:
 Thanks for the info, I've imported psyco but didn't play with it really =)
If you have a bit of free time you may time that code of mine (that's equal to yours + psyco) on your PC, so you can show us those timings too compared to the other ones... Bye, bearophile
Jun 10 2008
prev sibling next sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
== Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?
I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.
 2) Why is D malloc version ~33% slower than the C version? I took a look
    at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)
No idea. I'd expect them to be roughly equivalent. Sean
Jun 09 2008
parent reply Leandro Lucarella <llucax gmail.com> writes:
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Sean Kelly, el  9 de junio a las 22:53 me escribiste:
 == Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?
I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.
Really? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more. Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential? -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Me encanta el éxito; por eso prefiero el estado de progreso constante, con la meta al frente y no atrás. -- Ricardo Vaporeso. Punta del Este, Enero de 1918.
Jun 10 2008
next sibling parent Sean Kelly <sean invisibleduck.org> writes:
== Quote from Leandro Lucarella (llucax gmail.com)'s article
 --TB36FDmn/VVEgNH/
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 Content-Transfer-Encoding: 8bit
 Sean Kelly, el  9 de junio a las 22:53 me escribiste:
 == Quote from Leandro Lucarella (llucax gmail.com)'s article
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?
I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.
Really? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more. Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential?
Oh okay. I was replying to the post that suggested that said the malloc version was still much faster than D with the GC disabled. But if they're comparable then all the better. This is what I would expect anyway. As for the GC algorithm being exponential... all nodes of an AA contain pointers, so they will all be scanned by the GC. Given this and the chance of false positives, I would expect performance to decrease in a superlinear manner. But an exponential slow-down still seems a bit extreme. Sean
Jun 10 2008
prev sibling parent reply Marius Muja <mariusm cs.ubc.ca> writes:
Leandro Lucarella wrote:
 
 Considering that the version with GC disabled is almost as fast as the
 malloc version, and that the time grows exponentially (graph attached)
 with the number of nodes, it looks like the problem is the GC
 collection cycles to me. Is the GC algorithm exponential?
 
The graph may look exponential but that's often misleading (it may very well be a polynomial). To tell for sure plot the data on a graph with a logarithmic Y axis and then on a graph with both axes logarithmic. If the plot is a straight line on the semilogy graph then it's exponential. If it's a straight line on the loglog graph then it's polinomial.
Jun 10 2008
parent Leandro Lucarella <llucax gmail.com> writes:
Marius Muja, el 10 de junio a las 21:38 me escribiste:
 Leandro Lucarella wrote:
Considering that the version with GC disabled is almost as fast as the
malloc version, and that the time grows exponentially (graph attached)
with the number of nodes, it looks like the problem is the GC
collection cycles to me. Is the GC algorithm exponential?
The graph may look exponential but that's often misleading (it may very well be a polynomial). To tell for sure plot the data on a graph with a logarithmic Y axis and then on a graph with both axes logarithmic. If the plot is a straight line on the semilogy graph then it's exponential. If it's a straight line on the loglog graph then it's polinomial.
You and Sean are right, the graph somewhere in between linear and exponential, but is really far from linear, which I think is the worst case in GC algorithms... -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pudiera acercarme un poco más, hacia vos Te diría que me tiemblan los dos pies, cuando me mirás Si supieras todo lo que me costó, llegar Hoy sabrías que me cuesta respirar, cuando me mirás
Jun 11 2008
prev sibling next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Leandro Lucarella wrote:
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
    that uses malloc when iterating the list? It shouldn't be any GC
    activity in that part. Could be some GC locallity issue that yields
    more cache misses?
There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.
 2) Why is D malloc version ~33% slower than the C version? I took a look
    at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 
 
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)
 
 Tests attached.
Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Jun 09 2008
parent reply Leandro Lucarella <llucax gmail.com> writes:
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Walter Bright, el  9 de junio a las 17:12 me escribiste:
 Leandro Lucarella wrote:
But there are a few other results I can't explain:
1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?
There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.
But there is no GC activity there right?
2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)
Tests attached.
Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pensas que el alma no se ve el alma sí se ve en los ojos
Jun 10 2008
next sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Leandro Lucarella wrote:
 Walter Bright, el  9 de junio a las 17:12 me escribiste:
 Leandro Lucarella wrote:
 But there are a few other results I can't explain:
 1) Why is D gc (disabled or not) version ~25% slower than the D version
   that uses malloc when iterating the list? It shouldn't be any GC
   activity in that part. Could be some GC locallity issue that yields
   more cache misses?
There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.
But there is no GC activity there right?
Instrument it to be sure.
 
 2) Why is D malloc version ~33% slower than the C version? I took a look
   at the generated assembly and it's almost identical:
 	<_Dmain+198>:   lea    -0x20(%ebp),%eax
 	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
 	<_Dmain+208>:   addl   $0x1,0x8(%eax)
 	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
 	<_Dmain+216>:   mov    (%eax),%eax
 	<_Dmain+218>:   test   %eax,%eax
 	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
 	<main+248>:     lea    -0x1c(%ebp),%eax
 	<main+251>:     nop
 	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
 	<main+256>:     addl   $0x1,0x4(%eax)
 	<main+260>:     adcl   $0x0,0x8(%eax)
 	<main+264>:     mov    (%eax),%eax
 	<main+266>:     test   %eax,%eax
 	<main+268>:     jne    0x8048550 <main+256>
 	<main+270>:     movl   $0x0,0x4(%esp)
 	<main+278>:     movl   $0x8049800,(%esp)
 Tests attached.
Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection.
Break up your code into more functions to get better info from the profiler.
Jun 10 2008
parent Leandro Lucarella <llucax gmail.com> writes:
Walter Bright, el 10 de junio a las 14:13 me escribiste:
But there is no GC activity there right?
Instrument it to be sure.
Doesn't seems to be running. But it's weird you can tell for sure being the author :S
2) Why is D malloc version ~33% slower than the C version? I took a look
  at the generated assembly and it's almost identical:
	<_Dmain+198>:   lea    -0x20(%ebp),%eax
	<_Dmain+201>:   lea    0x0(%esi,%eiz,1),%esi
	<_Dmain+208>:   addl   $0x1,0x8(%eax)
	<_Dmain+212>:   adcl   $0x0,0xc(%eax)
	<_Dmain+216>:   mov    (%eax),%eax
	<_Dmain+218>:   test   %eax,%eax
	<_Dmain+220>:   jne    0x804a240 <_Dmain+208>
	<main+248>:     lea    -0x1c(%ebp),%eax
	<main+251>:     nop
	<main+252>:     lea    0x0(%esi,%eiz,1),%esi
	<main+256>:     addl   $0x1,0x4(%eax)
	<main+260>:     adcl   $0x0,0x8(%eax)
	<main+264>:     mov    (%eax),%eax
	<main+266>:     test   %eax,%eax
	<main+268>:     jne    0x8048550 <main+256>
	<main+270>:     movl   $0x0,0x4(%esp)
	<main+278>:     movl   $0x8049800,(%esp)
Tests attached.
Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection.
Break up your code into more functions to get better info from the profiler.
Is this useful (kcachegrind screenshoots, using callgrind with --dump-instr=yes and --trace-jump=yes): http://llucax.com.ar/~luca/kcachegrind-d.png http://llucax.com.ar/~luca/kcachegrind-c.png -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- De tan fina la condesa, por no cagarse, reza. -- Ricardo Vaporeso
Jun 11 2008
prev sibling parent reply Rounin <davidjo student.matnat.uio.no> writes:
I found this posting after investigating a huge slowdown in a D program I'd made
using gprof, which showed the culprit to be a function called
_D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time. That turned
out to be the garbage collector.

The program takes 1.74 seconds to run with two particular input files in Python.
The garbage collected D version takes 5.27 seconds.
The non-garbage collected version takes only 0.68 seconds.

Here's a more detailed report:

http://rounin.livejournal.com/21815.html

Obviously, the compiled program should run at least as fast as Python, so the
garbage collector could use some improvement.
Sep 15 2010
parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Wed, 15 Sep 2010 15:53:38 -0400, Rounin <davidjo student.matnat.uio.no>  
wrote:

 I found this posting after investigating a huge slowdown in a D program  
 I'd made
 using gprof, which showed the culprit to be a function called
 _D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time.  
 That turned
 out to be the garbage collector.

 The program takes 1.74 seconds to run with two particular input files in  
 Python.
 The garbage collected D version takes 5.27 seconds.
 The non-garbage collected version takes only 0.68 seconds.

 Here's a more detailed report:

 http://rounin.livejournal.com/21815.html

 Obviously, the compiled program should run at least as fast as Python,  
 so the
 garbage collector could use some improvement.
gdc is very old. You may want to try the thing on the latest dmd. No guarantee this is going to help though. Another thing you may want to try is dcollections HashMap instead of builtin associative arrays, as they use a free-list style allocator that helps avoid GC collections. I agree the GC is a major issue with performance, I hope after D2 is a bit more stable, we start working on improving it. -Steve
Sep 15 2010
parent reply Rounin <davidjo student.matnat.uio.no> writes:
Thank you for that advice. I'm using GDC because it's available from Ubuntu
Linux's package system, whereas DMD currently is not. (And the .deb posted on
digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
up-to-date tools will be made available.

By the way, today I re-compiled the program with a "std.gc.enable;" right before
the final "return 0" statement, and it still runs in 0.68 seconds. So either
those
objects aren't marked as needing garbage collection, or it's really just an
issue
of keeping the garbage collector from activating too quickly.
Sep 16 2010
next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Rounin:

 Thank you for that advice. I'm using GDC because it's available from Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
 up-to-date tools will be made available.
On a 32 bit system I find LDC to produce efficient programs, when possible (the GC is similar, in D2 it's essentially the same).
 By the way, today I re-compiled the program with a "std.gc.enable;" right
before
 the final "return 0" statement, and it still runs in 0.68 seconds.
You may try to disable/enable the GC in the Python code too (on 32 bit systems there's the very good Psyco too). Your benchmark is not portable, so I can't help you find where the performance problem is. When you perform a benchmark it's better to give all source code and all data too, to allow others to reproduce the results and look for performance problems. Keep in mind that D associative arrays are usually slower than Python dicts. Probably you build data structures like associative arrays, and this slows down the GC. If you disable&enable the GC around that build phase, the program is probably fast (so I suggest you to narrow as much as possible the width of the disable/enable span, so you may see where the GC problem is). If you put a exit(0) at the end of the program (to kill final collection) the D program may save more time. In Python 2.7 they have added a GC optimization that I may be used in D too: http://bugs.python.org/issue4074
The garbage collector now performs better for one common usage pattern: when
many objects are being allocated without deallocating any of them. This would
previously take quadratic time for garbage collection, but now the number of
full garbage collections is reduced as the number of objects on the heap grows.
The new logic only performs a full garbage collection pass when the middle
generation has been collected 10 times and when the number of survivor objects
from the middle generation exceeds 10% of the number of objects in the oldest
generation. (Suggested by Martin von Löwis and implemented by Antoine Pitrou;
issue 4074.)<
while D's splitlines() wasn't quite working.<
int space = std.regexp.find(line, r"\s");<
What do you mean? If there's a bug in splitlines() or split() it's better to add it to Bugzilla, possibly with inlined string to split (no external file to read). splitlines() or split() are simple functions of a module, written in D, so if there's a problem it's usually not too much hard to fix it, they are not built-in methods written in C as in CPython. if(!(path in oldpaths) && !(checksum in oldsums)) In D2 this may be written (unfortunately there is no "and" keyword, Walter doesn't get its usefulness yet): if (path !in oldpaths && checksum !in oldsums) Bye, bearophile
Sep 16 2010
parent reply Rounin <davidjo student.matnat.uio.no> writes:
Keep in mind that D associative arrays are usually slower than Python dicts.
Probably you build data structures like associative arrays, and this slows down
the GC. If you disable&enable the GC around that build phase, the program is
probably fast (so I suggest you to narrow as much as possible the width of the
disable/enable span, so you may see where the GC problem is). If you put a
exit(0) at the end of the program (to kill final collection) the D program may
save more time.
This could be the case, as I started noticing the slowdown as I was writing either the second or third block, which is where I build a ton of data structures. Complicating the code by avoiding associative arrays isn't really an option, though, as the whole point of writing the program was comparing D to Python in terms of simplicity. Turning off the garbage collector during time-critical bits is simple enough, though, as long as one knows that it has to be done.
What do you mean? If there's a bug in splitlines() or split() it's better to add
it to Bugzilla, possibly with inlined string to split (no external file to
read).
splitlines() or split() are simple functions of a module, written in D, so if
there's a problem it's usually not too much hard to fix it, they are not
built-in
methods written in C as in CPython.
There's no bug in split() as far as I know, but during development of the program, splitlines() would split the data on other things than newlines. It could be a bug in D, but it could just be a temporary glitch in the code, so I'm not filing a bug.
Sep 16 2010
parent reply dsimcha <dsimcha yahoo.com> writes:
== Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.
How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs. My RandAA can be several times faster than the builtin AA when dealing with huge arrays (purely due to memory management overhead), and some tweaking by others in the community has produced an even faster version. Furthermore, it doesn't create tons of heap fragmentation and false pointers for the GC. I've solved out of memory errors with the builtin AA in some small projects just by swapping in RandAA for it. (See http://dsource.org/projects/aa)
Sep 16 2010
parent reply Sean Kelly <sean invisibleduck.org> writes:
dsimcha Wrote:

 == Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.
How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs.
The precise scanning patch would probably help tremendously here.
Sep 16 2010
parent dsimcha <dsimcha yahoo.com> writes:
== Quote from Sean Kelly (sean invisibleduck.org)'s article
 dsimcha Wrote:
 == Quote from Rounin (davidjo student.matnat.uio.no)'s article
 Complicating the code by avoiding associative arrays isn't really an option,
 though, as the whole point of writing the program was comparing D to Python in
 terms of simplicity.
How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs.
The precise scanning patch would probably help tremendously here.
Yes, it would but even so, I don't think that a good AA implementation for a GC like D's should require an allocation on **every single insert**. This makes AAs basically unusable in parallel code.
Sep 16 2010
prev sibling next sibling parent reply "Steven Schveighoffer" <schveiguy yahoo.com> writes:
On Thu, 16 Sep 2010 06:41:14 -0400, Rounin <davidjo student.matnat.uio.no>  
wrote:

 Thank you for that advice. I'm using GDC because it's available from  
 Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb  
 posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more  
 popularity and more
 up-to-date tools will be made available.
I've heard that dmd is a bit tricky to set up on a 64-bit system (mine is 32-bit), but it will give more credence to your position to use the latest available tools.
 By the way, today I re-compiled the program with a "std.gc.enable;"  
 right before
 the final "return 0" statement, and it still runs in 0.68 seconds. So  
 either those
 objects aren't marked as needing garbage collection, or it's really just  
 an issue
 of keeping the garbage collector from activating too quickly.
The GC runs are not scheduled. Essentially, you are probably encountering a syndrome where the GC runs more than it should. This usually occurs when incrementally allocating large amounts of memory without dropping any of it. The GC colleciton runs when allocating memory cannot find any free memory. Most likely this is what's happening: allocate -> No free memory available -> run GC collection, get nothing back -> allocate more memory from OS -> allocate... Of course, the allocations will succeed for a bit after getting OS memory, but the GC collection is run way too often (as your profiling suggests). Disabling it at the beginning, then enabling it at the end means it's going to run at most once. Essentially, you take out the "run GC collection" step, everything else is the same. Given that 87% of the run time is spent in the collection cycle, it's pretty obvious why this reduced your runtime ;) -Steve
Sep 16 2010
parent Justin Johansson <no spam.com> writes:
On 16/09/2010 9:53 PM, Steven Schveighoffer wrote:
 On Thu, 16 Sep 2010 06:41:14 -0400, Rounin
 <davidjo student.matnat.uio.no> wrote:

 Thank you for that advice. I'm using GDC because it's available from
 Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb
 posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more
 popularity and more
 up-to-date tools will be made available.
I've heard that dmd is a bit tricky to set up on a 64-bit system (mine is 32-bit), but it will give more credence to your position to use the latest available tools.
 By the way, today I re-compiled the program with a "std.gc.enable;"
 right before
 the final "return 0" statement, and it still runs in 0.68 seconds. So
 either those
 objects aren't marked as needing garbage collection, or it's really
 just an issue
 of keeping the garbage collector from activating too quickly.
The GC runs are not scheduled. Essentially, you are probably encountering a syndrome where the GC runs more than it should. This usually occurs when incrementally allocating large amounts of memory without dropping any of it. The GC colleciton runs when allocating memory cannot find any free memory. Most likely this is what's happening: allocate -> No free memory available -> run GC collection, get nothing back -> allocate more memory from OS -> allocate... Of course, the allocations will succeed for a bit after getting OS memory, but the GC collection is run way too often (as your profiling suggests). Disabling it at the beginning, then enabling it at the end means it's going to run at most once. Essentially, you take out the "run GC collection" step, everything else is the same. Given that 87% of the run time is spent in the collection cycle, it's pretty obvious why this reduced your runtime ;) -Steve
The redux of this discussion is surely that no matter how "good" the semantics of a PL are, the proof of the pudding is in the GC. Given the more-or-less well-founded value vs reference semantics in D, it might be useful to condider region-based memory management idioms and integrate these idioms into the language. Perhaps we might see this in D99 as a measure to circumvent traditional GC issues.
Sep 16 2010
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 9/16/10 5:41 CDT, Rounin wrote:
 Thank you for that advice. I'm using GDC because it's available from Ubuntu
 Linux's package system, whereas DMD currently is not. (And the .deb posted on
 digitalmars.com is only for i386.) Hopefully, D will gain more popularity and
more
 up-to-date tools will be made available.

 By the way, today I re-compiled the program with a "std.gc.enable;" right
before
 the final "return 0" statement, and it still runs in 0.68 seconds. So either
those
 objects aren't marked as needing garbage collection, or it's really just an
issue
 of keeping the garbage collector from activating too quickly.
I have noted this trigger happiness of the GC since way back during my doctoral work, and conveyed it several times to Walter and Sean. They haven't gotten around to it, but I carry most of the blame because it was a pressing matter to me, not to them. There are two issues: one is the GC jumps in too early, and the other is it jumps in too often. A program of mine had a pattern in which it would need a lot of memory for a long time, during which it would do the occasional small string allocation etc. There would still be a lot of collection attempts going on; the GC failed to notice that repeated collections do not recover much memory. Thanks Rounin for the comparison. I think it's a great test bed for D's collector. If you could share the data too, that would be great. Andrei
Sep 16 2010
parent Rounin <davidjo student.matnat.uio.no> writes:
Thank you for those kind words! Hopefully, it'll become a prioritized issue
whenever more pressing things are resolved.

In the meantime, here are (some slightly anonymized versions of) the MD5 files
in
question. They're about 9-10 megabytes each, so no wonder it allocates a bit of
memory!

http://folk.uio.no/davidjo/a.md5
http://folk.uio.no/davidjo/b.md5
begin 644 comparechecksums.d
M(R$O=7-R+V)I;B]D;60 +7)U; H*:6UP;W)T('-T9"YS=&1I;RP <W1D+F9I
M;&4L('-T9"YS=')I;F<L('-T9"YR96=E>'`L('-T9"YC+G-T9&EO+"!S=&0N
M9V,[" II;G0 ;6%I;BAS=')I;F=;72!A<F=V*2![" H)+R\ 1&ES86)L92!T
M:&4 9V%R8F%G92!C;VQL96-T;W( =&\ :V5E<"!I="!F<F]M('-L;W=I;F< 
M=&AE('!R;V=R86T*"2\O(&1O=VXN" ES=&0N9V,N9&ES86)L93L*" EI9B`H
M87)G=BYL96YG=&  (3T ,RD >PH)"7=R:71E9FQN*")5<V%G93H )7, ;VQD
M8VAE8VMS=6US(&YE=V-H96-K<W5M<R(L(&%R9W9;,%TI.PH)"7)E='5R;B`Q
M.PH)?0H*"2\O(%)E860 =&AE(&-H96-K<W5M(&9I;&5S+ H*"7-T<FEN9R!O
M;&1F:6QE(#T <W1R:7`H8V%S="AS=')I;F<I<F5A9"AA<F=V6S%=*2D[" ES
M=')I;F< ;F5W9FEL92`]('-T<FEP*&-A<W0H<W1R:6YG*7)E860H87)G=ELR
M72DI.PH*"7-T<FEN9UM=(&]L9&QI;F5S(#T <W1D+G)E9V5X<"YS<&QI="AO
M;&1F:6QE+")<;B(I+G-O<G0[" ES=')I;F=;72!N97=L:6YE<R`]('-T9"YR
M96=E>'`N<W!L:70H;F5W9FEL92PB7&XB*2YS;W)T.PH*"2\O(%!A<G-E('1H
M92!D871A+ H*"2\O($UA<"!P871H<R!T;R!C:&5C:W-U;7, 86YD('9I8V4 
M=F5R<V$N" ES=')I;F=;<W1R:6YG72!O;&1P871H<SL*"7-T<FEN9UMS=')I
M;F==(&YE=W!A=&AS.PH)<W1R:6YG6W-T<FEN9UT ;VQD<W5M<SL*"7-T<FEN
M9UMS=')I;F==(&YE=W-U;7,[" H)9F]R96%C:"AS=')I;F< ;&EN93L ;VQD
M;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN92D[" D):6YT('-P86-E(#T 
M<W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B*3L*"0ES=')I;F< 8VAE8VMS
M=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[" D)<W1R:6YG('!A=&  /2!S
M=')I<"AL:6YE6W-P86-E+BXD72D[" D);VQD<&%T:'-;<&%T:%T /2!C:&5C
M:W-U;3L*"0EO;&1S=6US6V-H96-K<W5M72`]('!A=& [" E]" H)9F]R96%C
M:"AS=')I;F< ;&EN93L ;F5W;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN
M92D[" D):6YT('-P86-E(#T <W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B
M*3L*"0ES=')I;F< 8VAE8VMS=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[
M" D)<W1R:6YG('!A=&  /2!S=')I<"AL:6YE6W-P86-E+BXD72D[" D);F5W
M<&%T:'-;<&%T:%T /2!C:&5C:W-U;3L*"0EN97=S=6US6V-H96-K<W5M72`]
M('!A=& [" E]" H)+R\ 0V]R<F5L871E('1H92!D871A+ H*"7-T<FEN9UM=
M(&YE=V9I;&5S.PH)<W1R:6YG6UT 9&5L971E9&9I;&5S.PH)<W1R:6YG6UT 
M;6]V961F:6QE<SL*"7-T<FEN9UM=(&UO9&EF:65D9FEL97,[" H)+R\ 3&]O
M:R!F;W( ;F5W(&9I;&5S+ H)9F]R96%C:"AS=')I;F< <&%T:"P <W1R:6YG
M(&-H96-K<W5M.R!N97=P871H<RD >PH)"6EF*"$H<&%T:"!I;B!O;&1P871H
M<RD )B8 (2AC:&5C:W-U;2!I;B!O;&1S=6US*2D*"0D);F5W9FEL97, ?CT 
M<&%T:#L*"7T*" DO+R!,;V]K(&9O<B!D96QE=&5D+"!M;V1I9FEE9"!O<B!M
M;W9E9"!F:6QE<RX*"69O<F5A8V H<W1R:6YG('!A=& L('-T<FEN9R!C:&5C
M:W-U;3L ;VQD<&%T:',I('L*"0EI9B A*'!A=&  :6X ;F5W<&%T:',I("8F
M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"61E;&5T961F:6QE<R!^/2!P
M871H.PH)"65L<V4 :68H(2AP871H(&EN(&YE=W!A=&AS*2`F)B`H8VAE8VMS
M=6T :6X ;F5W<W5M<RDI('L*"0D)<W1R:6YG(&YE=W!A=&  /2!N97=S=6US
M6V-H96-K<W5M73L*"0D);6]V961F:6QE<R!^/2!P871H('X (B#BAI( (B!^
M(&YE=W!A=& [" D)?0H)"65L<V4 :68H*'!A=&  :6X ;F5W<&%T:',I("8F
M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"6UO9&EF:65D9FEL97, ?CT 
M<&%T:#L*"7T*" DO+R!0<F5S96YT('1H92!D871A+ H*"6YE=V9I;&5S(#T 
M;F5W9FEL97,N<V]R=#L*"61E;&5T961F:6QE<R`](&1E;&5T961F:6QE<RYS
M;W)T.PH);6]V961F:6QE<R`](&UO=F5D9FEL97,N<V]R=#L*"6UO9&EF:65D
M9FEL97, /2!M;V1I9FEE9&9I;&5S+G-O<G0[" H)=W)I=&5F;&X[" H)=W)I
M=&5F;&XH(DUO9&EF:65D(&9I;&5S.EQN(BD[" EF;W)E86-H*'-T<FEN9R!P
M871H.R!M;V1I9FEE9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H*3L*
M" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*")-;W9E9"!F:6QE
M<SI<;B(I.PH)9F]R96%C:"AS=')I;F< <&%T:#L ;6]V961F:6QE<RD*"0EP
M<FEN=&8H(B4N*G-<;B(L<&%T:"D[" H)=W)I=&5F;&X[" EW<FET969L;CL*
M" EW<FET969L;B B1&5L971E9"!F:6QE<SI<;B(I.PH)9F]R96%C:"AS=')I
M;F< <&%T:#L 9&5L971E9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H
M*3L*" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*").97< 9FEL
M97,Z7&XB*3L*"69O<F5A8V H<W1R:6YG('!A=& [(&YE=V9I;&5S*0H)"7!R
M:6YT9B B)2XJ<UQN(BQP871H*3L*" EW<FET969L;CL*" DO+R!292UE;F%B
M;&EN9R!T:&4 1T, <V5E;7, =&\ 9&\ ;F\ :&%R;2P 8G5T('1H96X 86=A
M:6XL('=E)W)E(&5X:71I;F<*"2\O('1H92!P<F]G<F%M(&%N>7=A>2X*"2\O

` end
Sep 16 2010
prev sibling next sibling parent Leandro Lucarella <llucax gmail.com> writes:
Adding info about Python+Psyco:
Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):

       C        C++      D no gc  D gc       D gc dis  Python     Py+Psyco
Fill   .136136  .142705  .131238  22.628577  0.242637  12.952816  2.564722
Inc    .025134  .038768  .037456   0.050480  0.051545   3.765271  0.087375
Total  .161270  .181473  .168694  22.679057  0.294182  16.718087  2.652097

Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
-frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
debian patches).

-- 
Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/
----------------------------------------------------------------------------
GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145  104C 949E BFB6 5F5A 8D05)
----------------------------------------------------------------------------
EL "PITUFO ENRIQUE" LLEGO A LA BAILANTA
	-- Crónica TV
Jun 10 2008
prev sibling parent reply Michael Neumann <mneumann ntecs.de> writes:
Leandro Lucarella wrote:
 I've done a very simple microbenchmark testing a single linked list.
 
 What the test does is fill the list with long elements, and then iterate
 the list and increment the associated value. This test mostly test the
 allocation speed.
 
 I've done plain a C version, a C++ version (using STL double-linked list,
 so it has some disvantage here), a D version without using the GC, a D
 version using the GC, and a Python version.
 
 Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs):
 
        C         C++       D no gc   D gc       D gc dis  Python
 Fill   0.136136  0.142705  0.131238  22.628577  0.242637  12.952816
 Inc    0.025134  0.038768  0.037456   0.050480  0.051545   3.765271
 Total  0.161270  0.181473  0.168694  22.679057  0.294182  16.718087
 
 Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus
 -frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with
 debian patches).
 
 You can see Python is almost twice faster than D doing allocation (and the
 python example has some intentional overhead to make it as close as D code
 as possible, using more Pythonic code could yield better results).
Python used reference counting (~ Python 1.x) in the past, but I think they switched to a tracing garbage collector in 2.x. Using reference counting would explain the speedup IMHO. I think any mark and sweep garbage collector would be somewhat slow in the case of a linked list, as it has to traverse all elements to determine which memory is live and which can be freed (so it's O(n)). You'd be probably better off using a dynamic array here. Regards, Michael
Jun 25 2008
next sibling parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"Michael Neumann" <mneumann ntecs.de> wrote in message 
news:4862675F.3020603 ntecs.de...

 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x. Using reference
 counting would explain the speedup IMHO. I think any mark and sweep
 garbage collector would be somewhat slow in the case of a linked list,
 as it has to traverse all elements to determine which memory is live and
 which can be freed (so it's O(n)). You'd be probably better off using a
 dynamic array here.
Maybe they just didn't change the API for backwards compatibility, but all the Python 2.x source I've seen used reference counting macros. Maybe it's a hybrid memory manager? Or maybe the macros are now no-ops ;)
Jun 25 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
Michael Neumann:
 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x.
Python 2.5.2 uses a reference counting GC plus logic to break reference cycles, part of the code: http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup You may want to take a look at this page of the results too (Psyco uses the same GC, it's not another language): http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=72520 Bye, bearophile
Jun 25 2008
parent "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"bearophile" <bearophileHUGS lycos.com> wrote in message 
news:g3tqs0$57g$1 digitalmars.com...
 Michael Neumann:
 Python used reference counting (~ Python 1.x) in the past, but I think
 they switched to a tracing garbage collector in 2.x.
Python 2.5.2 uses a reference counting GC plus logic to break reference cycles, part of the code: http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup
And there it is :)
Jun 25 2008