digitalmars.D - More embarrassing microbenchmars for D's GC.
- Leandro Lucarella (52/52) Jun 09 2008 Content-Disposition: inline
- bearophile (44/47) Jun 09 2008 Note that Psyco can't compile code outside functions (global code is slo...
- Leandro Lucarella (13/20) Jun 10 2008 Thanks for the info, I've imported psyco but didn't play with it really ...
- bearophile (4/5) Jun 10 2008 If you have a bit of free time you may time that code of mine (that's eq...
- Sean Kelly (10/34) Jun 09 2008 I think it may have more to do with the allocation strategy in the GC. ...
- Leandro Lucarella (20/34) Jun 10 2008 Content-Disposition: inline
- Sean Kelly (9/37) Jun 10 2008 Oh okay. I was replying to the post that suggested that said the malloc
- Marius Muja (6/12) Jun 10 2008 The graph may look exponential but that's often misleading (it may very
- Leandro Lucarella (13/24) Jun 11 2008 You and Sean are right, the graph somewhere in between linear and
- Walter Bright (5/33) Jun 09 2008 There are so many factors that can influence this, it's hard to say
- Leandro Lucarella (14/46) Jun 10 2008 Content-Disposition: inline
- Walter Bright (3/41) Jun 10 2008 Break up your code into more functions to get better info from the profi...
- Leandro Lucarella (14/44) Jun 11 2008 Doesn't seems to be running. But it's weird you can tell for sure being
- Rounin (11/11) Sep 15 2010 I found this posting after investigating a huge slowdown in a D program ...
- Steven Schveighoffer (10/25) Sep 15 2010 gdc is very old. You may want to try the thing on the latest dmd. No
- Rounin (8/8) Sep 16 2010 Thank you for that advice. I'm using GDC because it's available from Ubu...
- bearophile (13/22) Sep 16 2010 You may try to disable/enable the GC in the Python code too (on 32 bit s...
- Rounin (9/21) Sep 16 2010 This could be the case, as I started noticing the slowdown as I was writ...
- dsimcha (15/18) Sep 16 2010 How about using a library defined associative array? As I've mentioned ...
- Sean Kelly (2/15) Sep 16 2010 The precise scanning patch would probably help tremendously here.
- dsimcha (4/19) Sep 16 2010 Yes, it would but even so, I don't think that a good AA implementation f...
- Steven Schveighoffer (21/35) Sep 16 2010 I've heard that dmd is a bit tricky to set up on a 64-bit system (mine i...
- Justin Johansson (7/44) Sep 16 2010 The redux of this discussion is surely that no matter how "good" the
- Andrei Alexandrescu (14/22) Sep 16 2010 I have noted this trigger happiness of the GC since way back during my
- Rounin (68/69) Sep 16 2010 Thank you for those kind words! Hopefully, it'll become a prioritized is...
- Leandro Lucarella (16/16) Jun 10 2008 Adding info about Python+Psyco:
- Michael Neumann (10/34) Jun 25 2008 Python used reference counting (~ Python 1.x) in the past, but I think
- Jarrett Billingsley (5/12) Jun 25 2008 Maybe they just didn't change the API for backwards compatibility, but a...
- bearophile (7/9) Jun 25 2008 Python 2.5.2 uses a reference counting GC plus logic to break reference ...
- Jarrett Billingsley (3/9) Jun 25 2008 And there it is :)
Content-Disposition: inline Content-Transfer-Encoding: 8bit I've done a very simple microbenchmark testing a single linked list. What the test does is fill the list with long elements, and then iterate the list and increment the associated value. This test mostly test the allocation speed. I've done plain a C version, a C++ version (using STL double-linked list, so it has some disvantage here), a D version without using the GC, a D version using the GC, and a Python version. Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs): C C++ D no gc D gc D gc dis Python Fill 0.136136 0.142705 0.131238 22.628577 0.242637 12.952816 Inc 0.025134 0.038768 0.037456 0.050480 0.051545 3.765271 Total 0.161270 0.181473 0.168694 22.679057 0.294182 16.718087 Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus -frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with debian patches). You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results). But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses? 2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp) Tests attached. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- RENUNCIO PARA IR A REZARLE A SAN CAYETANO -- Crónica TV
Jun 09 2008
Leandro Lucarella:You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco). Try putting your code into a main() and add a: from psyco.classes import __metaclass__ At the top. The code: import sys from datetime import datetime import psyco psyco.full() from psyco.classes import __metaclass__ class Node: def __init__(self): self.data = 0 self.next = None def main(): start = datetime.now() list = Node() n = list for i in xrange(long(sys.argv[1])): n.data = i n.next = Node() n = n.next n.next = None end = datetime.now() delta = datetime.now() - start elapsed1 = delta.seconds * 1000000 + delta.microseconds start = datetime.now() p = list while p: p.data += 1 p = p.next delta = datetime.now() - start elapsed2 = delta.seconds * 1000000 + delta.microseconds print "Fill: %lu usec" % elapsed1 print "Inc: %lu usec" % elapsed2 print "Total: %lu usec" % (elapsed1 + elapsed2) main() ----------------------- Original: 9.32 s Just moving code into a function: 8.39 s the same, using Psyco: 1.94 s Bye, bearophile
Jun 09 2008
bearophile, el 9 de junio a las 17:03 me escribiste:Leandro Lucarella:Thanks for the info, I've imported psyco but didn't play with it really =) -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Sus discipulos se miraron sin entended hasta que uno preguntose: Peperino, soy Daniel Q. de Olivos tengo 54 años y aún soy virgen. A lo que Peperino respondiole: Si sos ganso, ganso ser. Y lo frotó, y lo curó y lo sanó. A lo que todos dijeron: ¡¡¡Peperino se la come, Peperino se la come!!! -- Peperino PómoroYou can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).Note that Psyco can't compile code outside functions (global code is slower even if you don't use Psyco).
Jun 10 2008
Leandro Lucarella:Thanks for the info, I've imported psyco but didn't play with it really =)If you have a bit of free time you may time that code of mine (that's equal to yours + psyco) on your PC, so you can show us those timings too compared to the other ones... Bye, bearophile
Jun 10 2008
== Quote from Leandro Lucarella (llucax gmail.com)'s articleBut there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp)No idea. I'd expect them to be roughly equivalent. Sean
Jun 09 2008
Content-Disposition: inline Content-Transfer-Encoding: 8bit Sean Kelly, el 9 de junio a las 22:53 me escribiste:== Quote from Leandro Lucarella (llucax gmail.com)'s articleReally? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more. Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential? -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Me encanta el éxito; por eso prefiero el estado de progreso constante, con la meta al frente y no atrás. -- Ricardo Vaporeso. Punta del Este, Enero de 1918.But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.
Jun 10 2008
== Quote from Leandro Lucarella (llucax gmail.com)'s article--TB36FDmn/VVEgNH/ Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Sean Kelly, el 9 de junio a las 22:53 me escribiste:Oh okay. I was replying to the post that suggested that said the malloc version was still much faster than D with the GC disabled. But if they're comparable then all the better. This is what I would expect anyway. As for the GC algorithm being exponential... all nodes of an AA contain pointers, so they will all be scanned by the GC. Given this and the chance of false positives, I would expect performance to decrease in a superlinear manner. But an exponential slow-down still seems a bit extreme. Sean== Quote from Leandro Lucarella (llucax gmail.com)'s articleReally? I though it was more collecion-related. I don't know exactly how the D GC is implemented, but I thought that, as you say, it allocates a chunk of memory, with each new it gave you a piece of that memory, and when it ran out of memory, it runs a collection cycle. When that cycle can't recover unused memory, allocates more. Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential?But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?I think it may have more to do with the allocation strategy in the GC. It obtains memory in chunks from the OS, and each chunk is typically a max of 8MB. So for a test like this the D GC will end up hitting the OS quite a few times asking for more memory. If I had to guess I'd say that malloc has a more efficient strategy here. If you're interested, try running the same test using Tango with and without a call tol tango.core.Memory.GC.reserve() for the amount of memory you expect the app to use before the loop.
Jun 10 2008
Leandro Lucarella wrote:Considering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential?The graph may look exponential but that's often misleading (it may very well be a polynomial). To tell for sure plot the data on a graph with a logarithmic Y axis and then on a graph with both axes logarithmic. If the plot is a straight line on the semilogy graph then it's exponential. If it's a straight line on the loglog graph then it's polinomial.
Jun 10 2008
Marius Muja, el 10 de junio a las 21:38 me escribiste:Leandro Lucarella wrote:You and Sean are right, the graph somewhere in between linear and exponential, but is really far from linear, which I think is the worst case in GC algorithms... -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pudiera acercarme un poco más, hacia vos Te dirÃa que me tiemblan los dos pies, cuando me mirás Si supieras todo lo que me costó, llegar Hoy sabrÃas que me cuesta respirar, cuando me mirásConsidering that the version with GC disabled is almost as fast as the malloc version, and that the time grows exponentially (graph attached) with the number of nodes, it looks like the problem is the GC collection cycles to me. Is the GC algorithm exponential?The graph may look exponential but that's often misleading (it may very well be a polynomial). To tell for sure plot the data on a graph with a logarithmic Y axis and then on a graph with both axes logarithmic. If the plot is a straight line on the semilogy graph then it's exponential. If it's a straight line on the loglog graph then it's polinomial.
Jun 11 2008
Leandro Lucarella wrote:But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp) Tests attached.Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Jun 09 2008
Content-Disposition: inline Content-Transfer-Encoding: 8bit Walter Bright, el 9 de junio a las 17:12 me escribiste:Leandro Lucarella wrote:But there is no GC activity there right?But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- Si pensas que el alma no se ve el alma sà se ve en los ojos2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp) Tests attached.Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Jun 10 2008
Leandro Lucarella wrote:Walter Bright, el 9 de junio a las 17:12 me escribiste:Instrument it to be sure.Leandro Lucarella wrote:But there is no GC activity there right?But there are a few other results I can't explain: 1) Why is D gc (disabled or not) version ~25% slower than the D version that uses malloc when iterating the list? It shouldn't be any GC activity in that part. Could be some GC locallity issue that yields more cache misses?There are so many factors that can influence this, it's hard to say without spending a lot of time carefully examining it.Break up your code into more functions to get better info from the profiler.Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection.2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp) Tests attached.Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Jun 10 2008
Walter Bright, el 10 de junio a las 14:13 me escribiste:Doesn't seems to be running. But it's weird you can tell for sure being the author :SBut there is no GC activity there right?Instrument it to be sure.Is this useful (kcachegrind screenshoots, using callgrind with --dump-instr=yes and --trace-jump=yes): http://llucax.com.ar/~luca/kcachegrind-d.png http://llucax.com.ar/~luca/kcachegrind-c.png -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- De tan fina la condesa, por no cagarse, reza. -- Ricardo VaporesoBreak up your code into more functions to get better info from the profiler.Attached is the output of the gprof program (for the list-test-d-gc with the GC running). I don't see any helpful information for the point 2), but it shows clearly that most of the time is spent in garbage collection.2) Why is D malloc version ~33% slower than the C version? I took a look at the generated assembly and it's almost identical: <_Dmain+198>: lea -0x20(%ebp),%eax <_Dmain+201>: lea 0x0(%esi,%eiz,1),%esi <_Dmain+208>: addl $0x1,0x8(%eax) <_Dmain+212>: adcl $0x0,0xc(%eax) <_Dmain+216>: mov (%eax),%eax <_Dmain+218>: test %eax,%eax <_Dmain+220>: jne 0x804a240 <_Dmain+208> <main+248>: lea -0x1c(%ebp),%eax <main+251>: nop <main+252>: lea 0x0(%esi,%eiz,1),%esi <main+256>: addl $0x1,0x4(%eax) <main+260>: adcl $0x0,0x8(%eax) <main+264>: mov (%eax),%eax <main+266>: test %eax,%eax <main+268>: jne 0x8048550 <main+256> <main+270>: movl $0x0,0x4(%esp) <main+278>: movl $0x8049800,(%esp) Tests attached.Without running a profiler, there's no way to be sure about just where in the code the time is being consumed.
Jun 11 2008
I found this posting after investigating a huge slowdown in a D program I'd made using gprof, which showed the culprit to be a function called _D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time. That turned out to be the garbage collector. The program takes 1.74 seconds to run with two particular input files in Python. The garbage collected D version takes 5.27 seconds. The non-garbage collected version takes only 0.68 seconds. Here's a more detailed report: http://rounin.livejournal.com/21815.html Obviously, the compiled program should run at least as fast as Python, so the garbage collector could use some improvement.
Sep 15 2010
On Wed, 15 Sep 2010 15:53:38 -0400, Rounin <davidjo student.matnat.uio.no> wrote:I found this posting after investigating a huge slowdown in a D program I'd made using gprof, which showed the culprit to be a function called _D3gcx3Gcx4markMFPvPvZv, in which the program spent 87% of its time. That turned out to be the garbage collector. The program takes 1.74 seconds to run with two particular input files in Python. The garbage collected D version takes 5.27 seconds. The non-garbage collected version takes only 0.68 seconds. Here's a more detailed report: http://rounin.livejournal.com/21815.html Obviously, the compiled program should run at least as fast as Python, so the garbage collector could use some improvement.gdc is very old. You may want to try the thing on the latest dmd. No guarantee this is going to help though. Another thing you may want to try is dcollections HashMap instead of builtin associative arrays, as they use a free-list style allocator that helps avoid GC collections. I agree the GC is a major issue with performance, I hope after D2 is a bit more stable, we start working on improving it. -Steve
Sep 15 2010
Thank you for that advice. I'm using GDC because it's available from Ubuntu Linux's package system, whereas DMD currently is not. (And the .deb posted on digitalmars.com is only for i386.) Hopefully, D will gain more popularity and more up-to-date tools will be made available. By the way, today I re-compiled the program with a "std.gc.enable;" right before the final "return 0" statement, and it still runs in 0.68 seconds. So either those objects aren't marked as needing garbage collection, or it's really just an issue of keeping the garbage collector from activating too quickly.
Sep 16 2010
Rounin:Thank you for that advice. I'm using GDC because it's available from Ubuntu Linux's package system, whereas DMD currently is not. (And the .deb posted on digitalmars.com is only for i386.) Hopefully, D will gain more popularity and more up-to-date tools will be made available.On a 32 bit system I find LDC to produce efficient programs, when possible (the GC is similar, in D2 it's essentially the same).By the way, today I re-compiled the program with a "std.gc.enable;" right before the final "return 0" statement, and it still runs in 0.68 seconds.You may try to disable/enable the GC in the Python code too (on 32 bit systems there's the very good Psyco too). Your benchmark is not portable, so I can't help you find where the performance problem is. When you perform a benchmark it's better to give all source code and all data too, to allow others to reproduce the results and look for performance problems. Keep in mind that D associative arrays are usually slower than Python dicts. Probably you build data structures like associative arrays, and this slows down the GC. If you disable&enable the GC around that build phase, the program is probably fast (so I suggest you to narrow as much as possible the width of the disable/enable span, so you may see where the GC problem is). If you put a exit(0) at the end of the program (to kill final collection) the D program may save more time. In Python 2.7 they have added a GC optimization that I may be used in D too: http://bugs.python.org/issue4074The garbage collector now performs better for one common usage pattern: when many objects are being allocated without deallocating any of them. This would previously take quadratic time for garbage collection, but now the number of full garbage collections is reduced as the number of objects on the heap grows. The new logic only performs a full garbage collection pass when the middle generation has been collected 10 times and when the number of survivor objects from the middle generation exceeds 10% of the number of objects in the oldest generation. (Suggested by Martin von Löwis and implemented by Antoine Pitrou; issue 4074.)<while D's splitlines() wasn't quite working.< int space = std.regexp.find(line, r"\s");<What do you mean? If there's a bug in splitlines() or split() it's better to add it to Bugzilla, possibly with inlined string to split (no external file to read). splitlines() or split() are simple functions of a module, written in D, so if there's a problem it's usually not too much hard to fix it, they are not built-in methods written in C as in CPython. if(!(path in oldpaths) && !(checksum in oldsums)) In D2 this may be written (unfortunately there is no "and" keyword, Walter doesn't get its usefulness yet): if (path !in oldpaths && checksum !in oldsums) Bye, bearophile
Sep 16 2010
Keep in mind that D associative arrays are usually slower than Python dicts. Probably you build data structures like associative arrays, and this slows down the GC. If you disable&enable the GC around that build phase, the program is probably fast (so I suggest you to narrow as much as possible the width of the disable/enable span, so you may see where the GC problem is). If you put a exit(0) at the end of the program (to kill final collection) the D program may save more time.This could be the case, as I started noticing the slowdown as I was writing either the second or third block, which is where I build a ton of data structures. Complicating the code by avoiding associative arrays isn't really an option, though, as the whole point of writing the program was comparing D to Python in terms of simplicity. Turning off the garbage collector during time-critical bits is simple enough, though, as long as one knows that it has to be done.What do you mean? If there's a bug in splitlines() or split() it's better to add it to Bugzilla, possibly with inlined string to split (no external file to read). splitlines() or split() are simple functions of a module, written in D, so if there's a problem it's usually not too much hard to fix it, they are not built-in methods written in C as in CPython.There's no bug in split() as far as I know, but during development of the program, splitlines() would split the data on other things than newlines. It could be a bug in D, but it could just be a temporary glitch in the code, so I'm not filing a bug.
Sep 16 2010
== Quote from Rounin (davidjo student.matnat.uio.no)'s articleComplicating the code by avoiding associative arrays isn't really an option, though, as the whole point of writing the program was comparing D to Python in terms of simplicity.How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs. My RandAA can be several times faster than the builtin AA when dealing with huge arrays (purely due to memory management overhead), and some tweaking by others in the community has produced an even faster version. Furthermore, it doesn't create tons of heap fragmentation and false pointers for the GC. I've solved out of memory errors with the builtin AA in some small projects just by swapping in RandAA for it. (See http://dsource.org/projects/aa)
Sep 16 2010
dsimcha Wrote:== Quote from Rounin (davidjo student.matnat.uio.no)'s articleThe precise scanning patch would probably help tremendously here.Complicating the code by avoiding associative arrays isn't really an option, though, as the whole point of writing the program was comparing D to Python in terms of simplicity.How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs.
Sep 16 2010
== Quote from Sean Kelly (sean invisibleduck.org)'s articledsimcha Wrote:Yes, it would but even so, I don't think that a good AA implementation for a GC like D's should require an allocation on **every single insert**. This makes AAs basically unusable in parallel code.== Quote from Rounin (davidjo student.matnat.uio.no)'s articleThe precise scanning patch would probably help tremendously here.Complicating the code by avoiding associative arrays isn't really an option, though, as the whole point of writing the program was comparing D to Python in terms of simplicity.How about using a library defined associative array? As I've mentioned several times here before, D's current builtin associative array implementation interacts horribly with the GC. This is true both in terms of speed and memory usage. Neither D's GC nor its AA implementation is that bad per se, but the AA implementation seems to expose all the weaknesses in the GC, such that the result when using large AAs is laughably bad. I think eventually we need to completely redesign and rewrite it, or at least put a sealed implementation in std.container that's more geared towards huge AAs.
Sep 16 2010
On Thu, 16 Sep 2010 06:41:14 -0400, Rounin <davidjo student.matnat.uio.no> wrote:Thank you for that advice. I'm using GDC because it's available from Ubuntu Linux's package system, whereas DMD currently is not. (And the .deb posted on digitalmars.com is only for i386.) Hopefully, D will gain more popularity and more up-to-date tools will be made available.I've heard that dmd is a bit tricky to set up on a 64-bit system (mine is 32-bit), but it will give more credence to your position to use the latest available tools.By the way, today I re-compiled the program with a "std.gc.enable;" right before the final "return 0" statement, and it still runs in 0.68 seconds. So either those objects aren't marked as needing garbage collection, or it's really just an issue of keeping the garbage collector from activating too quickly.The GC runs are not scheduled. Essentially, you are probably encountering a syndrome where the GC runs more than it should. This usually occurs when incrementally allocating large amounts of memory without dropping any of it. The GC colleciton runs when allocating memory cannot find any free memory. Most likely this is what's happening: allocate -> No free memory available -> run GC collection, get nothing back -> allocate more memory from OS -> allocate... Of course, the allocations will succeed for a bit after getting OS memory, but the GC collection is run way too often (as your profiling suggests). Disabling it at the beginning, then enabling it at the end means it's going to run at most once. Essentially, you take out the "run GC collection" step, everything else is the same. Given that 87% of the run time is spent in the collection cycle, it's pretty obvious why this reduced your runtime ;) -Steve
Sep 16 2010
On 16/09/2010 9:53 PM, Steven Schveighoffer wrote:On Thu, 16 Sep 2010 06:41:14 -0400, Rounin <davidjo student.matnat.uio.no> wrote:The redux of this discussion is surely that no matter how "good" the semantics of a PL are, the proof of the pudding is in the GC. Given the more-or-less well-founded value vs reference semantics in D, it might be useful to condider region-based memory management idioms and integrate these idioms into the language. Perhaps we might see this in D99 as a measure to circumvent traditional GC issues.Thank you for that advice. I'm using GDC because it's available from Ubuntu Linux's package system, whereas DMD currently is not. (And the .deb posted on digitalmars.com is only for i386.) Hopefully, D will gain more popularity and more up-to-date tools will be made available.I've heard that dmd is a bit tricky to set up on a 64-bit system (mine is 32-bit), but it will give more credence to your position to use the latest available tools.By the way, today I re-compiled the program with a "std.gc.enable;" right before the final "return 0" statement, and it still runs in 0.68 seconds. So either those objects aren't marked as needing garbage collection, or it's really just an issue of keeping the garbage collector from activating too quickly.The GC runs are not scheduled. Essentially, you are probably encountering a syndrome where the GC runs more than it should. This usually occurs when incrementally allocating large amounts of memory without dropping any of it. The GC colleciton runs when allocating memory cannot find any free memory. Most likely this is what's happening: allocate -> No free memory available -> run GC collection, get nothing back -> allocate more memory from OS -> allocate... Of course, the allocations will succeed for a bit after getting OS memory, but the GC collection is run way too often (as your profiling suggests). Disabling it at the beginning, then enabling it at the end means it's going to run at most once. Essentially, you take out the "run GC collection" step, everything else is the same. Given that 87% of the run time is spent in the collection cycle, it's pretty obvious why this reduced your runtime ;) -Steve
Sep 16 2010
On 9/16/10 5:41 CDT, Rounin wrote:Thank you for that advice. I'm using GDC because it's available from Ubuntu Linux's package system, whereas DMD currently is not. (And the .deb posted on digitalmars.com is only for i386.) Hopefully, D will gain more popularity and more up-to-date tools will be made available. By the way, today I re-compiled the program with a "std.gc.enable;" right before the final "return 0" statement, and it still runs in 0.68 seconds. So either those objects aren't marked as needing garbage collection, or it's really just an issue of keeping the garbage collector from activating too quickly.I have noted this trigger happiness of the GC since way back during my doctoral work, and conveyed it several times to Walter and Sean. They haven't gotten around to it, but I carry most of the blame because it was a pressing matter to me, not to them. There are two issues: one is the GC jumps in too early, and the other is it jumps in too often. A program of mine had a pattern in which it would need a lot of memory for a long time, during which it would do the occasional small string allocation etc. There would still be a lot of collection attempts going on; the GC failed to notice that repeated collections do not recover much memory. Thanks Rounin for the comparison. I think it's a great test bed for D's collector. If you could share the data too, that would be great. Andrei
Sep 16 2010
Thank you for those kind words! Hopefully, it'll become a prioritized issue whenever more pressing things are resolved. In the meantime, here are (some slightly anonymized versions of) the MD5 files in question. They're about 9-10 megabytes each, so no wonder it allocates a bit of memory! http://folk.uio.no/davidjo/a.md5 http://folk.uio.no/davidjo/b.md5 begin 644 comparechecksums.d M(R$O=7-R+V)I;B]D;60 +7)U; H*:6UP;W)T('-T9"YS=&1I;RP <W1D+F9I M;&4L('-T9"YS=')I;F<L('-T9"YR96=E>'`L('-T9"YC+G-T9&EO+"!S=&0N M9V,[" II;G0 ;6%I;BAS=')I;F=;72!A<F=V*2![" H)+R\ 1&ES86)L92!T M:&4 9V%R8F%G92!C;VQL96-T;W( =&\ :V5E<"!I="!F<F]M('-L;W=I;F< M=&AE('!R;V=R86T*"2\O(&1O=VXN" ES=&0N9V,N9&ES86)L93L*" EI9B`H M87)G=BYL96YG=& (3T ,RD >PH)"7=R:71E9FQN*")5<V%G93H )7, ;VQD M8VAE8VMS=6US(&YE=V-H96-K<W5M<R(L(&%R9W9;,%TI.PH)"7)E='5R;B`Q M.PH)?0H*"2\O(%)E860 =&AE(&-H96-K<W5M(&9I;&5S+ H*"7-T<FEN9R!O M;&1F:6QE(#T <W1R:7`H8V%S="AS=')I;F<I<F5A9"AA<F=V6S%=*2D[" ES M=')I;F< ;F5W9FEL92`]('-T<FEP*&-A<W0H<W1R:6YG*7)E860H87)G=ELR M72DI.PH*"7-T<FEN9UM=(&]L9&QI;F5S(#T <W1D+G)E9V5X<"YS<&QI="AO M;&1F:6QE+")<;B(I+G-O<G0[" ES=')I;F=;72!N97=L:6YE<R`]('-T9"YR M96=E>'`N<W!L:70H;F5W9FEL92PB7&XB*2YS;W)T.PH*"2\O(%!A<G-E('1H M92!D871A+ H*"2\O($UA<"!P871H<R!T;R!C:&5C:W-U;7, 86YD('9I8V4 M=F5R<V$N" ES=')I;F=;<W1R:6YG72!O;&1P871H<SL*"7-T<FEN9UMS=')I M;F==(&YE=W!A=&AS.PH)<W1R:6YG6W-T<FEN9UT ;VQD<W5M<SL*"7-T<FEN M9UMS=')I;F==(&YE=W-U;7,[" H)9F]R96%C:"AS=')I;F< ;&EN93L ;VQD M;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN92D[" D):6YT('-P86-E(#T M<W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B*3L*"0ES=')I;F< 8VAE8VMS M=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[" D)<W1R:6YG('!A=& /2!S M=')I<"AL:6YE6W-P86-E+BXD72D[" D);VQD<&%T:'-;<&%T:%T /2!C:&5C M:W-U;3L*"0EO;&1S=6US6V-H96-K<W5M72`]('!A=& [" E]" H)9F]R96%C M:"AS=')I;F< ;&EN93L ;F5W;&EN97,I('L*"0EL:6YE(#T <W1R:7`H;&EN M92D[" D):6YT('-P86-E(#T <W1D+G)E9V5X<"YF:6YD*&QI;F4L('(B7',B M*3L*"0ES=')I;F< 8VAE8VMS=6T /2!S=')I<"AL:6YE6S`N+G-P86-E72D[ M" D)<W1R:6YG('!A=& /2!S=')I<"AL:6YE6W-P86-E+BXD72D[" D);F5W M<&%T:'-;<&%T:%T /2!C:&5C:W-U;3L*"0EN97=S=6US6V-H96-K<W5M72`] M('!A=& [" E]" H)+R\ 0V]R<F5L871E('1H92!D871A+ H*"7-T<FEN9UM= M(&YE=V9I;&5S.PH)<W1R:6YG6UT 9&5L971E9&9I;&5S.PH)<W1R:6YG6UT M;6]V961F:6QE<SL*"7-T<FEN9UM=(&UO9&EF:65D9FEL97,[" H)+R\ 3&]O M:R!F;W( ;F5W(&9I;&5S+ H)9F]R96%C:"AS=')I;F< <&%T:"P <W1R:6YG M(&-H96-K<W5M.R!N97=P871H<RD >PH)"6EF*"$H<&%T:"!I;B!O;&1P871H M<RD )B8 (2AC:&5C:W-U;2!I;B!O;&1S=6US*2D*"0D);F5W9FEL97, ?CT M<&%T:#L*"7T*" DO+R!,;V]K(&9O<B!D96QE=&5D+"!M;V1I9FEE9"!O<B!M M;W9E9"!F:6QE<RX*"69O<F5A8V H<W1R:6YG('!A=& L('-T<FEN9R!C:&5C M:W-U;3L ;VQD<&%T:',I('L*"0EI9B A*'!A=& :6X ;F5W<&%T:',I("8F M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"61E;&5T961F:6QE<R!^/2!P M871H.PH)"65L<V4 :68H(2AP871H(&EN(&YE=W!A=&AS*2`F)B`H8VAE8VMS M=6T :6X ;F5W<W5M<RDI('L*"0D)<W1R:6YG(&YE=W!A=& /2!N97=S=6US M6V-H96-K<W5M73L*"0D);6]V961F:6QE<R!^/2!P871H('X (B#BAI( (B!^ M(&YE=W!A=& [" D)?0H)"65L<V4 :68H*'!A=& :6X ;F5W<&%T:',I("8F M("$H8VAE8VMS=6T :6X ;F5W<W5M<RDI" D)"6UO9&EF:65D9FEL97, ?CT M<&%T:#L*"7T*" DO+R!0<F5S96YT('1H92!D871A+ H*"6YE=V9I;&5S(#T M;F5W9FEL97,N<V]R=#L*"61E;&5T961F:6QE<R`](&1E;&5T961F:6QE<RYS M;W)T.PH);6]V961F:6QE<R`](&UO=F5D9FEL97,N<V]R=#L*"6UO9&EF:65D M9FEL97, /2!M;V1I9FEE9&9I;&5S+G-O<G0[" H)=W)I=&5F;&X[" H)=W)I M=&5F;&XH(DUO9&EF:65D(&9I;&5S.EQN(BD[" EF;W)E86-H*'-T<FEN9R!P M871H.R!M;V1I9FEE9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H*3L* M" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*")-;W9E9"!F:6QE M<SI<;B(I.PH)9F]R96%C:"AS=')I;F< <&%T:#L ;6]V961F:6QE<RD*"0EP M<FEN=&8H(B4N*G-<;B(L<&%T:"D[" H)=W)I=&5F;&X[" EW<FET969L;CL* M" EW<FET969L;B B1&5L971E9"!F:6QE<SI<;B(I.PH)9F]R96%C:"AS=')I M;F< <&%T:#L 9&5L971E9&9I;&5S*0H)"7!R:6YT9B B)2XJ<UQN(BQP871H M*3L*" EW<FET969L;CL*"7=R:71E9FQN.PH*"7=R:71E9FQN*").97< 9FEL M97,Z7&XB*3L*"69O<F5A8V H<W1R:6YG('!A=& [(&YE=V9I;&5S*0H)"7!R M:6YT9B B)2XJ<UQN(BQP871H*3L*" EW<FET969L;CL*" DO+R!292UE;F%B M;&EN9R!T:&4 1T, <V5E;7, =&\ 9&\ ;F\ :&%R;2P 8G5T('1H96X 86=A M:6XL('=E)W)E(&5X:71I;F<*"2\O('1H92!P<F]G<F%M(&%N>7=A>2X*"2\O` end
Sep 16 2010
Adding info about Python+Psyco: Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs): C C++ D no gc D gc D gc dis Python Py+Psyco Fill .136136 .142705 .131238 22.628577 0.242637 12.952816 2.564722 Inc .025134 .038768 .037456 0.050480 0.051545 3.765271 0.087375 Total .161270 .181473 .168694 22.679057 0.294182 16.718087 2.652097 Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus -frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with debian patches). -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- EL "PITUFO ENRIQUE" LLEGO A LA BAILANTA -- Crónica TV
Jun 10 2008
Leandro Lucarella wrote:I've done a very simple microbenchmark testing a single linked list. What the test does is fill the list with long elements, and then iterate the list and increment the associated value. This test mostly test the allocation speed. I've done plain a C version, a C++ version (using STL double-linked list, so it has some disvantage here), a D version without using the GC, a D version using the GC, and a Python version. Here are the result (Celeron ~2GHz, 1M elements, average from 3 runs): C C++ D no gc D gc D gc dis Python Fill 0.136136 0.142705 0.131238 22.628577 0.242637 12.952816 Inc 0.025134 0.038768 0.037456 0.050480 0.051545 3.765271 Total 0.161270 0.181473 0.168694 22.679057 0.294182 16.718087 Results are in seconds, compiled with gcc/g++/gdc (-O3 -finline, plus -frelease for D code). Using phobos (gdc 0.25-20080419-4.1.2-22, with debian patches). You can see Python is almost twice faster than D doing allocation (and the python example has some intentional overhead to make it as close as D code as possible, using more Pythonic code could yield better results).Python used reference counting (~ Python 1.x) in the past, but I think they switched to a tracing garbage collector in 2.x. Using reference counting would explain the speedup IMHO. I think any mark and sweep garbage collector would be somewhat slow in the case of a linked list, as it has to traverse all elements to determine which memory is live and which can be freed (so it's O(n)). You'd be probably better off using a dynamic array here. Regards, Michael
Jun 25 2008
"Michael Neumann" <mneumann ntecs.de> wrote in message news:4862675F.3020603 ntecs.de...Python used reference counting (~ Python 1.x) in the past, but I think they switched to a tracing garbage collector in 2.x. Using reference counting would explain the speedup IMHO. I think any mark and sweep garbage collector would be somewhat slow in the case of a linked list, as it has to traverse all elements to determine which memory is live and which can be freed (so it's O(n)). You'd be probably better off using a dynamic array here.Maybe they just didn't change the API for backwards compatibility, but all the Python 2.x source I've seen used reference counting macros. Maybe it's a hybrid memory manager? Or maybe the macros are now no-ops ;)
Jun 25 2008
Michael Neumann:Python used reference counting (~ Python 1.x) in the past, but I think they switched to a tracing garbage collector in 2.x.Python 2.5.2 uses a reference counting GC plus logic to break reference cycles, part of the code: http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup You may want to take a look at this page of the results too (Psyco uses the same GC, it's not another language): http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=72520 Bye, bearophile
Jun 25 2008
"bearophile" <bearophileHUGS lycos.com> wrote in message news:g3tqs0$57g$1 digitalmars.com...Michael Neumann:And there it is :)Python used reference counting (~ Python 1.x) in the past, but I think they switched to a tracing garbage collector in 2.x.Python 2.5.2 uses a reference counting GC plus logic to break reference cycles, part of the code: http://svn.python.org/view/python/trunk/Modules/gcmodule.c?rev=64048&view=markup
Jun 25 2008