digitalmars.D - Slow GC?
- bearophile (35/35) Mar 13 2008 While writing a program I have seen it's quite slow (about 35-40 seconds...
- Vladimir Panteleev (10/11) Mar 14 2008 Splitting a huge block of text by whitespace will cause an array that's ...
- Vladimir Panteleev (9/10) Mar 14 2008 ark" part of the scan.
- bearophile (8/11) Mar 14 2008 Note that in practical situations I don't use a split() on such big chun...
- Vladimir Panteleev (26/32) Mar 14 2008 Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild ...
- Walter Bright (2/4) Mar 14 2008 What exactly does the patch do?
- Vladimir Panteleev (16/20) Mar 14 2008 In gcx.mark(), it remembers the last pointer it examined that pointed to...
- Walter Bright (1/1) Mar 14 2008 Thanks, that makes sense.
- bearophile (34/37) Mar 14 2008 I'm sure for most of you doing this is really easy, but for me it wasn't...
- bearophile (4/5) Mar 14 2008 But watch the RAM used too, to me it seems to use more...
- Vladimir Panteleev (5/8) Mar 14 2008 Odd, it shouldn't. The only memory overhead is one DWORD on the stack.
While writing a program I have seen it's quite slow (about 35-40 seconds to run a program that requires less than 8 seconds to run with Python), I have progressively simplified it, to the following D code, able to show a similar problem: import std.file, std.string, std.gc; import std.stdio: putr = writefln; static import std.c.time; double clock() { auto t = std.c.time.clock(); return t/cast(double)std.c.time.CLOCKS_PER_SEC; } void main() { //disable; auto t0 = clock(); auto txt = cast(string)read("text.txt"); // 6.3 MB of text auto t1 = clock(); auto words = txt.split(); auto t2 = clock(); putr("loading time: ", t1 - t0); // 0.08 s putr("splitting time: ", t2 - t1); // 3.69 s with GC, 0.66 s without // Total running time with GC = 10.85 s } As you can see the Running time is about 3.7 seconds, but then it needs about 7 more seconds to deallocate the memory used. I have written the same program in Python, not using Psyco or disabling the GC or doing something fancy: from time import clock t0 = clock() txt = file("text.txt").read() // 6.3 MB of text t1 = clock() words = txt.split() t2 = clock() Splitting is twice faster, but that's not the most important thing to see: when the program is finished it needs about 1.1 seconds to deallocate its memory and terminate. It seems there is some problem in the D GC (in my original D program on a total running time of about 40 seconds, 17 are necessary for the final memory deallocation, while the same program in Python needs less than 2 seconds for the final memory + VirtualMachine deallocations). Tests performed on a Win2K, 256 MB RAM, pentium3 500 MHz, with Python 2.5.2, and DMD 1.028. My timings are approximated but they don't need more than a 0.2 s resolution to show the situation. Bye, bearophile
Mar 13 2008
On Fri, 14 Mar 2008 04:38:23 +0200, bearophile <bearophileHUGS lycos.com> wrote:While writing a program I have seen it's quite slow (about 35-40 seconds to run a program that requires less than 8 seconds to run with Python), I have progressively simplified it, to the following D code, able to show a similar problem:Splitting a huge block of text by whitespace will cause an array that's comparable to the size of the original text (a slice is 8 bytes, so if words are 6-7 characters long by average, the size would be the same). The slowdown appears because the GC needs to look up every one of those pointers, which point to the same large memory region. I made a patch for Tango's GC which caches the last page during the "mark" part of the scan. This reduces the execution time of your program tenfold for me, but it will only work when there's many consecutive pointers to the same GC page which is part of a big (>2048 bytes) allocation. There should be no noticeable performance penalty for the general case (as the overhead is just one compare and sometimes an assignment), so I've created an enhancement ticket: http://dsource.org/projects/tango/ticket/982 Interestingly, when I first compiled your program with Tango's runtime, and it ran a lot slower. I found the bottleneck, then created a ticket with an explanation and patch here: http://dsource.org/projects/tango/ticket/981 I also have a patch for Phobos (for expanding arrays, not the GC), but it doesn't yield a big improvement (about 13% faster). -- Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
On Fri, 14 Mar 2008 09:54:22 +0200, Vladimir Panteleev <thecybershadow g= mail.com> wrote:I made a patch for Tango's GC which caches the last page during the "m=ark" part of the scan. Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=3D1923 Improvement is also almost tenfold :) -- = Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
Vladimir Panteleev: You are very fast :-)Splitting a huge block of text by whitespace will cause an array that's comparable to the size of the original text (a slice is 8 bytes, so if words are 6-7 characters long by average, the size would be the same).<Note that in practical situations I don't use a split() on such big chunk of text. In such situation I use xsplit(), the lazy version (that you can find in my d libs. That I hope to see added both to the Python and D string methods/string library) that is "hugely" faster in such situation :-)Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=1923<I'd like to try that code; can I do it?Improvement is also almost tenfold :)<Sorry for the potentially stupid question: is this an improvement in the text splitting time, in the final garbage collection time when the program is finished, or is it just a blind sum of both? General note: I think more people have to compare the speed of programs written in other languages (like Python, Java, OCaML, Clean, C with GCC, Perl, etc) to the same programs written in D, because this (often?) shows spots where D may need some improvements. I know D is a language in development still, so too much early optimization may be bad, but I think a bit of care in that regard may help anyway. Bye and thank you, bearophile
Mar 14 2008
On Fri, 14 Mar 2008 15:00:48 +0200, bearophile <bearophileHUGS lycos.com=wrote:Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild = Phobos (using the makefiles).Phobos patch: http://d.puremagic.com/issues/show_bug.cgi?id=3D1923<I'd like to try that code; can I do it?he text splitting time, in the final garbage collection time when the pr= ogram is finished, or is it just a blind sum of both? Of the total running time. The slowdown happens when D starts a garbage = collection cycle while split() is running. Here are the exact times for me: =3D=3D Before =3D=3D loading time: 0.14 splitting time: 0.875 total execution time: 3.234 seconds. =3D=3D After =3D=3D loading time: 0.063 splitting time: 0.156 total execution time: 0.358 seconds.Improvement is also almost tenfold :)<Sorry for the potentially stupid question: is this an improvement in t=General note: I think more people have to compare the speed of program=s written in other languages (like Python, Java, OCaML, Clean, C with GC= C, Perl, etc) to the same programs written in D, because this (often?) s= hows spots where D may need some improvements. I know D is a language in= development still, so too much early optimization may be bad, but I thi= nk a bit of care in that regard may help anyway. Good optimizations often require significant and complicated changes to = the code. Luckily this one doesn't :) -- = Best regards, Vladimir mailto:thecybershadow gmail.com
Mar 14 2008
Vladimir Panteleev wrote:Good optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)What exactly does the patch do?
Mar 14 2008
On Sat, 15 Mar 2008 04:48:27 +0200, Walter Bright <newshound1 digitalmar= s.com> wrote:Vladimir Panteleev wrote:In gcx.mark(), it remembers the last pointer it examined that pointed to= a B_PAGE or B_PAGEPLUS memory page. If the next value that looks like a= pointer (is >=3DminAddr && <maxAddr) also points to the same memory pag= e ((lastPointer & ~(PAGESIZE-1)) =3D=3D (newPointer & ~(PAGESIZE-1))), i= t skips looking at it, thus skipping a call to findPool, looking up the = page's bin size, and testing / setting the respective bits. This greatly optimizes the GC for scanning areas that have consecutive p= ointers pointing to the same block of memory, e.g. slices of a bigger ar= ray. Note that I also changed "if (p >=3D minAddr)" to "if (p >=3D minAd= dr && p < maxAddr)" (which should also give a speed boost when scanning = objects with static byte/ubyte/char arrays or floats). -- = Best regards, Vladimir mailto:thecybershadow gmail.comGood optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)What exactly does the patch do?
Mar 14 2008
Vladimir Panteleev:Yes, just apply the patch (to src/phobos/internal/gc/gcx.d) and rebuild Phobos (using the makefiles).<I'm sure for most of you doing this is really easy, but for me it wasn't an immediate thing. I have succed. Phobos of DMD 1.028 for Win compiled with the patch (I'll keep them online for just few days): http://www.fantascienza.net/leonardo/phobos.7z Or bigger download (use the smaller if possible): http://www.fantascienza.net/leonardo/phobos.zip I suggest interested people to try this patch, to find if it gives improvements or the opposite to code that uses the GC.Here are the exact times for me:<My timings, the deallocation is about 9 times faster! OLD TIMINGS, with GC: loading time: 0.41 s splitting time: 3.76 s TOTAL time: 11.4 s OLD TIMINGS, without GC: loading time: 0.40 s splitting time: 0.67 s TOTAL time: 8.25 s NEW TIMINGS, with GC: loading time: 0.39 s splitting time: 0.8 s TOTAL time: 1.71 s NEW TIMINGS, without GC: loading time: 0.39 s splitting time: 0.67 s TOTAL time: 1.58 s The generation of a a long list of matches with a regex like the following one goes from 7.63s to 3.28s: auto txt = cast(string)read("text.txt"); auto words = RegExp("[a-z]+", "g").match(txt); The "bintrees" benchmark of the Shootout version (with the version of the code that doesn't use manual memory management. The code is online in that site, look for my nick in the code) goes (with input n = 15) from 42.1s to 34.1s.Good optimizations often require significant and complicated changes to the code. Luckily this one doesn't :)<I know, things aren't easy, but: - Various tests have shown me that the D GC is (was? With this patch it may be a bit faster in some situations) up to 10-11 times slower than the GC of the last Java, so I presume there's space for improvements. - Comparing the speed of the same programs from D to other fast languages (C with GCC, OcaML, Clean, etc) helps you to *know* the spots where the performance is bad. Then later you can let things as they are, or you can try to improve things. Otherwise you just are ignorant and you can't even start thinking about solving problems. It's like a higher-level form of code (language) profiling :-) And by the way, now the D program I was talking about is about 15% faster than the Python version ;-) Bye and thank you, bearophile
Mar 14 2008
bearophile:My timings, the deallocation is about 9 times faster!But watch the RAM used too, to me it seems to use more... Bye, bearophile
Mar 14 2008
On Sat, 15 Mar 2008 05:00:14 +0200, bearophile <bearophileHUGS lycos.com> wrote:bearophile:Odd, it shouldn't. The only memory overhead is one DWORD on the stack. -- Best regards, Vladimir mailto:thecybershadow gmail.comMy timings, the deallocation is about 9 times faster!But watch the RAM used too, to me it seems to use more...
Mar 14 2008