digitalmars.D - AA performance again
- bearophile (5/5) Jun 09 2008 Sometimes the simplest code is enough to show a problem. Other gentle pe...
- Steven Schveighoffer (10/10) Jun 09 2008 Have you tried dcollections HashMap?
- bearophile (14/21) Jun 09 2008 If you want, there are "mobile" versions of Python that doesn't require ...
- Steven Schveighoffer (12/35) Jun 09 2008 CPU is dual-core Xeon 1.6GHz, only using one core of course.
- bearophile (62/66) Jun 09 2008 Okay, for 10 million inserts:
- bearophile (5/6) Jun 09 2008 But my new CPU has two cores, so I hope to see data structures able to u...
- Steven Schveighoffer (6/11) Jun 09 2008 I don't see how hash insertions could be split into multiple-core operat...
- bearophile (5/7) Jun 09 2008 Okay, sorry for my comment then.
- Walter Bright (8/14) Jun 09 2008 I've investigated a similar issue before, and discovered it has nothing
- bearophile (37/38) Jun 09 2008 std.gc.disable();
- Steven Schveighoffer (18/60) Jun 09 2008 for further reference, disabling the GC yields 6.5 seconds to do 10m ins...
- Walter Bright (4/5) Jun 09 2008 Python uses 32 bit integers, so the equivalent D AA would be:
- bearophile (29/37) Jun 09 2008 As usual you are right and I am wrong :-)
- davidl (16/56) Jun 13 2008 I think the most important issue is D's AA(Phobos) is not balanaced.
- Lars Ivar Igesund (8/65) Jun 13 2008 Curious about your argumentation here (or test results showing that this...
- Lutger (33/33) Jun 10 2008 bearophile wrote:
- Robert Fraser (8/15) Jun 10 2008 If the bottleneck is GC, then it might be interesting to see how a
- Robert Fraser (2/3) Jun 10 2008 *trie, although tree-based would be something to look into, too.
Sometimes the simplest code is enough to show a problem. Other gentle people in the #D channel say those timings aren't random: http://leonardo-m.livejournal.com/64284.html The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd with 20 million longs. Bye, bearophile
Jun 09 2008
Have you tried dcollections HashMap? www.dsource.org/projects/dcollections If you use with Tango, the memory performance should be better (I have a special allocator that requires some of tango's GC functions). On my box (don't have python installed), 10m inserts took 15.3s, and the memory usage was 290MB (I think :) ) Of course, the python numbers are much better, but the issue basically comes down to when you allocate more memory, the GC tries to run and free some up too often. There's not a lot I can do about that. -Steve
Jun 09 2008
Steven Schveighoffer:Have you tried dcollections HashMap? www.dsource.org/projects/dcollectionsNot yet, I may try them later.On my box (don't have python installed), 10m inserts took 15.3s, and the memory usage was 290MB (I think :) )If you want, there are "mobile" versions of Python that doesn't require an installation. I may want to ask you what's your CPU/clock.Of course, the python numbers are much better, but the issue basically comes down to when you allocate more memory, the GC tries to run and free some up too often. There's not a lot I can do about that.I don't think it's just a matter of memory allocator (but I have no proof of this), there are many different macro and subtle ways to implement an hash map in C. Two more things: - In my blog post I have added a third Python version that's better, it creates a true dict (AA): dict.fromkeys(xrange(20000000), 0) - I have used D longs because Python ints are like that (Python 2.5 uses multiprecision numbers later), but Python integers are objects (so they are "boxed"). Inside sets and dicts their memory usage for Python 2.5 is: set(int): 28.2 byte/element dict(int:None): 36.2 byte/element Thank you for your answer, bear hugs, bearophile
Jun 09 2008
"bearophile" wroteSteven Schveighoffer:CPU is dual-core Xeon 1.6GHz, only using one core of course. To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do 10m inserts on my system.On my box (don't have python installed), 10m inserts took 15.3s, and the memory usage was 290MB (I think :) )If you want, there are "mobile" versions of Python that doesn't require an installation. I may want to ask you what's your CPU/clock.At least in D, that seems to be the bottleneck. This is why I get such a huge speedup using a custom allocator. You might find this interesting, print out the time it takes for each million inserts :)Of course, the python numbers are much better, but the issue basically comes down to when you allocate more memory, the GC tries to run and free some up too often. There's not a lot I can do about that.I don't think it's just a matter of memory allocator (but I have no proof of this), there are many different macro and subtle ways to implement an hash map in C.Two more things: - In my blog post I have added a third Python version that's better, it creates a true dict (AA): dict.fromkeys(xrange(20000000), 0) - I have used D longs because Python ints are like that (Python 2.5 uses multiprecision numbers later), but Python integers are objects (so they are "boxed"). Inside sets and dicts their memory usage for Python 2.5 is: set(int): 28.2 byte/element dict(int:None): 36.2 byte/elementYou might also want to try setting the value of each element differently. I'm not sure it will make a difference, but perhaps python is doing something clever if all the values are 0. The times you quote seem awfully fast, but the complexity of the issues is really beyond my understanding. -Steve
Jun 09 2008
Steven Schveighoffer:To give you an idea of the speedup, the builtin AA's take 40.5 seconds to do 10m inserts on my system.<I see, quite different from about 15 s.print out the time it takes for each million inserts :)Okay, for 10 million inserts: 0.44 s 0.75 s 1.20 s 1.34 s 2.09 s 2.62 s 2.33 s 1.92 s 4.67 s 5.48 s Total time: 24.0 s The code I have used: import std.conv: toUint; import d.time:clock;//www.fantascienza.net/leonardo/so/libs_d.zip void main() { size_t[long] aa; for (uint j; j < 10; ++j) { float t = clock(); uint min = 1_000_000 * j; uint max = 1_000_000 * (j + 1); for (uint i = min; i < max; ++i) aa[i] = 0; printf("%f\n", clock() - t); } }You might also want to try setting the value of each element differently. I'm not sure it will make a difference, but perhaps python is doing something clever if all the values are 0.<Okay, but I think the results will not change much. The D timings show very little changes: 1_000_000 0.48 s 1_000_000 1.26 s 5_000_000 5.97 s 10_000_000 24.14 s The D code I have used: import std.conv: toUint; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; for (uint i; i < n; ++i) aa[i] = i; } The Python timings are exactly the same: 1_000_000 0.25 s 1_000_000 0.41 s 5_000_000 0.80 s 10_000_000 1.48 s 20_000_000 2.88 s The Python code I have used: import sys def main(n): aa = {} for i in xrange(n): aa[i] = i import psyco; psyco.bind(main) main(int(sys.argv[1])) Without Psyco timings are similar but a little higher. Disabling the garbage collector timings become a just a bit lower.The times you quote seem awfully fast, but the complexity of the issues is really beyond my understanding.<Generally Python uses one or more dict access every time you use a variable, you read or write it, so they must be fast. In Python the sort (Timsort) and dicts/sets are optimized to death, they are now rather old, and not even Raymond Hettinger seem able to improve them :-) You may want to take a look here again, there are some more questions and answers: http://leonardo-m.livejournal.com/64284.html Bye, bearophile
Jun 09 2008
Steven Schveighoffer:CPU is dual-core Xeon 1.6GHz, only using one core of course.But my new CPU has two cores, so I hope to see data structures able to use both cores at the same time, where possible... Bye, bearophile
Jun 09 2008
"bearophile" wroteSteven Schveighoffer:I don't see how hash insertions could be split into multiple-core operations without locking, and I think you would then introduce a huge slowdown. At the very least, memory allocation needs to be synchronized, and since that is where the current bottleneck is, we would end up at the same place. -SteveCPU is dual-core Xeon 1.6GHz, only using one core of course.But my new CPU has two cores, so I hope to see data structures able to use both cores at the same time, where possible...
Jun 09 2008
Steven Schveighoffer:I don't see how hash insertions could be split into multiple-core operations without locking, and I think you would then introduce a huge slowdown.Okay, sorry for my comment then. (But time ago I have found a difficult book about the implementation of purely functional data structures, among them there are hashes too. I don't know if on a dual core they can be an improvement). Bye, bearophile
Jun 09 2008
bearophile wrote:Sometimes the simplest code is enough to show a problem. Other gentle people in the #D channel say those timings aren't random: http://leonardo-m.livejournal.com/64284.html The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd with 20 million longs.I've investigated a similar issue before, and discovered it has nothing to do with AAs. What is consuming the time is, as the memory usage grows, the garbage collector repeatedly kicks in to look for things to free. You can see this by wrapping the loop in: std.gc.disable(); ... loop ... std.gc.enable();
Jun 09 2008
Walter Bright:What is consuming the time is, as the memory usage grows, the garbage collector repeatedly kicks in to look for things to free. You can see this by wrapping the loop in:std.gc.disable(); ... loop ... std.gc.enable(); < The new timings disabling the GC are interesting: 1_000_000 0.26 s 2_000_000 0.57 s 5_000_000 2.11 s 10_000_000 8.79 s 20_000_000 29.78 s The modified D code I have used is: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } ------------ The original timings for D were: 100_000 0.05 s 500_000 0.20 s 1_000_000 0.47 s 2_000_000 1.25 s 5_000_000 5.93 s 10_000_000 24.34 s 20_000_000 129.8 s The original timings for Python (version 3, the most fair) were, best of 3: 10_000_000: 1.56 s 20_000_000: 3.04 s So even without GC, with N = 20 millions, Python is about 9.8 times faster, despite using boxed (64 bit, for this range) integers. Thank you for your answer, bearophile
Jun 09 2008
"bearophile" <bearophileHUGS lycos.com> wrote in message news:g2k51g$2j0j$1 digitalmars.com...Walter Bright:for further reference, disabling the GC yields 6.5 seconds to do 10m inserts on my system using dcollections' HashMap. I think it would be faster on your system since your system's AAs do much better than mine. Here is my code: import dcollections.HashMap; import tango.core.Memory; void main() { const uint N = 10_000_000; auto aa = new HashMap!(long, size_t); tango.core.Memory.GC.disable(); for(uint i; i < N; ++i) aa[i] = 0; tango.core.Memory.GC.enable(); } -SteveWhat is consuming the time is, as the memory usage grows, the garbage collector repeatedly kicks in to look for things to free. You can see this by wrapping the loop in:std.gc.disable(); ... loop ... std.gc.enable(); < The new timings disabling the GC are interesting: 1_000_000 0.26 s 2_000_000 0.57 s 5_000_000 2.11 s 10_000_000 8.79 s 20_000_000 29.78 s The modified D code I have used is: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[long] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } ------------ The original timings for D were: 100_000 0.05 s 500_000 0.20 s 1_000_000 0.47 s 2_000_000 1.25 s 5_000_000 5.93 s 10_000_000 24.34 s 20_000_000 129.8 s The original timings for Python (version 3, the most fair) were, best of 3: 10_000_000: 1.56 s 20_000_000: 3.04 s So even without GC, with N = 20 millions, Python is about 9.8 times faster, despite using boxed (64 bit, for this range) integers. Thank you for your answer, bearophile
Jun 09 2008
bearophile wrote:size_t[long] aa;Python uses 32 bit integers, so the equivalent D AA would be: size_t[int] aa; D longs are 64 bits.
Jun 09 2008
Walter Bright:Python uses 32 bit integers, so the equivalent D AA would be: size_t[int] aa; D longs are 64 bits.As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:<type 'int'>a = 2147483647 type(a)2147483648Lb = a + 1 b<type 'long'> In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophiletype(b)
Jun 09 2008
在 Tue, 10 Jun 2008 06:02:00 +0800,bearophile <bearophileHUGS lycos.com> 写道:Walter Bright:I think the most important issue is D's AA(Phobos) is not balanaced. Tango should perform worse in this case. While this case is almost a useless showcase of testing. AA keys are seldom well organized in this way. And this way is actually the worst case for D's AA. You should firstly generate some random strings to a file, and read this file to test both pythons' and Ds' AA. It's ridiculous that python can be any bit faster than D. The crucial bottle neck I can see, is the bucket tree not balanced in D. Once this is fixed in phobos, you should get the best performance AA, other AA won't be exponetial times faster than this, if there's , your fault. -- 使用 Opera 革命性的电子邮件客户程序: http://www.opera.com/mail/Python uses 32 bit integers, so the equivalent D AA would be: size_t[int] aa; D longs are 64 bits.As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:<type 'int'>a = 2147483647 type(a)2147483648Lb = a + 1 b<type 'long'> In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophiletype(b)
Jun 13 2008
davidl wrote:在 Tue, 10 Jun 2008 06:02:00 +0800,bearophile <bearophileHUGS lycos.com> 写道:Curious about your argumentation here (or test results showing that this is true), if you intended to write what you did. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the TangoWalter Bright:I think the most important issue is D's AA(Phobos) is not balanaced. Tango should perform worse in this case.Python uses 32 bit integers, so the equivalent D AA would be: size_t[int] aa; D longs are 64 bits.As usual you are right and I am wrong :-) My purpose is to have a good language to use, and you can see I like D :-) (I conflate AAs into the language because they are built in). Python 2.5 uses 32 signed numbers by default, but they are objects, and they are converted to multiprecision numbers whenever necessary:<type 'int'>a = 2147483647 type(a)2147483648Lb = a + 1 b<type 'long'> In Python 3 all numbers will be longs (at the moment they are about 25-30% slower, I think, but Python 3 is in beta still). ------------------ I have tested with uints then (no GC), the new D code: import std.conv: toUint; import std.gc: disable, enable; void main(string[] args) { uint n = toUint(args[1]); size_t[uint] aa; disable; for (uint i; i < n; ++i) aa[i] = 0; enable; } The timings seem the same: 1_000_000 0.27 s 2_000_000 0.59 s 5_000_000 2.14 s 10_000_000 8.87 s 20_000_000 29.9 s If someone can confirm such timings... Bye and thank you, bearophiletype(b)
Jun 13 2008
bearophile wrote: ... I have some different timings: dmd without GC: real 7.51 user 5.25 sys 0.79 dmd with GC: real 206.53 user 167.50 sys 2.14 python version 1 not using psycho: real 8.39 user 5.81 sys 1.06 python version 2: real 4.86 user 3.15 sys 0.82 python version 3: real 5.74 user 3.58 sys 1.10 I used python version 2.5.2 and dmd with the Tango runtime. Here's the D version: import tango.core.Memory; void main() { GC.disable(); const uint N = 20_000_000; size_t[long] aa; for (uint i; i < N; ++i) aa[i] = 0; }
Jun 10 2008
bearophile wrote:Sometimes the simplest code is enough to show a problem. Other gentle people in the #D channel say those timings aren't random: http://leonardo-m.livejournal.com/64284.html The D versions eat 425 MB with 10 million longs and about 757 MB RAM with dmd with 20 million longs. Bye, bearophileIf the bottleneck is GC, then it might be interesting to see how a dictionary type that uses less memory (i.e. Judy or another tree-based implementation) compares instead of using a hash (obviously with a lot of pre-allocation). If the bottleneck is in aaA.d, then maybe that just needs to be optimized, and I'm sure at least the Tango folks have the knowledge to do that. I'm guessing they've done all they can already, though.
Jun 10 2008
Robert Fraser wrote:i.e. Judy or another tree-based*trie, although tree-based would be something to look into, too.
Jun 10 2008