digitalmars.D - Message passing between threads: Java 4 times faster than D
- Nicolae Mihalache (10/10) Feb 09 2012 Hello,
- Marco Leise (8/19) Feb 09 2012 I cannot give you an explanation, just want to say that a message in
- Alex_Dovhal (5/15) Feb 09 2012 Hi, I downloaded your two programs, I didn't run them but noticed that i...
- Alex_Dovhal (2/2) Feb 09 2012 Sorry, my mistake. It's strange to have different 'n', but you measure s...
- Gor Gyolchanyan (11/13) Feb 09 2012 Generally, D's message passing is implemented in quite easy-to-use
- Andrei Alexandrescu (12/19) Feb 09 2012 cc Sean Kelly
- Marco Leise (3/22) Feb 09 2012 Well, what does +1 Variant and +1 LinkedListNode sum up to?
- Andrei Alexandrescu (3/30) Feb 09 2012 Sorry, I don't understand...
- Marco Leise (7/18) Feb 09 2012 There are at least 2 allocations, one for the Variant and one for the ne...
- bearophile (5/7) Feb 09 2012 Maybe this is able to help Sean and similar situations:
- Sean Kelly (3/10) Feb 09 2012 This would be handy. I don't always think to check the asm dump when =
- Andrei Alexandrescu (5/25) Feb 09 2012 I understand. The good news is, this looks like low-hanging fruit! I'll
-
Sean Kelly
(182/200)
Feb 09 2012
: - Sean Kelly (9/28) Feb 09 2012 So a queue per message type? How would ordering be preserved? Also, how...
- dsimcha (7/43) Feb 09 2012 I wonder how much it helps to just optimize the GC a little. How
- Brad Anderson (21/63) Feb 09 2012 dmd 2.057:
- Marco Leise (14/20) Feb 09 2012 I did some OProfile-ing. The full report is attached, but for simplicity...
- Sean Kelly (18/29) Feb 09 2012 much does the performance gap close when you use DMD 2.058 beta instead ...
- Timon Gehr (4/19) Feb 09 2012 You can mark an entire tuple as scope without trouble:
- Oliver Plow (5/9) Feb 10 2012 Is there a way to "turn off" the GC, e.g. a compiler switch to set the h...
- Jacob Carlborg (6/12) Feb 10 2012 There's a stub implementation of the GC which uses malloc, IIRC. Try
- Artur Skawina (87/93) Feb 10 2012 Calling GC.disable() at runtime will delay the GC until it's actually
- Sean Kelly (7/19) Feb 10 2012 GC.disable and GC.reserve are applicable. I tested with these and they d...
- Graham St Jack (12/32) Feb 09 2012 I suggest using a template-generated type that can contain any of the
- Martin Nowak (11/18) Feb 09 2012 I didn't yet got around to polish my lock-free SList/DList implementatio...
- deadalnix (2/15) Feb 10 2012 Doesn't this require shared to be working ?
- Martin Nowak (10/29) Feb 10 2012 g>
- Sean Kelly (15/23) Feb 09 2012 how would this work for interprocess messaging? An array-based queue is...
- David Nadlinger (4/9) Feb 09 2012 And the neat thing is that you don't have to worry about node deletion
- Sean Kelly (17/21) Feb 09 2012 deterministically.
- Sean Kelly (8/15) Feb 10 2012 you need to block.
- Nicolae Mihalache (8/27) Feb 09 2012 That would be funny but it's not true. I tested with different values,
- mw (8/18) Jun 11 2020 I just tried the attached program, D is still 3~4 times slower
Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. mache
Feb 09 2012
Am 09.02.2012, 10:06 Uhr, schrieb Nicolae Mihalache <xpromache gmail.com>:Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. macheI cannot give you an explanation, just want to say that a message in std.concurrency is also using a wrapper (a 'Variant') + a type field (standard, priority, linkDead). So you effectively have no optimization for int, but the same situation as in Java. The second thing I notice is that std.concurrency uses a double linked list implementation, while you use an array in the Java version, which results in no additional node allocations.
Feb 09 2012
"Nicolae Mihalache" <xpromache gmail.com> wrote:Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. macheHi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set to 10_000_000, so with this D code is 10/4 = 2.5 times faster :)
Feb 09 2012
Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.
Feb 09 2012
Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.-- Bye, Gor Gyolchanyan.
Feb 09 2012
On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant.cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Feb 09 2012
Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:On 2/9/12 6:10 AM, Gor Gyolchanyan wrote:Well, what does +1 Variant and +1 LinkedListNode sum up to?Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant.cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Feb 09 2012
On 2/9/12 10:31 AM, Marco Leise wrote:Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:Sorry, I don't understand... AndreiOn 2/9/12 6:10 AM, Gor Gyolchanyan wrote:Well, what does +1 Variant and +1 LinkedListNode sum up to?Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant.cc Sean Kelly I haven't looked at the implementation, but one possible liability is that large messages don't fit in a Variant and must use dynamic allocation under the wraps. There are a number of ways to avoid that, such as parallel arrays (one array per type for data and one for the additional tags). We must make the message passing subsystem to not use any memory allocation in the quiescent state. If we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). Andrei
Feb 09 2012
Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:There are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.Sorry, I don't understand... AndreiIf we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). AndreiWell, what does +1 Variant and +1 LinkedListNode sum up to?
Feb 09 2012
Marco Leise:Sean found another possible allocation in the other branch of this discussion.Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=5070 Bye, bearophile
Feb 09 2012
On Feb 9, 2012, at 11:53 AM, bearophile wrote:Marco Leise: =20This would be handy. I don't always think to check the asm dump when = I'm working with delegates.=Sean found another possible allocation in the other =20 branch of this discussion.=20 Maybe this is able to help Sean and similar situations: http://d.puremagic.com/issues/show_bug.cgi?id=3D5070
Feb 09 2012
On 2/9/12 11:49 AM, Marco Leise wrote:Am 09.02.2012, 20:35 Uhr, schrieb Andrei Alexandrescu <SeeWebsiteForEmail erdani.org>:I understand. The good news is, this looks like low-hanging fruit! I'll keep an eye on pull requests in druntime. Thanks to fellow Romanian Nicolae Mihalache for contributing the comparison. AndreiThere are at least 2 allocations, one for the Variant and one for the new node in the linked list aka message box. But from what you wrote it sounds like a Variant doesn't allocate unless the contained data exceeds some internal storage. Sean found another possible allocation in the other branch of this discussion.Sorry, I don't understand... AndreiIf we're doing one allocation per message passed, that might explain the 4x performance difference (I have no trouble figuring Java's allocator is this much faster than D's). AndreiWell, what does +1 Variant and +1 LinkedListNode sum up to?
Feb 09 2012
charset=windows-1252 On Feb 9, 2012, at 10:31 AM, Marco Leise wrote:Am 09.02.2012, 18:35 Uhr, schrieb Andrei Alexandrescu =<SeeWebsiteForEmail erdani.org>:=20I'dOn 2/9/12 6:10 AM, Gor Gyolchanyan wrote:Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. =messagerather have a templated message passing system with type-safe =that large messages don't fit in a Variant and must use dynamic = allocation under the wraps. There are a number of ways to avoid that, = such as parallel arrays (one array per type for data and one for the = additional tags).queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant.=20 cc Sean Kelly =20 I haven't looked at the implementation, but one possible liability is =allocation in the quiescent state. If we're doing one allocation per = message passed, that might explain the 4x performance difference (I have = no trouble figuring Java's allocator is this much faster than D's).=20 We must make the message passing subsystem to not use any memory ==20 Well, what does +1 Variant and +1 LinkedListNode sum up to?FWIW, you can use DMD's built in profiler so long as the receiving = thread is the same as the sending thread: import std.concurrency; void main() { for(int i =3D 0; i < 1_000_000; i++) { send(thisTid, 12345); auto x =3D receiveOnly!int(); } } I generated timings for this both before and after adding "scope" to = mbox.get(): $ dmd -release -inline -O abc $ time abc real 0m0.831s user 0m0.829s sys 0m0.002s =85 add "scope" to mbox.get() $ dmd -release -inline -O abc $ time abc real 0m0.653s user 0m0.649s sys 0m0.003s And here's the trace log after "scope" was added. Notice that there = were 61 calls to GCX.fullcollect(). We can also see that there was 1 = allocation per send/receive operation, so only an alloc for the message = list node. $ dmd -O -release -profile abc gladsheim:misc sean$ time abc real 0m11.348s user 0m11.331s sys 0m0.015s =3D=3D=3D=3D=3D=3D=3D=3D Timer Is 3579545 Ticks/Sec, Times are in = Microsecs =3D=3D=3D=3D=3D=3D=3D=3D Num Tree Func Per Calls Time Time Call 1000000 437709765 220179413 220 void = std.concurrency._send!(int)._send(std.concurrency.MsgType, = std.concurrency.Tid, int) 1000000 300987757 140736393 140 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*) 1000000 202131609 89479808 89 void* = gc.gcx.GC.malloc(uint, uint, uint*) 1 825045422 57556501 57556501 _Dmain 1000033 112651800 52026745 52 void* = gc.gcx.GC.mallocNoSync(uint, uint, uint*) 61 53422342 49606106 813214 uint = gc.gcx.Gcx.fullcollect(void*) 2000000 160103753 42531732 21 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool scan(ref = std.concurrency.List!(std.concurrency.Message).List) 2000000 42018612 39837170 19 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*) 1000000 117572021 24641771 24 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool onStandardMsg(ref = std.concurrency.Message) 1000000 47280794 20418675 20 void = std.concurrency.Message.map!(nothrow safe void = delegate(int)).map(nothrow safe void delegate(int)) 1000000 316556767 15569009 15 int = std.concurrency.receiveOnly!(int).receiveOnly() 1000000 36317362 13212905 13 property bool = std.variant.VariantN!(32u).VariantN.convertsTo!(int).convertsTo() 1000000 15445906 10879089 10 std.concurrency.Message = std.concurrency.Message.__ctor!(int).__ctor(std.concurrency.MsgType, = int) 1000000 45649454 9332092 9 property bool = std.concurrency.Message.convertsTo!(int).convertsTo() 1000000 26790032 7875877 7 property int = std.variant.VariantN!(32u).VariantN.get!(int).get() 1000000 444757778 7048013 7 void = std.concurrency._send!(int)._send(std.concurrency.Tid, int) 15512 6681162 6657279 429 int = gc.gcx.Gcx.allocPage(ubyte) 1000000 450932153 6174374 6 void = std.concurrency.send!(int).send(std.concurrency.Tid, int) 1000000 4566817 4566817 4 = std.variant.VariantN!(32u).VariantN = std.variant.VariantN!(32u).VariantN.opAssign!(int).opAssign(int) 4087 3635735 3518353 860 void = gc.gcx.Gcx.mark(void*, void*, int) 2000000 2069965 2069965 1 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).bool tryPutting(int*, = TypeInfo, void*) 2990271 195287 195287 0 void* = gc.gcx.sentinel_add(void*) 1000000 147610 147610 0 bool = std.concurrency.MessageBox.get!(nothrow safe void delegate(int), pure = safe void function(std.concurrency.LinkTerminated)*, pure safe void = function(std.concurrency.OwnerTerminated)*, pure safe void = function(std.variant.VariantN!(32u).VariantN)*).get(scope nothrow safe = void delegate(int), scope pure safe void = function(std.concurrency.LinkTerminated)*, scope pure safe void = function(std.concurrency.OwnerTerminated)*, scope pure safe void = function(std.variant.VariantN!(32u).VariantN)*).bool pty(ref = std.concurrency.List!(std.concurrency.Message).List) 1000032 121459 121459 0 void = gc.gcx.Gcx.setBits(gc.gcx.Pool*, uint, uint) 2000000 111475 111475 0 int = std.variant.VariantN!(32u).VariantN.handler!(int).handler(std.variant.Vari= antN!(32u).VariantN.OpID, ubyte[32]*, void*).int* getPtr(void*) 1000033 102413 102413 0 ubyte = gc.gcx.Gcx.findBin(uint) 1002228 94742 94742 0 property uint = gc.gcx.Pool.shiftBy() 1000000 72086 72086 0 int = std.concurrency.receiveOnly!(int).receiveOnly().nothrow safe void = __lambda17(int) 995119 70854 70854 0 void = gc.gcx.Gcx.log_free(void*) 1000033 61505 61505 0 void = gc.gcx.sentinel_init(void*, uint) 1000033 55070 55070 0 void = gc.gcx.Gcx.log_malloc(void*, uint) 995119 51748 51748 0 void = gc.gcx.sentinel_Invariant(const(void*)) 1 24692 24692 24692 void = gc.gcx.Pool.initialize(uint, bool) 3538 3544517 24525 6 void = gc.gcx.Gcx.mark(void*, void*) 15511 23883 23522 1 uint = gc.gcx.Pool.allocPages(uint) 124447 11615 11615 0 void = gc.gcx.Gcx.clrBitsSmallSweep(gc.gcx.Pool*, uint, uint) 1 7051 5921 5921 void = gc.gcx.GC.initialize() 1 27490 2797 2797 gc.gcx.Pool* = gc.gcx.Gcx.newPool(uint, bool) 61 53424783 2441 40 uint = gc.gcx.Gcx.fullcollectshell() 55 2227 2227 40 void = gc.gcx.Gcx.addRange(void*, void*) 1 1129 1119 1119 void = gc.gcx.Gcx.initialize() 16 360 360 22 uint = gc.gcx.Pool.extendPages(uint) 1 2547 319 319 void = gc.gcx.GC.addRange(void*, uint) 2196 278 278 0 gc.gcx.Pool* = gc.gcx.Gcx.findPool(void*) 1 65 65 65 void gc.gcx.GC.disable() 1 9 9 9 void = gc.gcx.Gcx.log_init() 1 5 5 5 void gc.gcx.GC.enable() 1 0 0 0 void = gc.gcx.GC.setStackBottom(void*) 1 0 0 0 property uint = gc.gcx.Pool.divisor()
Feb 09 2012
So a queue per message type? How would ordering be preserved? Also, how wou= ld this work for interprocess messaging? An array-based queue is an option h= owever (though it would mean memmoves on receive), as are free-lists for nod= es, etc. I guess the easiest thing there would be a lock-free shared slist f= or the node free-list, though I couldn't weigh the chance of cache misses fr= om using old memory blocks vs. just expecting the allocator to be fast.=20 On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wr= ote:Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. =20 On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:=eedSorry, my mistake. It's strange to have different 'n', but you measure sp=as 1000*n/time, so it's doesn't matter if n is 10 times bigger. =20 =20=20 =20 =20 --=20 Bye, Gor Gyolchanyan.
Feb 09 2012
I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wrote:Generally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex dovhal yahoo.com> wrote:Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.-- Bye, Gor Gyolchanyan.
Feb 09 2012
On Thu, Feb 9, 2012 at 9:22 AM, dsimcha <dsimcha yahoo.com> wrote:I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive. On Thursday, 9 February 2012 at 15:44:59 UTC, Sean Kelly wrote:dmd 2.057: received 100000000 messages in 192034 msec sum=4999999950000000 speed=520741 msg/sec received 100000000 messages in 84118 msec sum=4999999950000000 speed=1188806 msg/sec received 100000000 messages in 88274 msec sum=4999999950000000 speed=1132836 msg/sec dmd 2.058 beta: received 100000000 messages in 93539 msec sum=4999999950000000 speed=1069072 msg/sec received 100000000 messages in 96422 msec sum=4999999950000000 speed=1037107 msg/sec received 100000000 messages in 203961 msec sum=4999999950000000 speed=490289 msg/sec Both versions would inexplicably run at approximately half the speed sometimes. I have no idea what is up with that. I have no java development environment to test for comparison. This machine has 4 cores and is running Windows. Regards, Brad AndersonSo a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> wrote: Generally, D's message passing is implemented in quite easy-to-useway, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex Dovhal <alex dovhal yahoo.com> wrote:Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.-- Bye, Gor Gyolchanyan.
Feb 09 2012
Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>:I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive.I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*) ... Compiled with: gcc-Version 4.6.2 20111026 (gdc 0.31 - r751:34491c2e7bb4, using dmd 2.057) (GCC)
Feb 09 2012
On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:Am 09.02.2012, 17:22 Uhr, schrieb dsimcha <dsimcha yahoo.com>: =20much does the performance gap close when you use DMD 2.058 beta instead = of 2.057? This upcoming release has several new garbage collector = optimizations. If the GC is the bottleneck, then it's not surprising = that anything that relies heavily on it is slow because D's GC is still = fairly naive.I wonder how much it helps to just optimize the GC a little. How ==20 I did some OProfile-ing. The full report is attached, but for =simplicity it is without call graph this time. Here is an excerpt:=20 CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a =unit mask of 0x00 (Unhalted core cycles) count 100000samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* =gc.gcx.GC.malloc(ulong, uint, ulong*)4465 6.0795 gcx.d:2454 ulong =gc.gcx.Gcx.fullcollect(void*) One random thing that just occurred to me=85 if the standard receive = pattern is: receive((int x) { =85 }); There's a good chance that a stack frame is being dynamically allocated = for the delegate when it's passed to receive (since I don't believe = there's any way to declare the parameters to receive as "scope"). I'll = have to check this, and maybe consider changing receive to use alias = template parameters instead of normal function parameters?=
Feb 09 2012
On 02/09/2012 08:27 PM, Sean Kelly wrote:On Feb 9, 2012, at 10:14 AM, Marco Leise wrote:You can mark an entire tuple as scope without trouble: void foo(T,S...)(T arg1, scope S args) {...} Does this improve the run time?Am 09.02.2012, 17:22 Uhr, schrieb dsimcha<dsimcha yahoo.com>:One random thing that just occurred to me… if the standard receive pattern is: receive((int x) { … }); There's a good chance that a stack frame is being dynamically allocated for the delegate when it's passed to receive (since I don't believe there's any way to declare the parameters to receive as "scope"). I'll have to check this, and maybe consider changing receive to use alias template parameters instead of normal function parameters?I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising that anything that relies heavily on it is slow because D's GC is still fairly naive.I did some OProfile-ing. The full report is attached, but for simplicity it is without call graph this time. Here is an excerpt: CPU: Core 2, speed 2001 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples % linenr info symbol name 13838 18.8416 gcx.d:426 void* gc.gcx.GC.malloc(ulong, uint, ulong*) 4465 6.0795 gcx.d:2454 ulong gc.gcx.Gcx.fullcollect(void*)
Feb 09 2012
Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- Oliver -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.deI wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising
Feb 10 2012
On 2012-02-10 14:54, Oliver Plow wrote:There's a stub implementation of the GC which uses malloc, IIRC. Try using that one at link time. https://github.com/D-Programming-Language/druntime/blob/master/src/gcstub/gc.d -- /Jacob CarlborgIs there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not. -- OliverI wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising
Feb 10 2012
On 02/10/12 14:54, Oliver Plow wrote:Calling GC.disable() at runtime will delay the GC until it's actually needed, but won't disable it completely. Having a std noop GC stub selected by a switch would be nice, but you can get the same effect by giving the linker an object that has the necessary stubs. For this test case something like the patch below improves things significantly, for more gains, std.concurrency needs more invasive changes. Note it's just POC, to measure the current std.concurrency efficiency vs other approaches. The "freelist" arrays are not freed and leak, a complete implementation would free them when the link is shut down. Ignore the synchronize() calls - "synchronized" isn't properly lowered by the compiler, so i had to resort to this after switching the locking primitives. It should work with std "synchronized" equally well. The original testcase from this thread achieves ~4M msg/sec with this change (the numbers aren't stable, but mostly in the 3.5..4.0M range, 4.5M+ happens sometimes). The memory usage also decreases noticeably. artur --- std/concurrency.d +++ std/concurrency.d -1387,7 +1396,7 private m_last = n; Node* todelete = n.next; n.next = n.next.next; - //delete todelete; + delete todelete; m_count--; } -1430,6 +1439,56 private { val = v; } + import core.memory; + import core.exception; + new(size_t size) { + void* p; + if (afreelist.length) + p = afreelist[--afreelist.length]; + else if (gfreelist.length) { + { + scope lock = synchronize(fl); + if (gfreelist.length) { + afreelist = cast(Node*[])gfreelist; + gfreelist.length=0; + } + } + if (afreelist.length) + p = afreelist[--afreelist.length]; + } + + if (p) + return p; + + p = std.c.stdlib.malloc(size); + if (!p) + throw new OutOfMemoryError(); + GC.addRange(p, size); + return p; + } + delete(void* p) { + if (!p) + return; + pfreelist ~= cast(Node*)p; + if (pfreelist.length>=8) + { + { + scope lock = synchronize(fl); + gfreelist ~= cast(shared Node*[])pfreelist; + } + pfreelist.length=0; + pfreelist.assumeSafeAppend(); + } + // At some point all free nodes need to be freed, using: + //GC.removeRange(p); + //std.c.stdlib.free(p); + } + static Node*[] afreelist; + static ubyte[56] d1; + static Node*[] pfreelist; + static ubyte[56] d2; + shared static Node*[] gfreelist; + shared static Mutex fl; }Is there a way to "turn off" the GC, e.g. a compiler switch to set the heap size to a large number so that the GC is likely not to set in? I searched through this page: http://www.d-programming-language.org/dmd-w ndows.html#switches But couldn't find anything helpful. Then you could measure the thing with GC "turned off" to see whether the GC is the problem or not.I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising
Feb 10 2012
GC.disable and GC.reserve are applicable. I tested with these and they did h= elp but not a ton.=20 On Feb 10, 2012, at 5:54 AM, "Oliver Plow" <saxo123 gmx.de> wrote:p size to a large number so that the GC is likely not to set in? I searched t= hrough this page: http://www.d-programming-language.org/dmd-windows.html#swi= tches But couldn't find anything helpful. Then you could measure the thing w= ith GC "turned off" to see whether the GC is the problem or not.=20 Is there a way to "turn off" the GC, e.g. a compiler switch to set the hea=I wonder how much it helps to just optimize the GC a little. How much does the performance gap close when you use DMD 2.058 beta instead of 2.057? This upcoming release has several new garbage collector optimizations. If the GC is the bottleneck, then it's not surprising=20 -- Oliver =20 --=20 Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
Feb 10 2012
I suggest using a template-generated type that can contain any of the messages to be sent over a channel. It is reasonably straightforward to generate all the boilerplate code necessary to make this happen. My prototype (attached) still needs work to remove linux dependencies and tighten it up, but it works ok. Another advantage of this approach (well, I see it as an advantage) is that you declare in a single location all the messages that can be sent over the channel, and of course the messages are type-safe. The file of interest is concurrency.d. On 10/02/12 02:14, Sean Kelly wrote:So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast. On Feb 9, 2012, at 6:10 AM, Gor Gyolchanyan<gor.f.gyolchanyan gmail.com> wrote:-- Graham St JackGenerally, D's message passing is implemented in quite easy-to-use way, but far from being fast. I dislike the Variant structure, because it adds a huge overhead. I'd rather have a templated message passing system with type-safe message queue, so no Variant is necessary. In specific cases Messages can be polymorphic objects. This will be way faster, then Variant. On Thu, Feb 9, 2012 at 3:12 PM, Alex_Dovhal<alex_dovhal yahoo.com> wrote:Sorry, my mistake. It's strange to have different 'n', but you measure speed as 1000*n/time, so it's doesn't matter if n is 10 times bigger.-- Bye, Gor Gyolchanyan.
Feb 09 2012
On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org> wrote:So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast.I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists. Best first order optimization would be to allocate the list node deterministically. The only reason to use GC memory for them is that mallocating is still too cumbersome. Nodes are unshared so you'd want a unique_pointer.
Feb 09 2012
Le 09/02/2012 20:57, Martin Nowak a écrit :On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.org> wrote:Doesn't this require shared to be working ?So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would be a lock-free shared slist for the node free-list, though I couldn't weigh the chance of cache misses from using old memory blocks vs. just expecting the allocator to be fast.I didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists.
Feb 10 2012
On Fri, 10 Feb 2012 15:07:41 +0100, deadalnix <deadalnix gmail.com> wrot= e:Le 09/02/2012 20:57, Martin Nowak a =C3=A9crit :g>On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly <sean invisibleduck.or=wrote:So a queue per message type? How would ordering be preserved? Also, how would this work for interprocess messaging? An array-based queue=is an option however (though it would mean memmoves on receive), as are free-lists for nodes, etc. I guess the easiest thing there would=tbe a lock-free shared slist for the node free-list, though I couldn'=stweigh the chance of cache misses from using old memory blocks vs. ju=expecting the allocator to be fast.I didn't yet got around to polish my lock-free SList/DList =uimplementations, but mutexes should only become a problem with high contention when yo=Shared is already very helpful for writing lock-free stuff. I still need to use atomic load/store to make them portable though. If shared would add memory fences for full sequential order one would rather hack around this using weaker ordering.need to block. You'd also would need some kind of blocking for lock-free lists.Doesn't this require shared to be working ?
Feb 10 2012
On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:On Thu, 09 Feb 2012 16:44:46 +0100, Sean Kelly =<sean invisibleduck.org> wrote:=20how would this work for interprocess messaging? An array-based queue is = an option however (though it would mean memmoves on receive), as are = free-lists for nodes, etc. I guess the easiest thing there would be a = lock-free shared slist for the node free-list, though I couldn't weigh = the chance of cache misses from using old memory blocks vs. just = expecting the allocator to be fast.So a queue per message type? How would ordering be preserved? Also, ==20 I didn't yet got around to polish my lock-free SList/DList =implementations,but mutexes should only become a problem with high contention when you =need to block.You'd also would need some kind of blocking for lock-free lists.No blocking should be necessary for the lock-free list. Just try to = steal a node with a CAS. If the result was null (i.e. if the list ended = up being empty), allocate a node via malloc/GC.Best first order optimization would be to allocate the list node =deterministically. Neat idea. I think I can make that change fairly trivially.=
Feb 09 2012
On 2/9/12 11:17 PM, Sean Kelly wrote:On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:And the neat thing is that you don't have to worry about node deletion as much when you have a GC… DavidI didn't yet got around to polish my lock-free SList/DList implementations, but mutexes should only become a problem with high contention when you need to block. You'd also would need some kind of blocking for lock-free lists.No blocking should be necessary for the lock-free list. Just try to steal a node with a CAS. If the result was null (i.e. if the list ended up being empty), allocate a node via malloc/GC.
Feb 09 2012
charset=us-ascii On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:=20deterministically.Best first order optimization would be to allocate the list node ==20 Neat idea. I think I can make that change fairly trivially.$ time abc real 0m0.556s user 0m0.555s sys 0m0.001s So another 100ms improvement. Switching to a (__gshared, no mutex) = free-list that falls back on malloc yields: $ time abc real 0m0.505s user 0m0.503s sys 0m0.001s Not as much of a gain there, and I believe we've eliminated all the = allocations (though I'd have to do a pile build to verify). Still, = that's approaching being twice as fast as before, which is definitely = something.=
Feb 09 2012
On Feb 9, 2012, at 2:17 PM, Sean Kelly wrote:On Feb 9, 2012, at 11:57 AM, Martin Nowak wrote:implementations,=20 I didn't yet got around to polish my lock-free SList/DList =you need to block.but mutexes should only become a problem with high contention when =steal a node with a CAS. If the result was null (i.e. if the list ended = up being empty), allocate a node via malloc/GC. I just realized that the free-list is actually a stack, so doing this = lock-free runs into the ABA problem. There goes the idea of this being = an easy optimization.=You'd also would need some kind of blocking for lock-free lists.=20 No blocking should be necessary for the lock-free list. Just try to =
Feb 10 2012
That would be funny but it's not true. I tested with different values, that's why I ended up uploading different versions. The programs print the computed message rate and takes into account the number of messages. mache On Thu, Feb 9, 2012 at 11:57 AM, Alex_Dovhal <alex_dovhal yahoo.com> wrote:"Nicolae Mihalache" <xpromache gmail.com> wrote:).Hello, I'm a complete newbie in D and trying to compare with Java. I implemented =A0a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java=toThe two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags. macheHi, I downloaded your two programs, I didn't run them but noticed that in 'mp.d' you have n set to 100_000_000, while in 'ThroughputMpTest.java' n is set =10_000_000, so with this D code is 10/4 =3D 2.5 times faster :)
Feb 09 2012
On Thursday, 9 February 2012 at 09:09:24 UTC, Nicolae Mihalache wrote:Hello, I'm a complete newbie in D and trying to compare with Java. I implemented a simple test for measuring the throughput in message passing between threads. I see that Java can pass about 4mil messages/sec while D only achieves 1mil/sec. I thought that D should be faster. The messages are simply integers (which are converted to Integer in Java). The two programs are attached. I tried compiling the D version with both dmd and gdc and various optimization flags.I just tried the attached program, D is still 3~4 times slower than the Java version. Have we made any improvement on the message passing between threads since 2012? Or is there any new way to improve the D verson? Thanks.
Jun 11 2020