digitalmars.D - Faster Virtual Method Dispatch
- Craig Black (25/25) Apr 23 2006 Most programmers do not realize the overhead involved with calling a met...
- kris (38/70) Apr 23 2006 I'm no expert either, Craig, yet it seems that there's more to it? An
- BCS (6/76) Apr 23 2006 I'm not following you train of thought completely but...
- BCS (14/39) Apr 23 2006 IIRC what is being searched for isn't which entry in this vtbl, but whic...
- Craig Black (10/70) Apr 23 2006 Nope you are right. Which vtable. My bad.
- Hasan Aljudy (4/36) Apr 23 2006 I'm far from being an expert .. but I thought vtable lookups involve two...
- Craig Black (8/11) Apr 23 2006 From one non-expert to another, I have no idea how many assembly languag...
- Dave (14/45) Apr 23 2006 This is a little OT because it involves removing the need for a lot of
- kris (5/59) Apr 23 2006 Yes, I recall seeing Walter write something along these lines somewhere
- Dan (30/30) Apr 24 2006 Actually, having "if statements" is absolutely TERRIBLE for trying to im...
- Craig Black (13/57) Apr 24 2006 Maybe you are just way over my head but I'm not quite following how this...
- BCS (7/21) Apr 24 2006 A bad GC will definitely make for bad performance. And any GC will be sl...
- Craig Black (17/50) Apr 24 2006 I have heard similar. "The code required to manually free everything
- kris (7/67) Apr 24 2006 The nice thing about D, as a language, is that it supports both
- Craig Black (10/16) Apr 24 2006 Correct. But since the language core relies on Phobos I don't think tha...
- kris (15/27) Apr 24 2006 That's really quite easy, Craig; start with Ares, and then apply
- Sean Kelly (6/13) Apr 25 2006 At the very least, a GC is required if you want to do much of anything
- Walter Bright (21/63) Apr 24 2006 There's more to it than that. One of the big benefits of GC is that
- Craig Black (12/79) Apr 24 2006 Hmmm. I have glanced at the word count example many times but was not a...
- Walter Bright (23/34) Apr 24 2006 Even in that case, gc can win. For example, for explicit memory
- Craig Black (18/52) Apr 25 2006 I don't see a performance issue here. Nor do think this would be terrib...
- Sean Kelly (9/17) Apr 25 2006 One reason is that Java compilers perform basically no compile-time
- Craig Black (8/13) Apr 25 2006 It is also worth mentioning that compacting can be performed without
- Sean Kelly (11/21) Apr 25 2006 For what it's worth, the performance improvement may be attributed to
- Mike Capp (19/29) Apr 25 2006 Not true for intrusive refcounting. (This isn't always feasible if you'r...
- Craig Black (8/35) Apr 25 2006 What's intrusive refcounting? Please use small words. My brain is almos...
- Sean Kelly (8/44) Apr 25 2006 Maintaining the reference count within the class being pointed to
- Craig Black (8/53) Apr 25 2006 That's not the point. It demonstrates that the same performance can be
- Sean Kelly (12/29) Apr 25 2006 I merely meant that that's not the point of the wordcount example. Howe...
- Craig Black (25/68) Apr 25 2006 Great! I look forward to seeing the results.
- Sean Kelly (45/45) Apr 27 2006 I ran some quick tests this morning using the word count example code
- Sean Kelly (24/24) Apr 27 2006 Quick follow-up. I decided to re-run the tests using the full text of
- Craig Black (7/10) Apr 27 2006 Wow! It's over twice as fast! How difficult would it be to use the sam...
- Sean Kelly (23/33) Apr 27 2006 The C++ slice code should mirror what is already happening in D--as far
- Sean Kelly (8/13) Apr 27 2006 By the way, I do think that GC performance can be improved significantly...
- Craig Black (6/13) Apr 28 2006 I didn't know such selective scanning was possible with GC. Would this
- Sean Kelly (76/76) Apr 27 2006 Last follow-up. I re-ran all tests with a bit less stuff going on in
- Craig Black (9/17) Apr 28 2006 That is very strange ... so it seems that these benchmarks, being
- Sean Kelly (5/7) Apr 29 2006 I wanted to reduce the chance that random system noise would affect a
- Walter Bright (15/41) Apr 25 2006 True, but shared_ptr<> is billed as the C++ answer to gc, and
- Mike Capp (18/30) Apr 26 2006 Shrug. I'm finding myself increasingly unconcerned these days with what'...
- Walter Bright (11/43) Apr 26 2006 I can see that point, but I can see that once you stray into such
- Sean Kelly (14/21) Apr 24 2006 I believe this is indeed the common argument. In C++, smart pointers
- James Dunne (23/29) Apr 26 2006 Not true. A hierachical memory management system is great! It sure
- xs0 (5/22) Apr 26 2006 But how can you tell _when_ to delete the root? AFAIK, the main benefit
- James Dunne (29/56) Apr 26 2006 That's a different problem and depends on the nature of the application....
- Walter Bright (11/14) Apr 26 2006 One example that has plagued me in the past is having two distinct tree
- James Dunne (17/37) Apr 26 2006 Your example is having two trees _point_ to data within each other. The...
- James Dunne (14/49) Apr 26 2006 I should also mention that my compiler implementation makes use of
- Dave (2/9) Apr 26 2006 What kinda compiler?
- James Dunne (7/18) Apr 26 2006 My own research language named Iris. I'd like to keep it under the
- Walter Bright (3/9) Apr 26 2006 But when this happens in the general case, I can't delete *any* of the
- James Dunne (7/19) Apr 26 2006 When are you trying to delete one of these trees?
- Walter Bright (4/21) Apr 26 2006 I could wait until the program finishes before deleting anything, but
- James Dunne (8/28) Apr 26 2006 If you parent the memory you're creating in the tree data structures to
- Walter Bright (7/14) Apr 26 2006 I've designed data structures around the memory management requirement
- Craig Black (12/12) Apr 26 2006 Good idea! I suppose that every time you allocate memory you must alway...
- James Dunne (25/40) Apr 26 2006 Holy geez... Yes it is a good idea, but alas I cannot take credit for
- Craig Black (10/45) Apr 27 2006 Of coarse it does. Along those lines ... I've been thinking of an
- Craig Black (3/5) Apr 27 2006 I was in a hurry. Let me revise.
- James Dunne (13/23) Apr 27 2006 Hehe. That's okay. I understand text through typos.
- Craig Black (13/16) Apr 27 2006 Instead of calling malloc for each allocation unit, you could write an
- James Dunne (8/32) Apr 27 2006 Yes I understand. Thank you. Very interesting concept.
Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -Craig
Apr 23 2006
Craig Black wrote:Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigI'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in the cache lines. None of the other in-flight instructions have to be flushed at that time. If the vtable is not cached, then a bubble will appear in the pipeline while main-memory is addressed. But, again (as I understand it), this doesn't invalidate any in-flight instructions; just feeds the core less efficiently. On the other hand, a branch misprediction must invalidate a large number of in-flight instruction; usually all of them (unless multi-path speculative execution is occuring). The P4 is particularly maimed by this, as compared to its peers, because of its particularly long pipeline (which, in turn, was needed to reached the frequencies Intel marketeers felt they needed to drive AMD off the map). Obviously I don't know enough about this, but it might be interesting to research how one might introduce some locality-of-reference bias on the more common calls made. For instance, some fancy runtime might rewrite "common" calls to a dynamically constructed "global vtable". That would perhaps avoid the misprediction concerns, whilst managing to retain the globally popular set of lookups within cache? Clearly MS compiler-engineers have noticed something beneficial. On the other hand, it seems a bit like flogging a dead-horse to introduce something that will likely cause at least one branch misprediction? I suppose the idea is that a direct-call (versus an indirect one) will enable speculative execution within the impending function-call itself? Then again, heavyweight speculative execution is starting to lose favour against a grouping of simple symmetrical cores on one die (a la Cell, Kilocore, Niagara, etc). The idea there is that it doesn't matter if a bubble appears, or a core stalls; if you partition an app to take advantage of more than one execution thread, then it will (overall) execute faster anyway. The silicon saved via simplicity is applied to other areas such as massive on-die bandwidth. Of course, that doesn't help single-threaded applications, but the industry is changing, albeit slowly :) Wither the Transputer? <g> 2 Cents
Apr 23 2006
In article <e2gkms$rur$1 digitaldaemon.com>, kris says...Craig Black wrote:I'm not following you train of thought completely but... I don't think that the cache hit is the issue, I think that the fact that the function call is to an interdict address is the issue. The CPU might not be able to safely predict where it's supposed to go until it get all the way to the function call. As a result the instruction queue gets completely drained.Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigI'm no expert either, Craig, yet it seems that there's more to it? An indirect address lookup (via a vtable) doesn't trash the pipeline per se; I forget the cycles involved but there's very little overhead at all if the vtable is a "popular" one, or otherwise happens to be in thecache lines. None of the other in-flight instructions have to be flushed at that time. If the vtable is not cached, then a bubble will appear in the pipeline while main-memory is addressed. But, again (as I understand it), this doesn't invalidate any in-flight instructions; just feeds the core less efficiently. On the other hand, a branch misprediction must invalidate a large number of in-flight instruction; usually all of them (unless multi-path speculative execution is occuring). The P4 is particularly maimed by this, as compared to its peers, because of its particularly long pipeline (which, in turn, was needed to reached the frequencies Intel marketeers felt they needed to drive AMD off the map). Obviously I don't know enough about this, but it might be interesting to research how one might introduce some locality-of-reference bias on the more common calls made. For instance, some fancy runtime might rewrite "common" calls to a dynamically constructed "global vtable". That would perhaps avoid the misprediction concerns, whilst managing to retain the globally popular set of lookups within cache? Clearly MS compiler-engineers have noticed something beneficial. On the other hand, it seems a bit like flogging a dead-horse to introduce something that will likely cause at least one branch misprediction? I suppose the idea is that a direct-call (versus an indirect one) will enable speculative execution within the impending function-call itself? Then again, heavyweight speculative execution is starting to lose favour against a grouping of simple symmetrical cores on one die (a la Cell, Kilocore, Niagara, etc). The idea there is that it doesn't matter if a bubble appears, or a core stalls; if you partition an app to take advantage of more than one execution thread, then it will (overall) execute faster anyway. The silicon saved via simplicity is applied to other areas such as massive on-die bandwidth. Of course, that doesn't help single-threaded applications, but the industry is changing, albeit slowly :) Wither the Transputer? <g> 2 Cents
Apr 23 2006
In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than aIIRC what is being searched for isn't which entry in this vtbl, but which vtbl to use. Or maybe I am miss understanding you.given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigMaybe some sort of run-and-review optimization could be done. This would entail having the compiler build a heavily instrumented version of the program and then run it a bunch to find out which classes are most commonly used at a given call and then on the next build let the compiler optimize each call based on the measured data. If this is class ABC call ABC.foo else if XYZ call XYZ.foo else use the vtbl
Apr 23 2006
"BCS" <BCS_member pathlink.com> wrote in message news:e2go6m$13u7$1 digitaldaemon.com...In article <e2ggdq$lat$1 digitaldaemon.com>, Craig Black says...Nope you are right. Which vtable. My bad.Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than aIIRC what is being searched for isn't which entry in this vtbl, but which vtbl to use. Or maybe I am miss understanding you.Yes, like MSVC's profile guided optimization. A more general optimization would be nice too, but you could probably be even faster and more efficient with a profile-guided one. One real good thing about profile guided optimizations is that the compiler can spend more time on optimizing performance-critical code sections and less time on code that is not pertinent to performance. -Craiggiven threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigMaybe some sort of run-and-review optimization could be done. This would entail having the compiler build a heavily instrumented version of the program and then run it a bunch to find out which classes are most commonly used at a given call and then on the next build let the compiler optimize each call based on the measured data. If this is class ABC call ABC.foo else if XYZ call XYZ.foo else use the vtbl
Apr 23 2006
I'm far from being an expert .. but I thought vtable lookups involve two or three jump/call instructions at the assembly level; at least that's the impression I got from stepping through code in the VC++ debugger. Craig Black wrote:Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -Craig
Apr 23 2006
"Hasan Aljudy" <hasan.aljudy gmail.com> wrote in message news:e2gr8j$18r2$1 digitaldaemon.com...I'm far from being an expert .. but I thought vtable lookups involve two or three jump/call instructions at the assembly level; at least that's the impression I got from stepping through code in the VC++ debugger.From one non-expert to another, I have no idea how many assembly language instructions are involved. However, I have been informed by some very experienced programmers that virtual methods don't mesh well with pipelining. This is the main performance issue with calling a virtual method. -Craig
Apr 23 2006
Craig Black wrote:Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigThis is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD. But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call. Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place. Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++. - Dave
Apr 23 2006
Dave wrote:Craig Black wrote:Yes, I recall seeing Walter write something along these lines somewhere in the D documentation: being able to compile the entire program, versus linking things in from a library, permits much more aggressive finalization of methods as you describe. That's definately a benefit.Most programmers do not realize the overhead involved with calling a method virtually. A virtual method call is an O(1) operation. Just a vtable lookup and a invoking a function pointer. However, we must also consider that the CPU has no way of performing branch prediction with vtable lookups, so the pipeline is trashed. Thus, virtual method calls incur a large performance pentalty. Interestly, I found that one of the MSVC++ 2005 compiler's profile guided optimizations is replacing the vtable lookup with if statements for the most commonly called virtual methods followed by the vtable lookup, so that branch prediction can be performed in the majority of cases. I do not consider myself an expert in the area of compiler optimization, but knowing the little that I do know I have come up with a simple approach that may improve performance. This approach is described below. Performing a binary search using if statements, and then invoking the methods directly rather than using fuction pointers would allow the method dispatch to take advantage of pipelining. Algorithmically, however, this is no longer O(1), but O(log n), which is bad. So I would assume that it would only be faster if n (or the number of methods in the vtable) is lower than a given threshold. This threshold would be determined by testing and would perhaps be different on different processors. Thus, in order to ensure that this approach is always faster, a relatively low number should be used. Like I said I'm not an expert when it comes to low-level optimization strategies. I don't know how well such an algorithm would work in practice. Comments are obviously welcome. -CraigThis is a little OT because it involves removing the need for a lot of VMD altogether before the assembly code is even generated instead of a way to optimize the VMD. But, I think it would be good to take a look at 'auto-finalization' where the compiler front-end determines that a virtual method is not needed so just does a direct call. Not only can more direct calls be made, but potentially the vtable gets smaller, the lookups will be less frequent and there will be no need for any branch-prediction for direct calls, getting rid of much of the need for VMD optimizations for many programs in the first place. Maybe the semantic processing differences of 'import' over #include would make this more straight-forward to do for D compilers over C++. - Dave
Apr 23 2006
Actually, having "if statements" is absolutely TERRIBLE for trying to improve branch prediction. By it's very nature, if statements are branches, so you are causing at least one branch misprediction anyways. Furthermore, pointers are not O(1) algorithms. What the computer is doing is performing a: LEA (reg) [(v_pointer)]; J(cc)|JMP [(reg)]; My understanding is that the v_pointer address is known at compile time, however it implies: that the v_table's cache has to be loaded every time a function is called that extra lea buggers instruction pairing tagging on several cycles the potential that the v_table is a cache miss (highly unlikely a page miss) Interestingly however, if one were to prepend some Self Modifying Code to the program that instead loaded the vtable into a set of addresses attached to the vtable, then one could achieve the benefits of static functions and most of the benefits of virtual functions at the cost of having an extra void*.sizeof for every function call, and the SMC code overhead. The problem is that *some* functions would thereafter be virtual so that a single pointer could be used to point to any of a set of functions - a case not easily rendered with static functions (but still possible remembering that all code is just data) That said, I think the overhead involved in using virtual functions is quite reasonable and comparable to the overhead in using 'calling conventions' versus a internally-optimized system with primarily register arguments. It's 7-8 clocks every function call with a little bit of cache overhead and the potential for a cache miss. Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier. Thanks!
Apr 24 2006
"Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...Actually, having "if statements" is absolutely TERRIBLE for trying to improve branch prediction. By it's very nature, if statements are branches, so you are causing at least one branch misprediction anyways. Furthermore, pointers are not O(1) algorithms. What the computer is doing is performing a: LEA (reg) [(v_pointer)]; J(cc)|JMP [(reg)];Maybe you are just way over my head but I'm not quite following how this is anything bug O(1).My understanding is that the v_pointer address is known at compile time, however it implies: that the v_table's cache has to be loaded every time a function is called that extra lea buggers instruction pairing tagging on several cycles the potential that the v_table is a cache miss (highly unlikely a page miss) Interestingly however, if one were to prepend some Self Modifying Code to the program that instead loaded the vtable into a set of addresses attached to the vtable, then one could achieve the benefits of static functions and most of the benefits of virtual functions at the cost of having an extra void*.sizeof for every function call, and the SMC code overhead.]OK, now you are definitely over my head. :)The problem is that *some* functions would thereafter be virtual so that a single pointer could be used to point to any of a set of functions - a case not easily rendered with static functions (but still possible remembering that all code is just data) That said, I think the overhead involved in using virtual functions is quite reasonable and comparable to the overhead in using 'calling conventions' versus a internally-optimized system with primarily register arguments. It's 7-8 clocks every function call with a little bit of cache overhead and the potential for a cache miss. Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free. -Craig
Apr 24 2006
In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says..."Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I’m still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.
Apr 24 2006
"BCS" <BCS_member pathlink.com> wrote in message news:e2isuc$2jmd$1 digitaldaemon.com...In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd. The benefit of GC is not performance, but convenience. If you want the convenience of not having to free memory manually then use GC. However if you want performance, use malloc and free (new and delete). That is why GC should be completely optional, not just on a per class basis. If I want to completely eliminate GC from my application, I should be able to do so and still be able to make use of any standard library capabilities. -Craig"Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.
Apr 24 2006
Craig Black wrote:"BCS" <BCS_member pathlink.com> wrote in message news:e2isuc$2jmd$1 digitaldaemon.com...The nice thing about D, as a language, is that it supports both approaches. The bad thing, for your needs, is that phobos is tightly bound to the GC. My particular gripes are with the GC activity caused via utf conversions, std.string and so on. There's a surprising, perhaps troublesome, amount of GC activity generated from within phobos itself. This is one of the reasons alternative libraries exist.In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd. The benefit of GC is not performance, but convenience. If you want the convenience of not having to free memory manually then use GC. However if you want performance, use malloc and free (new and delete). That is why GC should be completely optional, not just on a per class basis. If I want to completely eliminate GC from my application, I should be able to do so and still be able to make use of any standard library capabilities. -Craig"Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.
Apr 24 2006
The nice thing about D, as a language, is that it supports both approaches. The bad thing, for your needs, is that phobos is tightly bound to the GC. My particular gripes are with the GC activity caused via utf conversions, std.string and so on. There's a surprising, perhaps troublesome, amount of GC activity generated from within phobos itself. This is one of the reasons alternative libraries exist.Correct. But since the language core relies on Phobos I don't think that there is a reasonable way to completely decouple D from GC. To me this situation is quite ugly. It would be nice if there was a way to replace Phobos completely. After writing standard libraries that don't rely on GC, we could easily produce D non-GC vs. D GC benchmarks. Some steps would have to be taken in order to keep some source compatibility between the two modes, so that it would't turn into two completely different languages. Manual memory management code could be versioned out when compiled in GC mode. -Craig
Apr 24 2006
Craig Black wrote:That's really quite easy, Craig; start with Ares, and then apply whatever you need on top. For example, Ares and Mango go together really well, and both make a point about not allocating memory where there's a sensible alternative. In fact, the Mango HttpServer does not touch the GC even once per service request, once it warms up -- all hail to the groovy array-slice, and behold the power of the trout-slap :) (kudos to IRC #d) There again, I suspect fear of the GC is somewhat overplayed; although it certainly troubles me when it is abused. Thus, it doesn't bother me at all that Ares kinda' needs a GC also. Perhaps it's more a question of libraries, and code in general, treating it with a little respect :) After all, using any kind of malloc() should perhaps be a question-mark for high-performance code - KrisThe nice thing about D, as a language, is that it supports both approaches. The bad thing, for your needs, is that phobos is tightly bound to the GC. My particular gripes are with the GC activity caused via utf conversions, std.string and so on. There's a surprising, perhaps troublesome, amount of GC activity generated from within phobos itself. This is one of the reasons alternative libraries exist.Correct. But since the language core relies on Phobos I don't think that there is a reasonable way to completely decouple D from GC. To me this situation is quite ugly. It would be nice if there was a way to replace Phobos completely.
Apr 24 2006
kris wrote:There again, I suspect fear of the GC is somewhat overplayed; although it certainly troubles me when it is abused. Thus, it doesn't bother me at all that Ares kinda' needs a GC also. Perhaps it's more a question of libraries, and code in general, treating it with a little respect :)At the very least, a GC is required if you want to do much of anything useful with dynamic arrays. For apps with a huge footprint, I may follow this route and define a custom allocator for all object types.After all, using any kind of malloc() should perhaps be a question-mark for high-performance codeTrue enough. Sean
Apr 25 2006
Craig Black wrote:"BCS" <BCS_member pathlink.com> wrote in message news:e2isuc$2jmd$1 digitaldaemon.com...There's more to it than that. One of the big benefits of GC is that fewer allocations are necessary - and a sure way to better allocation performance is to do less of them. The reason fewer allocations are necessary is that the normal behavior in C/C++, when faced with two pointers to the same object, is to copy the object. That way, there's no grief with worrying about freeing the object that someone else points to. To see this effect at work, see www.digitalmars.com/d/cppstrings.html. This problem is bad enough that the notion of shared_ptr<> has appeared. shared_ptr<> implements reference counting to eliminate the need for extra copies. But it has its own overhead problems: 1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening. 2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented. The inc/dec must be done in an exception safe manner, i.e. they need to be unwindable. Even worse, for multithreaded apps these inc/dec's need to be synchronized. And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd."Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.
Apr 24 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:e2k12i$19bi$1 digitaldaemon.com...Craig Black wrote:Hmmm. I have glanced at the word count example many times but was not aware of why it was faster in D until now. Is it really the GC paradigm that allows this performance? I think I'm beginning to finally see some method to the madness. Perhaps I need to rethink my stance on GC a little. It would seem then that GC can come out on top if there is a lot of sharing of allocated memory e.g. multiple references to each object. However, if you are certain that there will be only a single reference to each object, manual memory management should always outperform GC. Am I on the right track here? -Craig"BCS" <BCS_member pathlink.com> wrote in message news:e2isuc$2jmd$1 digitaldaemon.com...There's more to it than that. One of the big benefits of GC is that fewer allocations are necessary - and a sure way to better allocation performance is to do less of them. The reason fewer allocations are necessary is that the normal behavior in C/C++, when faced with two pointers to the same object, is to copy the object. That way, there's no grief with worrying about freeing the object that someone else points to. To see this effect at work, see www.digitalmars.com/d/cppstrings.html. This problem is bad enough that the notion of shared_ptr<> has appeared. shared_ptr<> implements reference counting to eliminate the need for extra copies. But it has its own overhead problems: 1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening. 2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented. The inc/dec must be done in an exception safe manner, i.e. they need to be unwindable. Even worse, for multithreaded apps these inc/dec's need to be synchronized. And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.In article <e2inmo$2bjp$1 digitaldaemon.com>, Craig Black says...I have heard similar. "The code required to manually free everything outweighs the overhead of automatic GC." Unless you are doing something rediculous in your code when freeing heap objects, this statement is false. Manual freeing does not require gobs of code or eat up lots of processor cycles. Typically it involves testing a pointer to see if it is null, calling free() if it is not, and then setting it to null. How could this ever be slower than scanning the entire stack for non-null pointers? GC is simply doing way more work. IMO this statement is simply obsurd."Dan" <Dan_member pathlink.com> wrote in message news:e2im9o$29gi$1 digitaldaemon.com...A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I'm still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Ultimately though, I think the GC causes the highest level of performance troubles on the benchmarks. I love it to death though and wish someone would get their hands dirty making it smaller and sexier.Or (truly) optional. :) The notion that the GC is optional in D is really misleading. Everything uses it. While you could certainly make it faster at the expense of more complexity, I don't think that GC that scans the stack testing for non-null pointers will ever provide the performance of C's malloc and free. GC is simply performing much more work. I have heard many arguments to the contrary, but I still don't see how it is possible that GC could compete with malloc and free.
Apr 24 2006
Craig Black wrote:Hmmm. I have glanced at the word count example many times but was not aware of why it was faster in D until now. Is it really the GC paradigm that allows this performance?Yes. It turns out to be very difficult for C++ to match it.I think I'm beginning to finally see some method to the madness. Perhaps I need to rethink my stance on GC a little. It would seem then that GC can come out on top if there is a lot of sharing of allocated memory e.g. multiple references to each object. However, if you are certain that there will be only a single reference to each object, manual memory management should always outperform GC. Am I on the right track here?Even in that case, gc can win. For example, for explicit memory allocation in the presence of exception handling, one has to carefully set up handlers that will delete any temporaries if an exception is thrown: S *p = new S(); p->a = foo(); bar(p); delete p; What if foo() throws an exception? We've got a memory leak in p, so we've got to add code to fix that. Dealing with that is an endless source of bugs and tedium in C++. With gc, nothing needs to be added. Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted. Compacting existing allocated memory means that allocating a new object is as easy as bumping a pointer, whereas the usual malloc algorithm requires searching a free list for a best match. So, the obvious question then becomes, if gc is faster, why is Java slower than C++? (I know, lots of Java people will claim it really isn't slower <g>.) The reason is that Java's design requires a lot more objects to be allocated than in C++ - for example, you cannot aggregate arrays or other objects inside other objects, they have to be allocated separately. More objects allocated means lower performance.
Apr 24 2006
"Walter Bright" <newshound digitalmars.com> wrote in message news:e2kffd$1rl5$1 digitaldaemon.com...Craig Black wrote:I don't see a performance issue here. Nor do think this would be terribly difficult to solve in D, especially with the fancy-shmancy new scope() features, as long as scope(exit) works properly with exceptions. This would, however, pose a problem for C++.Hmmm. I have glanced at the word count example many times but was not aware of why it was faster in D until now. Is it really the GC paradigm that allows this performance?Yes. It turns out to be very difficult for C++ to match it.I think I'm beginning to finally see some method to the madness. Perhaps I need to rethink my stance on GC a little. It would seem then that GC can come out on top if there is a lot of sharing of allocated memory e.g. multiple references to each object. However, if you are certain that there will be only a single reference to each object, manual memory management should always outperform GC. Am I on the right track here?Even in that case, gc can win. For example, for explicit memory allocation in the presence of exception handling, one has to carefully set up handlers that will delete any temporaries if an exception is thrown: S *p = new S(); p->a = foo(); bar(p); delete p; What if foo() throws an exception? We've got a memory leak in p, so we've got to add code to fix that. Dealing with that is an endless source of bugs and tedium in C++. With gc, nothing needs to be added.Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted. Compacting existing allocated memory means that allocating a new object is as easy as bumping a pointer, whereas the usual malloc algorithm requires searching a free list for a best match.I can see the obvious performance improvement here, but I am skeptical of whether it can actually be done with D, given it's liberal syntax with raw pointers. Seems to me that if the GC started moving pointers around, the compiler would have to enforce more safety rules with regards to pointers.So, the obvious question then becomes, if gc is faster, why is Java slower than C++? (I know, lots of Java people will claim it really isn't slower <g>.) The reason is that Java's design requires a lot more objects to be allocated than in C++ - for example, you cannot aggregate arrays or other objects inside other objects, they have to be allocated separately. More objects allocated means lower performance.that I don't know all the technical details with VM's, but I have run my own benchmarks. VM-based code is just way slower than native code. Can't really tell you exactly why though because I don't really know or care to know what goes on inside a VM. (Except for the purpose of explaining to people why it is so much slower.) -Craig
Apr 25 2006
Craig Black wrote:that I don't know all the technical details with VM's, but I have run my own benchmarks. VM-based code is just way slower than native code. Can't really tell you exactly why though because I don't really know or care to know what goes on inside a VM. (Except for the purpose of explaining to people why it is so much slower.)One reason is that Java compilers perform basically no compile-time optimization, and the JIT does very little as well. C/C++, on the other hand, can see huge performance improvements with optimization turned on. However, some VM languages (such as the Lisp variants) seem to perform quite well compared to natively compiled code, though I don't have the experience to say why. Sean
Apr 25 2006
Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted. Compacting existing allocated memory means that allocating a new object is as easy as bumping a pointer, whereas the usual malloc algorithm requires searching a free list for a best match.It is also worth mentioning that compacting can be performed without scanning the stack. Manual memory management can still be used along with a compacting allocator. An compacting allocator would certainly outperform CRT malloc and free. There is a tradeoff here because you can no longer use petal-to-the-metal algorithms that make use of raw pointers and pointer arithmetic, or if you do so, you would have to turn the allocator off temporarily to ensure safety. -Craig
Apr 25 2006
Craig Black wrote:Hmmm. I have glanced at the word count example many times but was not aware of why it was faster in D until now. Is it really the GC paradigm that allows this performance? I think I'm beginning to finally see some method to the madness. Perhaps I need to rethink my stance on GC a little.For what it's worth, the performance improvement may be attributed to two factors: the C++ map class is a balanced binary tree while D uses a hash table, and C++ indeed copies strings around while D passes references.It would seem then that GC can come out on top if there is a lot of sharing of allocated memory e.g. multiple references to each object. However, if you are certain that there will be only a single reference to each object, manual memory management should always outperform GC. Am I on the right track here?D allows manual memory management as well, so you aren't forced to use the GC if you don't want it in most cases. However, the lack of copy ctors does essentially eliminate the possibility of using smart pointers. One downside of a GC is that it needs to store bookkeeping data, which may amount to as much as 50% memory overhead. This can be problematic for very large applications. Sean
Apr 25 2006
In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening.Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented.Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing. [And in a subsequent posting]Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted.Has there ever been a language that successfully combined compacting GC with raw pointers? How would this work? And it's going to be tough to drop raw pointers without breaking ease-of-use for C libraries. cheers Mike
Apr 25 2006
"Mike Capp" <mike.capp gmail.com> wrote in message news:e2l4g1$2n5s$1 digitaldaemon.com...In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...What's intrusive refcounting? Please use small words. My brain is almost full right now. :)1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening.Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something. -Craig2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented.Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.
Apr 25 2006
Craig Black wrote:"Mike Capp" <mike.capp gmail.com> wrote in message news:e2l4g1$2n5s$1 digitaldaemon.com...Maintaining the reference count within the class being pointed to instead of within the shared pointer code.In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...What's intrusive refcounting? Please use small words. My brain is almost full right now. :)1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening.Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-) SeanIf you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented.Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.
Apr 25 2006
"Sean Kelly" <sean f4.ca> wrote in message news:e2lgsh$5tf$1 digitaldaemon.com...Craig Black wrote:That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow. -Craig"Mike Capp" <mike.capp gmail.com> wrote in message news:e2l4g1$2n5s$1 digitaldaemon.com...Maintaining the reference count within the class being pointed to instead of within the shared pointer code.In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...What's intrusive refcounting? Please use small words. My brain is almost full right now. :)1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening.Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented.Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.
Apr 25 2006
In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says..."Sean Kelly" <sean f4.ca> wrote in message news:e2lgsh$5tf$1 digitaldaemon.com...I merely meant that that's not the point of the wordcount example. However, I'll try and find the time to recode it using slices and hashtables in the next few days. IIRC that brings performance in line with the D version, so I suppose that means you can "have your cake and eat it too." It's worth mentioning, however, that slicing with explicit memory management doesn't have quite the utility it does with garbage collection. It works great for apps where the entire buffer can be loaded en masse, as with the wordcount example, but in other cases it's a bit more complicated to manage the lifetime of the referenced data. Thus the GC version is still a clear win in terms of ease of use, whether or not performance can be matched with specialized code elsewhere. SeanCraig Black wrote:That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow.If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)
Apr 25 2006
"Sean Kelly" <sean f4.ca> wrote in message news:e2lpjj$hvv$1 digitaldaemon.com...In article <e2lhhq$6qv$1 digitaldaemon.com>, Craig Black says...Great! I look forward to seeing the results."Sean Kelly" <sean f4.ca> wrote in message news:e2lgsh$5tf$1 digitaldaemon.com...I merely meant that that's not the point of the wordcount example. However, I'll try and find the time to recode it using slices and hashtables in the next few days. IIRC that brings performance in line with the D version, so I suppose that means you can "have your cake and eat it too." It's worth mentioning,Craig Black wrote:That's not the point. It demonstrates that the same performance can be achieved without GC, and all the baggage that accompanies it. So performance-wise it would be like having your cake and eating it too. If this was possible, perhaps D could be streamlined to take advantage of this somehow.If you think this is doable, then you should write a C++ version of wordcount based on this principal. If you can get comparable performance to the D version then you may be on to something.I'd meant to do this and never got around to it. However, the wordcount example is meant to compare performance using typical programming techniques for each language, and writing your own specialized containers isn't typical :-)however, that slicing with explicit memory management doesn't have quite the utility it does with garbage collection. It works great for apps where the entire buffer can be loaded en masse, as with the wordcount example, but in other cases it's a bit more complicated to manage the lifetime of the referenced data. Thus the GC version is still a clear win in terms of ease of use, whether or not performance can be matched with specialized code elsewhere.Different programmers have different problems that require different solutions. I doubt that we will ever find one approach to make everyone happy about everything. That's why I think GC should be completely optional. Ideally, turning of GC could be as easy as specifying a compiler switch. But maybe I'm wrong. Maybe GC can be faster than manual memory management, but I doubt it. Either way, we need to optimize this bottleneck somehow. Although D currenty delivers competitive performance, and is in some ways is faster than C++, I still believe we have a few more steps to take before we can say with confidence that D provides performance superior to C++. IMO, the fastest way to get extreme performance from D is (1) make GC optional and (2) provide a safe, easy way to take advantage of concurrency on multi-core systems. them to the punch, and provide a superior solution for D.) This would make D very attractive for performance-demanding applications. I am especially thinking of the game industry, with which I am indirectly involved. This is the kind of stuff that they are very concerned about. If we do not provide a solution to these issues, I fear it will be a big deterrent for many, especially since migrating large projects to D from C++ requires a large investment. -Craig
Apr 25 2006
I ran some quick tests this morning using the word count example code and here are my results so far: D version (wc): Execution time: 0.179 s Execution time: 0.183 s Execution time: 0.149 s C++ version 1 (wccpp2): Execution time: 0.216 s Execution time: 0.263 s Execution time: 0.183 s C++ version 2 (wccpp3): Execution time: 0.221 s Execution time: 0.168 s Execution time: 0.185 s C++ version 3 (wccpp4): Execution time: 0.191 s Execution time: 0.243 s Execution time: 0.142 s The apps were compiled according to the word count description and timed using ptime (I had some applications open during the tests so the results may be a tad inaccurate). wccpp2 is straight from the web page, wccpp3 replaces the C++ IO code with a C version quite similar to the D code in std.file, and wccpp4 replaces std::string with slice<char>. You'll notice that there is practically no difference between the std::string and slice tests (wccpp3 and wccpp4), and I think this is to be expected. Typical C++ string implementations have a "small string optimization" to avoid dynamic allocations for strings of less than 16 characters. And since there are likely very few words in the sample text that are larger than this, there are few if any dynamic allocations being performed by std::string. Thus I think the real notable performance difference between C++ and D is that C++ uses a balanced binary tree while D uses a hash table. I tried replacing std::map with a hash table implementation (technically an unordered_map implemented according to C++ TR1 that I did for a project a while back), but DMC choked on it and I don't have any more time to spend on this right now. When I get a chance, I'll try to verify that the unordered_map code conforms to the C++ spec (it compiles cleanly at WL:4 in Visual Studio 2005) and will file a bug report for DMC if it does. For reference, the bulk of the errors I saw were related to symbol lookup--a template class using static and non-static data from a parent class without fully qualifying the names--adding some 'using' statements took care of these. The final one was a bit more subtle and offered me almost no useful information to go on, so that one may take some investigation. Sean
Apr 27 2006
Quick follow-up. I decided to re-run the tests using the full text of the King James Bible (http://www.gutenberg.org/dirs/etext90/kjv10.txt) in an attempt to get better timing resolution, as it's 4,342 KB vs. Alice in Wonderland's 160 KB. Here are my results: D version (wc): Execution time: 1.409 s Execution time: 1.473 s Execution time: 1.295 s C++ version 1 (wccpp2): Execution time: 1.779 s Execution time: 1.762 s Execution time: 1.792 s C++ version 2 (wccpp3): Execution time: 1.758 s Execution time: 1.309 s Execution time: 1.297 s C++ version 3 (wccpp4): Execution time: 0.693 s Execution time: 0.767 s Execution time: 0.704 s Surprisingly, the slice code does substantially better than both the other C++ versions and the D code. Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive. Sean
Apr 27 2006
Surprisingly, the slice code does substantially better than both the other C++ versions and the D code. Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive.Wow! It's over twice as fast! How difficult would it be to use the same approach for wordcount in D by bypassing the GC? So what do you think we should do about GC to make D faster? Do you favor making GC optional or writing a more optimized, compacting GC? I personally think that even optimized GC won't come close to code that has been optimized with manual memory management. -Craig
Apr 27 2006
Craig Black wrote:The C++ slice code should mirror what is already happening in D--as far as I know, the D code should allocate no memory at all for strings, only for the AA. I suspect the problem may be GC scans of the rather large (4 megabyte) data area where the book contents are stored. If the D code were modified to allocate this memory via malloc instead of new, the D program may speed up substantially. I'll see about dropping the C++ file code into the D app this evening.Surprisingly, the slice code does substantially better than both the other C++ versions and the D code. Perhaps this evening I'll re-run the C++ tests under Visual Studio to give the hash table a test drive.Wow! It's over twice as fast! How difficult would it be to use the same approach for wordcount in D by bypassing the GC?So what do you think we should do about GC to make D faster? Do you favor making GC optional or writing a more optimized, compacting GC? I personally think that even optimized GC won't come close to code that has been optimized with manual memory management.I'm not sure either approach could be said to be "better" than the other insofar as performance is concerned as it really comes down to application design. I'll admit that, until D, I'd always dismissed garbage collection for one reason or another. But after using D for a while I suddenly realized how much time and code I used to invest in managing data ownership issues. I won't go so far as to say that data ownership is not important, but being freed from the yoke of smart pointer code and such I've found both that I'm more productive and that the result is more elegant than its C++ equivalent. My only remaining issue with GC is that it makes it too easy to ignore what's actually going on at a low level. This probably isn't an issue for experienced programmers who can't help but think about these things, but in I still feel that learning programmers should be forced to deal with this stuff simply so they're aware of it. Sean
Apr 27 2006
Craig Black wrote:So what do you think we should do about GC to make D faster? Do you favor making GC optional or writing a more optimized, compacting GC? I personally think that even optimized GC won't come close to code that has been optimized with manual memory management.By the way, I do think that GC performance can be improved significantly over where it is now, so don't take my other reply as an admission that GC is slow. Also, I suspect that by the time D hits 1.0, GC performance will be much improved over what it is now. This can be achieved a number of different ways, but the most obvious would be for the GC to ignore memory blocks that don't contain pointers. Sean
Apr 27 2006
By the way, I do think that GC performance can be improved significantly over where it is now, so don't take my other reply as an admission that GC is slow. Also, I suspect that by the time D hits 1.0, GC performance will be much improved over what it is now.I hope that you are right.This can be achieved a number of different ways, but the most obvious would be for the GC to ignore memory blocks that don't contain pointers.I didn't know such selective scanning was possible with GC. Would this optimization be easy to implement? If such an optimization were implemented, then I suppose it would be wise to keep heap allocation local to small blocks of memory so that the GC would not have much to scan. -Craig
Apr 28 2006
Last follow-up. I re-ran all tests with a bit less stuff going on in the background, and I ran the C++ tests with MSVC as well, including the hash table version (though I have no DMC run for this particular app). Here are the results. D version 1 (wc) : the original example D version 2 (wc2): using malloc for the buffer C++ version 1 (wccpp2): the original example C++ version 2 (wccpp3): replaced C++ stream code with C routines C++ version 3 (wccpp4): replaced std::string with slice<char> C++ version 4 (wccpp5): replaced std::map with unordered_map D version 1 (wc): Execution time: 0.380 s Execution time: 0.423 s Execution time: 0.446 s D version 2 (wc2): Execution time: 0.659 s Execution time: 0.424 s Execution time: 0.377 s ---------- DMC version 1 (wccpp2): Execution time: 1.656 s Execution time: 1.652 s Execution time: 1.656 s DMC version 2 (wccpp3): Execution time: 1.230 s Execution time: 1.257 s Execution time: 1.290 s DMC version 3 (wccpp4): Execution time: 0.674 s Execution time: 0.729 s Execution time: 0.697 s ---------- MSVC version 1 (wccpp2): Execution time: 1.093 s Execution time: 1.120 s Execution time: 1.042 s MSCV version 2 (wccpp3): Execution time: 0.797 s Execution time: 0.833 s Execution time: 0.799 s MSVC version 3 (wccpp4): Execution time: 0.761 s Execution time: 0.744 s Execution time: 0.726 s MSVC version 4 (wccpp5): Execution time: 0.438 s Execution time: 0.436 s Execution time: 0.406 s The most amazing thing here is the dramatic performance difference between these D runs and the previous ones. I won't speculate on why, bu the exact same app appears to be running 3 times as fast as it did this morning. Beyond that, we can see that the C++ app using slices and a hash table performs similarly to the D implementation, though it's difficult to draw any solid conclusions without a test run under DMC. Also, I should note for the record that the hash table implementation doesn't rehash automatically (it wasn't a feature I needed at the time), so I preallocated 4096 buckets before running the word count. There are 793875 words in the KJ bible so this results in somewhat long chains, but I saw roughly similar results with double the number of buckets so I think the bulk of processing at this speed may be taken up by IO. The only other notable difference is that the C++ code uses standard C calls for the file operations while the D code uses Windows calls. I can't see this resulting in a noticeable speed difference, but I figured it was worth mentioning. Finally, as these tests were quite informal I don't suggest reading too much into them. However I think a reasonable conclusion would be that D is faster than C++ for typical user applications (assuming a use pattern similar to the word count application) and while C++ can be quite competitive, making it so requires the use of hand-written code over standard library code. C++ will fare a bit better in 2009 when unordered_map is an official part of the standard, but that still leaves std::string as the suggested way to store string data. This may not be an issue if you're dealing with English text where most words are under 16 characters, but it could be for other data sets where allocations will occur. Sean
Apr 27 2006
Please allow me to ask a stupid question. Why are there three benchmarks for each version? Are you testing on different computers or something?The most amazing thing here is the dramatic performance difference between these D runs and the previous ones. I won't speculate on why, bu the exact same app appears to be running 3 times as fast as it did this morning.That is very strange ... so it seems that these benchmarks, being inconsistent, were inconclusive. Sounds like we need more testing. Anyway, D came out on top that time so that is good news!Finally, as these tests were quite informal I don't suggest reading too much into them.Agreed. Benchmarking one small program doesn't really say much.However I think a reasonable conclusion would be that D is faster than C++ for typical user applicationsI would agree, but not just based on a single benchmark. -Craig -Craig
Apr 28 2006
Craig Black wrote:Please allow me to ask a stupid question. Why are there three benchmarks for each version? Are you testing on different computers or something?I wanted to reduce the chance that random system noise would affect a result. I suppose I could have simply posted the mean for each test, but it was easier to cut & paste than to get out a calculator. Sean
Apr 29 2006
Mike Capp wrote:In article <e2k12i$19bi$1 digitaldaemon.com>, Walter Bright says...True, but shared_ptr<> is billed as the C++ answer to gc, and shared_ptr<> is not intrusive.1) for every allocation, shared_ptr<> allocates an *extra* block of memory to hold its accounting information. So you've got, out of the box, twice as many allocations/deallocations happening.Not true for intrusive refcounting. (This isn't always feasible if you're dealing with third-party classes, which is why shared_ptr isn't intrusive, but if you _can_ use it it's very close to ideal.)But that means just going back to the old way of being very, very careful about who "owns" the pointer and who doesn't. Refcounting is supposed to eliminate that nuisance.2) for every copy of the shared_ptr<>, like passing one to a function, returning one, etc., the reference count has to be incremented, then decremented.Again, not true for intrusive refcounting. You can assign an intrusive smartpointer from a raw pointer or reference, so most of the time you can use those to pass/return. Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" case where you have a raw pointer but are calling code which might drop the last "real" owning ref.Sure you could do that, but then slices and shared_ptr's are not interchangeable. In D, you can build a data structure that contains slices, references to the heap, and references to static data willy-nilly. In C++, because each of those would be a different type, you'd have to build a wall between each. I think that's one of the key benefits of D slicing - the user of the slice need not know or care that its a slice. I don't have to have two versions of toupper(), one for shared_ptr and one for slices.And even with all that falderol, shared_ptr<> still can't do slicing, which are an important efficiency improvement.Agreed, but couldn't you do something very similar with a simple ptr class containing a pointer to the (refcounted) base string plus a couple of range end indices? That's worked well for me in the past, but there may be benefits of D slices that I'm missing.Oh, it's quite doable <g>.Another advantage of gc (that isn't implemented, but could be, in D's gc) is that allocated memory can be compacted.Has there ever been a language that successfully combined compacting GC with raw pointers? How would this work? And it's going to be tough to drop raw pointers without breaking ease-of-use for C libraries.
Apr 25 2006
In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...True, but shared_ptr<> is billed as the C++ answer to gc, and shared_ptr<> is not intrusive.Shrug. I'm finding myself increasingly unconcerned these days with what's "billed as the C++ answer". I mean, iostreams?Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" caseBut that means just going back to the old way of being very, very careful about who "owns" the pointer and who doesn't. Refcounting is supposed to eliminate that nuisance.I'm not sure that's true. Smartpointer refcounting is supposed to make ownership rules explicit in code where they can be a) seen and b) automated, instead of hidden in documentation. In practice, I still make daft array-overrun and fencepost errors from time to time, but I've never had a problem with refcounting and ownership. As a first approximation: member variables, globals and new'ed locals are owners, and everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend to be glaringly obvious.Sure you could do that, but then slices and shared_ptr's are not interchangeable.True, but again, I'm not especially interested in shared_ptr, and this isn't a fundamental problem - you can write intrusive interchangeable string and slice classes in C++.I find your ideas intriguing, and would like to subscribe to your newsletter. Can you elaborate, or is this a trade secret? cheers MikeHas there ever been a language that successfully combined compacting GC with raw pointers? How would this work?Oh, it's quite doable <g>.
Apr 26 2006
Mike Capp wrote:In article <e2mfjd$1f3d$1 digitaldaemon.com>, Walter Bright says...I know what you mean.True, but shared_ptr<> is billed as the C++ answer to gc, and shared_ptr<> is not intrusive.Shrug. I'm finding myself increasingly unconcerned these days with what's "billed as the C++ answer". I mean, iostreams?I can see that point, but I can see that once you stray into such optimizations, there's no help from the compiler, and you'll need to be very careful. A corollary to that is extra time is spent, and bugs are likely.I'm not sure that's true. Smartpointer refcounting is supposed to make ownership rules explicit in code where they can be a) seen and b) automated, instead of hidden in documentation.Refcount twiddling is only needed when changing an "owning" pointer, plus the rare "safety" caseBut that means just going back to the old way of being very, very careful about who "owns" the pointer and who doesn't. Refcounting is supposed to eliminate that nuisance.In practice, I still make daft array-overrun and fencepost errors from time to time, but I've never had a problem with refcounting and ownership. As a first approximation: member variables, globals and new'ed locals are owners, and everything else isn't. Exceptions to this rule (e.g. double-linked lists) tend to be glaringly obvious.Andrei Alexandrescu has proposed a series of modifications to C++ to enable the compiler to enforce such rules, but nobody else seemed much interested in it.Perhaps, but I've never seen anyone pull it off.Sure you could do that, but then slices and shared_ptr's are not interchangeable.True, but again, I'm not especially interested in shared_ptr, and this isn't a fundamental problem - you can write intrusive interchangeable string and slice classes in C++.I'd rather not, until I get it working.I find your ideas intriguing, and would like to subscribe to your newsletter. Can you elaborate, or is this a trade secret?Has there ever been a language that successfully combined compacting GC with raw pointers? How would this work?Oh, it's quite doable <g>.
Apr 26 2006
BCS wrote:A bad GC will definitely make for bad performance. And any GC will be slower than an optimum malloc/free solution. However (and I’m still undecided on this one) I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.I believe this is indeed the common argument. In C++, smart pointers have become a very popular way to manage memory, and while smart pointers are technically a garbage collection strategy, I think they aren't considered as such here. In any case, the cost comes from updating the reference count, not from the malloc/free, as the updates must typically be performed atomically. That said, an accurate performance comparison between GC and manual memory management is extremely difficult because implementation details and use strategies have a tremendous impact on results. For anyone interested, there's been an extensive argu- um, dialog on the topic recently on comp.lang.c++.moderated titled "Can GC be benifical": Sean
Apr 24 2006
BCS wrote:[snip] I think the argument is that in any non trivial project, manually freeing all of the memory without any memory leaks or axing something to soon would require more overhead than using GC. The trade-off is not in the freeing time but in the can-I-free-this check.Not true. A hierachical memory management system is great! It sure beats reference counting, manual memory management, and conservative GC. What you do is allocate some memory, but also provide a "parent" memory pointer to relate the new memory to. When all is said and done you have a nice tree of related memory pointers (hopefully with just one root, provided you designed correctly). To free some memory, simply walk all the branches of that node, freeing their memory, then free the original memory. You can even get predictable destructors for objects this way. I'm currently developing a project in C which uses this method. I have it allocating lots of small blocks of memory to perform a job. The total comes to 32KB. When the program is finished I do one "free" call to the root structure and I get back ALL of my memory. Not a single byte is leaked. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James Dunne
Apr 26 2006
Not true. A hierachical memory management system is great! It sure beats reference counting, manual memory management, and conservative GC. What you do is allocate some memory, but also provide a "parent" memory pointer to relate the new memory to. When all is said and done you have a nice tree of related memory pointers (hopefully with just one root, provided you designed correctly). To free some memory, simply walk all the branches of that node, freeing their memory, then free the original memory. You can even get predictable destructors for objects this way. I'm currently developing a project in C which uses this method. I have it allocating lots of small blocks of memory to perform a job. The total comes to 32KB. When the program is finished I do one "free" call to the root structure and I get back ALL of my memory. Not a single byte is leaked.But how can you tell _when_ to delete the root? AFAIK, the main benefit of GC is not that you don't have to manually free the memory, but that you don't have to know when it is safe to do so. I don't see how organizing memory blocks into a tree solves that... xs0
Apr 26 2006
xs0 wrote:That's a different problem and depends on the nature of the application. If you're in a server environment, you free memory when the client request has been completely processed. If you're in an interactive user-interface environment, you free memory when you close documents, close windows, etc. When the data is no longer needed, you'll know when to free and what to free. I think you thought I was implying you should only have one memory root which all things are related to and that is the only thing that should ever be freed. The first part is right, but not the second. You should keep all the memory related to each other, but it's not that you have to free it in an "all or nothing" circumstance. Then again, that depends on your application type. In my example application I have lots of memory free operations on the individual nodes when I know I won't need that memory anymore, such as for temporary strings. If I forget to include an explicit free call to that temporary memory, it's okay since I know that memory has been related to another parent memory block that _will_ be freed. Can you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own... -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James DunneNot true. A hierachical memory management system is great! It sure beats reference counting, manual memory management, and conservative GC. What you do is allocate some memory, but also provide a "parent" memory pointer to relate the new memory to. When all is said and done you have a nice tree of related memory pointers (hopefully with just one root, provided you designed correctly). To free some memory, simply walk all the branches of that node, freeing their memory, then free the original memory. You can even get predictable destructors for objects this way. I'm currently developing a project in C which uses this method. I have it allocating lots of small blocks of memory to perform a job. The total comes to 32KB. When the program is finished I do one "free" call to the root structure and I get back ALL of my memory. Not a single byte is leaked.But how can you tell _when_ to delete the root? AFAIK, the main benefit of GC is not that you don't have to manually free the memory, but that you don't have to know when it is safe to do so. I don't see how organizing memory blocks into a tree solves that... xs0
Apr 26 2006
James Dunne wrote:Can you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own...One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.
Apr 26 2006
Walter Bright wrote:James Dunne wrote:Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph. Funny you mention compiler, because that is the example application I was referring to. :) -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James DunneCan you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own...One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.
Apr 26 2006
James Dunne wrote:Walter Bright wrote:I should also mention that my compiler implementation makes use of context structures, so every memory block that the compiler will allocate is somehow parented to that context structure. So when the compile job is complete, free the context structure and all the memory goes away. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James DunneJames Dunne wrote:Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph. Funny you mention compiler, because that is the example application I was referring to. :)Can you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own...One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.
Apr 26 2006
James Dunne wrote:I should also mention that my compiler implementation makes use of context structures, so every memory block that the compiler will allocate is somehow parented to that context structure. So when the compile job is complete, free the context structure and all the memory goes away.What kinda compiler?
Apr 26 2006
Dave wrote:James Dunne wrote:My own research language named Iris. I'd like to keep it under the radar for now. You can read about it in my monthly rants on http://iris-design-log.blogspot.com/. -- Regards, James DunneI should also mention that my compiler implementation makes use of context structures, so every memory block that the compiler will allocate is somehow parented to that context structure. So when the compile job is complete, free the context structure and all the memory goes away.What kinda compiler?
Apr 26 2006
James Dunne wrote:Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph.But when this happens in the general case, I can't delete *any* of the trees.
Apr 26 2006
Walter Bright wrote:James Dunne wrote:When are you trying to delete one of these trees? To which 'tree' are you referring, the data structure tree or the hierarchical memory manager's tree? -- Regards, James DunneYour example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph.But when this happens in the general case, I can't delete *any* of the trees.
Apr 26 2006
James Dunne wrote:Walter Bright wrote:I could wait until the program finishes before deleting anything, but that isn't memory management.James Dunne wrote:When are you trying to delete one of these trees?Your example is having two trees _point_ to data within each other. The original data is still _owned_ by the original tree which the data resides in. You wouldn't be swapping ownership of the data like you're talking about. In this system, when memory is allocated it is assigned an owner and that ownership does not ever change. So the system resembles a tree (one parent only), not a directed graph.But when this happens in the general case, I can't delete *any* of the trees.To which 'tree' are you referring, the data structure tree or the hierarchical memory manager's tree?The nested symbol table of the local scope of a function, for example.
Apr 26 2006
Walter Bright wrote:James Dunne wrote:If you parent the memory you're creating in the tree data structures to a structure that is higher up the chain than both tree data structures, then the data doesn't get lost simply because you freed the data structure that held references to that memory. -- Regards, James DunneCan you provide me a few legitimate examples of when you won't know what to free and when to free it? I'm having a tough time coming up with examples on my own...One example that has plagued me in the past is having two distinct tree data structures, and it becomes necessary for a node in one tree to refer to data in the other. Or when you have a third data structure, and you need to point to data in one or the other or both of the first two trees. A typical example of that is symbol tables inside the compiler. In the DMD front end (which is garbage collected) the data has all kinds of cross-connections in it. I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.
Apr 26 2006
James Dunne wrote:Walter Bright wrote:I've designed data structures around the memory management requirement for keeping a sole 'parent' around for it that can free it. But I'd rather design data structures around how they are used instead. For example, if I have an associative array of strings, I can stuff any old string into it - even static strings - without having to worry about who 'owns' those strings. It's quite liberating <g>.I.e. the problem crops up when the data in memory just doesn't fit into a strictly heirarcical format.If you parent the memory you're creating in the tree data structures to a structure that is higher up the chain than both tree data structures, then the data doesn't get lost simply because you freed the data structure that held references to that memory.
Apr 26 2006
Good idea! I suppose that every time you allocate memory you must always specify a parent node in the hierarchy. This requires a little more thought than GC, because you have to consider how to organize the hierarchy so that it will be easy to free memory relating to a given task once the task is done. Such a system is would be very fast, and with just a small amount of preparation, memory leaks are completely eliminated. It could potentially be faster than traditional memory management, because you don't necessarily have to call free on every block of memory. Rather, one free could free the entire hierarchy, or a large sub-hierarchy. Have you experienced any drawbacks? Does this approach work well with exceptions? threads? Also, what kind of data structure do you use for this hierarchy? -Craig
Apr 26 2006
Craig Black wrote:Good idea! I suppose that every time you allocate memory you must always specify a parent node in the hierarchy. This requires a little more thought than GC, because you have to consider how to organize the hierarchy so that it will be easy to free memory relating to a given task once the task is done. Such a system is would be very fast, and with just a small amount of preparation, memory leaks are completely eliminated. It could potentially be faster than traditional memory management, because you don't necessarily have to call free on every block of memory. Rather, one free could free the entire hierarchy, or a large sub-hierarchy. Have you experienced any drawbacks? Does this approach work well with exceptions? threads? Also, what kind of data structure do you use for this hierarchy? -CraigHoly geez... Yes it is a good idea, but alas I cannot take credit for it. You point out all the benefits that I've gotten from it so far quite accurately. :) I've only used it with my compiler implementation written in C, so I'll discuss my experience relating to that, if that's okay with you... Drawbacks? Only that you have to give it that much more thought about what to parent to what... There's no serious runtime penalty - it just adds a pointer to the parent's list of children, but that cost is tacked on to allocating memory in general. Exceptions? I'm coding in C, and I wrote my own pseudo-exception model, which makes use of that memory model so I don't even leak memory from exceptions... a good thing! Threads? Haven't gotten there yet, but I plan to design my threads to be completely independent of each other. It's easy to do when you're lexing/parsing/analyzing different source files at the same time. Only need to synchronize when each pass is completed. I'll look into concurrency issues in the allocator code when I get to it. The data structure is a simple list of child pointers and a single parent pointer for each memory block. Linked lists. The code still relies, of course, on malloc and free to actually get the memory in the first place. -- Regards, James Dunne
Apr 26 2006
"James Dunne" <james.jdunne gmail.com> wrote in message news:e2pjk0$96l$1 digitaldaemon.com...Craig Black wrote:Of coarse it does. Along those lines ... I've been thinking of an optimization for this. Couldn't you make come of your parent nodes an allocators, so that all of their children get their memory from it? This way, you could allocate large blocks of memory, and could cut down on your calls to malloc and free by an order of magnitude. Obviously you would have to choose what nodes in the hierarchy become allocators such that the least number of CRT calls would be made. -CraigGood idea! I suppose that every time you allocate memory you must always specify a parent node in the hierarchy. This requires a little more thought than GC, because you have to consider how to organize the hierarchy so that it will be easy to free memory relating to a given task once the task is done. Such a system is would be very fast, and with just a small amount of preparation, memory leaks are completely eliminated. It could potentially be faster than traditional memory management, because you don't necessarily have to call free on every block of memory. Rather, one free could free the entire hierarchy, or a large sub-hierarchy. Have you experienced any drawbacks? Does this approach work well with exceptions? threads? Also, what kind of data structure do you use for this hierarchy? -CraigHoly geez... Yes it is a good idea, but alas I cannot take credit for it. You point out all the benefits that I've gotten from it so far quite accurately. :) I've only used it with my compiler implementation written in C, so I'll discuss my experience relating to that, if that's okay with you... Drawbacks? Only that you have to give it that much more thought about what to parent to what... There's no serious runtime penalty - it just adds a pointer to the parent's list of children, but that cost is tacked on to allocating memory in general. Exceptions? I'm coding in C, and I wrote my own pseudo-exception model, which makes use of that memory model so I don't even leak memory from exceptions... a good thing! Threads? Haven't gotten there yet, but I plan to design my threads to be completely independent of each other. It's easy to do when you're lexing/parsing/analyzing different source files at the same time. Only need to synchronize when each pass is completed. I'll look into concurrency issues in the allocator code when I get to it. The data structure is a simple list of child pointers and a single parent pointer for each memory block. Linked lists. The code still relies, of course, on malloc and free to actually get the memory in the first place.
Apr 27 2006
Couldn't you make come of your parent nodes an allocators, so that all of their >children get their memory from it?I was in a hurry. Let me revise. Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?
Apr 27 2006
Craig Black wrote:Hehe. That's okay. I understand text through typos. I'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling? I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P. -- -----BEGIN GEEK CODE BLOCK----- Version: 3.1 GCS/MU/S d-pu s:+ a-->? C++++$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O M-- V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e h>--->++ r+++ y+++ ------END GEEK CODE BLOCK------ James DunneCouldn't you make come of your parent nodes an allocators, so that all of their >children get their memory from it?I was in a hurry. Let me revise. Couldn't you make some of your parent nodes allocators, so that all of their children get their memory from their parent allocator nodes?
Apr 27 2006
I'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling? I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P.Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks. Then when you want an allocation unit, you request it from the allocator. The allocator simply hands out the memory, an marks it as used. The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead. In this way, you would only call malloc and free for large chunks of memory. You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require. Did that help? -Craig
Apr 27 2006
Craig Black wrote:Yes I understand. Thank you. Very interesting concept. I don't think in my current situation that calling malloc is a definite bottleneck. When I see a performance concern later on, I'll definitely look into it. -- Regards, James DunneI'm not entirely familiar with the concept of allocators as you describe. Is this similar to memory pooling? I always skipped over that portion of the C++ STL; way too much complexity for my small brain =P.Instead of calling malloc for each allocation unit, you could write an allocator class that would allocate memory in large chunks. Then when you want an allocation unit, you request it from the allocator. The allocator simply hands out the memory, an marks it as used. The hierarchical memory system you describe lends itself to using such custom allocators, because you would never have to call free for each unit. You could call free at the allocator level instead. In this way, you would only call malloc and free for large chunks of memory. You could also inform each allocator how much memory to reserve when they are created, since you may have a good idea of how much memory a given task will require. Did that help? -Craig
Apr 27 2006