www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Memory Allocation and Locking

reply dsimcha <dsimcha yahoo.com> writes:
I guess this really is more of a C++ question then a D question, but it's kind
of both, so what the heck?

Anyhow, I had some old C++ code laying around that I've been using, and I had
added some quick and dirty multithreading of embarrassingly parallel parts of
it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
decided I needed to extend it in ways that couldn't be done nearly as easily
in C++ as in D, and it wasn't much code, so I just translated it to D.

First iteration was slow as dirt because of resource contention for memory
allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
loops (i.e. ~= ), and now the D version is faster than the C++ version.

Anyhow, the real quesiton is, why is it that I can get away w/ heap
allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
code but not in multithreaded D code?  I'm assuming, since there didn't seem
to be any resource contention in the C++ version, that there was no locking
for the memory allocation, but all of the allocation was done through STL
rather than explicitly, so I don't really know.  However, the code actually
worked like this, so I never really gave it much thought until today.  Is it
just a minor miracle that this worked at all w/o me explicitly using some kind
of lock?

By the way, if it's implementation defined, my C++ compiler is MinGW GCC 3.45.
Aug 21 2008
next sibling parent BCS <ao pathlink.com> writes:
Reply to dsimcha,

 I guess this really is more of a C++ question then a D question, but
 it's kind of both, so what the heck?
 
 Anyhow, I had some old C++ code laying around that I've been using,
 and I had added some quick and dirty multithreading of embarrassingly
 parallel parts of it to it to speed it up.  Lo and behold, it actually
 worked.  A while later, I decided I needed to extend it in ways that
 couldn't be done nearly as easily in C++ as in D, and it wasn't much
 code, so I just translated it to D.
 
 First iteration was slow as dirt because of resource contention for
 memory allocation.  Put in a few ugly hacks to avoid all memory
 allocation in inner loops (i.e. ~= ), and now the D version is faster
 than the C++ version.
 
 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded
 C++ code but not in multithreaded D code?  I'm assuming, since there
 didn't seem to be any resource contention in the C++ version, that
 there was no locking for the memory allocation, but all of the
 allocation was done through STL rather than explicitly, so I don't
 really know.  However, the code actually worked like this, so I never
 really gave it much thought until today.  Is it just a minor miracle
 that this worked at all w/o me explicitly using some kind of lock?
 
 By the way, if it's implementation defined, my C++ compiler is MinGW
 GCC 3.45.
 
under C++ I'll guess that new is a thin skin on malloc, under D, new uses the GC. If one of these is thread safe and the other isn't, there's your issue. But I'm just guessing.
Aug 21 2008
prev sibling next sibling parent "Steven Schveighoffer" <schveiguy yahoo.com> writes:
"dsimcha" wrote
I guess this really is more of a C++ question then a D question, but it's 
kind
 of both, so what the heck?

 Anyhow, I had some old C++ code laying around that I've been using, and I 
 had
 added some quick and dirty multithreading of embarrassingly parallel parts 
 of
 it to it to speed it up.  Lo and behold, it actually worked.  A while 
 later, I
 decided I needed to extend it in ways that couldn't be done nearly as 
 easily
 in C++ as in D, and it wasn't much code, so I just translated it to D.

 First iteration was slow as dirt because of resource contention for memory
 allocation.  Put in a few ugly hacks to avoid all memory allocation in 
 inner
 loops (i.e. ~= ), and now the D version is faster than the C++ version.

 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
 code but not in multithreaded D code?  I'm assuming, since there didn't 
 seem
 to be any resource contention in the C++ version, that there was no 
 locking
 for the memory allocation, but all of the allocation was done through STL
 rather than explicitly, so I don't really know.  However, the code 
 actually
 worked like this, so I never really gave it much thought until today.  Is 
 it
 just a minor miracle that this worked at all w/o me explicitly using some 
 kind
 of lock?

 By the way, if it's implementation defined, my C++ compiler is MinGW GCC 
 3.45.
I think it may be the way the GC is implemented. If it needs memory, it first tries running a collect cycle (not sure if there is some algorithm to limit running the GC) before trying to get more memory from the OS. If you put the allocations in an inner loop, you are running collect cycles more frequently (you should be able to see this by profiling the code). I'm not sure of the algorithm to decide when to run, but worst case, it is every time you allocate and there is no free space left. Most memory allocation schemes that I've seen need to lock the heap, so it is always a point of contention. However, with the new shared/unshared thing, it looks like D might be advancing past that obstacle. But the GC really needs some work to be smarter about how many times it runs. When implementing dcollections, I created a custom allocator that allocates pages of elements at once instead of individual elements. This saw a huge speedup, because I'm not running the GC as frequently. -Steve
Aug 22 2008
prev sibling parent reply Sean Kelly <sean invisibleduck.org> writes:
dsimcha wrote:
 I guess this really is more of a C++ question then a D question, but it's kind
 of both, so what the heck?
 
 Anyhow, I had some old C++ code laying around that I've been using, and I had
 added some quick and dirty multithreading of embarrassingly parallel parts of
 it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
 decided I needed to extend it in ways that couldn't be done nearly as easily
 in C++ as in D, and it wasn't much code, so I just translated it to D.
 
 First iteration was slow as dirt because of resource contention for memory
 allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
 loops (i.e. ~= ), and now the D version is faster than the C++ version.
 
 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
 code but not in multithreaded D code?  I'm assuming, since there didn't seem
 to be any resource contention in the C++ version, that there was no locking
 for the memory allocation, but all of the allocation was done through STL
 rather than explicitly, so I don't really know.  However, the code actually
 worked like this, so I never really gave it much thought until today.  Is it
 just a minor miracle that this worked at all w/o me explicitly using some kind
 of lock?
You can get away with it in C++ because there are no garbage collection cycles occurring when the allocator runs low on memory. To easiest way to approach an apples-apples comparison would be to disable the GC at the start of your program. Another performance issue for some programs is a result of how the GC allocator obtains memory from the OS. It does so in bite-sized blocks, so if your app does a ton of allocation and then runs the GC is not only collecting periodically but then it's also obtaining a chunk of memory from the OS to store the latest allocation (with some extra room for additional blocks), then it's doing the same thing again, etc. The Tango GC provides a reserve() method to optimize for this situation by allowing the user to have the GC pre-build a pool of N bytes for use by future allocations. In my testing this increased app performance by 50% or more given certain program designs. Sean
Aug 22 2008
parent reply dsimcha <dsimcha yahoo.com> writes:
But D's malloc appears to require locking even when memory is available, just to
allocate it.  Heck, it even synchronizes to check the capacity of an allocated
block, which (I'm guessing) means every ~= requires a synch if the code is
multithreaded, even if no realloc ends up being necessary.  I base this on
reading
the GC implementation.  I assume that allocating via builtin arrays is
basically a
thin layer on top of D's malloc.  Here's the Phobos implementation.  For those
who
haven't looked at this, the function thread_needLock just returns true if more
than 1 thread is currently running.  Haven't looked at Tango b/c, unfortunately,
it's not on D2 yet.

    void *malloc(size_t size)
    {
        if (!size)
        {
            return null;
        }

	if (!thread_needLock())
	{
	    return mallocNoSync(size);
	}
	else synchronized (gcLock)
	{
	    return mallocNoSync(size);
	}
    }

Bottom line is, it's mostly just a curiosity/future reference question, since
with
a few tweaks, the D version is now *faster* than the C++ version, but how can
C++
get away w/ concurrent memory allocations while D can't, or should my C++
version
never have worked in the first place?
Aug 22 2008
parent Sean Kelly <sean invisibleduck.org> writes:
dsimcha wrote:
 But D's malloc appears to require locking even when memory is available, just
to
 allocate it.  Heck, it even synchronizes to check the capacity of an allocated
 block, which (I'm guessing) means every ~= requires a synch if the code is
 multithreaded, even if no realloc ends up being necessary.  I base this on
reading
 the GC implementation.
You're right. But it's likely that an allocator for C++ would have to do the same thing, unless you're talking about a special-purpose one like HOARD. Also, the Tango GC will only lock on this mutex if the app is actually multithreaded. Single-threaded apps skip the locking process. The Phobos GC does this as well, but I recall finding some gaps in its coverage for this feature.
 I assume that allocating via builtin arrays is basically a
 thin layer on top of D's malloc.  Here's the Phobos implementation.  For those
who
 haven't looked at this, the function thread_needLock just returns true if more
 than 1 thread is currently running.  Haven't looked at Tango b/c,
unfortunately,
 it's not on D2 yet.
 
     void *malloc(size_t size)
     {
         if (!size)
         {
             return null;
         }
 
 	if (!thread_needLock())
 	{
 	    return mallocNoSync(size);
 	}
 	else synchronized (gcLock)
 	{
 	    return mallocNoSync(size);
 	}
     }
thread_needLock() was actually cribbed from Tango a while back. Phobos used to do something like "Thread.getAll.length > 1." The effect is essentially the same though.
 Bottom line is, it's mostly just a curiosity/future reference question, since
with
 a few tweaks, the D version is now *faster* than the C++ version, but how can
C++
 get away w/ concurrent memory allocations while D can't, or should my C++
version
 never have worked in the first place?
I don't think it's the locking that's an issue but rather the collection cycles plus possibly the need to obtain additional pools from the OS. Sean
Aug 22 2008