digitalmars.D.learn - Memory Allocation and Locking

dsimcha (20/20) Aug 21 2008 I guess this really is more of a C++ question then a D question, but it'...

BCS (4/32) Aug 21 2008 under C++ I'll guess that new is a thin skin on malloc, under D, new use...
Steven Schveighoffer (16/48) Aug 22 2008 I think it may be the way the GC is implemented. If it needs memory, it...
Sean Kelly (16/38) Aug 22 2008 You can get away with it in C++ because there are no garbage collection

dsimcha (28/28) Aug 22 2008 But D's malloc appears to require locking even when memory is available,...

Sean Kelly (13/44) Aug 22 2008 You're right. But it's likely that an allocator for C++ would have to

dsimcha <dsimcha yahoo.com> writes:

I guess this really is more of a C++ question then a D question, but it's kind
of both, so what the heck?

Anyhow, I had some old C++ code laying around that I've been using, and I had
added some quick and dirty multithreading of embarrassingly parallel parts of
it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
decided I needed to extend it in ways that couldn't be done nearly as easily
in C++ as in D, and it wasn't much code, so I just translated it to D.

First iteration was slow as dirt because of resource contention for memory
allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
loops (i.e. ~= ), and now the D version is faster than the C++ version.

Anyhow, the real quesiton is, why is it that I can get away w/ heap
allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
code but not in multithreaded D code?  I'm assuming, since there didn't seem
to be any resource contention in the C++ version, that there was no locking
for the memory allocation, but all of the allocation was done through STL
rather than explicitly, so I don't really know.  However, the code actually
worked like this, so I never really gave it much thought until today.  Is it
just a minor miracle that this worked at all w/o me explicitly using some kind
of lock?

By the way, if it's implementation defined, my C++ compiler is MinGW GCC 3.45.

Aug 21 2008

BCS <ao pathlink.com> writes:

Reply to dsimcha,

 I guess this really is more of a C++ question then a D question, but
 it's kind of both, so what the heck?
 
 Anyhow, I had some old C++ code laying around that I've been using,
 and I had added some quick and dirty multithreading of embarrassingly
 parallel parts of it to it to speed it up.  Lo and behold, it actually
 worked.  A while later, I decided I needed to extend it in ways that
 couldn't be done nearly as easily in C++ as in D, and it wasn't much
 code, so I just translated it to D.
 
 First iteration was slow as dirt because of resource contention for
 memory allocation.  Put in a few ugly hacks to avoid all memory
 allocation in inner loops (i.e. ~= ), and now the D version is faster
 than the C++ version.
 
 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded
 C++ code but not in multithreaded D code?  I'm assuming, since there
 didn't seem to be any resource contention in the C++ version, that
 there was no locking for the memory allocation, but all of the
 allocation was done through STL rather than explicitly, so I don't
 really know.  However, the code actually worked like this, so I never
 really gave it much thought until today.  Is it just a minor miracle
 that this worked at all w/o me explicitly using some kind of lock?
 
 By the way, if it's implementation defined, my C++ compiler is MinGW
 GCC 3.45.
 

under C++ I'll guess that new is a thin skin on malloc, under D, new uses 
the GC. If one of these is thread safe and the other isn't, there's your 
issue. But I'm just guessing.

Aug 21 2008

"Steven Schveighoffer" <schveiguy yahoo.com> writes:

"dsimcha" wrote
I guess this really is more of a C++ question then a D question, but it's 
kind
 of both, so what the heck?

 Anyhow, I had some old C++ code laying around that I've been using, and I 
 had
 added some quick and dirty multithreading of embarrassingly parallel parts 
 of
 it to it to speed it up.  Lo and behold, it actually worked.  A while 
 later, I
 decided I needed to extend it in ways that couldn't be done nearly as 
 easily
 in C++ as in D, and it wasn't much code, so I just translated it to D.

 First iteration was slow as dirt because of resource contention for memory
 allocation.  Put in a few ugly hacks to avoid all memory allocation in 
 inner
 loops (i.e. ~= ), and now the D version is faster than the C++ version.

 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
 code but not in multithreaded D code?  I'm assuming, since there didn't 
 seem
 to be any resource contention in the C++ version, that there was no 
 locking
 for the memory allocation, but all of the allocation was done through STL
 rather than explicitly, so I don't really know.  However, the code 
 actually
 worked like this, so I never really gave it much thought until today.  Is 
 it
 just a minor miracle that this worked at all w/o me explicitly using some 
 kind
 of lock?

 By the way, if it's implementation defined, my C++ compiler is MinGW GCC 
 3.45.

I think it may be the way the GC is implemented.  If it needs memory, it 
first tries running a collect cycle (not sure if there is some algorithm to 
limit running the GC) before trying to get more memory from the OS.  If you 
put the allocations in an inner loop, you are running collect cycles more 
frequently (you should be able to see this by profiling the code).  I'm not 
sure of the algorithm to decide when to run, but worst case, it is every 
time you allocate and there is no free space left.

Most memory allocation schemes that I've seen need to lock the heap, so it 
is always a point of contention.  However, with the new shared/unshared 
thing, it looks like D might be advancing past that obstacle.  But the GC 
really needs some work to be smarter about how many times it runs.  When 
implementing dcollections, I created a custom allocator that allocates pages 
of elements at once instead of individual elements.  This saw a huge 
speedup, because I'm not running the GC as frequently.

-Steve

Aug 22 2008

Sean Kelly <sean invisibleduck.org> writes:

dsimcha wrote:
 I guess this really is more of a C++ question then a D question, but it's kind
 of both, so what the heck?
 
 Anyhow, I had some old C++ code laying around that I've been using, and I had
 added some quick and dirty multithreading of embarrassingly parallel parts of
 it to it to speed it up.  Lo and behold, it actually worked.  A while later, I
 decided I needed to extend it in ways that couldn't be done nearly as easily
 in C++ as in D, and it wasn't much code, so I just translated it to D.
 
 First iteration was slow as dirt because of resource contention for memory
 allocation.  Put in a few ugly hacks to avoid all memory allocation in inner
 loops (i.e. ~= ), and now the D version is faster than the C++ version.
 
 Anyhow, the real quesiton is, why is it that I can get away w/ heap
 allocations in inner loops (i.e. vector.push_back() ) in multithreaded C++
 code but not in multithreaded D code?  I'm assuming, since there didn't seem
 to be any resource contention in the C++ version, that there was no locking
 for the memory allocation, but all of the allocation was done through STL
 rather than explicitly, so I don't really know.  However, the code actually
 worked like this, so I never really gave it much thought until today.  Is it
 just a minor miracle that this worked at all w/o me explicitly using some kind
 of lock?

You can get away with it in C++ because there are no garbage collection 
cycles occurring when the allocator runs low on memory.  To easiest way 
to approach an apples-apples comparison would be to disable the GC at 
the start of your program.

Another performance issue for some programs is a result of how the GC 
allocator obtains memory from the OS.  It does so in bite-sized blocks, 
so if your app does a ton of allocation and then runs the GC is not only 
collecting periodically but then it's also obtaining a chunk of memory 
from the OS to store the latest allocation (with some extra room for 
additional blocks), then it's doing the same thing again, etc.  The 
Tango GC provides a reserve() method to optimize for this situation by 
allowing the user to have the GC pre-build a pool of N bytes for use by 
future allocations.  In my testing this increased app performance by 50% 
or more given certain program designs.


Sean

Aug 22 2008

dsimcha <dsimcha yahoo.com> writes:

But D's malloc appears to require locking even when memory is available, just to
allocate it.  Heck, it even synchronizes to check the capacity of an allocated
block, which (I'm guessing) means every ~= requires a synch if the code is
multithreaded, even if no realloc ends up being necessary.  I base this on
reading
the GC implementation.  I assume that allocating via builtin arrays is
basically a
thin layer on top of D's malloc.  Here's the Phobos implementation.  For those
who
haven't looked at this, the function thread_needLock just returns true if more
than 1 thread is currently running.  Haven't looked at Tango b/c, unfortunately,
it's not on D2 yet.

    void *malloc(size_t size)
    {
        if (!size)
        {
            return null;
        }

	if (!thread_needLock())
	{
	    return mallocNoSync(size);
	}
	else synchronized (gcLock)
	{
	    return mallocNoSync(size);
	}
    }

Bottom line is, it's mostly just a curiosity/future reference question, since
with
a few tweaks, the D version is now *faster* than the C++ version, but how can
C++
get away w/ concurrent memory allocations while D can't, or should my C++
version
never have worked in the first place?

Aug 22 2008

Sean Kelly <sean invisibleduck.org> writes:

dsimcha wrote:
 But D's malloc appears to require locking even when memory is available, just
to
 allocate it.  Heck, it even synchronizes to check the capacity of an allocated
 block, which (I'm guessing) means every ~= requires a synch if the code is
 multithreaded, even if no realloc ends up being necessary.  I base this on
reading
 the GC implementation.

You're right.  But it's likely that an allocator for C++ would have to 
do the same thing, unless you're talking about a special-purpose one 
like HOARD.  Also, the Tango GC will only lock on this mutex if the app 
is actually multithreaded.  Single-threaded apps skip the locking 
process.  The Phobos GC does this as well, but I recall finding some 
gaps in its coverage for this feature.

 I assume that allocating via builtin arrays is basically a
 thin layer on top of D's malloc.  Here's the Phobos implementation.  For those
who
 haven't looked at this, the function thread_needLock just returns true if more
 than 1 thread is currently running.  Haven't looked at Tango b/c,
unfortunately,
 it's not on D2 yet.
 
     void *malloc(size_t size)
     {
         if (!size)
         {
             return null;
         }
 
 	if (!thread_needLock())
 	{
 	    return mallocNoSync(size);
 	}
 	else synchronized (gcLock)
 	{
 	    return mallocNoSync(size);
 	}
     }

thread_needLock() was actually cribbed from Tango a while back.  Phobos 
used to do something like "Thread.getAll.length > 1."  The effect is 
essentially the same though.

 Bottom line is, it's mostly just a curiosity/future reference question, since
with
 a few tweaks, the D version is now *faster* than the C++ version, but how can
C++
 get away w/ concurrent memory allocations while D can't, or should my C++
version
 never have worked in the first place?

I don't think it's the locking that's an issue but rather the collection 
cycles plus possibly the need to obtain additional pools from the OS.


Sean

Aug 22 2008

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Memory Allocation and Locking