digitalmars.D - Locality and Custom Allocations

John Demme (113/113) May 14 2006 Walter (and other compiler geniuses)- I've got two questions for you at ...

John Demme (6/147) May 14 2006 Sorry to reply to myself, but I though I kinda sounded like an ass in th...
Sean Kelly (6/22) May 14 2006 This was the conclusion in some research papers I've read as well, the

John Demme <me teqdruid.com> writes:

Walter (and other compiler geniuses)- I've got two questions for you at the
bottom of the post, if you want to skip the bulk of it.

-------

In order to decrease page faults (and thus increase performance) of my
LinkedList, I implemented a block allocate scheme using the below snippet
in a node struct:
struct Node
{
        static void[] block;
        /*
          If we allocate large blocks in advance for the nodes, then
          we increase our locality and thus cut down on page-faults
          and increase our cache-hit percentage
         */
        new (size_t sz, int length)
        {
                if (block.length < sz)
                        block = new void[sz*(length+1)];   
                        
                void* p = block.ptr;
                block = block[sz..$];
                return p;
        }
...
}

Then, I used some variants of the following benchmark code to test
ArrayList, LinkedList with block allocation and LinkedList without block
allocations.  You can find ArrayList and LinkedList (w/o block allocations)
in the mango.containers SVN.  ArrayList is backed by a D dynamic array, and
LinkedList uses struct nodes.

void main()
{
        MutableList!(int) list = new LinkedList!(int)();

        for (int i=0; i<1_000_000; i++)
        {
                list ~= i;
        }

        for (int j=0; j<1; j++)
        {
                foreach (int index, int i; list)
                {
                        if (i != index)
                                Stdout("Error: "c)(i)(" != "c)(index)(CR);
                }
        }
}

I found the results quite puzzling.  I used the "time" command in unix to
get the results. Here they are:

LinkedList, block allocations, compiler optimizations
(-O -release -inline) , 1_000_000 elements, 1_000 read loops
real    0m35.673s
user    0m34.274s
sys     0m0.240s

LinkedList, no block allocations, compiler optimizations
(-O -release -inline) , 1_000_000 elements, 1_000 read loops
real    1m2.546s
user    0m57.824s
sys     0m0.412s

ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
elements, 1_000 read loops
real    0m48.569s
user    0m46.559s
sys     0m0.288s

LinkedList, block allocations, compiler optimizations
(-O -release -inline) , 1_000_000 elements, 1 read loop
real    0m2.597s
user    0m2.500s
sys     0m0.016s

LinkedList, no block allocations, compiler optimizations
(-O -release -inline) , 1_000_000 elements, 1 read loop
real    0m0.988s
user    0m0.944s
sys     0m0.028s

ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
elements, 1 read loop
real    0m0.203s
user    0m0.148s
sys     0m0.028s


Interpretation:
Clearly, the block allocations are very helpful in the reads.  And surprise:
the LinkedList performs better than the ArrayList!!!  How?  I suspect that
it has to do with the method of iteration.  mango.container Containers are
iterated through using Iterator objects (the opApplys use them as well) and
the generic ListIterator calls .get(index) on the array in addition to
the .next() call to the iterator.  The LinkedListIterator, however, is
specific to LinkedList and uses a reference to the current node to avoid
the obvious inefficiency of .get(index) on a LinkedList.  Assuming that the
extra call to .get(index) on the ArrayList was causing some strain, I built
an ArrayListIterator to avoid it, and came up with the following results:

ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
elements, 1_000 read loops, using ArrayListIterator
real    0m30.966s
user    0m29.774s
sys     0m0.180s

OK, looks like I was right.  What does this tell us?  That virtual function
calls suck, but we knew this.  The LinkedList with block allocations only
runs about 15% longer than the ArrayList- that's not too shabby.

The bulk of my surpise came when only one read loop was done, so the
allocation part of the test program has far more weight.  The LinkedList
without block allocations does better by an order of magnitude!!!  This
blew me away.  I figured that in terms of allocations, the block
allocations would do no harm, and probably do better.  In fact, I expected
them to beat out the ArrayList-- the whole point of a LinkedList!  Here is
what I surmise from this result:

Custom Allocators are Slow.

Is my reasoning faulty, or am I right?  If I'm right, here's my question to
Walter:  why?  Are calls to these allocators not inlined?  If not, can they
be?  In addition, to what extent can virtual function calls be sped up via
optimization?  Under what circumstances can calls to virtual methods be
inlined?

Sorry for the long post.... I may just be compensating for not posting much
lately.

~John Demme

May 14 2006

John Demme <me teqdruid.com> writes:

John Demme wrote:

 Walter (and other compiler geniuses)- I've got two questions for you at
 the bottom of the post, if you want to skip the bulk of it.
 
 -------
 
 In order to decrease page faults (and thus increase performance) of my
 LinkedList, I implemented a block allocate scheme using the below snippet
 in a node struct:
 struct Node
 {
         static void[] block;
         /*
           If we allocate large blocks in advance for the nodes, then
           we increase our locality and thus cut down on page-faults
           and increase our cache-hit percentage
          */
         new (size_t sz, int length)
         {
                 if (block.length < sz)
                         block = new void[sz*(length+1)];
                         
                 void* p = block.ptr;
                 block = block[sz..$];
                 return p;
         }
 ...
 }
 
 Then, I used some variants of the following benchmark code to test
 ArrayList, LinkedList with block allocation and LinkedList without block
 allocations.  You can find ArrayList and LinkedList (w/o block
 allocations)
 in the mango.containers SVN.  ArrayList is backed by a D dynamic array,
 and LinkedList uses struct nodes.
 
 void main()
 {
         MutableList!(int) list = new LinkedList!(int)();
 
         for (int i=0; i<1_000_000; i++)
         {
                 list ~= i;
         }
 
         for (int j=0; j<1; j++)
         {
                 foreach (int index, int i; list)
                 {
                         if (i != index)
                                 Stdout("Error: "c)(i)(" != "c)(index)(CR);
                 }
         }
 }
 
 I found the results quite puzzling.  I used the "time" command in unix to
 get the results. Here they are:
 
 LinkedList, block allocations, compiler optimizations
 (-O -release -inline) , 1_000_000 elements, 1_000 read loops
 real    0m35.673s
 user    0m34.274s
 sys     0m0.240s
 
 LinkedList, no block allocations, compiler optimizations
 (-O -release -inline) , 1_000_000 elements, 1_000 read loops
 real    1m2.546s
 user    0m57.824s
 sys     0m0.412s
 
 ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
 elements, 1_000 read loops
 real    0m48.569s
 user    0m46.559s
 sys     0m0.288s
 
 LinkedList, block allocations, compiler optimizations
 (-O -release -inline) , 1_000_000 elements, 1 read loop
 real    0m2.597s
 user    0m2.500s
 sys     0m0.016s
 
 LinkedList, no block allocations, compiler optimizations
 (-O -release -inline) , 1_000_000 elements, 1 read loop
 real    0m0.988s
 user    0m0.944s
 sys     0m0.028s
 
 ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
 elements, 1 read loop
 real    0m0.203s
 user    0m0.148s
 sys     0m0.028s
 
 
 Interpretation:
 Clearly, the block allocations are very helpful in the reads.  And
 surprise:
 the LinkedList performs better than the ArrayList!!!  How?  I suspect that
 it has to do with the method of iteration.  mango.container Containers are
 iterated through using Iterator objects (the opApplys use them as well)
 and the generic ListIterator calls .get(index) on the array in addition to
 the .next() call to the iterator.  The LinkedListIterator, however, is
 specific to LinkedList and uses a reference to the current node to avoid
 the obvious inefficiency of .get(index) on a LinkedList.  Assuming that
 the extra call to .get(index) on the ArrayList was causing some strain, I
 built an ArrayListIterator to avoid it, and came up with the following
 results:
 
 ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
 elements, 1_000 read loops, using ArrayListIterator
 real    0m30.966s
 user    0m29.774s
 sys     0m0.180s
 
 OK, looks like I was right.  What does this tell us?  That virtual
 function
 calls suck, but we knew this.  The LinkedList with block allocations only
 runs about 15% longer than the ArrayList- that's not too shabby.
 
 The bulk of my surpise came when only one read loop was done, so the
 allocation part of the test program has far more weight.  The LinkedList
 without block allocations does better by an order of magnitude!!!  This
 blew me away.  I figured that in terms of allocations, the block
 allocations would do no harm, and probably do better.  In fact, I expected
 them to beat out the ArrayList-- the whole point of a LinkedList!  Here is
 what I surmise from this result:
 
 Custom Allocators are Slow.
 
 Is my reasoning faulty, or am I right?  If I'm right, here's my question
 to
 Walter:  why?  Are calls to these allocators not inlined?  If not, can
 they
 be?  In addition, to what extent can virtual function calls be sped up via
 optimization?  Under what circumstances can calls to virtual methods be
 inlined?
 
 Sorry for the long post.... I may just be compensating for not posting
 much lately.
 
 ~John Demme

Sorry to reply to myself, but I though I kinda sounded like an ass in the
post.  I'm not at all upset with DMD's performance- I'd rather see the
debugging stuff we're getting now over optimizations... I'm just wondering
what we might see in the future, and how to improve my code now.

~John Demme (again)

May 14 2006

Sean Kelly <sean f4.ca> writes:

John Demme wrote:
 
 The bulk of my surpise came when only one read loop was done, so the
 allocation part of the test program has far more weight.  The LinkedList
 without block allocations does better by an order of magnitude!!!  This
 blew me away.  I figured that in terms of allocations, the block
 allocations would do no harm, and probably do better.  In fact, I expected
 them to beat out the ArrayList-- the whole point of a LinkedList!  Here is
 what I surmise from this result:
 
 Custom Allocators are Slow.
 
 Is my reasoning faulty, or am I right?  If I'm right, here's my question to
 Walter:  why?  Are calls to these allocators not inlined?  If not, can they
 be?  In addition, to what extent can virtual function calls be sped up via
 optimization?  Under what circumstances can calls to virtual methods be
 inlined?

This was the conclusion in some research papers I've read as well, the 
most relevant being "Reconsidering Custom Memory Allocation":

http://www.cs.umass.edu/~emery/pubs/berger-oopsla2002.pdf

I'll leave the paper to outline why.


Sean

May 14 2006

D Programming

C/C++ Programming

Other

digitalmars.D - Locality and Custom Allocations