digitalmars.D - std.allocator: FreeList uses simple statistics to control number

Andrei Alexandrescu (54/54) May 20 2015 https://github.com/andralex/phobos/commit/90120fc290bc7840ffbee227667985...

Etienne Cimon (8/8) May 20 2015 What you could do is calculate the average allocation size and std
Morbid.Obesity (85/144) May 20 2015 Could you simply not use this in a thread that simply informs

Andrei Alexandrescu (3/3) May 21 2015 On 5/20/15 11:05 PM, Morbid.Obesity wrote:

=?UTF-8?B?Ik3DoXJjaW8=?= Martins" (26/27) May 21 2015 Could something like this work for the statistics part?

Andrei Alexandrescu (4/28) May 21 2015 Yah, I think that's a good idea. The size of the generated code was

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (5/8) May 21 2015 Have you measured whether it actually makes a difference? I would
Martin Nowak (5/10) May 22 2015 You can look at jemalloc which has a GC algorithm for this
Martin Nowak (5/8) May 22 2015 A one-pole low pass filter is a very efficient moving average.

Andrei Alexandrescu (3/8) May 22 2015 These are both solid. Thanks!! -- Andrei

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

https://github.com/andralex/phobos/commit/90120fc290bc7840ffbee22766798518b3418e15

There is a bothersome issue with freelists fronting general-purpose 
allocators (https://en.wikipedia.org/wiki/Free_list): they can grow 
indefinitely. Because they keep memory allocated in their parent, they 
cause fragmentation to it.

The worst pattern for a freelist is: an application allocates a bunch of 
blocks of the specific size the freelist is tuned for, then frees most 
of them and proceeds with allocating other sizes. In that case the large 
freelist is essentially leaked.

My past FreeList implementation had a maximum list length as a 
parameter. That was quite a primitive way to go about it. But what I 
want is a way to model the application's behavior and automatically 
adapt to the allocation patterns. In particular, if demand for blocks of 
the freelist size dwindles, it should be possible for the freelist 
length to progressively go down to zero.

So there are three possible allocation events:

1. The request is for a fit size and the free list is empty. That's a 
miss. It will be served from the parent allocator, and upon freeing the 
block will be added to the free list.

2. The request is for a fit size and the free list is not empty. That's 
a hit - nice. Serve it from the free list.

3. The request is for an unfit size. That's always served from the 
parent allocator, and returned to it upon deallocation.

What we want is to minimize the likelihood of a miss. Say over the past 
N allocation requests we have N1, N2, and N3 counts for the three 
events. Then we can estimate the probability of a miss from these past 
events:

Pmiss = N1 / N

Trouble is, all counts keep increasing and there is little adaptation to 
changing patterns - the stats are averaged over the entire lifetime. 
It's better to keep stats over a fixed-size rolling window, say the past 
Nw events. In that case, if a new miss event occurs, we update the 
estimated miss probability as such:

Pmiss = (Pmiss * Nw + 1) / (Nw + 1)

That is, in the recent past we've seen Pmiss * Nw misses, now we see one 
more (+ 1), and we divide everything by the increased number of events. 
In case of a non-miss (hit or unfit), the update is:

Pmiss = Pmiss * Nw / (Nw + 1)

Note how the first equation converges to Pmiss = 1 and the second to 
Pmiss = 0. Nice.

In a batch of events Nbatch of which there were Nmiss misses, the update is:

Pmiss = (Pmiss * Nw + Nmiss) / (Nw + Nbatch)

So after each allocation event, or after a few of them, we have a nice 
estimate of the probability we're missing the opportunity of allocating 
off the freelist. What I'm less sure about is how to efficiently wire 
that information into a feedback loop. Specifically:

1. How to update Pmiss efficiently? Most allocations should be fast, so 
it shouldn't be update with every call to allocate(). What I have now is 
I update every K calls.

2. How and when should the estimated Pmiss be used to reduce the size of 
the freelist? What I do now is when Pmiss is below a fixed threshold, I 
just periodically yank one element off of it.

Better ideas are welcome!


Andrei

May 20 2015

Etienne Cimon <etcimon gmail.com> writes:

What you could do is calculate the average allocation size and std 
deviantions in a moving window, and the z-score for each freelist and 
use this lookup table:

https://www.stat.tamu.edu/~lzhou/stat302/standardnormaltable.pdf

If P < 0.10 (maybe use this as a setting) this means the probability of 
the next allocations going through this freelist is too low and you can 
decide to tighten the freelist and let the deallocations fall through to 
the underlying allocator.

May 20 2015

"Morbid.Obesity" <Morbid.Obesity mail.com> writes:

On Wednesday, 20 May 2015 at 17:28:50 UTC, Andrei Alexandrescu 
wrote:
 https://github.com/andralex/phobos/commit/90120fc290bc7840ffbee22766798518b3418e15

 There is a bothersome issue with freelists fronting 
 general-purpose allocators 
 (https://en.wikipedia.org/wiki/Free_list): they can grow 
 indefinitely. Because they keep memory allocated in their 
 parent, they cause fragmentation to it.

 The worst pattern for a freelist is: an application allocates a 
 bunch of blocks of the specific size the freelist is tuned for, 
 then frees most of them and proceeds with allocating other 
 sizes. In that case the large freelist is essentially leaked.

 My past FreeList implementation had a maximum list length as a 
 parameter. That was quite a primitive way to go about it. But 
 what I want is a way to model the application's behavior and 
 automatically adapt to the allocation patterns. In particular, 
 if demand for blocks of the freelist size dwindles, it should 
 be possible for the freelist length to progressively go down to 
 zero.

 So there are three possible allocation events:

 1. The request is for a fit size and the free list is empty. 
 That's a miss. It will be served from the parent allocator, and 
 upon freeing the block will be added to the free list.

 2. The request is for a fit size and the free list is not 
 empty. That's a hit - nice. Serve it from the free list.

 3. The request is for an unfit size. That's always served from 
 the parent allocator, and returned to it upon deallocation.

 What we want is to minimize the likelihood of a miss. Say over 
 the past N allocation requests we have N1, N2, and N3 counts 
 for the three events. Then we can estimate the probability of a 
 miss from these past events:

 Pmiss = N1 / N

 Trouble is, all counts keep increasing and there is little 
 adaptation to changing patterns - the stats are averaged over 
 the entire lifetime. It's better to keep stats over a 
 fixed-size rolling window, say the past Nw events. In that 
 case, if a new miss event occurs, we update the estimated miss 
 probability as such:

 Pmiss = (Pmiss * Nw + 1) / (Nw + 1)

 That is, in the recent past we've seen Pmiss * Nw misses, now 
 we see one more (+ 1), and we divide everything by the 
 increased number of events. In case of a non-miss (hit or 
 unfit), the update is:

 Pmiss = Pmiss * Nw / (Nw + 1)

 Note how the first equation converges to Pmiss = 1 and the 
 second to Pmiss = 0. Nice.

 In a batch of events Nbatch of which there were Nmiss misses, 
 the update is:

 Pmiss = (Pmiss * Nw + Nmiss) / (Nw + Nbatch)

 So after each allocation event, or after a few of them, we have 
 a nice estimate of the probability we're missing the 
 opportunity of allocating off the freelist. What I'm less sure 
 about is how to efficiently wire that information into a 
 feedback loop. Specifically:

 1. How to update Pmiss efficiently? Most allocations should be 
 fast, so it shouldn't be update with every call to allocate(). 
 What I have now is I update every K calls.

Could you simply not use this in a thread that simply informs 
quickly that the free list needs to change it's optimization 
strategy?

In the free list allocator, you simply check a flag

allocator()
{
    if (needs_size_u)
    {
      do whatever
    }
}


The threaded optimization routine can run as frequency as one 
ones and even the frequency could be based on the recent 
allocation's per second. (since higher allocations per second 
fragment the list faster, optimization would be needed more than 
when allocations are low(why keep trying to optimize the list 
when no allocations are made?).

The thread could do such things as defragment the free list in 
the GC stop the world cycle if it is ever called. If one uses the 
GC, the free list could piggyback on it's fragmentation routine 
to clean itself up).



 2. How and when should the estimated Pmiss be used to reduce 
 the size of the freelist? What I do now is when Pmiss is below 
 a fixed threshold, I just periodically yank one element off of 
 it.


A. When pmiss is large there are more misses and you can yank in 
"proportion".

e.g., addcount = floor(C*(Pmiss^n-1/2))

here we simply fit scale Pmiss(or Phit) into an inter that 
represents the count to add or yank(if addcount < 0 we yank, 
which is why we subtract 1/2). C and n are constants that give us 
our range and "intensity". They can also be optimized, such as 
over the longer term  use of the program(e.g. over hours of using 
the program C and n slowly change to try and minimize Pmiss. this 
is a way to actually have the program optimize itself to the 
specific way the allocations work in context(both programming and 
with the user specific behavior).

e.g., Pmiss = 0, addcount = floor(-C/2). so C represents the 
maximum count we want to yank.

Pmiss = 1, addcoun t= floor(C/2) and hence C represents the 
maximum to add.

(it happens to be symmetric in this case but does not need to be, 
e.g., one could tailor the algorithms so they adapt and change 
two different C's instead of one, which will optimize the 
algorithm for high vs low memory usage)





B. Simply keep a running average of the previous n then use that 
with your threshold. I tend to use these circular history buffers 
for statistical analysis as they help smooth out the 
data(essentially a low pass filter). This keeps weird issues from 
happening(Think how some differential equation scan become 
numerically stable and it's possible that your algorithm may 
suffer the same issues in some cases(weird allocation 
strategies). Averaging it will make it much more stable. Since 
you already do this one it is not a huge(e.g., if you only 
depending on the previous value then your optimization algorithm 
will probably be unstable in some case).



C. You can essentially design whatever statistical measures of 
things such as using the standard deviation, as mentioned, the 
median, the mean, multi-parameter statistical estimations, etc 
and simply form a weighted sum of the rules to and then optimize 
the weights in a thread. The "sum" is a measure of many aspects 
of how Pmiss behaves.

One can then use the correlation between Pmiss and sum. 
(essentially compare there instantaneous directional derivatives)

e.g., if the sum and Pmiss are correlated then one tries to 
adjust the weights and see's if they stay correlated or not. If 
so, then the weights varied in the correct direction.

Of course, one should take averages the sums and Pmiss's to avoid 
instability. (essentially high pass filtering the functions to 
avoid any numerical aliasing)


The great thing about any of these methods(and they can be 
combined in way you want) is that they can be used to 
continuously optimize the the allocation strategies specific to 
the exact context they are used in. (e.g., adapt to different 
computers, memory layouts and amounts, program usage, etc...)

It happens naturally in these algorithms I'm mentioned(It is a 
form of parametric modeling and neural networks)

Another thing is that one can use profiling to help get a 
baseline optimization that will probably be significantly better 
than the non-adapting version.

The main problem is that the larger number of weights produces 
dimensional explosion, which means you'll have to use neural 
networks to solve this issue.... which then gets into into nested 
neural networks. Since we are using a neural network already to 
optimize the program and then have to use another one to optimize 
the neural network of the program's coefficients efficiently, it 
gets complicated real quick ;)

May 20 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/20/15 11:05 PM, Morbid.Obesity wrote:
[snip]

Thanks for sharing your insights! -- Andrei

May 21 2015

=?UTF-8?B?Ik3DoXJjaW8=?= Martins" <marcioapm gmail.com> writes:

On Thursday, 21 May 2015 at 15:17:03 UTC, Andrei Alexandrescu 
wrote:
 [snip]

Could something like this work for the statistics part?

enum AdaptabilityWindow	= 256;
enum AdaptabilitySpeed	= 2; // log2 of rescaling divisor


	if (hit) ++hitCount;
	else ++missCount;

	if ((hitCount + missCount) >= Adaptability) {
		hitCount >>= AdaptabilitySpeed;
		missCount >>= AdaptabilitySpeed;
	}
	
	// hitCount/(hitCount + missCount) and missCount / (hitCount + 
missCount) are a decent approximation of p(hit) and p(miss) given 
recent history
	// if you can make decisions just by comparing them then you 
don't have to perform expensive divisions

This is based on re-scaling frequencies, similar to what you'd 
find in model-based compression algorithms.

With it you could try to always keep a minimum reserve of blocks 
just in case, and when freeing a block of fit size make the 
decision about keeping it in the free-list if the tendency or 
returning it to the parent if the tendency is to allocate bad 
sizes.

To incrementally reduce the big free-list, you could just reduce 
the free-list size by a factor proportional to p(miss) - p(hit)

May 21 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/21/15 10:18 AM, "=?UTF-8?B?Ik3DoXJjaW8=?= Martins\" 
<marcioapm gmail.com>\"" wrote:
 On Thursday, 21 May 2015 at 15:17:03 UTC, Andrei Alexandrescu wrote:
 [snip]

 Could something like this work for the statistics part?

 enum AdaptabilityWindow    = 256;
 enum AdaptabilitySpeed    = 2; // log2 of rescaling divisor


      if (hit) ++hitCount;
      else ++missCount;

      if ((hitCount + missCount) >= Adaptability) {
          hitCount >>= AdaptabilitySpeed;
          missCount >>= AdaptabilitySpeed;
      }

      // hitCount/(hitCount + missCount) and missCount / (hitCount +
 missCount) are a decent approximation of p(hit) and p(miss) given recent
 history
      // if you can make decisions just by comparing them then you don't
 have to perform expensive divisions

 This is based on re-scaling frequencies, similar to what you'd find in
 model-based compression algorithms.

 With it you could try to always keep a minimum reserve of blocks just in
 case, and when freeing a block of fit size make the decision about
 keeping it in the free-list if the tendency or returning it to the
 parent if the tendency is to allocate bad sizes.

 To incrementally reduce the big free-list, you could just reduce the
 free-list size by a factor proportional to p(miss) - p(hit)

Yah, I think that's a good idea. The size of the generated code was 
getting worrisome. Thanks! -- Andrei

May 21 2015

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Wednesday, 20 May 2015 at 17:28:50 UTC, Andrei Alexandrescu 
wrote:
 1. How to update Pmiss efficiently? Most allocations should be 
 fast, so it shouldn't be update with every call to allocate(). 
 What I have now is I update every K calls.

Have you measured whether it actually makes a difference? I would 
expect memory access latency to dominate, at least in some 
scenarios.

May 21 2015

"Martin Nowak" <code dawg.eu> writes:

On Wednesday, 20 May 2015 at 17:28:50 UTC, Andrei Alexandrescu 
wrote:
 There is a bothersome issue with freelists fronting 
 general-purpose allocators 
 (https://en.wikipedia.org/wiki/Free_list): they can grow 
 indefinitely. Because they keep memory allocated in their 
 parent, they cause fragmentation to it.

You can look at jemalloc which has a GC algorithm for this 
problem.
https://m.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919

May 22 2015

"Martin Nowak" <code dawg.eu> writes:

On Wednesday, 20 May 2015 at 17:28:50 UTC, Andrei Alexandrescu 
wrote:
 1. How to update Pmiss efficiently? Most allocations should be 
 fast, so it shouldn't be update with every call to allocate(). 
 What I have now is I update every K calls.

A one-pole low pass filter is a very efficient moving average.
https://github.com/D-Programming-Language/druntime/blob/6e55b7aaff7566d374c2f253f831d3489e7fd1a5/src/gc/gc.d#L1732

Feed it with 1 on hit and 0 on miss to get Phit on average.

May 22 2015

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 5/22/15 9:26 AM, Martin Nowak wrote:
 A one-pole low pass filter is a very efficient moving average.
 https://github.com/D-Programming-Language/druntime/blob/6e55b7aaff7566d374c2f253f831d3489e7fd1a5/src/gc/gc.d#L1732
 Feed it with 1 on hit and 0 on miss to get Phit on average.

On 5/22/15 9:19 AM, Martin Nowak wrote:
 You can look at jemalloc which has a GC algorithm for this problem.
 https://m.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919

These are both solid. Thanks!! -- Andrei

May 22 2015

D Programming

C/C++ Programming

Other

digitalmars.D - std.allocator: FreeList uses simple statistics to control number