www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - How to save RAM in D programs (on zero initialized buffers): Reloaded

reply "Marco Leise" <Marco.Leise gmx.de> writes:
Hi, this is me again with some "size matters" topic. This time, it's not
the executable size, no! Instead I want to discuss a runtime memory
footprint and speed issue that affects everyone, and how to improve the
situation dramatically.

In D we allocate memory through the GC, that is initialized according to
the type's .init, which gives us a save default. In most cases this will
result in the memory block being zeroed out, like in the case of
allocating ubyte[] buffers. Let's assume, we have a program that
allocates some buffers in advance, that it may not use fully. This often  
happens
when the input data is much smaller than the anticipated case. So our
memory manager should handle this situation well:
    o  zero out a memory block
    o  we probably don't need all of it

So here is a small benchmark that allocates 512 * 1 MB, first using the
typical method: new ubyte[1024 * 1024]. The oputput is:

	** new ubyte[1024 * 1024]
	   ressource usage: +526840 KB
	   user time: +0.098s | sys. time: +0.368s

As expected we have a physical memory usage increase of ~512 MB and spent
a considerable amount of time in the system to find free memory blocks
and in our program to initialize the data to zero. Can we use the GC more
directly? Let's try GC.calloc:

	** GC.calloc(1024 * 1024)
	   ressource usage: +525104 KB
	   user time: +0.089s | sys. time: +0.370s

Again, 512 MB and about the same time. Nothing gained, but my RAM is
starting to fill up. By the way, how does a good old system call to
'malloc' compare? That gives us a block of garbage 'initialized' data - a
situation we left behind for good in D! So here we go with another test:

	** malloc(1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.000s | sys. time: +0.002s

Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
seconds ain't worth talking about. The operating system didn't actually
allocate the memory, it merely gave us a virtual memory range to use.
Only when we write to the memory will physical memory be bound. That's  
perfect
for a generously sized buffer, right? Well... we still want it zeroed
out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] = 0:

	** malloc(1024 * 1024) + zero out
	   ressource usage: +526336 KB
	   user time: +0.053s | sys. time: +0.366s

... and we are back at square one. With the exception, that the user time
is notably lower. What we need is a facility that gives us lazily
allocated zeroed out memory. And guess what, it's not too much to ask
for. Here is 'calloc' to the rescue:

	** calloc(1, 1024 * 1024)
	   ressource usage: +2048 KB
	   user time: +0.001s | sys. time: +0.001s

How does it work? The operating system fakes the memory allocation and
just gives us 131072 references to a special read-only memory page of
zeroes. The semantic is copy-on-write. So we start with a view on zeroed
out memory and get the real thing once we write into it. (Sorry, if I
tell
some of you nothing new, but I just found this out today ;) )
The question I have is, should we go and improve druntime with that
knowledge? I'm not aware of any caveats, are there any?
Thanks for reading and the test program for Linux is in the attachment (I
used GDC to compile).

-- Marco
Feb 07 2012
next sibling parent reply "Nick Sabalausky" <a a.a> writes:
Is void initialization not good enough?

IIRC it's something like:

ubyte[] buf = void;
Feb 07 2012
next sibling parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
Feb 07 2012
next sibling parent reply Michel Fortin <michel.fortin michelf.com> writes:
On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:
 
 Is void initialization not good enough?
 
 IIRC it's something like:
 
 ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 07 2012
next sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 07 Feb 2012 21:37:03 +0100, Michel Fortin  
<michel.fortin michelf.com> wrote:

 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?
  IIRC it's something like:
  ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.
A similar idea popped up to solve non-deterministic shared library unloading and/or mapped memory freeing. I'm don't think though that putting additional pressure on the GC is the right approach to resource management. The interface would be simple for non-overlapping ranges. void GC.trackRange(void* p, size_t sz, void delegate() finalizer)
Feb 07 2012
prev sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2012-02-07 21:37, Michel Fortin wrote:
 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.
What about GC.addRoot ? -- /Jacob Carlborg
Feb 07 2012
parent Michel Fortin <michel.fortin michelf.com> writes:
On 2012-02-08 07:30:29 +0000, Jacob Carlborg <doob me.com> said:

 On 2012-02-07 21:37, Michel Fortin wrote:
 On 2012-02-07 20:24:40 +0000, "Marco Leise" <Marco.Leise gmx.de> said:
 
 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:
 
 Is void initialization not good enough?
 
 IIRC it's something like:
 
 ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
What would be nice is a GC that would just track system-allocated memory blocks. What would be really great is if you could allocate a block with malloc/calloc (or something else) and later pass ownership to the GC, so that the GC calls free (or something else) when all references disappear. But I'm not sure how you can do that efficiently.
What about GC.addRoot ?
This is perfect if you want the GC to scan a memory range for pointers. But the GC can't track pointers pointing to the given memory range and notify you when that range is no longer referenced. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Feb 08 2012
prev sibling parent "Martin Nowak" <dawg dawgfoto.de> writes:
On Tue, 07 Feb 2012 21:24:40 +0100, Marco Leise <Marco.Leise gmx.de> wrote:

 Am 07.02.2012, 21:11 Uhr, schrieb Nick Sabalausky <a a.a>:

 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;
That gives me a) no buffer, who's pointer is b) not initialized to null. I want instead a defined pointer, to a valid array, that is initialized to zero. Anyway, I think the flaw in my proposal is the use of a GC. Since we don't get the memory directly from the operating system, but from a memory pool in the GC, it is generally 'recycled' and already used memory. It has to be zeroed out manually, unless there was a way to tell the OS to rebind some virtual memory addresses in our program to this magic 'zero page'.
You usually don't need initialized memory from the GC. The calloc behavior is still pretty interesting.
Feb 07 2012
prev sibling parent Kapps <Kapps NotValidEmail.com> writes:
On 07/02/2012 2:11 PM, Nick Sabalausky wrote:
 Is void initialization not good enough?

 IIRC it's something like:

 ubyte[] buf = void;
This example would be uninitializedArray!(ubyte[])(1024 * 1024). I would guess that it gives significantly better performance. There's also minimallyInitializerArray!(ubyte[])(1024 * 1024) that just initializes the bare data needed to be safe. They're both in std.array.
Feb 07 2012
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
On 7 February 2012 19:39, Marco Leise <Marco.Leise gmx.de> wrote:
 Hi, this is me again with some "size matters" topic. This time, it's not
 the executable size, no! Instead I want to discuss a runtime memory
 footprint and speed issue that affects everyone, and how to improve the
 situation dramatically.

 In D we allocate memory through the GC, that is initialized according to
 the type's .init, which gives us a save default. In most cases this will
 result in the memory block being zeroed out, like in the case of
 allocating ubyte[] buffers. Let's assume, we have a program that
 allocates some buffers in advance, that it may not use fully. This often
 happens
 when the input data is much smaller than the anticipated case. So our
 memory manager should handle this situation well:
 =A0 o =A0zero out a memory block
 =A0 o =A0we probably don't need all of it

 So here is a small benchmark that allocates 512 * 1 MB, first using the
 typical method: new ubyte[1024 * 1024]. The oputput is:

 =A0 =A0 =A0 =A0** new ubyte[1024 * 1024]
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s

 As expected we have a physical memory usage increase of ~512 MB and spent
 a considerable amount of time in the system to find free memory blocks
 and in our program to initialize the data to zero. Can we use the GC more
 directly? Let's try GC.calloc:

 =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s

 Again, 512 MB and about the same time. Nothing gained, but my RAM is
 starting to fill up. By the way, how does a good old system call to
 'malloc' compare? That gives us a block of garbage 'initialized' data - a
 situation we left behind for good in D! So here we go with another test:

 =A0 =A0 =A0 =A0** malloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s

 Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
 seconds ain't worth talking about. The operating system didn't actually
 allocate the memory, it merely gave us a virtual memory range to use.
 Only when we write to the memory will physical memory be bound. That's
 perfect
 for a generously sized buffer, right? Well... we still want it zeroed
 out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=
0:
 =A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s

 ... and we are back at square one. With the exception, that the user time
 is notably lower. What we need is a facility that gives us lazily
 allocated zeroed out memory. And guess what, it's not too much to ask
 for. Here is 'calloc' to the rescue:

 =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s

 How does it work? The operating system fakes the memory allocation and
 just gives us 131072 references to a special read-only memory page of
 zeroes. The semantic is copy-on-write. So we start with a view on zeroed
 out memory and get the real thing once we write into it. (Sorry, if I
 tell
 some of you nothing new, but I just found this out today ;) )
 The question I have is, should we go and improve druntime with that
 knowledge? I'm not aware of any caveats, are there any?
 Thanks for reading and the test program for Linux is in the attachment (I
 used GDC to compile).

 -- Marco
What about these functions? import std.array; byte[][ALLOCS] a, b; writeln("** uninitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) b[i] =3D uninitializedArray!(ubyte[])(1024 * 1024); prev =3D print_ressources(&prev); writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)"); foreach(i; 0 .. ALLOCS) c[i] =3D minimallyInitializedArray!(ubyte[])(1024 *= 1024); prev =3D print_ressources(&prev); Regards --=20 Iain Buclaw *(p < e ? p++ : p) =3D (c & 0x0f) + '0';
Feb 07 2012
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 07.02.2012, 22:15 Uhr, schrieb Iain Buclaw <ibuclaw ubuntu.com>:

   o  zero out a memory block     <---------- !!!
 What about these functions?


 import std.array;

 byte[][ALLOCS] a, b;

 writeln("** uninitializedArray!(ubyte[])(1024*1024)");
 foreach(i; 0 .. ALLOCS) b[i] = uninitializedArray!(ubyte[])(1024 * 1024);
 prev = print_ressources(&prev);

 writeln("** minimallyInitializedArray!(ubyte[])(1024*1024)");
 foreach(i; 0 .. ALLOCS) c[i] = minimallyInitializedArray!(ubyte[])(1024  
 * 1024);
 prev = print_ressources(&prev);



 Regards
Yeah I know these and used them, only to find out the program I am porting relies on the data beeing zeroed out (it is a certain type of compression utility). It's a shame to see the C version get away with a low memory footprint _and_ zero initialized buffers. I'll just copy the 'Array' class from the original then, that uses calloc and free.
Feb 07 2012
prev sibling next sibling parent reply "F i L" <witte2008 gmail.com> writes:
Can't you just write a custom allocator using calloc for 
performance critical structures 
(http://dlang.org/memory.html#newdelete), or do what Iain said.
Feb 07 2012
next sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 02/08/2012 12:01 AM, F i L wrote:
 Can't you just write a custom allocator using calloc for performance
 critical structures (http://dlang.org/memory.html#newdelete), or do what
 Iain said.
The solution with the best performance and least memory requirements obviously must be the default.
Feb 07 2012
prev sibling parent reply Artur Skawina <art.08.09 gmail.com> writes:
On 02/08/12 00:01, F i L wrote:
 Can't you just write a custom allocator using calloc for performance critical
structures (http://dlang.org/memory.html#newdelete), or do what Iain said.
That won't help - the compiler will defeat the optimization by "initializing" the area... artur
Feb 07 2012
parent reply "F i L" <witte2008 gmail.com> writes:
Artur Skawina wrote:
 That won't help - the compiler will defeat the optimization by 
 "initializing" the area...
I see. Timon Gehr wrote:
 The solution with the best performance and least memory 
 requirements obviously must be the default.
No argument here. Only, if calloc is an all-around better allocation method than malloc, why is malloc even used?
Feb 07 2012
parent "F i L" <witte2008 gmail.com> writes:
F i L wrote:
 Only, if calloc is an all-around better allocation method than 
 malloc, why is malloc even used?
Note-to-self: google before asking stupid questions...
Feb 07 2012
prev sibling next sibling parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 I'm not aware of any caveats, are there any?
The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred
Feb 07 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 04:37 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 I'm not aware of any caveats, are there any?
The tests only cover a very small fraction of an unknown data structure: the allocation phase. Of course one can want to make a bad design running faster. Especially if one need to allocate 0.5 TB main memory and can allocate this in 1 second instead of 98 seconds. But neglecting 1) the time for building a usable data structure of 0.5 TB main memory, 2) the time for quering this data structure for some element, 3) the time for inserting some element into this data structure, 4) the time for deleting some element from this data structure and 5) the amortized times of the actions in 3) and 4) is at least not in accordance with engineering principles. -manfred
I meant caveats with calloc vs. other allocators. I don't want to discuss the design of compression algorithms, that allocate a fixed sized buffer in advance. I am porting one of those to D and of the many data structures it allocates with calloc, there are some that may not be fully filled when the program exits. Yet, the user notices a start-up delay in the D version for the memory allocation and initialization as well as an instant jump to the maximum memory use. The C++ version starts instantly and maps memory pages on demand. The algorithm relies on some of the buffers being zeroed out, so there is really no alternative to calloc.
Feb 07 2012
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 In D we allocate memory through the GC
[...]
 Let's assume, we have a program that allocates some buffers in
 advance, that it may not use fully.
[...]
 there is really no alternative to calloc.
1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy. This seems necessary because there is not only the null pattern to initialize allocated memory. 2) According to your OP the request for allocating memory can be disjoint partitioned into three: a) memory that is guaranteed to be used b) memory that is guaranteed not to be used and c) memory that might be used. For the case that a + b exceeds available memory, should the coder predeclare a strategy, i.e. should the GC signal out of memory on request of the memory or should the GC wait until that sum reaches some deadline? -manfred
Feb 08 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 14:30 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 In D we allocate memory through the GC
[...]
 Let's assume, we have a program that allocates some buffers in
 advance, that it may not use fully.
[...]
 there is really no alternative to calloc.
1) calloc implements a strategie for allocating memory. If this strategy is usefull for parts of a program, then the GC should be informed from the code and be capable to adapt its behavior to the requested strategy.
That sounds a bit vague. If I understand you correctly you would impleme= nt = this as a hint flag, like GC.allocHint =3D AllocationHint.lazyZero; auto arr =3D new ubyte[=E2=80=A6];
 This seems necessary because there is not only the null pattern to
 initialize allocated memory.

 2)
 According to your OP the request for allocating memory can be
 disjoint partitioned into three:
 a) memory that is guaranteed to be used
 b) memory that is guaranteed not to be used and
 c) memory that might be used.

 For the case that a + b exceeds available memory, should the coder
 predeclare a strategy, i.e. should the GC signal out of memory on
 request of the memory or should the GC wait until that sum reaches
 some deadline?

 -manfred
Here I can't follow you. The request to allocate memory contains memory = = regions that are guaranteed not to be used? Why would I request them the= n? = Also, what is your definition of available memory? Available RAM, RAM + = = swap or available virtual memory going beyond what the operating system = = can commit and resulting in random killing of applications = (http://linux-mm.org/OOM_Killer) I don't see where the GC enters the picture, since all this 'may use = memory, but don't commit on it' is handled solely by the OS. I could onl= y = imagine a function in the GC that always allocates a new chunnk of memor= y = and does that through calloc - or the hammer method - all allocations us= e = calloc and some of the manual memset(..., 0, ...) is removed. -- Marco
Feb 08 2012
parent reply Manfred Nowak <svv1999 hotmail.com> writes:
Marco Leise wrote:

 That sounds a bit vague.
Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred
Feb 08 2012
parent reply "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 03:56 Uhr, schrieb Manfred Nowak <svv1999 hotmail.com>:

 Marco Leise wrote:

 That sounds a bit vague.
Andrei has written a paper on allocation: http://erdani.com/publications/cuj-2005-12.pdf -manfred
Oh ok, that's farther than I ever digged into memory management. My current problem with the compression utility has been solved at a small scale with a manual memory management 'ZeroInitializedAndAlignedBuffer' struct that uses calloc - right in the spirit of the original source code. The startup time was reduced by ~0.7 seconds and is now relaitvely close to the original as well, at least there is no longer that perceived delay. Techniques like calloc make it possible to use the algorithm in other places like batch processing, compressed FUSE file systems on Linux or libraries, where the new instances may be spawned in quick succession. All technical details aside, I would wish for some solution to "I spawn a new instance of some huge complex data structure, but I probably wont need all of it, so use calloc". I don't know what the situation with other D programs is, but would be interested in some larger scale experiments with calloc and app startup time. It may be worse in some cases, but what if it improved the general situation - without writing a memory management library? I don't know the GC internals enough to tell if calloc is already used somewhere (in place of malloc and memset).
Feb 09 2012
parent reply "Kagamin" <spam here.lot> writes:
I guess, calloc will reuse blocks too, so if you run the 
compressing function twice, it will reuse the memory block used 
and freed previously and zero it out honestly.
Feb 09 2012
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 09.02.2012, 11:55 Uhr, schrieb Kagamin <spam here.lot>:

 I guess, calloc will reuse blocks too, so if you run the compressing  
 function twice, it will reuse the memory block used and freed previously  
 and zero it out honestly.
You don't understand how it works. calloc gives you exactly 0 KB of memory. There is nothing to zero out :) There is on page, a block of 4096 bytes somewhere in the kernel, that is all zeroes and read-only. If you allocate memory you get a bunch of references to it. Or in other words, a zillion views on the same 4096 bytes repeating over and over. Only once you need to write to it, will happen what you say. The zero page is 'copied' into some (probably previously freed) page. This is equivalent to zeroing out the target page. The main difference here is that the zeroing out happens with a 'lazy' keyword attached to it.
Feb 09 2012
prev sibling parent reply Jose Armando Garcia <jsancio gmail.com> writes:
On Tue, Feb 7, 2012 at 5:39 PM, Marco Leise <Marco.Leise gmx.de> wrote:
 Hi, this is me again with some "size matters" topic. This time, it's not
 the executable size, no! Instead I want to discuss a runtime memory
 footprint and speed issue that affects everyone, and how to improve the
 situation dramatically.

 In D we allocate memory through the GC, that is initialized according to
 the type's .init, which gives us a save default. In most cases this will
 result in the memory block being zeroed out, like in the case of
 allocating ubyte[] buffers. Let's assume, we have a program that
 allocates some buffers in advance, that it may not use fully. This often
 happens
 when the input data is much smaller than the anticipated case. So our
 memory manager should handle this situation well:
 =A0 o =A0zero out a memory block
 =A0 o =A0we probably don't need all of it

 So here is a small benchmark that allocates 512 * 1 MB, first using the
 typical method: new ubyte[1024 * 1024]. The oputput is:

 =A0 =A0 =A0 =A0** new ubyte[1024 * 1024]
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526840 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.098s | sys. time: +0.368s

 As expected we have a physical memory usage increase of ~512 MB and spent
 a considerable amount of time in the system to find free memory blocks
 and in our program to initialize the data to zero. Can we use the GC more
 directly? Let's try GC.calloc:

 =A0 =A0 =A0 =A0** GC.calloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +525104 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.089s | sys. time: +0.370s

 Again, 512 MB and about the same time. Nothing gained, but my RAM is
 starting to fill up. By the way, how does a good old system call to
 'malloc' compare? That gives us a block of garbage 'initialized' data - a
 situation we left behind for good in D! So here we go with another test:

 =A0 =A0 =A0 =A0** malloc(1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.000s | sys. time: +0.002s

 Oh nice! May I say... these 512 calls were for free? 2 MB and 0.002
 seconds ain't worth talking about. The operating system didn't actually
 allocate the memory, it merely gave us a virtual memory range to use.
 Only when we write to the memory will physical memory be bound. That's
 perfect
 for a generously sized buffer, right? Well... we still want it zeroed
 out, so let's initialize this data to zero with ptr[0 .. 1024 * 1024] =3D=
0:
 =A0 =A0 =A0 =A0** malloc(1024 * 1024) + zero out
 =A0 =A0 =A0 =A0 =A0 ressource usage: +526336 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.053s | sys. time: +0.366s

 ... and we are back at square one. With the exception, that the user time
 is notably lower. What we need is a facility that gives us lazily
 allocated zeroed out memory. And guess what, it's not too much to ask
 for. Here is 'calloc' to the rescue:

 =A0 =A0 =A0 =A0** calloc(1, 1024 * 1024)
 =A0 =A0 =A0 =A0 =A0 ressource usage: +2048 KB
 =A0 =A0 =A0 =A0 =A0 user time: +0.001s | sys. time: +0.001s

 How does it work? The operating system fakes the memory allocation and
 just gives us 131072 references to a special read-only memory page of
 zeroes. The semantic is copy-on-write. So we start with a view on zeroed
 out memory and get the real thing once we write into it.
Special? What do you mean by special? Most OS use Virtual Memory so sure they can say here is a page and yet not have that page backed by physical memory. To my knowledge, they can only do this if the allocated memory points to an unmapped page. I doubt this is the case in non-trivial programs.
 (Sorry, if I
 tell
 some of you nothing new, but I just found this out today ;) )
 The question I have is, should we go and improve druntime with that
 knowledge? I'm not aware of any caveats, are there any?
 Thanks for reading and the test program for Linux is in the attachment (I
 used GDC to compile).

 -- Marco
Feb 07 2012
parent "Marco Leise" <Marco.Leise gmx.de> writes:
Am 08.02.2012, 04:40 Uhr, schrieb Jose Armando Garcia <jsancio gmail.com>:

 Special? What do you mean by special? Most OS use Virtual Memory so
 sure they can say here is a page and yet not have that page backed by
 physical memory. To my knowledge, they can only do this if the
 allocated memory points to an unmapped page. I doubt this is the case
 in non-trivial programs.
Yes, special. Like, say, you fork a process on Unix. Instead of copying all the uses memory, the processes share the same pages in read-only, copy-on-write mode. As soon as one of the processes writes to a memory page it is duplicated. The same is true for the 'zero page'. Just that there is one for the whole system and it can have a zillion references to it: http://en.wikipedia.org/wiki/Copy-on-write <- Look for calloc So the page can even be mapped and read without increasing the memory footprint, as I understand it.
Feb 07 2012