Should Out-of-Memory Default to Being a Non-Recoverable Error?

January 22, 2009

By recovering from such an error, I mean that the program is able to clean up and continue operating. This is as opposed to a non-recoverable error, where the process cleans up and exits. Far and away most places in code where an exception could be thrown are memory allocation points, so eliminating those possibilities will simplify a lot of code. This will greatly increases the range of functions that can be marked as nothrow (presuming that out-of-memory errors manifest themselves as a thrown exception), opening up opportunities for optimization and program analysis, and reducing the complexity of coverage testing. In my experience, far and away the vast majority of circumstances where an allocation fails should be treated as fatal errors. I’ve asked this question on stackoverflow and in the D programming language newsgroups, to try and guage under what circumstances it needs to be recoverable.

I’ll try to summarize the cases and ways to deal with them.

An error recovery system could free memory that is not critically needed, such as caches, and continue on.

This capability can be built in to the memory allocation system so it happens automatically without ever needing to signal an error. The memory manager can have user supplied callbacks to release memory in such cases.

A server app could fail for one client request, but recover and continue serving later requests.

This certainly sounds compelling, but the problem is that when a multitasking system starts getting low on memory, the system can slow down drastically. When it gets really low, other processes can start failing, and can bring down the system. It really isn’t a good idea to run such a system close to the edge of running out of memory. A better approach is to set a reasonable maximum memory use for the server app, estimate the memory cost of serving requests, and fail requests that would likely put it over the estimated amount before ever attempting to service them. This has the additional advantages of making it easy to recover from failing the request, as the request would have no half-baked state that needed unwinding, and there would be a single point where the recover has to be done rather than being distributed throughout the request service code.

A critical program, such as life support machines, flight control systems, etc., cannot be allowed to fail.

This implies that it is possible to create bug-free perfect programs that run on perfect hardware that will never fail. Such is, of course, impossible. A far more workable approach to critical systems is to design in redundancy rather than pray for perfection. But let’s say we’re going to try to design a perfect program that allocates memory and cannot fail. This implies having a test procedure that can simulate an out of memory error at every point in the program that could allocate memory, and test its recovery. That’s a tall order. A more practical approach is to preallocate all needed memory before entering the critical section of the program, and then not allocate inside that section.

A program can probe for how big a memory buffer can be allocated by repeatedly requesting a large block and then successively smaller ones until the allocation succeeds.

This strategy is not very practical on a multitasking system with lots of memory and processes, because the other processes will be badly squeezed by it. A better approach is to have a memory allocator API that allows querying how much memory is available, or perhaps an API that can reserve a certain amount of memory for use later, much like a hotel reserves funds in advance on your credit card.

A program can scale its operations to fit in whatever memory is available, relying on the relatively simple expedient of recovering from an out-of-memory error to flag its limits.

Such programs are rare, but illustrate the convenience of having an alternate allocation API that does return an error on out of memory.

Conclusion

Defaulting out of memory errors to be fatal errors is a practical language design choice, especially for larger systems. There are alternative design strategies for programs that might otherwise try to recover from out of memory errors, strategies that are arguably superior and not a compromise. But it’s still worthwhile to have an alternate method of allocating memory that offers some sort of recovery ability for the unusual cases that demand it.

Acknowledgements

I am indebted to David Held, Jason House, Bartosz Milewski and Andrei Alexandrescu for their helpful comments and suggestions. Any remaining errors are entirely my fault.

Articles

Should Out-of-Memory Default to Being a Non-Recoverable Error?

Conclusion

Acknowledgements