Arrays: Core or Library Type?
written by Walter Bright
May 28, 2008
As programming languages provide ever more powerful ways to specify user defined types, where do arrays fit in? Are arrays a fundamental type that must be in the core language, or are they a type that can be built up from other core types, and so be a user defined library type?
In languages that have pointers, there are several ways in which an array of type T could be represented, such as (in D programming language notation):
- length and pointer to the array data:
struct Array { size_t length; T* begin; }
- pointer pair representing the start and end of the array:
struct Array { T* begin; T* pastEnd; }
- length prefixed array data:
struct Array { size_t length; ... data ... }
- slice out of memory block:
struct Array { void* datablock; T* begin; size_t length; }
Add in some operator overloading for [ ] and other boilerplate, and there’s a nice, servicable user defined array type.
Or is it?
Consider it from the perspective of a tool that is analyzing the types. That tool could be the compiler, the optimizer, a debugger, the reflection/introspection facilities, the garbage collector, the security auditor, etc. Some of these tools will collect information at compile time (static analysis) while others at runtime (dynamic analysis). The tool will see the pointer(s), and that’s it. It will have no idea that it is pointing to an array, how many elements are in that array, or even if the pointer is pointing off the end of an array and is not pointing at any valid data at all. It hits an impermeable wall in its attempt to discover things about the program both at compile time and at run time.
There are a couple of ways one might break through that barrier. The first is for the tool to recognize the struct Array as something special that the tool has special knowledge of. While this works, it flies in the face of the whole point of a user defined array type. The user cannot change how the Arrays work, they are essentially fixed and become core types anyway.
The second way is to establish conventions for a user defined aggregate type like begin(), end(), and an iterator type. Then the tool looks for those names and uses them to walk the array type. The problem is that the code for begin(), end(), etc., can be arbitrarily complex, and so the tool must now be able to execute code. This requires a massive increase in the implementation complexity of the tool, as compared with the pretty simple case of just having a built in array type. Worse, the tool will have to use this strategy for every aggregate because it cannot tell the difference between an array and any other collection type. Dealing with this will also unacceptably slow down things like type aware garbage collectors that need to quickly walk a runtime data structure.
In addition, consider the problems an introspection facility must face:
- A generic swap needs to be able to ‘crack’ open the type and look at its composition. It can’t do anything with a pointer.
- A deep copy can be implemented generically with core arrays. But with user defined arrays, the generic facility will see the pointer, but will have no idea how that relates to the array. The memory held by the array object cannot be transitively accessed.
- A move can be implemented generically with core arrays. But how would it be done with a pointer?
- Many generic algorithms rely on being able to tell if two memory blocks are distinct or are overlapping, or whether they have interior references. (For example, a generic move, copy or swap would need to detect interior references so they can be fixed up.) Relying on begin(), end() and an iterator does not give a clue as to whether the memory is continuous or not.
- Templates parameterized on values (as opposed to types) have been shown by C++ to be very useful and popular. Such templates could be easily specialized on a built—in array value (with strings as an obvious candidate). But without built—in arrays, the proposition becomes much more difficult as the template engine must impose its own convention for passing pointers and sizes around.
Another problem with user defined arrays is that their implementations require pointer arithmetic. This means that the array implementations must necessarily be outside any memory safe subset of the language. The industry and academia both agree that the Trusted Computing Base (i.e. code that is not demonstrably safe and must be considered safe “by faith”) should be minimal in size so as to reduce bug incidence. See for example Daniel Bernstein’s notes about the exceptionally safe qmail program or Nickolai Zeldovich’s work.
Other collection types, such as linked lists, binary trees, etc., do not have this issue because they can be built up out of safe references and arrays. Arrays cannot be, and so they should properly be put into the core language.
Acknowledgements
This article was inspired and contributed to by Andrei Alexandrescu.