digitalmars.D - Dataflow analysis requirements document
- Richard (Rikki) Andrew Cattermole (636/636) Jun 15 A few months ago I wrote up a bit about what I thought the user
A few months ago I wrote up a bit about what I thought the user experience should be in using a DFA engine tied to the language. This was originally meant for Walter, and he has read it, I am now making public which was always intended. It was presented at a monthly meeting. Since then I have been working on a DFA engine described in this document as the fast DFA. Currently, its scope of work is focused on truthiness and nullability, but should it succeed in this role more features would be added. It has been approved for merging once it is ready, under the condition that it is self-contained in its implementation, so no attributes. It will be behind a preview switch. As experimental code it can be expected for it to fail in code that isn't my own. For those who have not been watching the progress on Discord here are a few snippets that currently work: ```d void loopy7() { int* ptr = new int; foreach (i; 0 .. 2) { if (ptr !is null) int val1 = *ptr; // ok ptr = null; } ptr = new int; foreach (i; 0 .. 2) { if (ptr !is null) int val1 = *ptr; // ok int val2 = *ptr; // error ptr = null; } // See less than state at testdfa\start.d(262) // Due to variable `ptr` was non-null and becomes null } void loopy8() { int* ptr = new int; bool b = ptr !is null; foreach(i; 0 .. 2) { assert(b); // outdated consequences ptr = null; // error } // Variable `ptr` was assumed due to assert of variable `b` which gets invalidated at testdfa\start.d(272) } void nested1() { static void nested2() { int* ptr; int v = *ptr; // Dereference on null `ptr` at testdfa\start.d(280) } int* ptr; void nested3() { int v = *ptr; } nested2; nested3; } ``` Function calls and inference are not working just yet, and my big codebase isn't compiling with unknown AST detection turned on. Here is the document: In the following sections, examples are given of requirements that new data flow analysis-based language features should attempt to solve for the D programming language. They are not exhaustive nor does it introduce syntax that must be utilized by future proposals. An overview of the following sections: - Attributes - Locally provable: annotations are not required for analysis to be beneficial to all D users. - Attributes extractable: attributes complicate matters when inferred. - Explicit attributes: are useful even for those who don't like using them. - Inferration of attributes results in less: when inferred fewer attributes will be in your code. - Prove what is true or guarantee it: attributes come in one of two forms, they exist because something was proven by inferration, or the guarantee was requested by the user. - Defaults matter: if you need to opt into catching obvious things, it isn't right. It does not need to catch everything to be useful. - Absence means it was not analyzed: the distinction between an absent attribute and a provided attribute is useful in itself. - All or nothing, or something in between: sometimes a feature should be all on or all off, other times it makes sense to make it selective. - Type state analysis: not all operations should be allowed on a variable, dereferencing a null pointer is inherently incorrect behaviour that can have significant side effects in the usage of D. - Preventing aliasing: the difficulty in aliasing and how ownership transfer with reference counting offers useful guarantees. - Escape analysis: establishes relationships between variables that enable applying restrictions. - Relationship strength: a relationship alone doesn't tell you what restrictions should be placed upon it, they must be specified in addition. - Extra information: once established relationships allow for propagating other information such as an output holds GC memory. With information only available local to a function, there are plenty of things that can be checked for without the use of annotations. For example, we can tell that the variable ``ptr`` is never assigned to and will be ``null``. Yet in the example, it was dereferenced. This is a clear bug that could never be corrected. ```d int* ptr; *ptr = 2; ``` Catching this may seem like a waste of time for the compiler to be performing, however, this is a very simple case. Now consider a 200-line function with branches. How certain are you that you'll get it right without 100% code coverage? A benefit to locally provable things is that at no point were annotations required. But it doesn't end at a simple non-null check, it can work for other type states! In the following example the variable ``buffer`` is not initialized, which means it cannot be read from. At no point did we need to understand what ``writeln`` does, everything needed came from the argument expression and what is in it is an error. ```d ubyte[24] buffer = void; writeln(buffer[22]); ``` Inferred attributes do not assist in enabling separate compilation, instead, they cause numerous issues and potentially will slow down a build process. Without attributes, it would be necessary to do a whole program analysis which is not possible with separate compilation. Therefore DFA attributes must: 1. Not be part of the mangling. 2. Must be extractable and passed into a new phase of compilation to apply. If an attribute is part of mangling, it would take only one aspect of it to be wrong to make a symbol no longer link. This may appear as a feature, but it does not aid in encouraging usage. It could unintentionally mean that inferration could silently cause breakage without any obvious reason in the source. Once inferred, an attribute must be extractable and then passable to the next phase of compilation. Without this both shared and static libraries (where the source is not available) cannot work with DFA-based language features without being explicitly annotated in its entirety. For shared and static libraries where the entire source code is available, it matters not that they are in such artifacts. But it does require that import paths be run through the DFA as well as the source code being compiled which will slow things down. To extract the inferred attributes, the use of the .di generator could be used. However, currently, it runs before semantic analysis making it incapable of doing this. Furthermore, the rewriting of D features has no path back through the parse tree nor can it be fully expressed using D (comma expression and anonymous classes). Regardless of what attributes the compiler can prove to exist, there are times when you want to make stronger guarantees to callers, or yourself. In the following example, the two functions are identical except the latter guarantees to the caller that the parameter will not be escaped. Naturally, with inferration, they would both be annotated as the latter. ```d void unknownEscaping(int* ptr) { *ptr = 99; } void nonEscapingExplicit(scope int* ptr) { *ptr = 99; } ``` This may be needed to assist in separate compilation guarantees to callers or where inferration may have failed (or not attempted due to language features like `` system``). Both attributes and their inferration have some consequences, but they also offer some nice benefits too. For example, in a previous example, we showed that ``scope`` being on a parameter for a given function could be inferred. ```d void unknownEscaping(int* ptr) { *ptr = 99; } ``` The compiler in this case could prove that the variable ``ptr`` doesn't escape the function. So the parameter should be ``scope``. But why didn't the user write it? Do they not know about escape analysis? Do they not want attributes in their code? It doesn't matter why, it wasn't written, nor should it be expected of a human being to write what the compiler can prove. Following from the earlier mention of separate compilation and how attributes must be extractable, being able to extract the inferred attributes means that only the undeterminable attributes (i.e. when function pointers are involved) would need human writing. When a user of D writes an attribute for code that is being compiled, it is due to them wanting to guarantee that this behaviour is true however that may not be the complete story in the analysis of this aspect of their code. For example the parameter ``input`` is marked as ``scope`` in ``heapOrStack`` yet this parameter gets escaped into heap memory. Should this cause an error? No! ```d void heapOrStack(scope int* input) { int** heap = new int; *heap = input; } ``` In this case, only local information is required to determine that the ``heap`` variable should be promoted to the stack, and then limited in its lifetime. What this indicates is that growth alone does not indicate memory escaping. Escape analysis must consider the escape sets after convergence, not during it. Furthermore, if the user never had written the ``scope`` on the parameter ``input``, the compiler could still prove in this case that ``input`` never escaped. The difference between having written it and not, is if the user wanted the guarantee that it does not, and if the compiler can prove it. ```d void stillNoEscapes(int* input) { int** heap = new int; *heap = input; } ``` If the user needs to write an attribute when function pointers are not in play, it is either a bug or the DFA cannot model the problem (which may not be a bug). Writing an attribute should be tuned so that those who want the _guarantee_ of that attribute can do so, but the average person does not _need_ to. What analysis is applied to a function by default (with no discussion of editions here), if it does not cover the things that can be proved locally to a function with a 100% certainty of incorrect behaviour then the defaults are not correct. Currently, in D the default is `` system`` and to opt into checks. This is not the correct default given the previous statement. However within "correct default" is a fairly large range of behaviors. Changing the default to be `` safe`` or an intermediary falls within this. Local information may not be the only verifiable information in a function. If the required information is available (either inferred or explicitly attributed), then errors should be emitted when incorrect code is detected. But this isn't the complete story, without this information and without a complete control flow graph capable DFA, it is not possible to prove anything related to a variable after it has become unanalyzable. This leads to an important understanding of the default DFA, it must have the ability to give up on a variable. This prevents the requirement of all D code being annotated. As a consequence of this, it has the potential to be significantly faster than if it had support for full control flow graph and graph support. If an attribute exists, this is something that has or will been proven to be true if succesfully compiled. If an attribute has not been provided, then the absence must imply it couldn't be analyzed. Consider the ``scope`` attribute. It tells the compiler that this variable when converged successfully at the end of its scope will have an empty escape set. This gives us the annotation `` escape()``. However this is not enough, we still need a way of differentiating between a variable potentially escaping into an unknown location and a variable that hasn't been analyzed. An annotation that could be used for this is `` escape(__unknown)``, note how this annotation is functioning as a set. This gives us three states a variable can be in: 1. No set 2. Set is empty 3. Set contains ``__unknown`` An example of the usefulness of this three-state positive annotation approach can be seen with ``printf``. As a non-D function it would not be analyzed by a D DFA and therefore cannot be called, it would need to be explicitly attributed. In this example the default DFA as per ``Defaults Matter`` section would give up on and not emit errors for function calls that are not attributed, allowing the call to ``printf`` to work in the ``absentAnnotations`` function. ```d extern(C) int printf(const char*, ...) trusted; void absentAnnotations() safe { int* ptr; printf("something %p\n", ptr); } ``` If either the absence meant something (such as `` escape(__unknown)``) or the default DFA did not give up, this code would not compile. The invasiveness of a language feature greatly impacts the usage of the language feature. Take `` live`` as an example, three out of the 65 respondents to the ``memory safety in D what is your view?`` survey said that they use it. As a feature `` live`` is opt-in on a function basis, with no transitive guarantees or verification that it is applied to a graph of functions. This is not what the prevention of aliasing is meant for. Prevention of aliasing either for lifetimes or for code generation purposes is object-specific behaviour, not function. In both use cases, it would need to be specified per variable and transitively applied to any location it is transferred into. Not transitively applying it breaks the guarantee in any use case where lifetimes or code generation is concerned leading to minimal adoption. As a third option to such a feature, it could be applied to all variables in every function, as an ownership transfer prevention mechanism this would cause great breakage in D. What this demonstrates is for each DFA-based language feature, a choice has to be made as to its invasiveness. Some features may be best turned on for everything, or a subset of functions. Others benefit from being on per object in what guarantees they offer. For a given object and with that stored within a variable, each has a type state that dictates accessibility and manipulation possibilities that can be done with it. For example, in `` safe`` code, it should not be possible to read from an uninitialized variable. ```d int var = void; writeln(var); // Error ``` Or to dereference a null pointer in any function: ```d int* ptr; writeln(*ptr); // Error ``` To prevent this each variable is given a type state property, that has one of the following built-in type states (although there may be user-defined ones): 1. Unreachable: you cannot do anything with it. 2. Reachable: can be written to, which will increase its type state. May be explicitly opted into `` = void``. 3. Initialized: can be read or written. This is the default in D. 4. Non-null: can be dereferenced. It is affected scope so that checks can affect the type state seen within: ```d int* ptr = ...; // `ptr` is in initialized type state, not non-null (unknown nullability) if (ptr !is null) { int v = *ptr; // ok } else { int v = *ptr; // Error: variable `ptr` is null, it cannot be dereferenced. } ``` When variables converge the lowest type state wins. ```d int* ptr = ...; // `ptr` is in initialized type state, not non-null (unknown nullability) if (ptr !is null) { } int v = *ptr; // Error: variable `ptr` is null, it cannot be dereferenced. ``` Type state analysis utilises variable tracking, not value tracking in determining what the type state is. Although value tracking may be used to increase the type state for all variables that contain an object within scope. When used, it acts as a first line of defence against program, data and operating system corruption-level events. As well as the introduction of unsafe logical behaviour. Its job is to prevent these events from happening, not prevent them from making things worse once they have occurred. Without this capability, especially in a multithreaded environment, it is too easy for corruption of data that the program is operating on and operating system level behaviour that can prevent further usage of the program or machine. If uptime is considered unrequested program shutdown is a major risk to the capability of a company to use D. Unknown locations (variables outside of a function), should be assumed to have the type state initialized, even after a null check. The reason for this is the intermediary between the check and further access the value could have changed. ```d __gshared int* global; // thread 1 assert(global !is null); // thread 2 global = null; // thread 1 int v = *global; // Null dereference!!! ``` The way to resolve this, ideally with syntax sugar is to perform a load. ```d if ((auto thing = global) !is null) { // *thing ok } ``` The precision of type state analysis may be finer-grained than a single variable. It may allow for some indirection for instance static arrays, structs, ``isolated``, ``immutable`` or tuples. ```d int*[4] vals; assert(vals[2] !is null); int v = *vals[2]; // ok int w = *vals[1]; // Error: variable `vals` entry 1 could be null, cannot dereference ``` Problematic aliasing is when an object is in multiple variables within a function, without the static analyzer knowing that this has occurred. This is problematic for code generation and the performance penalties may not be desired. Therefore prevention may be required. In preventing aliasing the use of a language feature like isolated from Midori may be used. The basic premise is going into or out of a function takes place with a transfer. A transfer is when the old variable(s) are destroyed, and the previous value is copied into the next location. There are a set of problems associated with preventing aliasing: 1. A transfer must be an atomic transaction, without necessarily being an atomic instruction. This prevents a destructor, copy, or move constructor call, see compare and swap atomic instruction for what this should be implemented as when moving into and out of unknown locations. 2. All function parameters must be assumed to alias, `` restrict`` can be used to indicate that this is not the case but is not implemented by dmd and there is no enforcement on the caller side. 3. Limited to any type that can be supported by the compare and swap atomic instruction. Usually, this supports the sizes of one and two-pointers. Therefore is limited to heap-allocated memory or stack memory that has been taken a pointer to. To account for these requirements an ownership transfer mechanism is required. This mechanism is annotated per object, which can be done as a type qualifier (such as ``isolated(T)``). When a transfer occurs, the old variable must be put into its initialized type state, it must not hold a value. A transfer is very similar to a move, but it cannot have a function called on the old variable in response or to assist the move operation. The compiler must guarantee that no matter how many variables an object is within, it will be notified only one time that its end-of-life event has occurred, this is especially important when using reference counting with isolated. ```d struct S { void opRCInc(); void opRCDec(); } void takesIsolatedIn1(isolated(S*) input) { isolated(S*) temp = input; input = typeof(input).init; temp.opRCDec(); } void takesIsolatedIn2(isolated(S*) input) { isolated(S*) temp = input; // elided as the object in both `temp` and `input` // temp.opRCDec(); input.opRCDec(); } ``` For this to function it requires value tracking. Both reference counting and isolated require it to function. For reference counting it is used for eliding pairs of increments and decrement calls. For added optimization opportunities when a function call occurs, the increment happens within the callee, not the caller. ```d struct S { void opRCInc(); void opRCDec(); } // the pair of RC calls may be elided in this example void rcCallee(S input) { input.opRCInc(); input.opRCDec(); } void rcCaller() { S s = ...; rcCallee(s); s.opRCDec(); } ``` Of note is that reference counting establishes safe access and mutation of memory at runtime rather than at compile time like isolated does. When paired together isolated guarantees the reference count will only ever be one or zero. Reference counting offers a hook for when that count reaches zero. While allowing multiple references within a function body. Coupled with escape analysis this offers a flexible method of memory management when doing data processing that enables the best code generation wrt. aliasing. Any access to a variable post-transfer (or a move) should be prevented. This will likely indicate a logical bug in the code. However, if a non-null pointer check occurs, then access in that branch can be allowed. ```d isolated(int*) obj = new int; foreach(_; 0 .. 10) { if (random() > 0.5) { transfer(obj); // ugh oh! error } else { obj = new int; } if (obj !is null) { transfer(obj); // all ok! } } ``` Alternatively, type-state analysis can be used for a simpler implementation. Wherein a transfer puts all variables an object is in into a type state reachable. Preventing all reads. The establishment of a relationship between variables is a useful bit of information to have. With this relationship for each input to its respective output, it has a strength that determines what restrictions are placed upon this relationship. These relationships form a kind of upside-down tree, which can be seen in the following example: ```d int* escapeThese(int* a, int* b, int* c, int* d) { int* middle1 = random() > 0.5 ? a : b; int* middle2 = random() > 0.5 ? c : d; int* bottom = random() > 0.5 ? middle1 : middle2; return bottom; } ``` In that example, there is only one scope and no interesting language features that would complicate what the data flow analysis must support. What it is meant to show is how the tree affects the cleanup process of a single scope. Given that the DFA has been run forward, now it's time to do the backwards cleanup at the end of the scope. It occurs in the following order: 1. The return value if not annotated received the ``bottom`` variable state, otherwise the annotated state of the return value must match or exceed that of the ``bottom``variable. 2. ``bottom`` then has its state bubbled up to ``middle1`` and ``middle2``. 3. ``middle2`` then has its state bubbled up to ``c`` and ``d``. 4. ``middle1`` then has its state bubbled up to ``a`` and ``b``. 5. Finally ``a``, ``b``, ``c`` and ``d``. If they are not annotated then the parameters receive the state from the variable, otherwise, the annotated state of the parameter must match or exceed that of the variable. For multiple scopes, convergence takes place at the end of the scope with its parent or when other events occur such as returning from the function. An important feedback that came from the survey ``memory safety in D what is your view?`` was that the annotation that DIP1000 uses for escapes is too confusing. So let's break down what parts there are to the annotation. 1. The escape set: defines the relationship between an input variable to its outputs. 2. The relationship strength: what establishes the relationship between input and output, and how that relationship should be restricted. Unfortunately, DIP1000 has a limited escape set provided as it supports the empty escape set ``scope`` by itself, and ``return`` + ``scope``means that the input escapes into either the return value or first parameter (which could be the this pointer). There is no attribute for stating that it escapes into the unknown see ``Abscense Means It Was Not Analyzed`` for why this matters. To make matters worse, to determine the relationship strength is based upon the order of the attributes ``return`` and ``scope``. The syntax this document uses is in the form of `` escape(output S, ...)``. Where ``S`` is the strength. At the minimum, the relationship strengths needed are: 1. Assignment ``=`` 2. Taking a pointer to input and storing in output (could be a field), the input must outlive the output ``&`` 3. Protect the input from mutation as long as the output exists, the input must outlive the output ``^`` When both the input and the output are by-ref, the relationship will default to taking a pointer ``&``, otherwise the default is assignment ``=``. This default matters, see ``Defaults Matter`` for why. The goal of having a default for the relationship strength is an attempt of preventing the need for it to be specified by the user. Relationships between variables facilitate the transmission of extra information that could be useful in other analyses. For example, knowing that a variable contains GC memory with no restrictions placed upon it, means it can be escaped freely. In the following example, the ``escapableGC`` function has one parameter, and this parameter contributes to the return value. There are no other parameters, therefore if the argument to that parameter is GC, so is the output. ```d int* escapableGC( escape(return) int* ptr); int* someVal = escapableGC(new int); return someVal; // perfectly safe as `someVal` is GC ``` While this tells us the behaviour of parameters, what if there are none? Well, we'd need to annotate for a given output this extra information. ```d gcout int* outGC() { return new int; } ``` The purpose of this extra information is twofold. It allows for code that would otherwise be considered unsafe, to be known to be safe. It helps to prevent bad behaviour without requiring attributes to go down the call stack which is very invasive. An example of catching bad behaviour would be preventing TLS memory from crossing a coroutine's yield point. This helps to prevent some coloring of functions. ```d int tls; tlsout int* grab() { return &tls; } void co() async { int* ptr = grab(); async return; int v = *ptr; // Error: the variable `ptr` contains TLS memory and cannot cross a coroutine yield point due to the possibility of being on a different thread. return; } ``` - [Report: Analysis of survey: Memory safety in D, what is your view?](https://forum.dlang.org/post/kijxuuxgtyzciuhjanrq forum.dlang.org) - [What color is your function](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/)
Jun 15