www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Dataflow analysis requirements document

A few months ago I wrote up a bit about what I thought the user 
experience should be in using a DFA engine tied to the language. 
This was originally meant for Walter, and he has read it, I am 
now making public which was always intended. It was presented at 
a monthly meeting.

Since then I have been working on a DFA engine described in this 
document as the fast DFA.

Currently, its scope of work is focused on truthiness and 
nullability, but should it succeed in this role more features 
would be added.

It has been approved for merging once it is ready, under the 
condition that it is self-contained in its implementation, so no 
attributes. It will be behind a preview switch. As experimental 
code it can be expected for it to fail in code that isn't my own.

For those who have not been watching the progress on Discord here 
are a few snippets that currently work:

```d
void loopy7()
{
     int* ptr = new int;

     foreach (i; 0 .. 2)
     {
         if (ptr !is null)
             int val1 = *ptr; // ok

         ptr = null;
     }

     ptr = new int;

     foreach (i; 0 .. 2)
     {
         if (ptr !is null)
             int val1 = *ptr; // ok

         int val2 = *ptr; // error
         ptr = null;
     } // See less than state at testdfa\start.d(262)
       //     Due to variable `ptr` was non-null and becomes null
}

void loopy8() {
     int* ptr = new int;
     bool b = ptr !is null;

     foreach(i; 0 .. 2) {
         assert(b); // outdated consequences
         ptr = null; // error
     } // Variable `ptr` was assumed due to assert of variable `b` 
which gets invalidated at testdfa\start.d(272)
}

void nested1()
{
     static void nested2()
     {
         int* ptr;
         int v = *ptr; // Dereference on null `ptr` at 
testdfa\start.d(280)
     }

     int* ptr;

     void nested3()
     {
         int v = *ptr;
     }

     nested2;
     nested3;
}
```

Function calls and inference are not working just yet, and my big 
codebase isn't compiling with unknown AST detection turned on.

Here is the document:





In the following sections, examples are given of requirements 
that new data flow analysis-based language features should 
attempt to solve for the D programming language. They are not 
exhaustive nor does it introduce syntax that must be utilized by 
future proposals.

An overview of the following sections:

- Attributes
	- Locally provable: annotations are not required for analysis to 
be beneficial to all D users.
	- Attributes extractable: attributes complicate matters when 
inferred.
	- Explicit attributes: are useful even for those who don't like 
using them.
	- Inferration of attributes results in less: when inferred fewer 
attributes will be in your code.
	- Prove what is true or guarantee it: attributes come in one of 
two forms, they exist because something was proven by 
inferration, or the guarantee was requested by the user.
	- Defaults matter: if you need to opt into catching obvious 
things, it isn't right. It does not need to catch everything to 
be useful.
	- Absence means it was not analyzed: the distinction between an 
absent attribute and a provided attribute is useful in itself.
- All or nothing, or something in between: sometimes a feature 
should be all on or all off, other times it makes sense to make 
it selective.
- Type state analysis: not all operations should be allowed on a 
variable, dereferencing a null pointer is inherently incorrect 
behaviour that can have significant side effects in the usage of 
D.
- Preventing aliasing: the difficulty in aliasing and how 
ownership transfer with reference counting offers useful 
guarantees.
- Escape analysis: establishes relationships between variables 
that enable applying restrictions.
	- Relationship strength: a relationship alone doesn't tell you 
what restrictions should be placed upon it, they must be 
specified in addition.
	- Extra information: once established relationships allow for 
propagating other information such as an output holds GC memory.





With information only available local to a function, there are 
plenty of things that can be checked for without the use of 
annotations.

For example, we can tell that the variable ``ptr`` is never 
assigned to and will be ``null``. Yet in the example, it was 
dereferenced. This is a clear bug that could never be corrected.

```d
int* ptr;
*ptr = 2;
```

Catching this may seem like a waste of time for the compiler to 
be performing, however, this is a very simple case. Now consider 
a 200-line function with branches. How certain are you that 
you'll get it right without 100% code coverage?

A benefit to locally provable things is that at no point were 
annotations required. But it doesn't end at a simple non-null 
check, it can work for other type states!

In the following example the variable ``buffer`` is not 
initialized, which means it cannot be read from. At no point did 
we need to understand what ``writeln`` does, everything needed 
came from the argument expression and what is in it is an error.

```d
ubyte[24] buffer = void;
writeln(buffer[22]);
```



Inferred attributes do not assist in enabling separate 
compilation, instead, they cause numerous issues and potentially 
will slow down a build process. Without attributes, it would be 
necessary to do a whole program analysis which is not possible 
with separate compilation.

Therefore DFA attributes must:

1. Not be part of the mangling.
2. Must be extractable and passed into a new phase of compilation 
to apply.

If an attribute is part of mangling, it would take only one 
aspect of it to be wrong to make a symbol no longer link. This 
may appear as a feature, but it does not aid in encouraging 
usage. It could unintentionally mean that inferration could 
silently cause breakage without any obvious reason in the source.

Once inferred, an attribute must be extractable and then passable 
to the next phase of compilation. Without this both shared and 
static libraries (where the source is not available) cannot work 
with DFA-based language features without being explicitly 
annotated in its entirety.

For shared and static libraries where the entire source code is 
available, it matters not that they are in such artifacts. But it 
does require that import paths be run through the DFA as well as 
the source code being compiled which will slow things down.

To extract the inferred attributes, the use of the .di generator 
could be used. However, currently, it runs before semantic 
analysis making it incapable of doing this. Furthermore, the 
rewriting of D features has no path back through the parse tree 
nor can it be fully expressed using D (comma expression and 
anonymous classes).



Regardless of what attributes the compiler can prove to exist, 
there are times when you want to make stronger guarantees to 
callers, or yourself.

In the following example, the two functions are identical except 
the latter guarantees to the caller that the parameter will not 
be escaped. Naturally, with inferration, they would both be 
annotated as the latter.

```d
void unknownEscaping(int* ptr) {
	*ptr = 99;
}

void nonEscapingExplicit(scope int* ptr) {
	*ptr = 99;
}
```

This may be needed to assist in separate compilation guarantees 
to callers or where inferration may have failed (or not attempted 
due to language features like `` system``).



Both attributes and their inferration have some consequences, but 
they also offer some nice benefits too.

For example, in a previous example, we showed that ``scope`` 
being on a parameter for a given function could be inferred.

```d
void unknownEscaping(int* ptr) {
	*ptr = 99;
}
```

The compiler in this case could prove that the variable ``ptr`` 
doesn't escape the function. So the parameter should be 
``scope``. But why didn't the user write it? Do they not know 
about escape analysis? Do they not want attributes in their code? 
It doesn't matter why, it wasn't written, nor should it be 
expected of a human being to write what the compiler can prove.

Following from the earlier mention of separate compilation and 
how attributes must be extractable, being able to extract the 
inferred attributes means that only the undeterminable attributes 
(i.e. when function pointers are involved) would need human 
writing.



When a user of D writes an attribute for code that is being 
compiled, it is due to them wanting to guarantee that this 
behaviour is true however that may not be the complete story in 
the analysis of this aspect of their code.

For example the parameter ``input`` is marked as ``scope``  in 
``heapOrStack`` yet this parameter gets escaped into heap memory. 
Should this cause an error? No!

```d
void heapOrStack(scope int* input) {
	int** heap = new int;
	*heap = input;
}
```

In this case, only local information is required to determine 
that the ``heap`` variable should be promoted to the stack, and 
then limited in its lifetime. What this indicates is that growth 
alone does not indicate memory escaping. Escape analysis must 
consider the escape sets after convergence, not during it.

Furthermore, if the user never had written the ``scope`` on the 
parameter ``input``, the compiler could still prove in this case 
that ``input`` never escaped. The difference between having 
written it and not, is if the user wanted the guarantee that it 
does not, and if the compiler can prove it.

```d
void stillNoEscapes(int* input) {
	int** heap = new int;
	*heap = input;
}
```

If the user needs to write an attribute when function pointers 
are not in play, it is either a bug or the DFA cannot model the 
problem (which may not be a bug). Writing an attribute should be 
tuned so that those who want the _guarantee_ of that attribute 
can do so, but the average person does not _need_ to.



What analysis is applied to a function by default (with no 
discussion of editions here), if it does not cover the things 
that can be proved locally to a function with a 100% certainty of 
incorrect behaviour then the defaults are not correct.

Currently, in D the default is `` system`` and to opt into 
checks. This is not the correct default given the previous 
statement.

However within "correct default" is a fairly large range of 
behaviors. Changing the default to be `` safe`` or an 
intermediary falls within this.

Local information may not be the only verifiable information in a 
function. If the required information is available (either 
inferred or explicitly attributed), then errors should be emitted 
when incorrect code is detected. But this isn't the complete 
story, without this information and without a complete control 
flow graph capable DFA, it is not possible to prove anything 
related to a variable after it has become unanalyzable.

This leads to an important understanding of the default DFA, it 
must have the ability to give up on a variable. This prevents the 
requirement of all D code being annotated. As a consequence of 
this, it has the potential to be significantly faster than if it 
had support for full control flow graph and graph support.



If an attribute exists, this is something that has or will been 
proven to be true if succesfully compiled. If an attribute has 
not been provided, then the absence must imply it couldn't be 
analyzed.

Consider the ``scope`` attribute. It tells the compiler that this 
variable when converged successfully at the end of its scope will 
have an empty escape set.

This gives us the annotation `` escape()``.

However this is not enough, we still need a way of 
differentiating between a variable potentially escaping into an 
unknown location and a variable that hasn't been analyzed.

An annotation that could be used for this is 
`` escape(__unknown)``, note how this annotation is functioning 
as a set. This gives us three states a variable can be in:

1. No set
2. Set is empty
3. Set contains ``__unknown``

An example of the usefulness of this three-state positive 
annotation approach can be seen with ``printf``. As a non-D 
function it would not be analyzed by a D DFA and therefore cannot 
be called, it would need to be explicitly attributed.

In this example the default DFA as per ``Defaults Matter`` 
section would give up on and not emit errors for function calls 
that are not attributed, allowing the call to ``printf`` to work 
in the ``absentAnnotations`` function.

```d
extern(C) int printf(const char*, ...)  trusted;

void absentAnnotations()  safe {
	int* ptr;
	printf("something %p\n", ptr);
}
```

If either the absence meant something (such as 
`` escape(__unknown)``) or the default DFA did not give up, this 
code would not compile.



The invasiveness of a language feature greatly impacts the usage 
of the language feature. Take `` live`` as an example, three out 
of the 65 respondents to the ``memory safety in D what is your 
view?`` survey said that they use it.

As a feature `` live`` is opt-in on a function basis, with no 
transitive guarantees or verification that it is applied to a 
graph of functions. This is not what the prevention of aliasing 
is meant for.

Prevention of aliasing either for lifetimes or for code 
generation purposes is object-specific behaviour, not function. 
In both use cases, it would need to be specified per variable and 
transitively applied to any location it is transferred into. Not 
transitively applying it breaks the guarantee in any use case 
where lifetimes or code generation is concerned leading to 
minimal adoption.

As a third option to such a feature, it could be applied to all 
variables in every function, as an ownership transfer prevention 
mechanism this would cause great breakage in D.

What this demonstrates is for each DFA-based language feature, a 
choice has to be made as to its invasiveness. Some features may 
be best turned on for everything, or a subset of functions. 
Others benefit from being on per object in what guarantees they 
offer.



For a given object and with that stored within a variable, each 
has a type state that dictates accessibility and manipulation 
possibilities that can be done with it.

For example, in `` safe`` code, it should not be possible to read 
from an uninitialized variable.

```d
int var = void;
writeln(var); // Error
```

Or to dereference a null pointer in any function:

```d
int* ptr;
writeln(*ptr); // Error
```

To prevent this each variable is given a type state property, 
that has one of the following built-in type states (although 
there may be user-defined ones):

1. Unreachable: you cannot do anything with it.
2. Reachable: can be written to, which will increase its type 
state. May be explicitly opted into `` = void``.
3. Initialized: can be read or written. This is the default in D.
4. Non-null: can be dereferenced.

It is affected scope so that checks can affect the type state 
seen within:

```d
int* ptr = ...;
// `ptr` is in initialized type state, not non-null (unknown 
nullability)

if (ptr !is null) {
	int v = *ptr; // ok
} else {
	int v = *ptr; // Error: variable `ptr` is null, it cannot be 
dereferenced.
}
```

When variables converge the lowest type state wins.

```d
int* ptr = ...;
// `ptr` is in initialized type state, not non-null (unknown 
nullability)

if (ptr !is null) {
}

int v = *ptr; // Error: variable `ptr` is null, it cannot be 
dereferenced.
```

Type state analysis utilises variable tracking, not value 
tracking in determining what the type state is. Although value 
tracking may be used to increase the type state for all variables 
that contain an object within scope.

When used, it acts as a first line of defence against program, 
data and operating system corruption-level events. As well as the 
introduction of unsafe logical behaviour. Its job is to prevent 
these events from happening, not prevent them from making things 
worse once they have occurred. Without this capability, 
especially in a multithreaded environment, it is too easy for 
corruption of data that the program is operating on and operating 
system level behaviour that can prevent further usage of the 
program or machine. If uptime is considered unrequested program 
shutdown is a major risk to the capability of a company to use D.

Unknown locations (variables outside of a function), should be 
assumed to have the type state initialized, even after a null 
check. The reason for this is the intermediary between the check 
and further access the value could have changed.

```d
__gshared int* global;

// thread 1
assert(global !is null);

// thread 2
global = null;

// thread 1
int v = *global; // Null dereference!!!
```

The way to resolve this, ideally with syntax sugar is to perform 
a load.

```d
if ((auto thing = global) !is null) {
	// *thing ok
}
```

The precision of type state analysis may be finer-grained than a 
single variable. It may allow for some indirection for instance 
static arrays, structs, ``isolated``, ``immutable`` or tuples.

```d
int*[4] vals;

assert(vals[2] !is null);

int v = *vals[2]; // ok
int w = *vals[1]; // Error: variable `vals` entry 1 could be 
null, cannot dereference
```



Problematic aliasing is when an object is in multiple variables 
within a function, without the static analyzer knowing that this 
has occurred. This is problematic for code generation and the 
performance penalties may not be desired. Therefore prevention 
may be required.

In preventing aliasing the use of a language feature like 
isolated from Midori may be used. The basic premise is going into 
or out of a function takes place with a transfer. A transfer is 
when the old variable(s) are destroyed, and the previous value is 
copied into the next location.

There are a set of problems associated with preventing aliasing:

1. A transfer must be an atomic transaction, without necessarily 
being an atomic instruction. This prevents a destructor, copy, or 
move constructor call, see compare and swap atomic instruction 
for what this should be implemented as when moving into and out 
of unknown locations.
2. All function parameters must be assumed to alias, 
`` restrict`` can be used to indicate that this is not the case 
but is not implemented by dmd and there is no enforcement on the 
caller side.
3. Limited to any type that can be supported by the compare and 
swap atomic instruction. Usually, this supports the sizes of one 
and two-pointers. Therefore is limited to heap-allocated memory 
or stack memory that has been taken a pointer to.

To account for these requirements an ownership transfer mechanism 
is required. This mechanism is annotated per object, which can be 
done as a type qualifier (such as ``isolated(T)``).

When a transfer occurs, the old variable must be put into its 
initialized type state, it must not hold a value. A transfer is 
very similar to a move, but it cannot have a function called on 
the old variable in response or to assist the move operation.

The compiler must guarantee that no matter how many variables an 
object is within, it will be notified only one time that its 
end-of-life event has occurred, this is especially important when 
using reference counting with isolated.

```d
struct S {
	void opRCInc();
	void opRCDec();
}

void takesIsolatedIn1(isolated(S*) input) {
	isolated(S*) temp = input;
	input = typeof(input).init;
	temp.opRCDec();
}

void takesIsolatedIn2(isolated(S*) input) {
	isolated(S*) temp = input;
	
	// elided as the object in both `temp` and `input`
	// temp.opRCDec();
	
	input.opRCDec();
}
```

For this to function it requires value tracking. Both reference 
counting and isolated require it to function. For reference 
counting it is used for eliding pairs of increments and decrement 
calls. For added optimization opportunities when a function call 
occurs, the increment happens within the callee, not the caller.

```d
struct S {
	void opRCInc();
	void opRCDec();
}

// the pair of RC calls may be elided in this example
void rcCallee(S input) {
	input.opRCInc();
	
	input.opRCDec();
}

void rcCaller() {
	S s = ...;
	
	rcCallee(s);
	
	s.opRCDec();
}
```

Of note is that reference counting establishes safe access and 
mutation of memory at runtime rather than at compile time like 
isolated does. When paired together isolated guarantees the 
reference count will only ever be one or zero. Reference counting 
offers a hook for when that count reaches zero. While allowing 
multiple references within a function body. Coupled with escape 
analysis this offers a flexible method of memory management when 
doing data processing that enables the best code generation wrt. 
aliasing.

Any access to a variable post-transfer (or a move) should be 
prevented. This will likely indicate a logical bug in the code. 
However, if a non-null pointer check occurs, then access in that 
branch can be allowed.

```d
isolated(int*) obj = new int;

foreach(_; 0 .. 10) {
	if (random() > 0.5) {
		transfer(obj); // ugh oh! error
	} else {
		obj = new int;
	}

	if (obj !is null) {
		transfer(obj); // all ok!
	}
}
```

Alternatively, type-state analysis can be used for a simpler 
implementation. Wherein a transfer puts all variables an object 
is in into a type state reachable. Preventing all reads.



The establishment of a relationship between variables is a useful 
bit of information to have. With this relationship for each input 
to its respective output, it has a strength that determines what 
restrictions are placed upon this relationship.

These relationships form a kind of upside-down tree, which can be 
seen in the following example:

```d
int* escapeThese(int* a, int* b, int* c, int* d) {
	int* middle1 = random() > 0.5 ? a : b;
	int* middle2 = random() > 0.5 ? c : d;

	int* bottom = random() > 0.5 ? middle1 : middle2;
	return bottom;
}
```

In that example, there is only one scope and no interesting 
language features that would complicate what the data flow 
analysis must support. What it is meant to show is how the tree 
affects the cleanup process of a single scope.

Given that the DFA has been run forward, now it's time to do the 
backwards cleanup at the end of the scope. It occurs in the 
following order:

1. The return value if not annotated received the ``bottom`` 
variable state, otherwise the annotated state of the return value 
must match or exceed that of the ``bottom``variable.
2. ``bottom`` then has its state bubbled up to ``middle1`` and 
``middle2``.
3. ``middle2`` then has its state bubbled up to ``c`` and ``d``.
4. ``middle1`` then has its state bubbled up to ``a`` and ``b``.
5. Finally ``a``, ``b``, ``c`` and ``d``. If they are not 
annotated then the parameters receive the state from the 
variable, otherwise, the annotated state of the parameter must 
match or exceed that of the variable.

For multiple scopes, convergence takes place at the end of the 
scope with its parent or when other events occur such as 
returning from the function.



An important feedback that came from the survey ``memory safety 
in D what is your view?`` was that the annotation that DIP1000 
uses for escapes is too confusing.

So let's break down what parts there are to the annotation.

1. The escape set: defines the relationship between an input 
variable to its outputs.
2. The relationship strength: what establishes the relationship 
between input and output, and how that relationship should be 
restricted.

Unfortunately, DIP1000 has a limited escape set provided as it 
supports the empty escape set ``scope`` by itself, and ``return`` 
+ ``scope``means that the input escapes into either the return 
value or first parameter (which could be the this pointer). There 
is no attribute for stating that it escapes into the unknown see 
``Abscense Means It Was Not Analyzed`` for why this matters.

To make matters worse, to determine the relationship strength is 
based upon the order of the attributes ``return`` and ``scope``.

The syntax this document uses is in the form of `` escape(output 
S, ...)``. Where ``S``  is the strength.

At the minimum, the relationship strengths needed are:

1. Assignment ``=``
2. Taking a pointer to input and storing in output (could be a 
field), the input must outlive the output ``&``
3. Protect the input from mutation as long as the output exists, 
the input must outlive the output ``^``

When both the input and the output are by-ref, the relationship 
will default to taking a pointer ``&``, otherwise the default is 
assignment ``=``. This default matters, see ``Defaults Matter`` 
for why. The goal of having a default for the relationship 
strength is an attempt of preventing the need for it to be 
specified by the user.



Relationships between variables facilitate the transmission of 
extra information that could be useful in other analyses. For 
example, knowing that a variable contains GC memory with no 
restrictions placed upon it, means it can be escaped freely.

In the following example, the ``escapableGC`` function has one 
parameter, and this parameter contributes to the return value. 
There are no other parameters, therefore if the argument to that 
parameter is GC, so is the output.

```d
int* escapableGC( escape(return) int* ptr);

int* someVal = escapableGC(new int);
return someVal; // perfectly safe as `someVal` is GC
```

While this tells us the behaviour of parameters, what if there 
are none? Well, we'd need to annotate for a given output this 
extra information.

```d
 gcout int* outGC() {
	return new int;
}
```

The purpose of this extra information is twofold. It allows for 
code that would otherwise be considered unsafe, to be known to be 
safe. It helps to prevent bad behaviour without requiring 
attributes to go down the call stack which is very invasive.

An example of catching bad behaviour would be preventing TLS 
memory from crossing a coroutine's yield point. This helps to 
prevent some coloring of functions.

```d
int tls;

 tlsout int* grab() {
	return &tls;
}

void co()  async {
	int* ptr = grab();

	 async return;

	int v = *ptr; // Error: the variable `ptr` contains TLS memory 
and cannot cross a coroutine yield point due to the possibility 
of being on a different thread.

	return;
}
```



- [Report: Analysis of survey: Memory safety in D, what is your 
view?](https://forum.dlang.org/post/kijxuuxgtyzciuhjanrq forum.dlang.org)
- [What color is your 
function](https://journal.stuffwithstuff.com/2015/02/01/what-color-is-your-function/)
Jun 15