digitalmars.D - Re: Lints, Condate and bugs

bearophile <bearophileHUGS lycos.com> Oct 27 2010

bearophile <bearophileHUGS lycos.com> Oct 27 2010
Walter Bright <newshound2 digitalmars.com> Oct 27 2010

bearophile <bearophileHUGS lycos.com> Oct 27 2010

bearophile <bearophileHUGS lycos.com> Oct 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 I'd hoped to be able to obsolete them by designing the need for such checks
out of the language.


This is a good purpose, I agree that when possible it's the best solution, and
indeed D makes useless (or less useful) many of the bugs found by C lints.


That Condate tool is simple, its rule set is less than a page long, so it can't
find very complex kind of bugs. Much more refined lints (for C), that contain
quite more than just few simple rules, are able to find more complex bugs.
There is a free online demo for a good lint.


 There's no need to check for an error that cannot be expressed.


Right :-)


 Looking at what the rule based analyzers do falls into predictable categories:


Commercial lints for C are probably able to find other kind of bugs too. Even
Splint (a free lint for C) is probably able to do more than your list (but you
need to add some semantics annotations to the C code if you want Split to do
that).


 Keep in mind that such tools can also produce large quantities of false
positives,


That Condate tool is able to find only simple bugs, but I think its false
positives aren't many. The static analyzer of Clang is supposed to have a
really low amount of false positives, so low that I think it may be configured
to submit bug reports automatically :-)


 requiring ugly workarounds or causing the programmer to miss the real
 bugs. Keep in mind also that these tools are often way oversold - they catch a
 few kinds of bugs, but not logic errors. Over time, I've found my own coding
 bugs that such tools might catch get less and less rare. The bugs in my code
are
 logic errors that no tool could catch.


I am quite sure that if I run a good C/C++ lint on the D front-end it may catch
a large (hundreds) of bugs. But even if you are right, that you don't write
bugs that a simple rule-based analyzers is able to catch, the world is full of
people that don't have your level of experience in C/C++ coding. So for them a
lint may be useful. I have some practice for C, I keep my brain switched on
when I write C code, but I have found a good C lint able to find some mistakes
in my C code. It gives many false positives (well, most of them are true "bugs"
but I don't want to fix them, like giving a signed integer to the malloc
function), but in the end I want to keep using it. I am sold on it.


 Here's how D deals with them:


Once in a while I use a low-level coding style in D, in such cases I may fall
in several traps of C. A lint tool for D may help in those parts of code too.

I have merged your two parallel list of points, to make my answers more
readable.


 1. Memory allocation errors - failure to free, dangling pointers, redundant
frees
 1. Garbage collection.


The GC avoids a large number of bugs. For the situations where the GC is unfit
Ada shows the usage of more than one class of pointers, with different
capabilities. This may reduce the bug rate (but introduces some extra
complexity). In past for example I have suggested to statically differentiate
pointers to GC-managed memory from pointers to manually managed memory (so they
are two different kind of pointers), because they are quite different (example:
putting tags inside a GC-managed pointer is a bad idea). You answered me that
this introduces too much complexity in the language.


 4. Memory corruption, such as buffer overflows
 4. Array bounds checking, and safe mode in general, solves this.


Array bounds checking slows down code a lot. Often more than 20%. See below for
my answer to point 8.

(Static analysis in recent JavaVMs is able to infer that many of those tests
checks are useless and removed them with no harm for the code). 


 6. Failure to deal with error returns
 6. Exceptions solve this


Yet I don't see exceptions used much in Phobos yet :-]
Example: in Python list.index() throws an exception if the item is missing,
while indexOf returns -1.


 7. Null pointer dereferencing
 7. Certainly the idea of non-null types has a lot of adherents, D doesn't have
that.


It's too much late to to make D2 references non-null on default now, so that
discussion is closed.

But in bug http://d.puremagic.com/issues/show_bug.cgi?id=4571 I have proposed
something different that once improved may be added still, and may improve the
situation.

The idea is in two parts. The first is to add a simple syntax to denote a
non-null reference or pointer. A possibility is to use the   suffix:

class T {}
T nullable_reference;
T  nonnullable_reference = new T ();

struct S {}
S nullable_pointer;
S  nonnullable_pointer = new S ();


A possible alternative is to use the - (or +) suffix:

class T {}
T nullable_reference;
T+ nonnullable_reference = new T+();

struct S {}
S nullable_pointer;
S+ nonnullable_pointer = new S+();


A possible problem with non-null class references can be seen with this D
program that uses the trailing   syntax:

class Foo {}

class A {
    Foo  name;
    this(Foo  s) {
        this.name = s;
        this.m();
    }

    void m() { /*...*/ }
}

class B : A {
    Foo  path;
    this(Foo  p, Foo  s) {
        super(s);
        this.path = p;
    }

    override void m() {
        // here this.path is null despite it's a non-null
        assert(this.path !is null);
    }
}

void main() {
    new B(new Foo, new Foo);
}


I have adapted that example from this paper, it discusses about partially
uninitialized objects too:
http://research.microsoft.com/pubs/67461/non-null.pdf

A comment about that program from the paper:

The problem with the code is that during the base call to A's constructor, the
virtual method B.m may be invoked. At this time, field path of the object under
construction has not yet been initialized. Thus, accesses of this.path in
method B.m may yield a possibly-null value, even though the field has been
declared as being non-null.<



[Another more recent possible idea not present in the bug 4571: a island of
null-safe code may be present inside more code that is not null-safe. The
soundness of the whole code may be assured if the compiler adds run-time tests
at the interface points between the two realms. I have seen this solution used
to introduce "gradually" dependently typed code inside a larger amount of code
that lacks dependent type annotations:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.118.4557 ]


That's just the first half of a solution. Beside introducing nonnull
pointers/references, and a handy syntax to denote them, to have a null-safe
language you also need to require explicit tests every time a nullable
pointers/references is about to be dereferenced, and then after this test in
the else branch the reference type "becomes" a non-nullable one. This requires
a small amount of flow analysis. This second part of solution is optional, but
it makes the overall solution good enough.


 8. Signed/unsigned mismatching
 8. The only successful solution to this I've seen is Java's simply not having



When I write C code I always keep this GCC warning active, and I find it useful.

A partial solution to this problem are compile-time/run-time overflows from
integral types. If I assign a -1 to an unsigned value (without an esplicit
static cast) there is a compile-time error.

In your list you are forgetting integral overflows too, that for example static
analysis in SPARK is often able to avoid (but you need lot of time and brain to
write such code, so it's not a general solution for D or other languages, it's
fit only for special code).

See below for more answers.


 I don't think there's much value left for add-on static analysis tools.


In the end I don't have a D lint to test/try, so I can't know if you are right
:-)

Recently in this newsgroup I have shown something about Dependant types, their
type system is able to statically catch higher order bugs (well, to enforce
certain higher order invariants) in the code :-) D is not able to catch such
bugs:
http://www.bluishcoder.co.nz/2010/09/01/dependent-types-in-ats.html

Thank you for all your answers.

--------------------------

Ellery Newcomer:

 or python's variation: not having fixed integer types at all.
 
 Question: would it be feasible to make D use some sort of bignum type as
 the default index type for arrays, etc, but make it possible for the
 programmer to use uint or ulong or whatever for e.g. places where
 performance is an issue?


Another solution are integral overflows.


 I went to the trouble of modifying dmd to warn on unsigned/signed
 comparison. It found me some bugs which probably would not have been
 noticed otherwise. Did it produce false positives? Yes. Did that make me
 wish I hadn't done it? Hell no.


I guess most not-DMD D compilers will have this warning.

--------------------------

Walter:

 I suspect that's a big jump in complexity for very little gain.


Well, avoiding many integral-related bugs is a nice gain.


 I also suspect that it would result in a drastic performance problem (remember,
 Python is 100x slower than native code),


Python is 100-130x slower than native code only if your native code is
numerically-intensive and really well written for performance, otherwise in
most cases you don't end up more than 20-50x slower.

If your Python code uses strings, it uses functions written in C that are
usually faster than the ones found in std.string, so your Python code is not
slow.

If your D code is high-level and looks like Java code, it's probably no more
than 4-20x slower than Python code.

Regarding the source of the low Python performance you are quite wrong. Psyco
is a JIT for Python, if you use Psyco you keep using multi-precision numbers,
and it's very uncommon to see good Psyco programs to run more than 10-15 times
slower than perfectly optimized D programs.

Python is slow first of all because it's dynamic, and because its interpreter
is designed to be simple and hackable by not very expert C open source
programmers.

CLisp and OCaML implementations use tagged integers, and from what I have have
seen you can't expect CLisp code that uses lot of integers more than 2-3 times
slower than D code. Often the difference is lower (and very often the
difference is speed doesn't come from tagged integral numbers, but has other
causes, first of all that lot of CLisp code is dynamically typed).

I have used Delphi and C# with integer overflows and if you use them in C# code
you usually don't see the code more than 50% slower even if the code is using
integer numbers all the time :-) Usually the percentage is lower. In practice
this is not a problem, especially if you are debugging your program :-)


 and will disable the advantages D has
 with type inference (as the user will be back to explicitly naming the type).


You may be right, but here I don't fully understand you.


 You might want to consider changing your coding style to eschew the use of
 unsigned types.


This is unfortunately impossible in D, Phobos, arrays and other things are full
of unsigned numbers...


 That's a good sentiment, but it doesn't work that way in practice. Warnings
 always become de-facto requirements. They aren't the solution to a disagreement
 about how the language should work.


Comparing signed and unsigned values is an hazard in D.

-----------------------------

KennyTM~:

 Why .length cannot return a signed integer (not size_t) by default?


I have a bug report on this:
http://d.puremagic.com/issues/show_bug.cgi?id=3843

Recently it was discussed, but not appreciated :-)

-----------------------------

Walter:

 20% is an unacceptably huge penalty.


In D there are array bound tests that often slow down array-intensive code more
than 20%. You probably accept that slow down because it's optional (lot of
future D programmer will not want to remove those run time tests. You can't
assume that most people will what to remove them!). I am using OCaML that uses
tagged integers and it's far from being a slow language.

And I am willing to accept an optional 50% slow down (or even 100%, this means
it runs half the speed) for optional integral overflows (on non-growable
integral values). I'd like to learn to modify DMD just to introduce those
overflows, maybe in few more years I'll be able to do it :-)

Bye,
bearophile

Oct 27 2010

bearophile <bearophileHUGS lycos.com> writes:

 Yet I don't see exceptions used much in Phobos yet :-]


Andrei has used enforce() in many places, so there are many exceptions. But
enforce() is used like a non-removable assert(), the exceptions it generates
are generic... Once DMD has two compiled Phobos libs inside its binary
distribution (one with asserts too), those enforce() are better replaced by
asserts.

Bye,
bearophile

Oct 27 2010

Walter Bright <newshound2 digitalmars.com> writes:

bearophile wrote:
 Commercial lints for C are probably able to find other kind of bugs too. Even
 Splint (a free lint for C) is probably able to do more than your list (but
 you need to add some semantics annotations to the C code if you want Split to
 do that).


When you find yourself adding semantic annotations to the C code, or providing 
extra semantic rules to the static analyzer, what that really is saying is that 
the abstract type abilities of the language being analyzed are deficient.


 The static analyzer of Clang is supposed to have a
 really low amount of false positives, so low that I think it may be
 configured to submit bug reports automatically :-)


We'll see. A lot of organizations treat false positives as "bugs" simply
because 
it's easier to deal with them that way.


 I am quite sure that if I run a good C/C++ lint on the D front-end it may
 catch a large (hundreds) of bugs.


I don't believe it, but feel free to prove me wrong. I'll be happy to fix any 
bugs found that way, but not false positives.


 But even if you are right, that you don't
 write bugs that a simple rule-based analyzers is able to catch, the world is
 full of people that don't have your level of experience in C/C++ coding. So
 for them a lint may be useful.


I found lint useful for maybe a year or so, and then it just stopped finding
any 
problems in my code. Not that there weren't bugs, not at all, but I had simply 
learned to not do the kinds of things lint detects.


 1. Memory allocation errors - failure to free, dangling pointers, redundant
 frees 1. Garbage collection.


 The GC avoids a large number of bugs. For the situations where the GC is
 unfit Ada shows the usage of more than one class of pointers, with different
 capabilities. This may reduce the bug rate (but introduces some extra
 complexity). In past for example I have suggested to statically differentiate
 pointers to GC-managed memory from pointers to manually managed memory (so
 they are two different kind of pointers), because they are quite different
 (example: putting tags inside a GC-managed pointer is a bad idea). You
 answered me that this introduces too much complexity in the language.


Microsoft's Managed C++ does exactly this. While a technical success, it is a 
complete failure in regards to pleasing programmers.

I'm also very familiar with using multiple pointer types, which are a necessity 
for DOS programming. I'm sick of it. It sucks. I don't want to go back to it, 
and I don't know anyone who does.


 4. Memory corruption, such as buffer overflows 4. Array bounds checking,
 and safe mode in general, solves this.


 for my answer to point 8.


That's why it's selectable with a switch.

 (Static analysis in recent JavaVMs is able to infer that many of those tests
 checks are useless and removed them with no harm for the code).


I know about data flow analysis, and I was able to implement such checking for 
array bounds. But it is of only limited effectiveness.


 6. Failure to deal with error returns 6. Exceptions solve this


 list.index() throws an exception if the item is missing, while indexOf
 returns -1.


That is because there is a different idea of what an "error" is when indexing a 
list.




 8. Signed/unsigned mismatching 8. The only successful solution to this I've
 seen is Java's simply not having


 
 When I write C code I always keep this GCC warning active, and I find it
 useful.


The rate of false positives for such make them not suitable for inclusion in
the 
language.


 In your list you are forgetting integral overflows too, that for example
 static analysis in SPARK is often able to avoid (but you need lot of time and
 brain to write such code, so it's not a general solution for D or other
 languages, it's fit only for special code).


I did that deliberately, as I haven't seen any focus on them in static analysis 
tools. But I do know that it is of particular interest to you.


 I also suspect that it would result in a drastic performance problem
 (remember, Python is 100x slower than native code),


 numerically-intensive and really well written for performance, otherwise in
 most cases you don't end up more than 20-50x slower.


Even 20% slower, let alone 20 times slower, is unacceptable.


 CLisp and OCaML implementations use tagged integers, and from what I have
 have seen you can't expect CLisp code that uses lot of integers more than 2-3
 times slower than D code.


Such slowdowns are simply not acceptable for D.


 I have used Delphi and C# with integer overflows and if you use them in C#
 code you usually don't see the code more than 50% slower even if the code is
 using integer numbers all the time :-) Usually the percentage is lower. In
 practice this is not a problem, especially if you are debugging your program
 :-)


50% slower than C++ means that people will not switch from C++ to D. I think a 
total 5% slowdown relative to C++ is about the max acceptable.

In particular, I do not view integer overflows as remotely big enough of a 
problem to justify such massive slowdowns. Yes, I've had an overflow bug here 
and there over the years, but nothing remotely as debilitating as uninitialized 
data bugs or pointer bugs.

Oct 27 2010

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 When you find yourself adding semantic annotations to the C code, or providing
 extra semantic rules to the static analyzer, what that really is saying is that
 the abstract type abilities of the language being analyzed are deficient.


Right. Some of my recent suggestions like the  outer or the  optional_tag() (
http://d.puremagic.com/issues/show_bug.cgi?id=5125 ) attributes are ways to
give more semantics to the D compiler.

On the other hand Cyclone (and heavily annotated C code for Splint) shows that
too many annotations make code writing an unpleasant experience (or even a
pain. Writing D2 code is a bit more painful just because of const correctness),
so you must keep a balance. That idea about islands in that dynamic dependent
types paper is a possible way to solve this problem with annotations.


 I found lint useful for maybe a year or so, and then it just stopped finding
any
 problems in my code.


I am using C for quite more than a year, yet I find useful a good lint. You are
surely much better than me in C.


 Microsoft's Managed C++ does exactly this. While a technical success, it is a
 complete failure in regards to pleasing programmers.


I see.


 I'm also very familiar with using multiple pointer types, which are a necessity
 for DOS programming. I'm sick of it. It sucks. I don't want to go back to it,
 and I don't know anyone who does.


I see. I will need to program more in Ada to see if you are right regarding the
why Ada uses pointers.


 I know about data flow analysis, and I was able to implement such checking for
 array bounds. But it is of only limited effectiveness.


This is interesting. You have done lot of things.
(Time ago I have used the Java SciMark 2.0 benchmark before and after the
introduction of that static analyis and I have seen the difference.)


 as I haven't seen any focus on them in static analysis tools.


Most lints don't focus on them because they are hard to spot statically,
especially if you use C language. To avoid overflow bugs with a static analysis
you need big guns, as done by SPARK on a subset of Ada, but it's a language for
special purposes.


 [integer overflows]
 50% slower than C++ means that people will not switch from C++ to D. I think a
 total 5% slowdown relative to C++ is about the max acceptable.
 ...
 In particular, I do not view integer overflows as remotely big enough of a
 problem to justify such massive slowdowns.


(There I was talking about integer overflows.) I answer with one of your
answers:

 That's why it's selectable with a switch.


(But for integral overflows in D I suggest two switches, one for unsigned and
one for both signed and unsigned values).


 Yes, I've had an overflow bug here
 and there over the years, but nothing remotely as debilitating as uninitialized
 data bugs or pointer bugs.


D allows you to remove a large percentage of the memory-related bugs, so now
the percentage of integer-related bugs (over the new total of bugs) becomes
higher :-)

Thank you for your many answers.

Bye,
bearophile

Oct 27 2010

bearophile <bearophileHUGS lycos.com> writes:

 (But for integral overflows in D I suggest two switches, one for unsigned and
one for both signed and unsigned values).


I meant one for signed and one for both signed and unsigned values, sorry.

Bye,
bearophile

Oct 27 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Re: Lints, Condate and bugs