digitalmars.D.learn - Is garbage detection a thing?

Mark (90/90) Nov 29 2020 Hi,

Daniel N (4/6) Nov 29 2020 Sounds like what you want is ASAN? You can use it with plain C or

Mark (20/27) Nov 29 2020 I could use AddressSanitizer indirectly by using Go. But their

Mark (5/6) Nov 29 2020 Oh wait, it was ThreadSanitizer that Go uses, right? I failed at

Mark (7/8) Nov 29 2020 I found: Ada is not good for me. It has no augmented assignment.

Kagamin (4/8) Dec 01 2020 Using a reasonable naming convention should be much easier than

Daniel N (6/10) Nov 29 2020 You could try a linux image in VirtualBox or VMware, to more

Kagamin (1/1) Nov 29 2020 Maybe Ada.
Elronnd (69/96) Nov 29 2020 Because it's just as expensive to do garbage detection as

Mark (48/53) Nov 29 2020 My motivation was actually just that I wanted a very small

Bastiaan Veelo (43/65) Nov 29 2020 In order to detect garbage, you need extensive run-time

Mark (10/20) Nov 29 2020 I'm not sure but I have a gut feeling that I am just in a

Mark (7/8) Nov 29 2020 I should add that I have more like a romantic view of software

Mark <was330 via.tokyo.jp> writes:

Hi,

can I ask you something in general? I don't know anyone whom I 
could ask. I'm a hobbyist with no science degree or job in 
computing, and also know no other programmers.

I have no good understanding why "garbage collection" is a big 
thing and why "garbage detection" is no thing (I think so).

I want to get rid of undefined behavior. So I tell myself, what 
is this actually? It's most of the time corrupted heap memory and 
the C++ compiler giving me errors that I thought were kind of 
impossible.

Now I could follow all C++ guidelines and almost everything would 
be okay. But many people went into different directions, e.g. 
1995 they released Java and you would use garbage collection.

What I don't understand is, when today there exist tools for C++ 
(allocator APIs for debugging purposes, or Address-Sanitizer or 
maybe also MPX) to just detect that your program tried to use a 
memory address that was actually freed and invalidated,

why did Java and other languages not stop there but also made a 
system that keeps every address alive as long as it is used?

One very minor criticism that I have is: With GC there can be 
"semantically old data" (a problematic term, sorry) which is 
still alive and valid, and the language gives me the feeling that 
it is a nice system that way. But the overall behavior isn't 
necessarily very correct, it's just that it is much better than a 
corrupted heap which could lead to everything possibly crashing 
soon.

My bigger criticism is just, that compilers with garbage 
collection are big software (with big libraries) and tend to have 
defects in other parts. E.g. such compilers (two different ones) 
lately gave me wrong line numbers in error messages.

And other people's (not mine, not really much) criticism is that 
they say garbage collection increases the use of memory and it 
can create a blocking of threads when accessing shared memory, or 
something like this.




So... I wonder where the languages are that only try to give this 
type of error: Your faulty program has (at runtime) used memory 
which has already been freed. Not garbage collection. The 
compiled program just stops all execution and tells me this, so 
that I would go on with my manual memory management.

Now, from today's perspective I could use Rust to create a very 
formal representation of my requirements and create a program 
that is very deterministic and at the same time uses very few 
resources.

But I'd like to pretend there is no Rust (because the lifetimes 
and some other things make it a domain-specific language to some 
extent), and I would like to ask about the "runtime-solution".
Why shouldn't it be a good thing? Has it been tried?

All I would *need* to do additionally is dividing the project 
into two sub-projects as it is done with C++: Debug build an 
release build.

Then the debug build would use a virtual machine that uses type 
information from compilation for garbage detection, but not 
garbage collection.

And when I have tested all runtime cases of my compiled software, 
which runs slow, but quite deterministically, I will go on and 
build the release build.

And if the release build (which is faster) does not behave 
deterministically, I would fix the "VM/Non-VM compiler" I'm 
talking about until the release build shows the same behavior.

I guess there is a way this approach could fail: Timing may have 
influence and make the VM behave differently from the Non-VM 
(e.g. x64). And it's surely not easy to write a compiler that 
creates code which traces pointers and still leaves you much 
freedom to cast and alter pointers. In some way it is doomed to 
fail, but there are language constructs that work.

There have been C interpreters, iterators as pointer 
replacements, or just any replacement. BTW I know of CINT and 
safe-c, but I'm not happy how these projects look from the 
outside.

If I had the education and persistence I would like to try to 
build my own "safe-c", yet another one. But I think it's better 
to ask you why garbage detection isn't a popular thing. Does it 
exist at all as core idea in a language (probably a C 
improvement)?

Where are the flaws in my thinking?

I currently think, if I were serious about it (I'm not 100% 
sure), I should just find a C interpreter. CINT? Or this one 
academic compiler from five years ago? (I believe this compiler 
needs a special CPU) To be honest, I have no clue. Just one 
"interpreter" that tries to mimic pointers as much as it can, and 
later I would be free to port the code to Microsoft's C.

Or maybe I could use the safe-c subset in D? But I believe it 
uses garbage collection. I know nothing about it, sorry.

What I tried in the past few days was porting working Go code to 
C. I wanted the C code to be Go-idiomatic, and I was looking 
there for the common subset from Golang combined with C. Well, I 
used macros, had a few ideas, but then this C style quickly 
failed. Really frustrating. But.. I'm not planning to give up. ;)

Thanks a lot for reading, and sorry for a lot of text that is 
off-topic and is not related to D.

Nov 29 2020

Daniel N <no public.email> writes:

On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Thanks a lot for reading, and sorry for a lot of text that is 
 off-topic and is not related to D.

Sounds like what you want is ASAN? You can use it with plain C or 
D(LDC).
https://clang.llvm.org/docs/AddressSanitizer.html

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

On Sunday, 29 November 2020 at 16:21:59 UTC, Daniel N wrote:
 On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Thanks a lot for reading, and sorry for a lot of text that is 
 off-topic and is not related to D.

 Sounds like what you want is ASAN? You can use it with plain C 
 or D(LDC).
 https://clang.llvm.org/docs/AddressSanitizer.html

I could use AddressSanitizer indirectly by using Go. But their 
compiler gave me wrong line numbers for errors and I have not yet 
overcome this psycholocicaly, to be honest. They have a fixed 
version, which is a WIP which is already using generics.

So I went on and saw that Visual C++ now features 
AdressSanitizer. It showed faulty behavior very soon, a false 
positive AFAIR. It's in experimental stage. /d2MPX is out of the 
early development stage but it's no prominent feature.

I went on with Nim, which then gave me wrong line numbers for 
error messages. I'm not counting wrong, they really do, both 
Golang and Nim gave me errors that triggered me. ;)

I began a little C compiler project based on c4, knowing that I 
would be very old when this should ever be finished.

Actually I am looking for a good compiler for Windows or maybe 
macOS, and looking at JAI, too.

Maybe I should just install Linux. But ... the drivers... My 
Thinkpad just doesn't like any Linux. I run out of ideas.

In the first place all I wanted to do is make some music.

Kind regards

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

 I could use AddressSanitizer indirectly by using Go. But their

Oh wait, it was ThreadSanitizer that Go uses, right? I failed at 
talking.

I would probably use ASAN under Linux, because that is the right 
thing to do?

Looking at Ada now.

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

 Looking at Ada now.

I found: Ada is not good for me. It has no augmented assignment. 
It's just that I want DRY because I use very verbose variable 
names, and in the past I had a real world case (game in Lua) 
where I became frustrated when I had to repeat the names. I 
understand that NASA or so will repeat their variable names. They 
get paid. ;)

Kind regards

Nov 29 2020

Kagamin <spam here.lot> writes:

On Sunday, 29 November 2020 at 19:09:07 UTC, Mark wrote:
 Looking at Ada now.

 I found: Ada is not good for me. It has no augmented 
 assignment. It's just that I want DRY because I use very 
 verbose variable names

Using a reasonable naming convention should be much easier than 
looking for a perfect custom language. Well, another variant is 
zig, which was supposed to be C but safe.

Dec 01 2020

Daniel N <no public.email> writes:

On Sunday, 29 November 2020 at 16:35:26 UTC, Mark wrote:
 Maybe I should just install Linux. But ... the drivers... My 
 Thinkpad just doesn't like any Linux. I run out of ideas.

 In the first place all I wanted to do is make some music.

 Kind regards

You could try a linux image in VirtualBox or VMware, to more 
easily evaluate if linux + ASAN matches your expectations or if 
it's another dead-end.

Regards,
Daniel

Nov 29 2020

Kagamin <spam here.lot> writes:

Maybe Ada.

Nov 29 2020

Elronnd <elronnd elronnd.net> writes:

On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 I have no good understanding why "garbage collection" is a big 
 thing and why "garbage detection" is no thing (I think so).

Because it's just as expensive to do garbage detection as 
automatic garbage collection.  So if you're going to go to the 
work of detecting when something is garbage, it's basically free 
to detect it at that point.


 today there exist tools for C++ (allocator APIs for debugging 
 purposes, or Address-Sanitizer or maybe also MPX) to just 
 detect that your program tried to use a memory address that was 
 actually freed and invalidated,

Note that address sanitizer is significantly slower than most 
‘real’ GCs (such as are used by java, or others).


 why did Java and other languages not stop there but also made a 
 system that keeps every address alive as long as it is used?

 Then the debug build would use a virtual machine that uses type 
 information from compilation for garbage detection, but not 
 garbage collection.

Address sanitizer does exactly what you propose here.  The 
problem is this:

Testing cannot prove only the presence of bugs; never their 
absence.  You may run your c++ program a thousand times with 
address sanitizer enabled and get no errors; yet, your code may 
still be incorrect and contain memory errors.  Safety features in 
a language--like a GC--prevent an entire class of bugs 
definitively.


 One very minor criticism that I have is: With GC there can be 
 "semantically old data" (a problematic term, sorry) which is 
 still alive and valid, and the language gives me the feeling 
 that it is a nice system that way. But the overall behavior 
 isn't necessarily very correct, it's just that it is much 
 better than a corrupted heap which could lead to everything 
 possibly crashing soon.

The distinction here is _reachability_ vs _liveness_.

So, GC theory:

A _graph_ is type of data structure.  Imagine you have a sheet of 
paper, and on the sheet of paper you have a bunch of dots.  There 
are lines connecting some of the dots.  In graph theory, the dots 
are called nodes, and the lines are edges.  We say that nodes A 
and B are _connected_ if there is an edge going between them.  We 
also say that A is _reachable_ from B if either A and B are 
connected, or A is connected to some C, where C is reachable from 
B.  Basically, if you can reach one point from another just by 
following lines, then each is reachable from the other.

A _directed_ graph is one in which the edges have directionality. 
  Imagine the lines have little arrows at the ends.  There may be 
an edge that goes A -> B; or there may be an edge that goes B -> 
A.  Or there may be both: A <-> B.  (Or they can be unconnected.) 
  In this case, to reach one node from another, you have to follow 
the arrows.  So it may be that, starting at A, you can reach B; 
but you can't go the other way round.

The _heap_ is all the objects you've ever created.  This includes 
the objects you allocate with 'new', as well as all the objects 
you allocate from the stack and all your global variables.  
What's interesting is that we can think of the heap as a directed 
graph.  If object A contains a pointer to object B, we can think 
of that the same as way as there being an edge going from node A 
to node B.

The _root set_ is some relatively small number of heap objects 
that are always available.  Generally, this is all the global 
variables and stack-allocated objects.  The name _reachable_ is 
given to any object which is reachable from one of the root set.

It is impossible for your program to access an unreachable 
object; there's no way to get a pointer to it in the first place. 
  So it is safe for the GC to free unreachable objects.

But we can also add another category of objects: _live_ vs _dead_ 
objects.  Live objects are ones which you're actually going to 
access at some point.  Dead objects are objects that you're never 
going to access again, even if they're reachable.  If a GC could 
detect which reachable objects were dead, it would be able to be 
more efficient and use less memory...hypothetically.

The reason this distinction is important, and the reason I bring 
up graph theory, is that liveness is impossible to prove.  
Seriously: it's impossible, in the general case, for the GC to 
prove that an object is still alive.  Whereas it's trivial to 
prove reachability.

Now, it is true that there are some cases where an object is dead 
but still reachable.  The fact of the matter is that in most such 
cases, the object becomes unreachable shortly thereafter.  In the 
cases when it's not, it tends to be impractical to prove an 
object is dead.  The extra work that it would take to prove 
deadness in such cases, if it were even possible to prove, would 
make it a not worthwhile optimization.


 And when I have tested all runtime cases of my compiled 
 software, which runs slow, but quite deterministically, I will 
 go on and build the release build.

 And if the release build (which is faster) does not behave 
 deterministically, I would fix the "VM/Non-VM compiler" I'm 
 talking about until the release build shows the same behavior.

 I guess there is a way this approach could fail: Timing may 
 have influence and make the VM behave differently from the 
 Non-VM (e.g. x64).

I don't know why you're so hung up on timing.  It's easy to write 
code which isn't sensitive to timing, as long as you don't use 
threads.  That doesn't mean it's possible to test it 
exhaustively; see the above note about testing.

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

 The reason this distinction is important, and the reason I 
 bring up graph theory, is that liveness is impossible to prove.
  Seriously: it's impossible, in the general case, for the GC to 
 prove that an object is still alive.  Whereas it's trivial to 
 prove reachability.

My motivation was actually just that I wanted a very small 
compiler with no libraries, because I got a bit tired of big 
things with little defects.

But the liveness is a thing I would like to say something about: 
I don't want a compiler that tries to prove it for me. The reason 
is that if I did manual memory management and created a 
use-after-free bug, then my personal world would be still very 
good. It's just that the industry isn't happy when releasing 
banana software is a serious problem?

In the case of the use-after-free I would say that my software 
needs correct memory state and correct logic. The logic so to 
speak is what I actually wanted to implement in the first place. 
If the program is perfect and works in all situations, how could 
it do this with bugs in the memory management? It can't.

So, I didn't really think often about it, but when Rust came out, 
there was some trend into this direction, and my feeling was: I 
can follow it, it looks good, but can I still just use C? The way 
it works would just be that I have correct out-of-bounds and 
use-after-free/double-free detection and race condition detection 
if my software is supposed to be chaotic server or browser 
software (?), both things at runtime. Given that it is true (?) 
that software cannot do a task correctly when it does part of it 
(memory management) incorrectly.

More or less it was an idea to get what Rust and Swift try to do 
without all the language features. C with less language 
constructs would be nice. It's just that the processors are bad 
for me, for what I'm trying to do. Intel offers MPX. It's good I 
guess. But why does Visual C++ implement it like an easter egg. 
Why have they now added ASAN as experimental feature that 
immediately fails. They do things I don't really like, because 
the industry needs it like this, but not me. And at the high 
level, there's bloat and fancy colors.

I want just a toolkit to create exactly what can be created, and 
if I do it wrong it should fail hard. And I'm trying to make it 
so that I don't write assembly, if that is possible and good in 
my situation.

I'd say, I have just not understood the whole thing and maybe 
should try a different hobby, or finally create the thing I'm 
looking for, which turns out to be an endless story. It's just 
that a few hours ago I had the hope that the holy grail would 
exist.

I had found the solution to almost every problem in Golang. And 
then after one year my hobby just broke down, because Go outputs 
wrong line numbers. It's no good compiler when it does that. I'd 
rather quit my hobby than accept it.

Thanks a lot for your explanation!
Really kind regards,
you helped me to understand it.

Nov 29 2020

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Hi,

 can I ask you something in general? I don't know anyone whom I 
 could ask. I'm a hobbyist with no science degree or job in 
 computing, and also know no other programmers.

 I have no good understanding why "garbage collection" is a big 
 thing and why "garbage detection" is no thing (I think so).

In order to detect garbage, you need extensive run-time 
instrumentation, the difficulties of which you have indicated 
yourself. In addition comes that detection depends on 
circumstance, which is an argument against the debug/release 
strategy you proposed. There is no guarantee that you’ll find all 
problems in the debug build. Garbage collection also comes at a 
runtime cost, but strategies exist to minimise those, and in 
addition a GC enables valuable language features. One such 
strategy is to minimise allocations, which improves performance 
in any memory management scheme.

[...]
 What I don't understand is, when today there exist tools for 
 C++ (allocator APIs for debugging purposes, or 
 Address-Sanitizer or maybe also MPX) to just detect that your 
 program tried to use a memory address that was actually freed 
 and invalidated,

 why did Java and other languages not stop there but also made a 
 system that keeps every address alive as long as it is used?

Elimination of memory problems is much more valuable than 
detection. Recovering from memory errors at run time is 
unreliable.

 One very minor criticism that I have is: With GC there can be 
 "semantically old data" (a problematic term, sorry) which is 
 still alive and valid, and the language gives me the feeling 
 that it is a nice system that way. But the overall behavior 
 isn't necessarily very correct, it's just that it is much 
 better than a corrupted heap which could lead to everything 
 possibly crashing soon.

At least in D, you can avoid old data to hang around for too 
long. See core.memory.

 Or maybe I could use the safe-c subset in D? But I believe it 
 uses garbage collection. I know nothing about it, sorry.

 safe D is not a sub-set, indeed it uses garbage collection. Fact 
is that there are very few domains where this is a problem. Not 
all garbage collectors are equal either, so if you think garbage 
collection is bad in one language, this may not directly apply in 
another. In D the garbage collector is even pluggable, various 
implantations exist. Have you seen the GC category on the 
blog?https://dlang.org/blog/2017/03/20/dont-fear-the-reaper/

BetterC is a subset of D, it does not use garbage collection.

You may be interested in current work being done in static 
analysis of manual memory management in D: 
https://youtu.be/XQHAIglE9CU

The advantage of D is that all options are open. This allows the 
following approach:
1) Start development without worrying about memory. Should 
collection cycles be noticeable:
2) Profile your program and make strategic optimisations 
https://youtu.be/dRORNQIB2wA. If this is not enough:
3) Force explicit collection in idle moments. If you need to go 
further:
4) Completely eliminate collection in hot loops using  nogc 
and/or GC.disable. When even this is not enough:
5) Try another GC implementation. And if you really need to:
6) Switch to manual memory management where it matters.

This makes starting a project in D a safe choice, in multiple 
meanings of the word.

— Bastiaan.

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

 Elimination of memory problems is much more valuable than 
 detection. Recovering from memory errors at run time is 
 unreliable.

I'm not sure but I have a gut feeling that I am just in a 
position that is not good to defend. I want small software that 
fails hard on weak causes, and the industry wants software that 
fails soft on strong causes. I cannot win any argument and will, 
as hobbyist who likes elegant code, face frustration after 
frustration, admittedly in all digital products, not only 
compilers.

I'm sure you're right. Elimination of memory problems is what 
must be provided. I agree.

 The advantage of D is that all options are open. This allows 
 the following approach:
 ...
 ...
 This makes starting a project in D a safe choice, in multiple 
 meanings of the word.

 — Bastiaan.

Thanks! :) Well, in the end, there should be a way like this.

Nov 29 2020

Mark <was330 via.tokyo.jp> writes:

 Recovering from memory errors at run time is unreliable.


I should add that I have more like a romantic view of software 
release cycles where testing is done until the software is in a 
very, very sophisticated and stable state. More than usual.

Not that I want to solely rely on such an approach. It should 
still try to recover from failures, and recovering should be 
trained like fire fighting is trained. But there should be no 
smokers in the building, so to speak.

Nov 29 2020

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Is garbage detection a thing?