www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Is garbage detection a thing?

reply Mark <was330 via.tokyo.jp> writes:
Hi,

can I ask you something in general? I don't know anyone whom I 
could ask. I'm a hobbyist with no science degree or job in 
computing, and also know no other programmers.

I have no good understanding why "garbage collection" is a big 
thing and why "garbage detection" is no thing (I think so).

I want to get rid of undefined behavior. So I tell myself, what 
is this actually? It's most of the time corrupted heap memory and 
the C++ compiler giving me errors that I thought were kind of 
impossible.

Now I could follow all C++ guidelines and almost everything would 
be okay. But many people went into different directions, e.g. 
1995 they released Java and you would use garbage collection.

What I don't understand is, when today there exist tools for C++ 
(allocator APIs for debugging purposes, or Address-Sanitizer or 
maybe also MPX) to just detect that your program tried to use a 
memory address that was actually freed and invalidated,

why did Java and other languages not stop there but also made a 
system that keeps every address alive as long as it is used?

One very minor criticism that I have is: With GC there can be 
"semantically old data" (a problematic term, sorry) which is 
still alive and valid, and the language gives me the feeling that 
it is a nice system that way. But the overall behavior isn't 
necessarily very correct, it's just that it is much better than a 
corrupted heap which could lead to everything possibly crashing 
soon.

My bigger criticism is just, that compilers with garbage 
collection are big software (with big libraries) and tend to have 
defects in other parts. E.g. such compilers (two different ones) 
lately gave me wrong line numbers in error messages.

And other people's (not mine, not really much) criticism is that 
they say garbage collection increases the use of memory and it 
can create a blocking of threads when accessing shared memory, or 
something like this.




So... I wonder where the languages are that only try to give this 
type of error: Your faulty program has (at runtime) used memory 
which has already been freed. Not garbage collection. The 
compiled program just stops all execution and tells me this, so 
that I would go on with my manual memory management.

Now, from today's perspective I could use Rust to create a very 
formal representation of my requirements and create a program 
that is very deterministic and at the same time uses very few 
resources.

But I'd like to pretend there is no Rust (because the lifetimes 
and some other things make it a domain-specific language to some 
extent), and I would like to ask about the "runtime-solution".
Why shouldn't it be a good thing? Has it been tried?

All I would *need* to do additionally is dividing the project 
into two sub-projects as it is done with C++: Debug build an 
release build.

Then the debug build would use a virtual machine that uses type 
information from compilation for garbage detection, but not 
garbage collection.

And when I have tested all runtime cases of my compiled software, 
which runs slow, but quite deterministically, I will go on and 
build the release build.

And if the release build (which is faster) does not behave 
deterministically, I would fix the "VM/Non-VM compiler" I'm 
talking about until the release build shows the same behavior.

I guess there is a way this approach could fail: Timing may have 
influence and make the VM behave differently from the Non-VM 
(e.g. x64). And it's surely not easy to write a compiler that 
creates code which traces pointers and still leaves you much 
freedom to cast and alter pointers. In some way it is doomed to 
fail, but there are language constructs that work.

There have been C interpreters, iterators as pointer 
replacements, or just any replacement. BTW I know of CINT and 
safe-c, but I'm not happy how these projects look from the 
outside.

If I had the education and persistence I would like to try to 
build my own "safe-c", yet another one. But I think it's better 
to ask you why garbage detection isn't a popular thing. Does it 
exist at all as core idea in a language (probably a C 
improvement)?

Where are the flaws in my thinking?

I currently think, if I were serious about it (I'm not 100% 
sure), I should just find a C interpreter. CINT? Or this one 
academic compiler from five years ago? (I believe this compiler 
needs a special CPU) To be honest, I have no clue. Just one 
"interpreter" that tries to mimic pointers as much as it can, and 
later I would be free to port the code to Microsoft's C.

Or maybe I could use the safe-c subset in D? But I believe it 
uses garbage collection. I know nothing about it, sorry.

What I tried in the past few days was porting working Go code to 
C. I wanted the C code to be Go-idiomatic, and I was looking 
there for the common subset from Golang combined with C. Well, I 
used macros, had a few ideas, but then this C style quickly 
failed. Really frustrating. But.. I'm not planning to give up. ;)

Thanks a lot for reading, and sorry for a lot of text that is 
off-topic and is not related to D.
Nov 29 2020
next sibling parent reply Daniel N <no public.email> writes:
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Thanks a lot for reading, and sorry for a lot of text that is 
 off-topic and is not related to D.
Sounds like what you want is ASAN? You can use it with plain C or D(LDC). https://clang.llvm.org/docs/AddressSanitizer.html
Nov 29 2020
parent reply Mark <was330 via.tokyo.jp> writes:
On Sunday, 29 November 2020 at 16:21:59 UTC, Daniel N wrote:
 On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Thanks a lot for reading, and sorry for a lot of text that is 
 off-topic and is not related to D.
Sounds like what you want is ASAN? You can use it with plain C or D(LDC). https://clang.llvm.org/docs/AddressSanitizer.html
I could use AddressSanitizer indirectly by using Go. But their compiler gave me wrong line numbers for errors and I have not yet overcome this psycholocicaly, to be honest. They have a fixed version, which is a WIP which is already using generics. So I went on and saw that Visual C++ now features AdressSanitizer. It showed faulty behavior very soon, a false positive AFAIR. It's in experimental stage. /d2MPX is out of the early development stage but it's no prominent feature. I went on with Nim, which then gave me wrong line numbers for error messages. I'm not counting wrong, they really do, both Golang and Nim gave me errors that triggered me. ;) I began a little C compiler project based on c4, knowing that I would be very old when this should ever be finished. Actually I am looking for a good compiler for Windows or maybe macOS, and looking at JAI, too. Maybe I should just install Linux. But ... the drivers... My Thinkpad just doesn't like any Linux. I run out of ideas. In the first place all I wanted to do is make some music. Kind regards
Nov 29 2020
next sibling parent reply Mark <was330 via.tokyo.jp> writes:
 I could use AddressSanitizer indirectly by using Go. But their
Oh wait, it was ThreadSanitizer that Go uses, right? I failed at talking. I would probably use ASAN under Linux, because that is the right thing to do? Looking at Ada now.
Nov 29 2020
parent reply Mark <was330 via.tokyo.jp> writes:
 Looking at Ada now.
I found: Ada is not good for me. It has no augmented assignment. It's just that I want DRY because I use very verbose variable names, and in the past I had a real world case (game in Lua) where I became frustrated when I had to repeat the names. I understand that NASA or so will repeat their variable names. They get paid. ;) Kind regards
Nov 29 2020
parent Kagamin <spam here.lot> writes:
On Sunday, 29 November 2020 at 19:09:07 UTC, Mark wrote:
 Looking at Ada now.
I found: Ada is not good for me. It has no augmented assignment. It's just that I want DRY because I use very verbose variable names
Using a reasonable naming convention should be much easier than looking for a perfect custom language. Well, another variant is zig, which was supposed to be C but safe.
Dec 01 2020
prev sibling parent Daniel N <no public.email> writes:
On Sunday, 29 November 2020 at 16:35:26 UTC, Mark wrote:
 Maybe I should just install Linux. But ... the drivers... My 
 Thinkpad just doesn't like any Linux. I run out of ideas.

 In the first place all I wanted to do is make some music.

 Kind regards
You could try a linux image in VirtualBox or VMware, to more easily evaluate if linux + ASAN matches your expectations or if it's another dead-end. Regards, Daniel
Nov 29 2020
prev sibling next sibling parent Kagamin <spam here.lot> writes:
Maybe Ada.
Nov 29 2020
prev sibling next sibling parent reply Elronnd <elronnd elronnd.net> writes:
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 I have no good understanding why "garbage collection" is a big 
 thing and why "garbage detection" is no thing (I think so).
Because it's just as expensive to do garbage detection as automatic garbage collection. So if you're going to go to the work of detecting when something is garbage, it's basically free to detect it at that point.
 today there exist tools for C++ (allocator APIs for debugging 
 purposes, or Address-Sanitizer or maybe also MPX) to just 
 detect that your program tried to use a memory address that was 
 actually freed and invalidated,
Note that address sanitizer is significantly slower than most ‘real’ GCs (such as are used by java, or others).
 why did Java and other languages not stop there but also made a 
 system that keeps every address alive as long as it is used?
 Then the debug build would use a virtual machine that uses type 
 information from compilation for garbage detection, but not 
 garbage collection.
Address sanitizer does exactly what you propose here. The problem is this: Testing cannot prove only the presence of bugs; never their absence. You may run your c++ program a thousand times with address sanitizer enabled and get no errors; yet, your code may still be incorrect and contain memory errors. Safety features in a language--like a GC--prevent an entire class of bugs definitively.
 One very minor criticism that I have is: With GC there can be 
 "semantically old data" (a problematic term, sorry) which is 
 still alive and valid, and the language gives me the feeling 
 that it is a nice system that way. But the overall behavior 
 isn't necessarily very correct, it's just that it is much 
 better than a corrupted heap which could lead to everything 
 possibly crashing soon.
The distinction here is _reachability_ vs _liveness_. So, GC theory: A _graph_ is type of data structure. Imagine you have a sheet of paper, and on the sheet of paper you have a bunch of dots. There are lines connecting some of the dots. In graph theory, the dots are called nodes, and the lines are edges. We say that nodes A and B are _connected_ if there is an edge going between them. We also say that A is _reachable_ from B if either A and B are connected, or A is connected to some C, where C is reachable from B. Basically, if you can reach one point from another just by following lines, then each is reachable from the other. A _directed_ graph is one in which the edges have directionality. Imagine the lines have little arrows at the ends. There may be an edge that goes A -> B; or there may be an edge that goes B -> A. Or there may be both: A <-> B. (Or they can be unconnected.) In this case, to reach one node from another, you have to follow the arrows. So it may be that, starting at A, you can reach B; but you can't go the other way round. The _heap_ is all the objects you've ever created. This includes the objects you allocate with 'new', as well as all the objects you allocate from the stack and all your global variables. What's interesting is that we can think of the heap as a directed graph. If object A contains a pointer to object B, we can think of that the same as way as there being an edge going from node A to node B. The _root set_ is some relatively small number of heap objects that are always available. Generally, this is all the global variables and stack-allocated objects. The name _reachable_ is given to any object which is reachable from one of the root set. It is impossible for your program to access an unreachable object; there's no way to get a pointer to it in the first place. So it is safe for the GC to free unreachable objects. But we can also add another category of objects: _live_ vs _dead_ objects. Live objects are ones which you're actually going to access at some point. Dead objects are objects that you're never going to access again, even if they're reachable. If a GC could detect which reachable objects were dead, it would be able to be more efficient and use less memory...hypothetically. The reason this distinction is important, and the reason I bring up graph theory, is that liveness is impossible to prove. Seriously: it's impossible, in the general case, for the GC to prove that an object is still alive. Whereas it's trivial to prove reachability. Now, it is true that there are some cases where an object is dead but still reachable. The fact of the matter is that in most such cases, the object becomes unreachable shortly thereafter. In the cases when it's not, it tends to be impractical to prove an object is dead. The extra work that it would take to prove deadness in such cases, if it were even possible to prove, would make it a not worthwhile optimization.
 And when I have tested all runtime cases of my compiled 
 software, which runs slow, but quite deterministically, I will 
 go on and build the release build.

 And if the release build (which is faster) does not behave 
 deterministically, I would fix the "VM/Non-VM compiler" I'm 
 talking about until the release build shows the same behavior.

 I guess there is a way this approach could fail: Timing may 
 have influence and make the VM behave differently from the 
 Non-VM (e.g. x64).
I don't know why you're so hung up on timing. It's easy to write code which isn't sensitive to timing, as long as you don't use threads. That doesn't mean it's possible to test it exhaustively; see the above note about testing.
Nov 29 2020
parent Mark <was330 via.tokyo.jp> writes:
 The reason this distinction is important, and the reason I 
 bring up graph theory, is that liveness is impossible to prove.
  Seriously: it's impossible, in the general case, for the GC to 
 prove that an object is still alive.  Whereas it's trivial to 
 prove reachability.
My motivation was actually just that I wanted a very small compiler with no libraries, because I got a bit tired of big things with little defects. But the liveness is a thing I would like to say something about: I don't want a compiler that tries to prove it for me. The reason is that if I did manual memory management and created a use-after-free bug, then my personal world would be still very good. It's just that the industry isn't happy when releasing banana software is a serious problem? In the case of the use-after-free I would say that my software needs correct memory state and correct logic. The logic so to speak is what I actually wanted to implement in the first place. If the program is perfect and works in all situations, how could it do this with bugs in the memory management? It can't. So, I didn't really think often about it, but when Rust came out, there was some trend into this direction, and my feeling was: I can follow it, it looks good, but can I still just use C? The way it works would just be that I have correct out-of-bounds and use-after-free/double-free detection and race condition detection if my software is supposed to be chaotic server or browser software (?), both things at runtime. Given that it is true (?) that software cannot do a task correctly when it does part of it (memory management) incorrectly. More or less it was an idea to get what Rust and Swift try to do without all the language features. C with less language constructs would be nice. It's just that the processors are bad for me, for what I'm trying to do. Intel offers MPX. It's good I guess. But why does Visual C++ implement it like an easter egg. Why have they now added ASAN as experimental feature that immediately fails. They do things I don't really like, because the industry needs it like this, but not me. And at the high level, there's bloat and fancy colors. I want just a toolkit to create exactly what can be created, and if I do it wrong it should fail hard. And I'm trying to make it so that I don't write assembly, if that is possible and good in my situation. I'd say, I have just not understood the whole thing and maybe should try a different hobby, or finally create the thing I'm looking for, which turns out to be an endless story. It's just that a few hours ago I had the hope that the holy grail would exist. I had found the solution to almost every problem in Golang. And then after one year my hobby just broke down, because Go outputs wrong line numbers. It's no good compiler when it does that. I'd rather quit my hobby than accept it. Thanks a lot for your explanation! Really kind regards, you helped me to understand it.
Nov 29 2020
prev sibling parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Sunday, 29 November 2020 at 16:05:04 UTC, Mark wrote:
 Hi,

 can I ask you something in general? I don't know anyone whom I 
 could ask. I'm a hobbyist with no science degree or job in 
 computing, and also know no other programmers.

 I have no good understanding why "garbage collection" is a big 
 thing and why "garbage detection" is no thing (I think so).
In order to detect garbage, you need extensive run-time instrumentation, the difficulties of which you have indicated yourself. In addition comes that detection depends on circumstance, which is an argument against the debug/release strategy you proposed. There is no guarantee that you’ll find all problems in the debug build. Garbage collection also comes at a runtime cost, but strategies exist to minimise those, and in addition a GC enables valuable language features. One such strategy is to minimise allocations, which improves performance in any memory management scheme. [...]
 What I don't understand is, when today there exist tools for 
 C++ (allocator APIs for debugging purposes, or 
 Address-Sanitizer or maybe also MPX) to just detect that your 
 program tried to use a memory address that was actually freed 
 and invalidated,

 why did Java and other languages not stop there but also made a 
 system that keeps every address alive as long as it is used?
Elimination of memory problems is much more valuable than detection. Recovering from memory errors at run time is unreliable.
 One very minor criticism that I have is: With GC there can be 
 "semantically old data" (a problematic term, sorry) which is 
 still alive and valid, and the language gives me the feeling 
 that it is a nice system that way. But the overall behavior 
 isn't necessarily very correct, it's just that it is much 
 better than a corrupted heap which could lead to everything 
 possibly crashing soon.
At least in D, you can avoid old data to hang around for too long. See core.memory.
 Or maybe I could use the safe-c subset in D? But I believe it 
 uses garbage collection. I know nothing about it, sorry.
safe D is not a sub-set, indeed it uses garbage collection. Fact is that there are very few domains where this is a problem. Not all garbage collectors are equal either, so if you think garbage collection is bad in one language, this may not directly apply in another. In D the garbage collector is even pluggable, various implantations exist. Have you seen the GC category on the blog?https://dlang.org/blog/2017/03/20/dont-fear-the-reaper/ BetterC is a subset of D, it does not use garbage collection. You may be interested in current work being done in static analysis of manual memory management in D: https://youtu.be/XQHAIglE9CU The advantage of D is that all options are open. This allows the following approach: 1) Start development without worrying about memory. Should collection cycles be noticeable: 2) Profile your program and make strategic optimisations https://youtu.be/dRORNQIB2wA. If this is not enough: 3) Force explicit collection in idle moments. If you need to go further: 4) Completely eliminate collection in hot loops using nogc and/or GC.disable. When even this is not enough: 5) Try another GC implementation. And if you really need to: 6) Switch to manual memory management where it matters. This makes starting a project in D a safe choice, in multiple meanings of the word. — Bastiaan.
Nov 29 2020
parent reply Mark <was330 via.tokyo.jp> writes:
 Elimination of memory problems is much more valuable than 
 detection. Recovering from memory errors at run time is 
 unreliable.
I'm not sure but I have a gut feeling that I am just in a position that is not good to defend. I want small software that fails hard on weak causes, and the industry wants software that fails soft on strong causes. I cannot win any argument and will, as hobbyist who likes elegant code, face frustration after frustration, admittedly in all digital products, not only compilers. I'm sure you're right. Elimination of memory problems is what must be provided. I agree.
 The advantage of D is that all options are open. This allows 
 the following approach:
 ...
 ...
 This makes starting a project in D a safe choice, in multiple 
 meanings of the word.

 — Bastiaan.
Thanks! :) Well, in the end, there should be a way like this.
Nov 29 2020
parent Mark <was330 via.tokyo.jp> writes:
 Recovering from memory errors at run time is unreliable.
I should add that I have more like a romantic view of software release cycles where testing is done until the software is in a very, very sophisticated and stable state. More than usual. Not that I want to solely rely on such an approach. It should still try to recover from failures, and recovering should be trained like fire fighting is trained. But there should be no smokers in the building, so to speak.
Nov 29 2020