www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - What's the go with the GC these days?

reply Manu <turkeyman gmail.com> writes:
I'm somewhere between a light GC user and a  nogc user, and I don't
really know much about where we're at, or much about start-of-the-art
GC in general.

I read comments like this:
https://www.reddit.com/r/programming/comments/acsg61/version_20840_of_dmd_the_d_reference_compiler_has/edbhgzi

"""
And so we do nothing and D continues to languish as every thread is
filled with discussion about a poor implementation of a sorely
outdated GC? Other languages in this domain get by just fine - Go and
Nim to name some off the top of my head (with the full disclosure that
the former isn't quite D performance and Nim uses a thread-local GC).
There are C / C++ libraries that provide a better GC. D's tradeoff is
basically having one of the worst GCs of any language I've used this
century to avoid a write barrier that causes a, what, 5%
nigh-deterministic overall reduction in performance (assuming we use
fast barriers out of a cycle).

And if a reasonably performant GC is fundamentally incompatible and
you think it will always be this way, then maybe D as a language needs
to make serious re-evaluations? D keeps saying "Don't Fear the
Reaper", but D is one of the only languages where I actually do.
"""

How much truth is in here?
What is this business about write barriers making the GC fast? Is that
actually fact, or are there other hurdles?

Is the claim that write-barriers generally slow your runtime
performance significant? And if it is, is it possible/reasonable to
control the places where they are emit?
Can  nogc be used to inhibit write barriers in code that we know
doesn't interact with the GC, such that we have control over that loss
of perf?

It's obvious that the quality of the GC implementation is a mental
barrier to entry for many... and D has a GC, which is *attractive* to
some users. Despite the fact that I don't care about the GC much
personally, we do need to care about this as a group, and nobody seems
to be making substantial progress.

Is progress possible, or is the hard reality that the language is just
designed such to be resistant to a quality GC, while the ecosystem
sadly tends to rely on it?

Where's the ARC stuff? What happened to opAddRef/opRelease?
Jan 05
next sibling parent reply Meta <jared771 gmail.com> writes:
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 Is progress possible, or is the hard reality that the language 
 is just designed such to be resistant to a quality GC, while 
 the ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
As per Andrei's talk at the last Dconf, ref counting requires __mutable to play nicely with const and immutable.
Jan 05
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Saturday, 5 January 2019 at 22:42:11 UTC, Meta wrote:
 On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 Is progress possible, or is the hard reality that the language 
 is just designed such to be resistant to a quality GC, while 
 the ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
As per Andrei's talk at the last Dconf, ref counting requires __mutable to play nicely with const and immutable.
I'd rather have opHeadMutable than __mutable, does the same thing but doesn't subvert the type system
Jan 05
next sibling parent reply Meta <jared771 gmail.com> writes:
On Saturday, 5 January 2019 at 23:51:53 UTC, Nicholas Wilson 
wrote:
 On Saturday, 5 January 2019 at 22:42:11 UTC, Meta wrote:
 On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 Is progress possible, or is the hard reality that the 
 language is just designed such to be resistant to a quality 
 GC, while the ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
As per Andrei's talk at the last Dconf, ref counting requires __mutable to play nicely with const and immutable.
I'd rather have opHeadMutable than __mutable, does the same thing but doesn't subvert the type system
I'm fairly dubious of adding __mutable as well, but I'm assuming the previous solution of using an AfixAllocator didn't pan out. I don't know enough about opHeadMutable to consider whether it would address the same problems.
Jan 05
parent Timon Gehr <timon.gehr gmx.ch> writes:
On 06.01.19 04:02, Meta wrote:
 On Saturday, 5 January 2019 at 23:51:53 UTC, Nicholas Wilson wrote:
 On Saturday, 5 January 2019 at 22:42:11 UTC, Meta wrote:
 On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 Is progress possible, or is the hard reality that the language is 
 just designed such to be resistant to a quality GC, while the 
 ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
As per Andrei's talk at the last Dconf, ref counting requires __mutable to play nicely with const and immutable.
I'd rather have opHeadMutable than __mutable, does the same thing but doesn't subvert  the type system
I'm fairly dubious of adding __mutable as well, but I'm assuming the previous solution of using an AfixAllocator didn't pan out.
It can't work without `__mutable` because it breaks transitivity of immutability. The entire point of `__mutable` is to add a transitivity escape hatch for ( system-level data structure and runtime implementations) that doesn't block high-level optimizations based on immutability and purity.
 I don't know 
 enough about opHeadMutable to consider whether it would address the same 
 problems.
It's orthogonal. If you have an immutable object, it can't have a reference to a reference count without `__mutable`. opHeadMutable can't change that.
Jan 07
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 06.01.19 01:51, Nicholas Wilson wrote:
 On Saturday, 5 January 2019 at 22:42:11 UTC, Meta wrote:
 On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 Is progress possible, or is the hard reality that the language is 
 just designed such to be resistant to a quality GC, while the 
 ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
As per Andrei's talk at the last Dconf, ref counting requires __mutable to play nicely with const and immutable.
I'd rather have opHeadMutable than __mutable, does the same thing but doesn't subvert the type system
I'd like opHeadMutable too, but it certainly doesn't do the same thing.
Jan 07
prev sibling next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 I'm somewhere between a light GC user and a  nogc user, and I don't
 really know much about where we're at, or much about start-of-the-art GC
 in general.
I use the GC unabashedly and only try to make sure I reuse memory when it's reasonably convenient. I've also looked into GC a bit.
 How much truth is in here?
D uses a simple GC. Simple means easy to reason about. It's also got better tools to stress the GC less. But one thing it could get that would be interesting is a thread-local GC. (You'd have a stop-the-world phase that happened infrequently. Casting something to shared would pin it in the thread it was allocated from.) Another thing that it probably should have is a precise pointer map. For small allocations, that probably wouldn't be a net benefit -- the overhead of storing a typeinfo pointer by each allocation isn't awesome. But for mid-size arrays and larger, it could be helpful. The Boehm collector is more advanced than D's GC. It's (at least optionally) generational with write barriers. Go is a GC-heavy language, only a bit less so than Java, so its GC performance makes a huge difference.
 What is this business about write barriers making the GC fast? Is that
 actually fact, or are there other hurdles?
With D's GC, your normal operation is unimpeded; you're not running GC code at all unless you allocate. This makes it easy to reason about the GC. However, the GC must then scan all your memory to determine if you have pointers to a memory allocation. With write barriers, any write to any piece of memory that might contain pointers can be intercepted. The fast-but-inaccurate way is to use the MMU: mark that memory readonly and set up a fault handler. The handler will mark the memory read-write and enqueue that page of memory for scanning. The GC must also maintain a graph of which pages of memory point to which other pages of memory.
 Is the claim that write-barriers generally slow your runtime performance
 significant? And if it is, is it possible/reasonable to control the
 places where they are emit?
You can control it a little. Any time you think you might possibly need to write to a pointer, you do a fake write to it in advance. An optimizing compiler might try to remove those writes, so maybe define it in assembly. This is hideously cumbersome. Alternatively, you could manually mark a segment of memory for the GC to clear the write barrier, and it will have to scan that region next collection.
 Can  nogc be used to inhibit write barriers in code that we know doesn't
 interact with the GC, such that we have control over that loss of perf?
The point of a write barrier is to record where pointers might have changed. nogc doesn't mean that you can't change a pointer, so it can't sidestep write barriers. Not unless you want both memory leaks and use- after-free bugs.
 Is progress possible, or is the hard reality that the language is just
 designed such to be resistant to a quality GC, while the ecosystem sadly
 tends to rely on it?
Three things about D make it harder to make a good GC for it: * unaligned pointers * unions * externally allocated memory (malloc and friends) We've pretty much addressed malloc by telling people to manually add and remove malloced memory from what the GC scans. A union is pretty much just a pointer that might not be valid. Unaligned pointers just kind of suck.
Jan 05
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 3:12 PM, Neia Neutuladh wrote:
 The Boehm collector is more advanced than D's GC. It's (at least
 optionally) generational with write barriers.
Hmm, how does the Boehm collector insert write barriers? Back in the 90's, I implemented a collector that did write barriers by setting the page to read-only. When a write to it was executed, the resulting seg fault was captured, the write was logged, and the page was set to read-write. It was a fabulous idea, but the problem was it was SLOWER! The operating system's dealing with seg faults and changing write permissions was execrably slow.
Jan 05
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sat, 05 Jan 2019 19:46:22 -0800, Walter Bright wrote:
 On 1/5/2019 3:12 PM, Neia Neutuladh wrote:
 The Boehm collector is more advanced than D's GC. It's (at least
 optionally) generational with write barriers.
Hmm, how does the Boehm collector insert write barriers?
The docs say it "provides incremental and generational collection under operating systems which provide the right kind of virtual memory support." The source code says they're using mprotect on most posix systems, vm_protect on OSX, and something on Windows. The only other way I know of to implement a write barrier is to insert some code on every pointer write. GC-as-a-library would require you to use their pointer type instead of the builtin, and I expect most C++ devs would rather just use a reference counting pointer type instead.
 Back in the 90's, I implemented a collector that did write barriers by
 setting the page to read-only. When a write to it was executed, the
 resulting seg fault was captured, the write was logged, and the page was
 set to read-write.
 
 It was a fabulous idea, but the problem was it was SLOWER! The operating
 system's dealing with seg faults and changing write permissions was
 execrably slow.
Either the speed has improved, or they're just eating the time cost. The other side effect is that you (and the GC) have to be very careful about replacing other segfault handlers.
Jan 05
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 8:52 PM, Neia Neutuladh wrote:
 The only other way I know of to implement a write barrier is to insert
 some code on every pointer write. GC-as-a-library would require you to use
 their pointer type instead of the builtin, and I expect most C++ devs
 would rather just use a reference counting pointer type instead.
Microsoft tried this with their "Managed C++" variant of C++, where they had two fundamental pointer types. It was a laudable effort, but failed to gain traction. D learns from that mistake :-)
 Either the speed has improved, or they're just eating the time cost.
The fact that Java/Go/etc. use inserted write gates suggest the speed hasn't improved, which leaves eating the cost.
 The other side effect is that you (and the GC) have to be very careful
 about replacing other segfault handlers.
Yeah, you can do that with a sandboxed language, but not one that is a systems programming language where users will want to use to muck about with segfault handlers themselves.
Jan 05
next sibling parent reply Manu <turkeyman gmail.com> writes:
On Sat, Jan 5, 2019 at 10:55 PM Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 1/5/2019 8:52 PM, Neia Neutuladh wrote:
 The only other way I know of to implement a write barrier is to insert
 some code on every pointer write. GC-as-a-library would require you to use
 their pointer type instead of the builtin, and I expect most C++ devs
 would rather just use a reference counting pointer type instead.
Microsoft tried this with their "Managed C++" variant of C++, where they had two fundamental pointer types. It was a laudable effort, but failed to gain traction. D learns from that mistake :-)
I was very much 'there' when this was rejected by the community. I don't think it 'failed to gain traction' despite trying, so much as it was just plain rejected or dismissed. And not for any particularly good reason. It went like this among the ~hundreds of programmers I interacted with at the time. 1. "Managed C++" is code for "lame C#", or just "C# interop layer" 2. C++ users want absolutely nothing to do with C#, and nothing to do with the word 'managed' 3. C# users want absolutely nothing to do with C++ 4. The feature didn't stand alone as a C++ feature 5. Even if it did, it was limited to windows, and to MS compilers. Completely not-portable. It's obvious why it failed. I don't think comment on the merit of the design has any relevance at all to that particular story. The particulars of the design and whether it was good or not couldn't have had less affect on the feature being ignored and/or rejected.
 Either the speed has improved, or they're just eating the time cost.
The fact that Java/Go/etc. use inserted write gates suggest the speed hasn't improved, which leaves eating the cost.
Could D insert such code, and also effectively elide it where nogc is used (or inferred)? If we can help the GC in common code, but still have the expressive power we need to elide that extra work where we know it's not necessary, then it might be reasonable to consider.
Jan 05
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 11:58 PM, Manu wrote:
 I was very much 'there' when this was rejected by the community. I
 don't think it 'failed to gain traction' despite trying, so much as it
 was just plain rejected or dismissed.
 And not for any particularly good reason. It went like this among the
 ~hundreds of programmers I interacted with at the time.
 1. "Managed C++" is code for "lame C#", or just "C# interop layer"
 2. C++ users want absolutely nothing to do with C#, and nothing to do
 with the word 'managed'
 3. C# users want absolutely nothing to do with C++
 4. The feature didn't stand alone as a C++ feature
 5. Even if it did, it was limited to windows, and to MS compilers.
 Completely not-portable.
I had a lot of experience with multiple pointer types from the DOS world. While it works, it is a constant headache. With every .. single .. pointer one had to decide near*? far*? ss*? cs*? huge*? and Zortech's vptr*? and of course every .. single .. data .. structure had the same issues, and no, they could not interoperate. Programmers (myself included) were very happy to abandon that way of doing things.
 It's obvious why it failed. I don't think comment on the merit of the
 design has any relevance at all to that particular story.
 The particulars of the design and whether it was good or not couldn't
 have had less affect on the feature being ignored and/or rejected.
It had a large effect in my experience. I'm an expert on near/far/etc, and I don't miss it like I don't miss EBCDIC. I was kinda glad to see Managed C++ fail, so I wouldn't get any pressure to implement it. (I also recall programmers looking at it, and respond "two kinds of pointers? Blech." Not just me.)
 Either the speed has improved, or they're just eating the time cost.
The fact that Java/Go/etc. use inserted write gates suggest the speed hasn't improved, which leaves eating the cost.
Could D insert such code, and also effectively elide it where nogc is used (or inferred)?
nogc code means no allocations happen, not that there are no pointers to the gc in the code. It's orthogonal.
 If we can help the GC in common code, but still have the expressive
 power we need to elide that extra work where we know it's not
 necessary, then it might be reasonable to consider.
If we could do that, we wouldn't need GC, the compiler could figure out the last use of each pointer. With all the research into GCs, nobody has ever figured that out.
Jan 06
next sibling parent reply Manu <turkeyman gmail.com> writes:
On Sun, Jan 6, 2019 at 12:55 AM Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 It's obvious why it failed. I don't think comment on the merit of the
 design has any relevance at all to that particular story.
 The particulars of the design and whether it was good or not couldn't
 have had less affect on the feature being ignored and/or rejected.
It had a large effect in my experience. I'm an expert on near/far/etc, and I don't miss it like I don't miss EBCDIC. I was kinda glad to see Managed C++ fail, so I wouldn't get any pressure to implement it. (I also recall programmers looking at it, and respond "two kinds of pointers? Blech." Not just me.)
But now we have heaps of pointers... just that they have ugly names. Now we have std::unique_ptr<T>, std::shared_ptr<T>, std::auto_ptr<T>, std::weak_ptr<T> These are all just as equally 'multiple kinds of pointers' as `T^` was for ARC pointers, except they have hideous names, instead of a nice concise 1-byte syntax. It's all irrelevant though, Nobody's asking for multiple pointer types in D. All I want from language support for ARC in D, is an opInc/opDec function which are called appropriately around assignments, and elided appropriately. copy ctor's can't offer this functionality.
Jan 06
next sibling parent Ethan <gooberman gmail.com> writes:
On Sunday, 6 January 2019 at 09:40:25 UTC, Manu wrote:

 But now we have heaps of pointers... just that they have ugly 
 names.
 Now we have std::unique_ptr<T>, std::shared_ptr<T>, 
 std::auto_ptr<T>,
 std::weak_ptr<T>
 These are all just as equally 'multiple kinds of pointers' as 
 `T^` was
 for ARC pointers, except they have hideous names, instead of a 
 nice
 concise 1-byte syntax.
Not much to add to this discussion other than "Where's my ARC???" But, the impression I got from digging around UWP code for Quantum Break is that T^ is just syntactic sugar for a COM pointer. So C++ Windows programmers have had different types of pointers for much longer than C++11. If Microsoft had just come out and said "They're COM pointers" there'd have been far less gnashing of teeth.
Jan 06
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 1:40 AM, Manu wrote:
 It's all irrelevant though, Nobody's asking for multiple pointer types
 in D. All I want from language support for ARC in D, is an opInc/opDec
 function which are called appropriately around assignments, and elided
 appropriately.
It turns out to be fiendishly difficult to automatically elide counter bumps.
 copy ctor's can't offer this functionality.
They can produce a working ref counting solution. D's will have a couple fundamental advantages over the C++ one: 1. D won't need the locking on the increment, because D has different types for threaded vs non-threaded. 2. With dip25 and dip1000, D can offer memory safe access to rc object contents.
Jan 06
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Jan 06, 2019 at 01:25:32PM -0800, Walter Bright via Digitalmars-d wrote:
 On 1/6/2019 1:40 AM, Manu wrote:
 It's all irrelevant though, Nobody's asking for multiple pointer
 types in D. All I want from language support for ARC in D, is an
 opInc/opDec function which are called appropriately around
 assignments, and elided appropriately.
It turns out to be fiendishly difficult to automatically elide counter bumps.
Just out of curiosity, any concrete examples of difficulties that prevent easy elision of counter bumps? Just so we have a better idea of the challenges we're up against. Are there any common cases that are easy enough to implement, that might be "good enough", leaving more tricky cases aside?
 copy ctor's can't offer this functionality.
They can produce a working ref counting solution.
Care to elaborate? Is it related to how current move + postblit semantics make it impractical to maintain a consistent ref count?
 D's will have a couple fundamental advantages over the C++ one:
 
 1. D won't need the locking on the increment, because D has different
 types for threaded vs non-threaded.
 
 2. With dip25 and dip1000, D can offer memory safe access to rc object
 contents.
dip25 and dip1000 have been around for a long time now. What are the remaining issues blocking their full implementation / deployment? T -- Give a man a fish, and he eats once. Teach a man to fish, and he will sit forever.
Jan 06
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 6:00 PM, H. S. Teoh wrote:
 On Sun, Jan 06, 2019 at 01:25:32PM -0800, Walter Bright via Digitalmars-d
wrote:
 It turns out to be fiendishly difficult to automatically elide counter
 bumps.
Just out of curiosity, any concrete examples of difficulties that prevent easy elision of counter bumps? Just so we have a better idea of the challenges we're up against. Are there any common cases that are easy enough to implement, that might be "good enough", leaving more tricky cases aside?
See: https://digitalmars.com/d/archives/digitalmars/D/draft_proposal_for_ref_counting_in_D_211885.html There's more on the Dlang-study mailing list: Dlang-study puremagic.com http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study
 copy ctor's can't offer this functionality.
They can produce a working ref counting solution.
Care to elaborate?
https://github.com/RazvanN7/DIPs/blob/efacbf8efde8027291113633984b0e7a039e8f30/DIPs/DIP1xxx-rn.md
 D's will have a couple fundamental advantages over the C++ one:

 1. D won't need the locking on the increment, because D has different
 types for threaded vs non-threaded.

 2. With dip25 and dip1000, D can offer memory safe access to rc object
 contents.
dip25 and dip1000 have been around for a long time now. What are the remaining issues blocking their full implementation / deployment?
dip25 is becoming the default compiler behavior. dip1000 needs getting Phobos to compile with dip1000. Now that https://github.com/dlang/dmd/pull/8504 is merged, I still need to improve it with `return` inference.
Jan 06
prev sibling next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Monday, 7 January 2019 at 02:00:05 UTC, H. S. Teoh wrote:
 dip25 and dip1000 have been around for a long time now. What 
 are the remaining issues blocking their full implementation / 
 deployment?
For a long while, documentation. See that dates on https://github.com/dlang/dlang.org/pull/2453 a good waste of 4 1/2 months. But the documentation still needs to improve a lot. dip25 is in the process of being turned on by default as an opt-out -transition switch and going through a deprecation cycle. dip1000 has changed quite a bit since it since the reviews and I suspect it will need to go through more once the spec for to is nailed down more (hello dconf?) and it too will probably need to be transitioned in.
Jan 06
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Monday, 7 January 2019 at 02:00:05 UTC, H. S. Teoh wrote:
 Just out of curiosity, any concrete examples of difficulties 
 that prevent easy elision of counter bumps?
Let's say you obtain ownership of every other object in an array, then you have to prove that you also release every other object in that array before you return from the function. If you always hold onto the ref-counted object in a named reference that is never changed, then it is reasonably easy to do. Then you only have to prove that the acquired object is released once while the reference is live. So basically, whenever the recounted object is accessed through a graph-like structure then you need an advanced prover.
Jan 08
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 8 January 2019 at 10:56:43 UTC, Ola Fosheim Grøstad 
wrote:
 Let's say you obtain ownership of every other object in an 
 array, then you have to prove that  you also release every 
 other object in that array before you return from the function.
Or rather, that every other object has been acquired (locked to memory) by the context (the calling function). So, it is quite doable, but you have to use ideas used for proving program correctness, which is a difficult topic.
Jan 08
prev sibling parent welkam <wwwelkam gmail.com> writes:
On Sunday, 6 January 2019 at 08:52:53 UTC, Walter Bright wrote:
 On 1/5/2019 11:58 PM, Manu wrote:
 If we can help the GC in common code, but still have the 
 expressive
 power we need to elide that extra work where we know it's not
 necessary, then it might be reasonable to consider.
If we could do that, we wouldn't need GC, the compiler could figure out the last use of each pointer. With all the research into GCs, nobody has ever figured that out.
Dont think of it as all or nothing. With dip1000 compiler knows the life time of some objects and if they have a reference that outlives objects scope. With this information compiler can replace some GC allocations with stack or specialized malloc allocations with deallocation at the end of the scope. This simple addition to compiler could increase performance of GC code by a lot in poorly written code. With some mentoring I could even implement it myself. It doesnt sound that hard (famous last words) since lifetime tracking is already implemented
Jan 07
prev sibling parent Neia Neutuladh <neia ikeran.org> writes:
On Sat, 05 Jan 2019 23:58:55 -0800, Manu wrote:
 Could D insert such code, and also effectively elide it where  nogc is
 used (or inferred)?
Write barriers are how the GC maintains the graph of which memory points to which other memory, so it is very difficult to avoid them. You don't need to include them for writes to non-pointers. For injected runtime calls, this means any non-pointer. For mprotect(), it means the entire block of memory must be pointer-free. You could use nogc as a means to prove that you don't have to call the "pointer was written" code multiple times for the same address. The compiler has to prove that it's the same address, that no other thread could have written to that data, and that the GC couldn't have run in between. But D as is can't prove any of that.
Jan 06
prev sibling parent reply Russel Winder <russel winder.org.uk> writes:
On Sat, 2019-01-05 at 22:54 -0800, Walter Bright via Digitalmars-d wrote:
[=E2=80=A6]
 The fact that Java/Go/etc. use inserted write gates suggest the speed
 hasn't=20
 improved, which leaves eating the cost.
[=E2=80=A6] Java GC continues to change and improve in leaps and bounds, both speed and latency. And indeed a lack of "stop the world time". The G1 GC that was see= n as the very best there could be two Java versions ago has been surpassed ag= ain with Shenandoah. JVM GC people just keep coming up with ways of killing off= GC cost. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Jan 06
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 3:32 AM, Russel Winder wrote:
 Java GC continues to change and improve in leaps and bounds, both speed and
 latency. And indeed a lack of "stop the world time". The G1 GC that was seen
 as the very best there could be two Java versions ago has been surpassed again
 with Shenandoah. JVM GC people just keep coming up with ways of killing off GC
 cost.
1. Java has a very constrained interface to C, with a lot of rules to follow, mainly so the GC is not corrupted. D, being a systems programming language, simply does not have that luxury. 2. Let me know when Java lets go of write barriers!
Jan 06
next sibling parent Russel Winder <russel winder.org.uk> writes:
On Sun, 2019-01-06 at 13:28 -0800, Walter Bright via Digitalmars-d wrote:
=20
[=E2=80=A6]
 1. Java has a very constrained interface to C, with a lot of rules to
 follow,=20
 mainly so the GC is not corrupted. D, being a systems programming languag=
e,=20
 simply does not have that luxury.
I understand that the Java language and its heap are very different from D,= so the context and constraints are different. I am assuming Go is the same, though a priori I would think less different to D than Java. My concern (if= it can be called that) is that the JVM folk and the Go folk spend a lot of tim= e finding new and/or improved GC algorithms, whereas in D there appears to be= no movement from an old and simple GC.
 2. Let me know when Java lets go of write barriers!
G1 GC certainly avoids write barriers whenever it can. I do not know the details, but there are conference papers out there looking at avoiding writ= e barriers at all costs in the newer GC algorithms. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Jan 07
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 06:59:10PM +0000, Russel Winder via Digitalmars-d wrote:
 On Sun, 2019-01-06 at 13:28 -0800, Walter Bright via Digitalmars-d wrote:
 
[…]
 1. Java has a very constrained interface to C, with a lot of rules
 to follow, mainly so the GC is not corrupted. D, being a systems
 programming language, simply does not have that luxury.
I understand that the Java language and its heap are very different from D, so the context and constraints are different. I am assuming Go is the same, though a priori I would think less different to D than Java. My concern (if it can be called that) is that the JVM folk and the Go folk spend a lot of time finding new and/or improved GC algorithms, whereas in D there appears to be no movement from an old and simple GC.
IOW it's more of a matter of perception than an actual technical issue then? And just FYI, there *have* been gradual improvements to D's GC over the years, even if they are not ground-breaking ones. In any case, as Walter said, Java / Go / similar languages impose quite a lot of restrictions on what the programmer can do, which gives the language implementors many more assumptions they can work with. D pretty much gives the programmer free rein, except perhaps under safe, but even then, the restrictions are pretty minimal and nowhere near the managed languages. This leaves the language implementors very little room for maneuvre, meaning that any prospective GC algorithm for D is constrained by being unable to make the same assumptions that a Java or Go GC can take for granted -- in fact, unable to make many assumptions at all. Naturally, this also rules out a lot of algorithms that could otherwise be applicable, and makes finding a significantly improved algorithm a very challenging task. (And that's assuming said significantly improved algorithm exists at all, under D's operating conditions.)
 2. Let me know when Java lets go of write barriers!
G1 GC certainly avoids write barriers whenever it can.
The stickler is in "whenever it can". Currently, Walter is operating under the premise of "no write barriers *at all*". Whether or not this is a tenable stance as far as GC improvement is concerned remains to be seen, but being able to fallback to write barriers occasionally represents a completely different ballpark to having *zero* write barriers, *anywhere*. It may sound like a subtle difference, but it makes a world of difference as far as GC algorithms are concerned.
 I do not know the details, but there are conference papers out there
 looking at avoiding write barriers at all costs in the newer GC
 algorithms.
[...] If there are any GC algorithms that require no write barriers at all and still offer the same benefits as a modern generational GC, I'm sure we'll be jumping all over it in no time at all! Assuming that it doesn't make other assumptions incompatible with D, that is. T -- The best compiler is between your ears. -- Michael Abrash
Jan 07
prev sibling next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sat, Jan 05, 2019 at 11:12:52PM +0000, Neia Neutuladh via Digitalmars-d
wrote:
 On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 I'm somewhere between a light GC user and a  nogc user, and I don't
 really know much about where we're at, or much about
 start-of-the-art GC in general.
I use the GC unabashedly and only try to make sure I reuse memory when it's reasonably convenient. I've also looked into GC a bit.
I also use the GC freely, and only bother with GC optimization when my profiler shows that there's an actual problem. As I've said numerous times before, unless you're working on extremely time-sensitive code like real-time applications or 3D game engines, D's GC usually does not cause much noticeable difference. Unless you do something extreme like allocate tens of millions of small strings (or other small objects) per second, or allocate huge objects rapidly and expect unreferenced memory to be quickly reused. And when the GC does start becoming a problem, there are often easy first-stab solutions that offer improvements that suffice for most cases, the most obvious one being calling GC.disable and then scheduling GC.collect yourself at more convenient times than when the default system might run collections. This requires the least code changes, and IME often yields quite good improvements. Past that, the next most obvious one is to reduce GC pressure by reusing frequently-allocated objects -- standard advice for improving GC performance, y'know. This requires a bit more work, but should be obvious by identifying code hotspots with a profiler and examining which objects are being most frequently allocated. Use stdx.allocator to allocate from a pool instead, or just retain the object in a cache and reuse it the next time round, etc.. If the GC is still an issue after this, look into preallocating objects outside your main loop, so that you can control exactly when GC pauses happen. Again, standard advice for GC optimization. Past this point, if GC remains a big issue, you could start pulling out nogc and using malloc/free, etc.. (Though I might do malloc/free in the previous step if I know there are big objects I'm gonna need, and they have straightforward lifetimes that are easy to track -- this reduces the size of the GC heap, and thereby also improves collection performance.) // As far as improving the GC itself is concerned, it will surely be nice, and an overall win for D, and certainly we shouldn't delay on doing this. But I don't think it's a life-or-death problem that we must fix Right Here And Now Or Else(tm).
 How much truth is in here?
D uses a simple GC. Simple means easy to reason about. It's also got better tools to stress the GC less.
Yeah, I find that with my recent, more idiomatic D code, I tend to use ranges with lazy evaluation a lot more than plain arrays, meaning that what I'd normally allocate as arrays in equivalent C/C++ code, in D no allocation (or significantly less allocation) actually happens because of lazy evaluation. The one place where I still tend to allocate more would be string manipulation, but even here, using slices instead of copying substrings like in C/C++ still reduces the actual number of allocations. The situation is significantly different from GC-heavy languages like Java, where basically every non-trivial type requires an allocation with few or no alternatives or ways of bypassing it. The absence of by-value aggregates in Java causes your average code to allocate far more than the equivalent D code, not to mention the heavy OO emphasis often leads to many indirections like interfaces and vtables, which often also entail allocating adaptor objects, wrappers, etc.. So in Java, GC performance would play a much larger role in overall performance, whereas in D, the prevalence of by-value types like ranges, and passing small parcels of information as structs rather than classes makes the GC pressure much lower, so GC performance is not as significant. (Of course, it's possible for Java compilers to optimize away some allocations if object lifetimes can be statically determined -- I don't know if any Java compilers actually do this. Still, Java code does tend to be more allocation-heavy than typical D code.)
 But one thing it could get that would be interesting is a thread-local
 GC.
Yes, yes, and yes! This would allow per-thread segregation of the GC heap, which would allow much better control of GC pauses.
 (You'd have a stop-the-world phase that happened infrequently.
 Casting something to shared would pin it in the thread it was
 allocated from.)
How does this solve the problem of shared, though? The last time I checked, casting to/from shared is the main showstopper for a thread-local GC. [...]
 Is progress possible, or is the hard reality that the language is
 just designed such to be resistant to a quality GC, while the
 ecosystem sadly tends to rely on it?
Three things about D make it harder to make a good GC for it: * unaligned pointers * unions * externally allocated memory (malloc and friends) We've pretty much addressed malloc by telling people to manually add and remove malloced memory from what the GC scans. A union is pretty much just a pointer that might not be valid. Unaligned pointers just kind of suck.
Unaligned pointers are generally just a bad idea IMO. I'm tempted to say it should be defined as UB, along with obscured pointers (like the doubly-linked list with 1 pointer per node trick using XOR). Maybe with a compiler / runtime switch to enable a more conservative GC. Now that I think of it, we could deal with pointers in unions the same way -- if the compiler detects it, then trigger conservative mode in the GC. With these two out of the way, a generational GC for D seems closer to the realm of possibility. T -- "How are you doing?" "Doing what?"
Jan 05
next sibling parent Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Sunday, 6 January 2019 at 04:34:30 UTC, H. S. Teoh wrote:
 On Sat, Jan 05, 2019 at 11:12:52PM +0000, Neia Neutuladh via
 But one thing it could get that would be interesting is a 
 thread-local GC.
Yes, yes, and yes! This would allow per-thread segregation of the GC heap, which would allow much better control of GC pauses.
 (You'd have a stop-the-world phase that happened infrequently. 
 Casting something to shared would pin it in the thread it was 
 allocated from.)
How does this solve the problem of shared, though? The last time I checked, casting to/from shared is the main showstopper for a thread-local GC.
The recent shared thread by Manu was trying (only somewhat successfully) to tie shared to safe/ trusted with an inductive model, i.e. the user writes trusted foundational layer and safety is verified from there on up the call stack. I wonder if that sort of a model would safely permit a thread local GC?
 Three things about D make it harder to make a good GC for it:
 * unaligned pointers
 * unions
 * externally allocated memory (malloc and friends)
 
 We've pretty much addressed malloc by telling people to 
 manually add and remove malloced memory from what the GC 
 scans. A union is pretty much just a pointer that might not be 
 valid. Unaligned pointers just kind of suck.
Unaligned pointers are generally just a bad idea IMO. I'm tempted to say it should be defined as UB, along with obscured pointers (like the doubly-linked list with 1 pointer per node trick using XOR). Maybe with a compiler / runtime switch to enable a more conservative GC. Now that I think of it, we could deal with pointers in unions the same way -- if the compiler detects it, then trigger conservative mode in the GC. With these two out of the way, a generational GC for D seems closer to the realm of possibility.
With a precise GC that knows the types/layout of what its dealing with it ought to be possible to have a way to retrieve all the pointers from a type for scanning. For an inactive union that would be returning nothing. It could possibly work for an XORlist as well, not that I thunk we should be letting that kind of monstrosity dictate our designs. Ditto for unaligned pointers.
Jan 05
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 8:34 PM, H. S. Teoh wrote:
 Of course, it's possible for Java compilers to optimize away some
 allocations if object lifetimes can be statically determined -- I don't
 know if any Java compilers actually do this.
They do. I forgot the jargon term for it, but they'll determine if the lifetime does not escape the function scope, and allocate it on the stack instead of the heap.
Jan 05
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Sunday, 6 January 2019 at 06:57:40 UTC, Walter Bright wrote:
 On 1/5/2019 8:34 PM, H. S. Teoh wrote:
 Of course, it's possible for Java compilers to optimize away 
 some
 allocations if object lifetimes can be statically determined 
 -- I don't
 know if any Java compilers actually do this.
They do. I forgot the jargon term for it, but they'll determine if the lifetime does not escape the function scope, and allocate it on the stack instead of the heap.
Heap to stack promotion? BTW LDC does this https://github.com/ldc-developers/ldc/blob/master/gen/passes/GarbageCollect2Stack.cpp
Jan 05
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 11:31 PM, Nicholas Wilson wrote:
 Heap to stack promotion? BTW LDC does this 
 https://github.com/ldc-developers/ldc/blob/master/gen/passes/Garb
geCollect2Stack.cpp 
I suppose that's as good a word as any! I just remember SROA as a candidate for the most pretentious, obtuse compiler acronym ever, right behind SFINAE.
Jan 06
parent reply Nicholas Wilson <iamthewilsonator hotmail.com> writes:
On Sunday, 6 January 2019 at 08:55:31 UTC, Walter Bright wrote:
 On 1/5/2019 11:31 PM, Nicholas Wilson wrote:
 Heap to stack promotion? BTW LDC does this 
 https://github.com/ldc-developers/ldc/blob/master/gen/passes/GarbageCollect2Stack.cpp
I suppose that's as good a word as any! I just remember SROA as a candidate for the most pretentious, obtuse compiler acronym ever, right behind SFINAE.
SROA is at least intuitive in what is does and makes sense when you consider padding, SFINAE OTOH, I know what it stands for, I have no idea what it does or why. Mind you thats par for the course for C++.
Jan 06
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 7:05 PM, Nicholas Wilson wrote:
 SROA is at least intuitive in what is does and makes sense when you consider 
 padding,
I would have called something like Aggregate Slicing, or simply Slice-O-Matic :-)
 SFINAE OTOH, I know what it stands for, I have no idea what it does or 
 why.
SFINAE => Substitution Failure Is Not An Error It means if a function template instantiation does not produce compilable code, the function is (silently) not instantiated. In D we do this with template constraints.
Jan 06
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/5/19 6:12 PM, Neia Neutuladh wrote:
 On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 Is progress possible, or is the hard reality that the language is just
 designed such to be resistant to a quality GC, while the ecosystem sadly
 tends to rely on it?
Three things about D make it harder to make a good GC for it: * unaligned pointers
Unaligned pointers are not scanned, and so this is not a problem. As far as the GC is concerned, they aren't pointers. Perhaps you meant interior pointers?
 * unions
Unions and void * (or essentially untyped blocks) are the biggest problems. But depending on what we want to implement, we may have solutions for that. It's difficult to paint with broad brushes here. There are certain problems that prevent certain solutions. But, for instance, a precise GC is getting traction as we speak: https://github.com/dlang/druntime/pull/2418 But precise GC is not necessarily better performing. And it's still "stop-the-world". I think the biggest improvement for D's GC could be the thread-local pools (and avoid stop-the-world for those), but since casting shared and local data around is valid, I don't know how it can be implemented without language changes. There are always extra heuristics and algorithmic improvements that people can analyze in the current GC as well. -Steve
Jan 07
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Mon, 07 Jan 2019 10:21:58 -0500, Steven Schveighoffer wrote:
 On 1/5/19 6:12 PM, Neia Neutuladh wrote:
 On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 Is progress possible, or is the hard reality that the language is just
 designed such to be resistant to a quality GC, while the ecosystem
 sadly tends to rely on it?
Three things about D make it harder to make a good GC for it: * unaligned pointers
Unaligned pointers are not scanned, and so this is not a problem. As far as the GC is concerned, they aren't pointers.
In that case, it's already solved! And in pretty much the same way as malloc'd memory. (I was going to check but was overcome with a bout of laziness.)
 * unions
Unions and void * (or essentially untyped blocks) are the biggest problems. But depending on what we want to implement, we may have solutions for that. It's difficult to paint with broad brushes here. There are certain problems that prevent certain solutions. But, for instance, a precise GC is getting traction as we speak: https://github.com/dlang/druntime/pull/2418
That's a more precise GC instead of a fully precise GC. Unions and void[] mean you can't have a fully precise GC.
 But precise GC is not necessarily better performing. And it's still
 "stop-the-world".
Right. Precise means you collect more data more often, and it means you use up cache lines for pointer maps. It might also fragment your heap more (each allocation needs a pointer map, or maybe you have different pools for sufficiently common things like all pointers or no pointers).
 I think the biggest improvement for D's GC could be the thread-local
 pools (and avoid stop-the-world for those), but since casting shared and
 local data around is valid, I don't know how it can be implemented
 without language changes.
Like I said before, casting to shared would have to invoke a runtime function. That's a runtime change and compiler change, but I don't think the spec says anything about whether any cast might invoke a runtime function.
Jan 07
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 06:59:37PM +0000, Neia Neutuladh via Digitalmars-d
wrote:
 On Mon, 07 Jan 2019 10:21:58 -0500, Steven Schveighoffer wrote:
[...]
 I think the biggest improvement for D's GC could be the thread-local
 pools (and avoid stop-the-world for those), but since casting shared
 and local data around is valid, I don't know how it can be
 implemented without language changes.
Like I said before, casting to shared would have to invoke a runtime function. That's a runtime change and compiler change, but I don't think the spec says anything about whether any cast might invoke a runtime function.
Casting between arrays of different types does already call a druntime function in some cases (e.g., casting int[] to S[] where S is some struct). So I'd say it's fair game. T -- Only boring people get bored. -- JM
Jan 07
prev sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/7/19 1:59 PM, Neia Neutuladh wrote:
 On Mon, 07 Jan 2019 10:21:58 -0500, Steven Schveighoffer wrote:
 On 1/5/19 6:12 PM, Neia Neutuladh wrote:
 On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 Is progress possible, or is the hard reality that the language is just
 designed such to be resistant to a quality GC, while the ecosystem
 sadly tends to rely on it?
Three things about D make it harder to make a good GC for it: * unaligned pointers
Unaligned pointers are not scanned, and so this is not a problem. As far as the GC is concerned, they aren't pointers.
In that case, it's already solved! And in pretty much the same way as malloc'd memory. (I was going to check but was overcome with a bout of laziness.)
 * unions
Unions and void * (or essentially untyped blocks) are the biggest problems. But depending on what we want to implement, we may have solutions for that. It's difficult to paint with broad brushes here. There are certain problems that prevent certain solutions. But, for instance, a precise GC is getting traction as we speak: https://github.com/dlang/druntime/pull/2418
That's a more precise GC instead of a fully precise GC. Unions and void[] mean you can't have a fully precise GC.
Yes, well said. It's more precise, and a fully precise GC is not possible due to how D works.
 
 But precise GC is not necessarily better performing. And it's still
 "stop-the-world".
Right. Precise means you collect more data more often, and it means you use up cache lines for pointer maps. It might also fragment your heap more (each allocation needs a pointer map, or maybe you have different pools for sufficiently common things like all pointers or no pointers).
The pointer maps are static, and so they are stored with typeinfo I believe. We already store a typeinfo pointer for everything but low-level allocations and basic types (for destruction). There may be tweakable cases where it's OK to be imprecise to achieve better speed, or have a "small pointer map optimization".
 
 I think the biggest improvement for D's GC could be the thread-local
 pools (and avoid stop-the-world for those), but since casting shared and
 local data around is valid, I don't know how it can be implemented
 without language changes.
Like I said before, casting to shared would have to invoke a runtime function. That's a runtime change and compiler change, but I don't think the spec says anything about whether any cast might invoke a runtime function.
This is doable. Casting to/from shared should be a rare thing, and probably fine to hook. However, the problem is more (for me) that the memory would have to be copied, as the memory is currently in one pool and has to be moved to another pool. It's better probably to just allocate it shared to begin with, or copy it if you need to "cast" to/from shared. Some notion of tail-modified type constructors would be really useful here. -Steve
Jan 07
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 02:33:56PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
 On 1/7/19 1:59 PM, Neia Neutuladh wrote:
[...]
 Like I said before, casting to shared would have to invoke a runtime
 function. That's a runtime change and compiler change, but I don't
 think the spec says anything about whether any cast might invoke a
 runtime function.
 
This is doable. Casting to/from shared should be a rare thing, and probably fine to hook. However, the problem is more (for me) that the memory would have to be copied, as the memory is currently in one pool and has to be moved to another pool.
[...] I think what Neia has in mind, which he mentioned in another post, is to have the druntime function simply pin the object, whichever pool it belongs to. The idea being to tell the respective thread-local GC "this object might be referenced by another thread, do not collect or move". T -- "The number you have dialed is imaginary. Please rotate your phone 90 degrees and try again."
Jan 07
next sibling parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/7/19 2:52 PM, H. S. Teoh wrote:
 On Mon, Jan 07, 2019 at 02:33:56PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
 On 1/7/19 1:59 PM, Neia Neutuladh wrote:
[...]
 Like I said before, casting to shared would have to invoke a runtime
 function. That's a runtime change and compiler change, but I don't
 think the spec says anything about whether any cast might invoke a
 runtime function.
This is doable. Casting to/from shared should be a rare thing, and probably fine to hook. However, the problem is more (for me) that the memory would have to be copied, as the memory is currently in one pool and has to be moved to another pool.
[...] I think what Neia has in mind, which he mentioned in another post, is to have the druntime function simply pin the object, whichever pool it belongs to. The idea being to tell the respective thread-local GC "this object might be referenced by another thread, do not collect or move".
Imagine I have a shared piece of data, then I cast it to thread local. Now, it is able to point at thread local data, but it lives in the shared heap. Pinned or not, this means a local collection must scan the shared heap to see if it can collect any thread-local data. This defeats the purpose of having a thread-local heap in the first place (which you should be able to scan without having to stop-the-world). -Steve
Jan 07
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 03:28:26PM -0500, Steven Schveighoffer via
Digitalmars-d wrote:
 On 1/7/19 2:52 PM, H. S. Teoh wrote:
[...]
 I think what Neia has in mind, which he mentioned in another post,
 is to have the druntime function simply pin the object, whichever
 pool it belongs to.  The idea being to tell the respective
 thread-local GC "this object might be referenced by another thread,
 do not collect or move".
Imagine I have a shared piece of data, then I cast it to thread local. Now, it is able to point at thread local data, but it lives in the shared heap. Pinned or not, this means a local collection must scan the shared heap to see if it can collect any thread-local data. This defeats the purpose of having a thread-local heap in the first place (which you should be able to scan without having to stop-the-world).
[...] Hmph, good point. That sux. :-( I suppose that's where copying/moving the object comes in -- migrate it to a different pool somehow so that we can avoid stop-the-world. T -- They pretend to pay us, and we pretend to work. -- Russian saying
Jan 07
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Mon, 07 Jan 2019 13:06:22 -0800, H. S. Teoh wrote:
 Hmph, good point.  That sux. :-(  I suppose that's where copying/moving
 the object comes in -- migrate it to a different pool somehow so that we
 can avoid stop-the-world.
class Node { enum Type { leaf, intermediate } Type type; union { Node child; ulong data; } } auto node = new Node; node.type = Node.Type.leaf; node.data = cast(ulong)cast(void*)node; How do you copy this? Pinning avoids those problems, but it still involves scanning memory as if doing a collection.
Jan 07
parent reply Steven Schveighoffer <schveiguy gmail.com> writes:
On 1/7/19 4:57 PM, Neia Neutuladh wrote:
 On Mon, 07 Jan 2019 13:06:22 -0800, H. S. Teoh wrote:
 Hmph, good point.  That sux. :-(  I suppose that's where copying/moving
 the object comes in -- migrate it to a different pool somehow so that we
 can avoid stop-the-world.
class Node { enum Type { leaf, intermediate } Type type; union { Node child; ulong data; } } auto node = new Node; node.type = Node.Type.leaf; node.data = cast(ulong)cast(void*)node; How do you copy this?
You copy the memory, then update the references to that memory. Or you make the user explicitly copy, and disallow the cast.
 Pinning avoids those problems, but it still involves scanning memory as if
 doing a collection.
Even if you pin, if you cast the above from shared to unshared, you then have made it so the thread-local pool cannot be scanned independently. If we get thread-local pools, and all scans have to be done on all thread local pools along with the shared pool, we haven't gained anything. -Steve
Jan 08
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Tue, 08 Jan 2019 09:21:23 -0500, Steven Schveighoffer wrote:
 You copy the memory, then update the references to that memory.
Again, in the case of unions and void[], how do you determine if something is a reference to that memory instead of a false pointer? Moving collectors *have* to be fully precise, and D doesn't allow a fully precise collector.
 Even if you pin, if you cast the above from shared to unshared, you then
 have made it so the thread-local pool cannot be scanned independently.
Because the reference to an object in the current thread might be stored inside an object allocated by a different thread's GC, and that might have happened without a cast. Fixing that would require redesigning shared.
 If we get thread-local pools, and all scans have to be done on all
 thread local pools along with the shared pool, we haven't gained
 anything.
We can shorten the stop-the-world phase to the size of shared memory. If you're only sharing a little data and have a decent number of threads, this would be a big improvement.
Jan 08
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Tuesday, 8 January 2019 at 18:04:26 UTC, Neia Neutuladh wrote:
 We can shorten the stop-the-world phase to the size of shared 
 memory. If you're only sharing a little data and have a decent 
 number of threads, this would be a big improvement.
Pointer storage isn't limited to shared memory?
Jan 08
parent Neia Neutuladh <neia ikeran.org> writes:
On Wed, 09 Jan 2019 00:24:14 +0000, Ola Fosheim Grøstad wrote:
 On Tuesday, 8 January 2019 at 18:04:26 UTC, Neia Neutuladh wrote:
 We can shorten the stop-the-world phase to the size of shared memory.
 If you're only sharing a little data and have a decent number of
 threads, this would be a big improvement.
Pointer storage isn't limited to shared memory?
1. Stop all the threads. 2. Scan shared memory: everything that's been allocated as shared, everything that's been explicitly casted to shared, and everything reachable from those objects. Use that to mark things in the current thread's local GC heap as reachable. 3. Resume every thread but this one. 4. Run mark/sweep on the current thread's heap. 5. Resume this thread. If you have a small amount of shared memory, the stop-the-world phase would be pretty short. As an alternative, this could hijack all the threads to run a parallel mark/sweep, where each individual thread can resume as soon as its own local mark/sweep finishes. If you have a worker thread that deals with very few pointers, it can resume very quickly after the shared mark/sweep phase.
Jan 08
prev sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Mon, 07 Jan 2019 11:52:18 -0800, H. S. Teoh wrote:
 I think what Neia has in mind, which he mentioned in another post, is to
 have the druntime function simply pin the object, whichever pool it
 belongs to.  The idea being to tell the respective thread-local GC "this
 object might be referenced by another thread, do not collect or move".
She, not he. Thinking about this some more, this strategy isn't as cheap or easy as I was hoping. Let's say you're using fearless: auto a = gcExclusive!Node; spawn({ { auto tmp = a.lock; tmp.parent = new Object; } GC.threadLocal.collect(); GC.threadLocal.minimize(); { auto tmp = a.lock; writeln(tmp.parent.toString); } }); Where do we insert the runtime call to mark a.parent as shared? Fearless would have to: 1. Acquire the mutex 2. Tell the current thread's GC to add the locked object as a root 3. Run your code 4. Tell the thread's GC to pin everything it can find from `a` (by casting back to shared) 5. Release the mutex This effectively incurs a mark (but not sweep) cycle every time you cast to shared. If it were just pinning one object, that would be a lot easier to sell.
Jan 07
parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 7, 2019 1:48:04 PM MST Neia Neutuladh via Digitalmars-d 
wrote:
 On Mon, 07 Jan 2019 11:52:18 -0800, H. S. Teoh wrote:
 I think what Neia has in mind, which he mentioned in another post, is to
 have the druntime function simply pin the object, whichever pool it
 belongs to.  The idea being to tell the respective thread-local GC "this
 object might be referenced by another thread, do not collect or move".
She, not he. Thinking about this some more, this strategy isn't as cheap or easy as I was hoping. Let's say you're using fearless: auto a = gcExclusive!Node; spawn({ { auto tmp = a.lock; tmp.parent = new Object; } GC.threadLocal.collect(); GC.threadLocal.minimize(); { auto tmp = a.lock; writeln(tmp.parent.toString); } }); Where do we insert the runtime call to mark a.parent as shared? Fearless would have to: 1. Acquire the mutex 2. Tell the current thread's GC to add the locked object as a root 3. Run your code 4. Tell the thread's GC to pin everything it can find from `a` (by casting back to shared) 5. Release the mutex This effectively incurs a mark (but not sweep) cycle every time you cast to shared. If it were just pinning one object, that would be a lot easier to sell.
Casting between shared and thread-local doesn't have to be free, but it cannot be expensive, because unless you're using atomics, the way that it's designed pretty much requires that you cast away shared (after locking a mutex to protect the object) in order to operate on it. So, if casting away shared is expensive, then code using shared gets screwed. Also, there isn't anything stopping a programmer from assigning a thread-local reference to something inside an object that has had shared cast away, and there likely isn't any reason for the programmer to cast anything to shared in that scenario. They'd just get rid of the thread-local reference to the shared object before releasing the mutex, meaning that the runtime couldn't even see that what had been assigned to one of its internals had gone from being treated as thread-local to shared. Having objects go from shared to thread-local in such code could be possible as well if a reference is taken from the object which has had shared cast away and then stored as thread-local. In fact that sort of scenario is pretty much exactly what would happen with something like a consumer-producer queue. Programmers dealing with shared need to be aware of such things and deal with them appropriately in order to avoid having problems (which is part of why casting away shared has to be trusted), but it's perfectly legal so long as no shared object is ever treated as thread-local when it can actually be operated on by multiple threads, and no thread-local objects end up on multiple threads at the same time. Given that sort of situation, I don't see how we can have the GC accurately track whether objects are thread-local or shared. Casting is just too blunt an instrument and allows too much. - Jonathan M Davis
Jan 07
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Mon, 07 Jan 2019 14:53:50 -0700, Jonathan M Davis wrote:
 Given that sort of situation, I don't see how we can have the
 GC accurately track whether objects are thread-local or shared. Casting
 is just too blunt an instrument and allows too much.
This is exactly the situation I brought up in the post you're replying to. I explained what the solution is in the post you just replied to. That was in fact the entire point of that post. It requires a number of careful steps. It's not automatic. It's *mostly* automatic, and you can wrap the rest in a library. But it's expensive and it makes it easy to write incorrect code. To review: When you cast to shared, the thread-local GC pins the thing as shared. It also finds every bit of its memory that you can reach from that object, the same way as if it were doing a mark/sweep collection, and *that* is also pinned as shared. When you are modifying an object that's had shared cast away, you need to tell the GC not to run, or you need to pin objects-that-will-become-shared temporarily (for instance, by keeping local references to them). When you are done modifying an object that's had shared cast away, you need to cast it back to shared. This is not a good solution, so D isn't getting a thread-local GC that eliminates stop-the-world unless someone else comes up with a cleverer solution. On the other hand, it doesn't have a cost for casting away shared as such; casting shared(T) to const(T) is totally free.
Jan 07
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 10:27:15PM +0000, Neia Neutuladh via Digitalmars-d
wrote:
[...]
 To review:
 
 When you cast to shared, the thread-local GC pins the thing as shared.
 It also finds every bit of its memory that you can reach from that
 object, the same way as if it were doing a mark/sweep collection, and
 *that* is also pinned as shared.
 
 When you are modifying an object that's had shared cast away, you need
 to tell the GC not to run, or you need to pin
 objects-that-will-become-shared temporarily (for instance, by keeping
 local references to them).
This step is too easy to get wrong. Once you cast shared away, the type system can no longer help you identify it as shared (or "once was shared"). As far as the compiler is concerned, assigning a pointer to a thread-local object to a shared pointer that has been cast into a thread-local pointer, is identical to assigning one thread-local pointer to another. So you'll run into the problem Steven described, without any warning whatsoever.
 When you are done modifying an object that's had shared cast away, you
 need to cast it back to shared.
This is also too easy to miss, because under current D, such a cast is redundant and not done in practice. It also looks weird: class C { ... } shared(C) myObj; mutex.acquire(); C tmp = cast(C) myObj; tmp.doStuff(); myObj = cast(shared(C)) tmp; // people will not think to do this mutex.release();
 This is not a good solution, so D isn't getting a thread-local GC that
 eliminates stop-the-world unless someone else comes up with a cleverer
 solution. On the other hand, it doesn't have a cost for casting away
 shared as such; casting shared(T) to const(T) is totally free.
Yeah, shared is the fly in the ointment that prevents us from having a thread-local GC. Perhaps there can be some way of detecting whether the code casts away shared? I mean if your code never actually casts away shared, e.g. by using messaging or whatever for thread communication, then you won't run into this problem, and you could safely use a thread-local GC without any ill-effects. Though all it takes is for *one* cast to exist *somewhere* and your code becomes memory-unsafe. So it cannot be enabled by default. T -- They say that "guns don't kill people, people kill people." Well I think the gun helps. If you just stood there and yelled BANG, I don't think you'd kill too many people. -- Eddie Izzard, Dressed to Kill
Jan 07
prev sibling parent reply Jonathan M Davis <newsgroup.d jmdavisprog.com> writes:
On Monday, January 7, 2019 3:27:15 PM MST Neia Neutuladh via Digitalmars-d 
wrote:
 On Mon, 07 Jan 2019 14:53:50 -0700, Jonathan M Davis wrote:
 Given that sort of situation, I don't see how we can have the
 GC accurately track whether objects are thread-local or shared. Casting
 is just too blunt an instrument and allows too much.
This is exactly the situation I brought up in the post you're replying to. I explained what the solution is in the post you just replied to. That was in fact the entire point of that post.
Well, then I clearly read over it way too quickly.
 It requires a number of careful steps. It's not automatic. It's *mostly*
 automatic, and you can wrap the rest in a library. But it's expensive and
 it makes it easy to write incorrect code.

 To review:

 When you cast to shared, the thread-local GC pins the thing as shared. It
 also finds every bit of its memory that you can reach from that object,
 the same way as if it were doing a mark/sweep collection, and *that* is
 also pinned as shared.

 When you are modifying an object that's had shared cast away, you need to
 tell the GC not to run, or you need to pin objects-that-will-become-shared
 temporarily (for instance, by keeping local references to them).

 When you are done modifying an object that's had shared cast away, you
 need to cast it back to shared.

 This is not a good solution, so D isn't getting a thread-local GC that
 eliminates stop-the-world unless someone else comes up with a cleverer
 solution. On the other hand, it doesn't have a cost for casting away
 shared as such; casting shared(T) to const(T) is totally free.
Yeah, such a solution wouldn't fly. Basically, you're talking about having to have a way to tell the GC that you're moving stuff between threads, and that would be so error prone that it's not even funny. It's already problematic enough to get code that deals with sharing data across threads right as it is. If we could solve the forking problem on Windows so that we could actually have a cross-platform concurrent GC like the Linux one that Sociomantic has used, then that would likely give similar benefits (if not better) without having to muck with the type system. I forget exactly what the stop-the-world pause times were, but they were pretty low. If a thread really couldn't afford to be stopped at all, it would still need to be separate from the GC, just like now, but that would be true of a solution that involved thread-local heaps as well. In any case, it's issues like these which definitely make it much harder to drastically improve D's GC. We have much worse constraints to work under than languages like Java, and we don't have the same kind of manpower trying to improve the situation. But at least D is set up in a way that works quite well with minimizing heap allocations - especially with the idioms that are typical in idiomatic D. - Jonathan M Davis
Jan 07
next sibling parent reply Paulo Pinto <pjmlp progtools.org> writes:
On Tuesday, 8 January 2019 at 04:34:11 UTC, Jonathan M Davis 
wrote:
 On Monday, January 7, 2019 3:27:15 PM MST Neia Neutuladh via 
 Digitalmars-d wrote:
 On Mon, 07 Jan 2019 14:53:50 -0700, Jonathan M Davis wrote:
 Given that sort of situation, I don't see how we can have the
 GC accurately track whether objects are thread-local or 
 shared. Casting
 is just too blunt an instrument and allows too much.
This is exactly the situation I brought up in the post you're replying to. I explained what the solution is in the post you just replied to. That was in fact the entire point of that post.
Well, then I clearly read over it way too quickly.
 It requires a number of careful steps. It's not automatic. 
 It's *mostly* automatic, and you can wrap the rest in a 
 library. But it's expensive and it makes it easy to write 
 incorrect code.

 To review:

 When you cast to shared, the thread-local GC pins the thing as 
 shared. It also finds every bit of its memory that you can 
 reach from that object, the same way as if it were doing a 
 mark/sweep collection, and *that* is also pinned as shared.

 When you are modifying an object that's had shared cast away, 
 you need to tell the GC not to run, or you need to pin 
 objects-that-will-become-shared temporarily (for instance, by 
 keeping local references to them).

 When you are done modifying an object that's had shared cast 
 away, you need to cast it back to shared.

 This is not a good solution, so D isn't getting a thread-local 
 GC that eliminates stop-the-world unless someone else comes up 
 with a cleverer solution. On the other hand, it doesn't have a 
 cost for casting away shared as such; casting shared(T) to 
 const(T) is totally free.
Yeah, such a solution wouldn't fly. Basically, you're talking about having to have a way to tell the GC that you're moving stuff between threads, and that would be so error prone that it's not even funny. It's already problematic enough to get code that deals with sharing data across threads right as it is. If we could solve the forking problem on Windows so that we could actually have a cross-platform concurrent GC like the Linux one that Sociomantic has used, then that would likely give similar benefits (if not better) without having to muck with the type system. I forget exactly what the stop-the-world pause times were, but they were pretty low. If a thread really couldn't afford to be stopped at all, it would still need to be separate from the GC, just like now, but that would be true of a solution that involved thread-local heaps as well. In any case, it's issues like these which definitely make it much harder to drastically improve D's GC. We have much worse constraints to work under than languages like Java, and we don't have the same kind of manpower trying to improve the situation. But at least D is set up in a way that works quite well with minimizing heap allocations - especially with the idioms that are typical in idiomatic D. - Jonathan M Davis
Yet, all the GC enabled languages happened to beat Swift with its reference counting, on the implementation of an high performance network userspace driver. "Safe and Secure Drivers in High-Level Languages How to write PCIe drivers in Rust, go, C#, Swift, Haskell, and OCaml" https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_high-level_languages The professor that organized the research thesis didn't even considered D for the project, and he had plenty to choose from. And MIT is doing research with writing POSIX kernels in Go, https://github.com/mit-pdos/biscuit Meanwhile C# now has all the nice features from project Midori for low level programming, and is getting all the love from game developers fed up with C++, including some very well known ones. The theme that D's GC cannot be improved won't take the language very far.
Jan 08
parent reply =?UTF-8?B?TcOhcmNpbw==?= Martins <marcioapm gmail.com> writes:
On Tuesday, 8 January 2019 at 08:44:35 UTC, Paulo Pinto wrote:
 The theme that D's GC cannot be improved won't take the 
 language very far.
I share the same sentiment. Every GC conversation ends up into the performance of write barriers, shared, comparison with other languages, or some other argument that is not moving the bar at all. Is the current implementation of simple MS not improvable in any way?
Jan 08
parent Reimer Behrends <behrends gmail.com> writes:
On Tuesday, 8 January 2019 at 09:31:09 UTC, Márcio Martins wrote:
 Is the current implementation of simple MS not improvable in 
 any way?
I'd say it is. I've written up a small benchmark here: https://gist.github.com/rbehrends/3d4ab87d34e79722272189abc68eee00 Options: * D with its native GC, compiled with ldc 1.13.0. * jemalloc, version 5.1.0. * The Boehm GC, version 8.0.2 with sequential marking. * The Boehm GC with parallel marking. * The Boehm GC using incremental GC. * The Boehm GC with explicit deallocation instead of GC. * System malloc on macOS 10.12.6. This is a version of the "binary trees" benchmark [1], as it was the one allocation-oriented benchmark readily available and easily portable between languages. Because it is a very specialized benchmark (it allocates a *lot* of small fixed-size objects), I'll caution against generalizing its results, but it can give us some insights all the same. Here's a sample run with several GC options enabled, run on a 2.6GHz Core i7 with four cores; code that doesn't do allocation/collection (i.e. checksum()) accounts for less than two seconds of runtime. Most actual tuned non-functional programs will spend only 10%-20% doing allocation and collection, so this is not terribly representative of real applications and exaggerates differences in performance. $ make benchmark DEPTH=21 # D garbage collector /usr/bin/time ./btree-d 21 >/dev/null 34.24 real 33.96 user 0.21 sys # jemalloc explicit malloc()/free() /usr/bin/time ./btree-jemalloc 21 >/dev/null 21.62 real 21.35 user 0.22 sys # Boehm GC with four parallel marker threads GC_MARKERS=4 /usr/bin/time ./btree-gc 21 >/dev/null 9.39 real 12.02 user 0.18 sys # Boehm GC with single-threaded marking GC_MARKERS=1 /usr/bin/time ./btree-gc 21 >/dev/null 11.42 real 11.34 user 0.06 sys # Boehm GC with explicit deallocation GC_MARKERS=1 /usr/bin/time ./btree-gc-free 21 >/dev/null 13.12 real 13.05 user 0.05 sys # Boehm GC with incremental collection (single-threaded) /usr/bin/time ./btree-gc-inc 21 >/dev/null 20.74 real 17.21 user 6.29 sys # System malloc()/free() /usr/bin/time ./btree-sysmalloc 21 >/dev/null 67.73 real 66.97 user 0.74 sys The fastest version is the parallel-mark version of the Boehm-Demers-Weiser GC, which beats out all the competitors. The rear is brought up by system malloc() (to nobody's surprise, standard malloc()/free() on a Mac is dog slow). The Boehm GC benchmarks are run with interior pointers disabled in order to prevent extra padding for the small objects that would blow up the heap size, but even with interior pointers enabled, the relative performance doesn't change all that much; the single-threaded run clocks in at around 16 seconds instead, still twice as fast as the D GC, and the version with parallel marking finishes in around 13 seconds. It is worth noting that the Boehm GC still does go through all the conservative marking machinery and does check interior pointers from the stack regardless, so a precise or semi-precise collector may still be able to gain additional performance. Parallel marking gives us a noticeable speedup in wall clock time and a really big one in GC pauses (not displayed here) at the expense of using more CPU time. Incremental collection is super expensive in comparison, due to the cost of mprotect(), for code that technically would not require write barriers. It still is on par with jemalloc() for throughput. In terms of GC pause times, we get the following for the Boehm GC: $ time GC_MARKERS=4 ./btree-gc 23 | tail -5 88 collections 0.056 seconds/collection 0.080 max seconds/collection 818 MB heap size 306 MB free bytes This is again with parallel marking (four marker threads) and using a larger heap. The collection times are wall clock times, based on gettimeofday(). Note that most heaps will contain at least some non-pointer data, ours is almost 100% pointers. With single-threaded marking, the maximum collection time more than triples to nearly .3 seconds. As an interesting aside, the maximum pause time for explicitly deleting trees in the benchmark is much larger than the GC pause time (nearly a second for jemalloc()) and would remain larger even if the delete tree operation were parallelized. This is a good reminder that cascading deletes in RAII can also be problematic for pause times and generally require tradeoffs (deferred/concurrent deletion, sacrificing timely destructor calls) in order to work in real time systems. [1] https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html
Jan 08
prev sibling parent reply Francesco Mecca <me francescomecca.eu> writes:
On Tuesday, 8 January 2019 at 04:34:11 UTC, Jonathan M Davis 
wrote:
 If we could solve the forking problem on Windows so that we 
 could actually have a cross-platform concurrent GC like the 
 Linux one that Sociomantic has used, then that would likely 
 give similar benefits (if not better) without having to muck 
 with the type system. I forget exactly what the stop-the-world 
 pause times were, but they were pretty low. If a thread really 
 couldn't afford to be stopped at all, it would still need to be 
 separate from the GC, just like now, but that would be true of 
 a solution that involved thread-local heaps as well.
I don't want to talk about my work (mentored by Leandro of Sociomantic) until it is done, but the concurrent GC has been mentioned so many times in this thread that I can't refrain to show current results. This are the pause times for Dustmite: Not forking GC (Linux): Collection time Stop-the-world time 0 0.000139 0.000137 1 0.000172 0.000155 2 0.001917 0.001033 3 0.013672 0.007031 4 0.028831 0.017149 5 0.073201 0.042553 6 0.175916 0.103788 7 0.271328 0.179002 8 0.607061 0.375101 Forking GC (Linux): Collection time Stop-the-world time 0 0.000128 0.000125 1 0.000025 0.000125 2 0.000163 0.000155 3 0.003665 0.000155 4 0.000569 0.000564 5 0.023067 0.000564 6 0.001888 0.001882 7 0.062640 0.001882 8 0.004437 0.004429 9 0.118210 0.004429 10 0.009619 0.009613 11 0.251923 0.009613 As you can see the pause times are way lower. This is true for many other small programs used for benchmarking. Back in the old TangoRT times the concurrent GC even improved the program execution time but this is no longer true mainly because Martin Nowak implemented a similar strategy for the eager allocation of pools. I was already contacted by Mike in order to report the status of the project for the SAOC so I hope we can disclose more in the upcoming weeks. This post highlights a possible way to implement the forking behaviour on Windows as well, but I haven't touched a Windows box in years. https://rainers.github.io/visuald/druntime/concurrentgc.html I would like to see a thread local GC in the future as well but I still have to understand the weak points of `shared` and how it can be reshaped.
Jan 08
parent Sebastiaan Koppe <mail skoppe.eu> writes:
On Tuesday, 8 January 2019 at 09:39:07 UTC, Francesco Mecca wrote:
 As you can see the pause times are way lower.
 This is true for many other small programs used for 
 benchmarking.
 Back in the old TangoRT times the concurrent GC even improved 
 the program execution time but this is no longer true mainly 
 because Martin Nowak implemented a similar strategy for the 
 eager allocation of pools.

 I was already contacted by Mike in order to report the status 
 of the project for the SAOC so I hope we can disclose more in 
 the upcoming weeks.
That is just plain awesome. Good to see work is being done on that front.
Jan 09
prev sibling next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 2:05 PM, Manu wrote:
 How much truth is in here?
There's a lot of triggering on the word "GC". At some level, it doesn't matter how good or bad the GC is, it's "GC is bad". Even if your code never calls the GC.
 What is this business about write barriers making the GC fast? Is that
 actually fact, or are there other hurdles?
Here's an explanation: https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works There are many more if you google "gc write barrier". The penalty, of course, is extra code gets run for every write through a pointer. The benefits exceed the penalties for a language where all dynamic allocation is done via the GC. But for D, where quite a lot of it is not, it isn't so clear. What is clear is that if you're going to compare speed with C++, having those write barriers in there is going to be a serious penalty, because they'll be there even if you don't use the GC.
 Where's the ARC stuff? What happened to opAddRef/opRelease?
Andrei, Razvan, and I have decided that the most pragmatic way forward to support reference counting is to support copy constructors ala C++. C++ is clearly successful with this approach, and Andrei/Razvan concluded that D's postblit just doesn't cut it. Razvan has a proposal for copy constructors, and an implementation PR that has fallen a bit behind his latest proposal.
Jan 05
next sibling parent reply Manu <turkeyman gmail.com> writes:
On Sat, Jan 5, 2019 at 7:45 PM Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 1/5/2019 2:05 PM, Manu wrote:
 How much truth is in here?
There's a lot of triggering on the word "GC". At some level, it doesn't matter how good or bad the GC is, it's "GC is bad". Even if your code never calls the GC.
 What is this business about write barriers making the GC fast? Is that
 actually fact, or are there other hurdles?
Here's an explanation: https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works There are many more if you google "gc write barrier". The penalty, of course, is extra code gets run for every write through a pointer. The benefits exceed the penalties for a language where all dynamic allocation is done via the GC. But for D, where quite a lot of it is not, it isn't so clear. What is clear is that if you're going to compare speed with C++, having those write barriers in there is going to be a serious penalty, because they'll be there even if you don't use the GC. > Where's the ARC stuff? What happened to opAddRef/opRelease? Andrei, Razvan, and I have decided that the most pragmatic way forward to support reference counting is to support copy constructors ala C++. C++ is clearly successful with this approach, and Andrei/Razvan concluded that D's postblit just doesn't cut it. Razvan has a proposal for copy constructors, and an implementation PR that has fallen a bit behind his latest proposal.
Hmnm, that's pretty disappointing... as a C++ user who has been doing C++ ref counting for years, I would say that, while maybe it is *kinda* successful, it's not at all satisfying, and it's one major reason I've been pushing it in D for the last 8-9 years. Ref counting in C++ is a sloppy hack. I really just want the compiler to emit opInc and opDec appropriately on either sides of a reference assignment, and let me take it from there. Also, with escape analysis, we can elide such calls very effectively, and we can't do that using the C++ strategy. I think it's disappointing to embrace C++'s hacky 'solution', we could do better.
Jan 05
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 11:48 PM, Manu wrote:
 Hmnm, that's pretty disappointing... as a C++ user who has been doing
 C++ ref counting for years, I would say that, while maybe it is
 *kinda* successful, it's not at all satisfying, and it's one major
 reason I've been pushing it in D for the last 8-9 years.
 Ref counting in C++ is a sloppy hack. I really just want the compiler
 to emit opInc and opDec appropriately on either sides of a reference
 assignment, and let me take it from there.
 Also, with escape analysis, we can elide such calls very effectively,
 and we can't do that using the C++ strategy.
 
 I think it's disappointing to embrace C++'s hacky 'solution', we could
 do better.
So far, we've failed at every attempt at it.
Jan 06
next sibling parent reply Manu <turkeyman gmail.com> writes:
On Sun, Jan 6, 2019 at 1:25 AM Walter Bright via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 On 1/5/2019 11:48 PM, Manu wrote:
 Hmnm, that's pretty disappointing... as a C++ user who has been doing
 C++ ref counting for years, I would say that, while maybe it is
 *kinda* successful, it's not at all satisfying, and it's one major
 reason I've been pushing it in D for the last 8-9 years.
 Ref counting in C++ is a sloppy hack. I really just want the compiler
 to emit opInc and opDec appropriately on either sides of a reference
 assignment, and let me take it from there.
 Also, with escape analysis, we can elide such calls very effectively,
 and we can't do that using the C++ strategy.

 I think it's disappointing to embrace C++'s hacky 'solution', we could
 do better.
So far, we've failed at every attempt at it.
Why did opInc/opDec fail? It was never available in any experimental form, I was never able to try it out. I've been waiting eagerly for almost a decade...
Jan 06
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 1:42 AM, Manu wrote:
 Why did opInc/opDec fail? It was never available in any experimental
 form, I was never able to try it out. I've been waiting eagerly for
 almost a decade...
There was a looong thread about it. It wasn't implemented before major problems were found for it. See: https://digitalmars.com/d/archives/digitalmars/D/draft_proposal_for_ref_counting_in_D_211885.html There's more on the Dlang-study mailing list: Dlang-study puremagic.com http://lists.puremagic.com/cgi-bin/mailman/listinfo/dlang-study
Jan 06
prev sibling next sibling parent Russel Winder <russel winder.org.uk> writes:
On Sun, 2019-01-06 at 01:22 -0800, Walter Bright via Digitalmars-d wrote:
 On 1/5/2019 11:48 PM, Manu wrote:
 Hmnm, that's pretty disappointing... as a C++ user who has been doing
 C++ ref counting for years, I would say that, while maybe it is
 *kinda* successful, it's not at all satisfying, and it's one major
 reason I've been pushing it in D for the last 8-9 years.
 Ref counting in C++ is a sloppy hack. I really just want the compiler
 to emit opInc and opDec appropriately on either sides of a reference
 assignment, and let me take it from there.
 Also, with escape analysis, we can elide such calls very effectively,
 and we can't do that using the C++ strategy.
=20
 I think it's disappointing to embrace C++'s hacky 'solution', we could
 do better.
=20 So far, we've failed at every attempt at it.
This is not an excuse to stop trying! In this, Java and Go are excellent examples: someone comes up with the best that can be done, and then someone finds a new and usually better way of do= ing things. I don't have any specific ideas, so I can't put my activity where my mouth = is. Sorry. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Jan 07
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 07:18:02PM +0000, Russel Winder via Digitalmars-d wrote:
 On Sun, 2019-01-06 at 01:22 -0800, Walter Bright via Digitalmars-d wrote:
 On 1/5/2019 11:48 PM, Manu wrote:
 Hmnm, that's pretty disappointing... as a C++ user who has been
 doing C++ ref counting for years, I would say that, while maybe it
 is *kinda* successful, it's not at all satisfying, and it's one
 major reason I've been pushing it in D for the last 8-9 years.
 Ref counting in C++ is a sloppy hack. I really just want the
 compiler to emit opInc and opDec appropriately on either sides of
 a reference assignment, and let me take it from there.  Also, with
 escape analysis, we can elide such calls very effectively, and we
 can't do that using the C++ strategy.
 
 I think it's disappointing to embrace C++'s hacky 'solution', we
 could do better.
So far, we've failed at every attempt at it.
This is not an excuse to stop trying!
[...] I think you may have misunderstood. Read the "study" mailing lists that Walter referred to. You'll see some concrete examples of tricky situations that may not be immediately obvious from a naive conception of ARC. It's not that we've stopped trying, but that we've run out of good ideas and nobody has stepped up offering better ones. After reading through said mailing lists, I think I better understand now why Walter is pushing for dip25 and dip1000. They tackle essential issues related to proper escape analysis, which is a requirement for RC counter bump elisions that won't cause UB or other nasty problems. In themselves they won't give you RC just yet, but they pave the way and lay some of the essential foundations that an RC implementation will require. T -- Never step over a puddle, always step around it. Chances are that whatever made it is still dripping.
Jan 07
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 6 January 2019 at 07:48:03 UTC, Manu wrote:
 Ref counting in C++ is a sloppy hack. I really just want the 
 compiler
 to emit opInc and opDec appropriately on either sides of a 
 reference
 assignment, and let me take it from there.
I believe C++ compilers are allowed to do this for shared_ptr, but you would need global optimization for it to make much sense. But shared_ptr is not frequently used in performance sensitive loops, because it is slow, so it doesn't make a lot of sense for compiler implementors to spend that much time on it...
Jan 08
prev sibling next sibling parent reply Jacob Carlborg <doob me.com> writes:
On 2019-01-06 04:42, Walter Bright wrote:

  > Where's the ARC stuff? What happened to opAddRef/opRelease?
 
 Andrei, Razvan, and I have decided that the most pragmatic way forward 
 to support reference counting is to support copy constructors ala C++. 
 C++ is clearly successful with this approach, and Andrei/Razvan 
 concluded that D's postblit just doesn't cut it. Razvan has a proposal 
 for copy constructors, and an implementation PR that has fallen a bit 
 behind his latest proposal.
Razvan has confirmed that copy constructors are only for structs [1]. The ARC (opAddRef/opRelease) proposal was for classes as well. So copy constructors are not a substitute as defined now. [1] https://forum.dlang.org/post/hqqszotxfwauczupjmez forum.dlang.org -- /Jacob Carlborg
Jan 07
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 08:31:06PM +0100, Jacob Carlborg via Digitalmars-d
wrote:
 On 2019-01-06 04:42, Walter Bright wrote:
[...]
 Andrei, Razvan, and I have decided that the most pragmatic way
 forward to support reference counting is to support copy
 constructors ala C++.  C++ is clearly successful with this approach,
 and Andrei/Razvan concluded that D's postblit just doesn't cut it.
 Razvan has a proposal for copy constructors, and an implementation
 PR that has fallen a bit behind his latest proposal.
Razvan has confirmed that copy constructors are only for structs [1]. The ARC (opAddRef/opRelease) proposal was for classes as well. So copy constructors are not a substitute as defined now.
[...] I think the idea is probably to use struct wrappers to implement RC, i.e., you'd have a struct wrapping a class reference of the object it's managing. As long as the struct can have semantics compatible with RC, it can be used to implement RC. At least in theory. My suspicion, though, is that at some point further language support is going to be needed, because of interaction with D's const system. The RC count needs to be mutable, yet the managed object must be allowed to be const or immutable in a way that doesn't require onerous, incomplete workarounds (like using a Const template instead of just typing `const RC!MyClass` -- which won't work if you have the RC'd object inside another aggregate that also needs const/immutable qualifiers). Requiring that RC objects can only be mutable is going to lead to more dissatisfaction about RC in D. T -- The easy way is the wrong way, and the hard way is the stupid way. Pick one.
Jan 07
parent Jacob Carlborg <doob me.com> writes:
On 2019-01-07 20:46, H. S. Teoh wrote:

 I think the idea is probably to use struct wrappers to implement RC,
 i.e., you'd have a struct wrapping a class reference of the object it's
 managing.  As long as the struct can have semantics compatible with RC,
 it can be used to implement RC. At least in theory.
That will be quite ugly. Will these struct wrappers be covariant if the types they wrap are covariant? I mean, is that possible to implement? -- /Jacob Carlborg
Jan 08
prev sibling parent bioinfornatics <bioinfornatics fedoraproject.org> writes:
On Sunday, 6 January 2019 at 03:42:05 UTC, Walter Bright wrote:
 On 1/5/2019 2:05 PM, Manu wrote:
 How much truth is in here?
There's a lot of triggering on the word "GC". At some level, it doesn't matter how good or bad the GC is, it's "GC is bad". Even if your code never calls the GC.
 What is this business about write barriers making the GC fast? 
 Is that
 actually fact, or are there other hurdles?
Here's an explanation: https://stackoverflow.com/questions/19154607/how-actually-card-table-and-writer-barrier-works There are many more if you google "gc write barrier". The penalty, of course, is extra code gets run for every write through a pointer. The benefits exceed the penalties for a language where all dynamic allocation is done via the GC. But for D, where quite a lot of it is not, it isn't so clear. What is clear is that if you're going to compare speed with C++, having those write barriers in there is going to be a serious penalty, because they'll be there even if you don't use the GC.
 Where's the ARC stuff? What happened to opAddRef/opRelease?
Andrei, Razvan, and I have decided that the most pragmatic way forward to support reference counting is to support copy constructors ala C++. C++ is clearly successful with this approach, and Andrei/Razvan concluded that D's postblit just doesn't cut it. Razvan has a proposal for copy constructors, and an implementation PR that has fallen a bit behind his latest proposal.
Topics: get a better understanding of D's application bottlenecks It could be interesting to be able to analyze trace execution and gc use with tools designed for. As example the process minning (a scientific field) seem to be interesting in order to understand what happen and where are the bottllenecks. With such tools people will blame the GC only when he is responsible. So a tool which would be able to generate an event trace to the file format xes (http://www.processmining.org/logs/xes) should be awesome. Indeed they are somme tools such as Prom (http://www.promtools.org/doku.php) which are able to analyse and represent it. The developpers will get a better understanding with the use with such (animated) representation: http://fluxicon.com/blog/wp-content/uploads/2013/11/Disco-Process- ining-Animation.gif . Best regards
Jan 08
prev sibling next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
Some more on write barriers with LLVM:

https://llvm.org/docs/GarbageCollection.html
Jan 05
prev sibling next sibling parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sat, 05 Jan 2019 20:34:30 -0800, H. S. Teoh wrote:
 On Sat, Jan 05, 2019 at 11:12:52PM +0000, Neia Neutuladh via
 Digitalmars-d wrote:
 On Sat, 05 Jan 2019 14:05:19 -0800, Manu wrote:
 I'm somewhere between a light GC user and a  nogc user, and I don't
 really know much about where we're at, or much about start-of-the-art
 GC in general.
I use the GC unabashedly and only try to make sure I reuse memory when it's reasonably convenient. I've also looked into GC a bit.
I also use the GC freely, and only bother with GC optimization when my profiler shows that there's an actual problem. As I've said numerous times before, unless you're working on extremely time-sensitive code like real-time applications or 3D game engines, D's GC usually does not cause much noticeable difference. Unless you do something extreme like allocate tens of millions of small strings (or other small objects) per second, or allocate huge objects rapidly and expect unreferenced memory to be quickly reused.
String allocation is one area where D makes it absurdly easy to reuse existing memory while other languages force you to go through hoops. One of the reasons I know as much as I do about Unicode is that I ported some code from D to C# and was annoyed by the C# code taking 50 times as long and allocating 17000 times as much memory. The main culprits were UTF-16 and copy-on-substring. To resolve that, I made my own string type. (And used Mono's ahead-of-time compilation and did a couple other things.)
 (You'd have a stop-the-world phase that happened infrequently. Casting
 something to shared would pin it in the thread it was allocated from.)
How does this solve the problem of shared, though? The last time I checked, casting to/from shared is the main showstopper for a thread-local GC.
The issue with a thread-local GC and casting to shared is that the GC for thread A doesn't know about references held by thread B. That means thread A might collect and reuse that shared object. If we had a runtime call in the cast to shared, thread A's GC can record that that object is possibly referenced from another thread. It won't collect it until a stop-the-world phase says it's not referenced. This would give teeth to some of the undefined behavior we currently have.
 Now that I think of it, we could deal with pointers in unions the same
 way -- if the compiler detects it, then trigger conservative mode in the
 GC.
Right. In the past when people have looked at this problem, the idea was to treat anything that could possibly be a pointer as a pointer.
 With these two out of the way, a generational GC for D seems closer to
 the realm of possibility.
I'm not sure how much sense it makes to have a generational GC without write barriers.
Jan 05
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Jan 06, 2019 at 05:11:19AM +0000, Neia Neutuladh via Digitalmars-d
wrote:
 On Sat, 05 Jan 2019 20:34:30 -0800, H. S. Teoh wrote:
[...]
 One of the reasons I know as much as I do about Unicode is that I
 ported some code from D to C# and was annoyed by the C# code taking 50
 times as long and allocating 17000 times as much memory. The main
 culprits were UTF-16 and copy-on-substring. To resolve that, I made my
 own string type.  (And used Mono's ahead-of-time compilation and did a
 couple other things.)
Yeah, Walter's right on the money about copy-on-substring being a big performance hit in C/C++, and apparently also C# (caveat: I don't know anything about C#). D's slices really does eliminate a lot of the background cost that most people don't normally think about.
 (You'd have a stop-the-world phase that happened infrequently.
 Casting something to shared would pin it in the thread it was
 allocated from.)
How does this solve the problem of shared, though? The last time I checked, casting to/from shared is the main showstopper for a thread-local GC.
The issue with a thread-local GC and casting to shared is that the GC for thread A doesn't know about references held by thread B. That means thread A might collect and reuse that shared object. If we had a runtime call in the cast to shared, thread A's GC can record that that object is possibly referenced from another thread. It won't collect it until a stop-the-world phase says it's not referenced. This would give teeth to some of the undefined behavior we currently have.
Ahh, I see what you mean. Hmm, neat idea. Might actually work! Worth exploring, I think.
 Now that I think of it, we could deal with pointers in unions the
 same way -- if the compiler detects it, then trigger conservative
 mode in the GC.
Right. In the past when people have looked at this problem, the idea was to treat anything that could possibly be a pointer as a pointer.
 With these two out of the way, a generational GC for D seems closer
 to the realm of possibility.
I'm not sure how much sense it makes to have a generational GC without write barriers.
True. But it would allow at least a precise GC. And perhaps a few other GC improvements that are currently not possible. T -- Why waste time reinventing the wheel, when you could be reinventing the engine? -- Damian Conway
Jan 05
parent reply Neia Neutuladh <neia ikeran.org> writes:
On Sat, 05 Jan 2019 22:17:27 -0800, H. S. Teoh wrote:
 Yeah, Walter's right on the money about copy-on-substring being a big
 performance hit in C/C++, and apparently also C# (caveat: I don't know
 anything about C#).  D's slices really does eliminate a lot of the
 background cost that most people don't normally think about.
Interestingly, Java used to use slicing for substrings. This was changed (circa java 7, IIRC). The logic was that you might read in a large string and preserve a short substring of it for a long time, and that costs you a lot of extra memory. I mean, you could just call `new String(substring)`, but apparently people weren't doing that and were confused at their memory usage.
Jan 05
parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 11:03 PM, Neia Neutuladh wrote:
 Interestingly, Java used to use slicing for substrings. This was changed
 (circa java 7, IIRC). The logic was that you might read in a large string
 and preserve a short substring of it for a long time, and that costs you a
 lot of extra memory.
That's why DMD doesn't use slicing of the source file buffer for strings and identifiers. There's also a caching issue with it, as the slices will be spread out in memory rather than concentrated.
Jan 06
prev sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 1/5/2019 9:11 PM, Neia Neutuladh wrote:
 I'm not sure how much sense it makes to have a generational GC without
 write barriers.
Eh, I invented a technique in the 90's to do that. Then I read a paper about it, it's called a "mostly copying" generational GC. All it does is pin the objects it isn't sure about (i.e. ambiguous pointers to them).
Jan 06
prev sibling next sibling parent Guillaume Piolat <first.last gmail.com> writes:
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 How much truth is in here?
The truth of oft-repeated mantras? http://www.infognition.com/blog/2014/the_real_problem_with_gc_in_d.html taught us how small our heap should be to keep performance. The situation has improved since with -betterC, -profile=gc, allocators, nogc. Piece of knowledge too (avoid void[]). The D GC doesn't seem to stop people from implementing fast systems. When we disabled GC completely and went completely manual, we got zero speed improvement, only memory usage decrease. GC is surprinsingly affordable.
Jan 06
prev sibling next sibling parent reply Reimer Behrends <behrends gmail.com> writes:
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 How much truth is in here?
 What is this business about write barriers making the GC fast? 
 Is that
 actually fact, or are there other hurdles?
First of all, stop-the-world GC gets an unnecessarily bad rap these days. A high throughput stop-the-world GC can actually be pretty competitive in terms of throughput, and latency may be lower than what you'd expect. You probably aren't going to run AAA video games with one, but there are many application domains where latency just doesn't matter or doesn't matter as much; even normal interactive use (such as your typical GUI application) often works plenty fine. Stop-the-world collectors are more of a problem for a functional programming language (or languages that exercise allocation-heavy functional idioms a lot), but less so for imperative languages that have more control over allocation frequency. In general, a modern precise stop-the-world collector should be able to handle 1GB worth of pointers in something like .2 seconds on modern hardware, give or take (especially if you can leverage parallelism). Given that much of the heap will generally be occupied by non-pointer data in an imperative language, such as strings, bitmaps, or numerical data, this is plenty for typical interactive applications (you will often have interaction hiccups of comparable size from other system sources) and reasonably written server-side software can use a multi-process approach to limit the per-process heap size. For batch programs and most command line tools, latency matters little, anyway. Throughput-wise, you should be able to perform on par with something like jemalloc, possibly faster, especially if you have compiler support (to e.g. inline pool allocations and avoid unnecessary zeroing of newly allocated memory if you statically know the object's size and layout). You generally won't be able to beat specialized allocators for throughput, though, just general purpose allocators. Note that manual memory allocation is not a free lunch, either. As an extreme case, normal reference counting is like having a very expensive write barrier that even applies to stack frames. Manual allocation only really wins out for throughput if you can use optimized allocation techniques (such as a pool/arena allocator where you throw the pool/arena away at the end). The problem with the D garbage collector is that it doesn't do well in terms of either throughput or latency. Even so, GC work is proportional to allocation work, so the impact on throughput in an imperative language like D is often less than you think, even if the GC doesn't have good performance. For example, I've used Tiny GC [1] in real world C/C++ applications where I needed to avoid external dependencies and the code still had good performance. That despite the fact that due its design (a conservative GC based on system malloc()), its throughput isn't particularly great. Fun fact: Erlang uses a stop-the-world collector (or used to, I haven't checked in a while), but as it is a distributed system with tons of processes with small heaps, it can also keep latency low, as each individual collection only has to traverse a limited amount of memory. In order to get incremental GC with low pause times, you will need one of several approaches: - Write barriers. - Read barriers. - Snapshot techniques. - Optimized reference counting. In practice, read barriers and snapshots without hardware support are often too inefficient for typical programming languages. Read barriers are still used (with heavy compiler support) in Azul's GC and Shenandoah, because (inter alia) they allow you to get around the last problem for extremely low latency GC when it comes to scanning stacks on a multi-threaded system. Snapshot-based systems are rare, but there was at one point a fork()-based snapshot collector for D1. The dual problem with that is that fork() is POSIX-specific and fork() does not get along well with threads. (It's possible, but a pain in the neck.) Write barriers are usually the technique of choice, because they only incur overhead when pointers are written to the heap, so they have fairly low and predictable overhead. Write barriers can also improve throughput. If you use generational collection *and* the generational hypothesis holds (i.e. you have a fair amount of short-lived objects), then the cost of the write barrier is usually more than offset by the greatly reduced tracing work necessary. You will need a write barrier for generational collection, anyway, so you get the incremental part for (mostly) free. Write barriers are also particularly attractive in functional (including mostly functional) languages, where pointer writes to the heap are uncommon to begin with. That said, in an imperative language you can often avoid having short-lived heap-allocated objects, so the generational hypothesis may not always apply. The most common use cases where generational GC might benefit D are probably if you are doing lots of work with strings or are incrementally growing a dynamic array starting from a small size. The overhead of a write barrier is context-dependent. Probably the worst case is something like sorting an array in place, where each pointer swap will require two write barriers. That said, these situations can be worked around, especially in an optimizing compiler that can elide multiple write barriers for the same object, and more typical code will not write pointers to the heap that much to begin with. Basic reference counting is normally dog slow due to memory locality issues. However, you can optimize the most performance degrading case (assignments to local variables) in a couple of ways. You can optimize away some, which is what Swift does [2]. Or you can use deferred reference counting, which counts only references from global variables and the heap, but requires occasional stack scanning. Reference counting is less attractive for the multi-threaded case with shared heaps, because atomic reference counting is even slower; competitive implementations of deferred reference counting for Java therefore buffer increments and decrements in thread-local storage and process them in batches [3]. Also, additional work has to be done to deal with cycles. And, of course, there are various hybrid approaches, such as generational collection for the minor heap and reference counting for the major heap. Going back to D, the obvious low-hanging fruit is to bring its throughput up to par with modern stop-the-world collectors. If your latency requirements are much higher than what that can give you, chances are that you are already operating in a highly specialized domain and have very specific requirements where you'd also have to invest non-trivial effort into tuning a general purpose GC. That said, having (optional) support for write barriers in the compiler would be pretty nice, though I don't know how much effort that would be. [1] https://github.com/ivmai/tinygc [2] https://github.com/apple/swift/blob/master/docs/ARCOptimization.rst [3] See David Bacon's seminal paper at https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon01Concurrent.pdf
Jan 06
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 1/6/2019 12:28 PM, Reimer Behrends wrote:
 In general, a modern precise stop-the-world collector should be able to
 handle 1GB worth of pointers in something like .2 seconds on modern
 hardware,
I use Thunderbird mail, latest version, and it has random pauses that are up to 10 seconds. It'll happen when I'm in the middle of typing, which is frustrating. I'm not positive it is a GC pause, maybe it's something else, but the randomness of it suggests it is GC.
Jan 06
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Sun, Jan 06, 2019 at 07:34:41PM -0800, Walter Bright via Digitalmars-d wrote:
 On 1/6/2019 12:28 PM, Reimer Behrends wrote:
 In general, a modern precise stop-the-world collector should be able
 to handle 1GB worth of pointers in something like .2 seconds on
 modern hardware,
I use Thunderbird mail, latest version, and it has random pauses that are up to 10 seconds. It'll happen when I'm in the middle of typing, which is frustrating. I'm not positive it is a GC pause, maybe it's something else, but the randomness of it suggests it is GC.
I've experienced a similar effect in Firefox, and though I cannot say for sure it isn't a GC problem, I notice that it causes a long spike of intensive I/O, and appears to be correlated with occasional segfaults and signs of memory corruption / memory leak, and generally happens only after significant usage over a prolonged timeframe, generally ending up in a state of extreme memory usage (several GBs in resident set size for just a small number of persistent tabs) that reset to more reasonable levels upon restarting and restoring exactly the same tabs. T -- What are you when you run out of Monet? Baroque.
Jan 06
next sibling parent Patrick Schluter <Patrick.Schluter bbox.fr> writes:
On Monday, 7 January 2019 at 06:52:40 UTC, H. S. Teoh wrote:
 On Sun, Jan 06, 2019 at 07:34:41PM -0800, Walter Bright via 
 Digitalmars-d wrote:
 [...]
I've experienced a similar effect in Firefox, and though I cannot say for sure it isn't a GC problem, I notice that it causes a long spike of intensive I/O, and appears to be correlated with occasional segfaults and signs of memory corruption / memory leak, and generally happens only after significant usage over a prolonged timeframe, generally ending up in a state of extreme memory usage (several GBs in resident set size for just a small number of persistent tabs) that reset to more reasonable levels upon restarting and restoring exactly the same tabs.
I have that too. It happens always on javascript heavy pages like facebook or youtube. Killing the process with the taskmanager is the quickest way to resolve the issue.
Jan 06
prev sibling next sibling parent Reimer Behrends <behrends gmail.com> writes:
On Monday, 7 January 2019 at 06:52:40 UTC, H. S. Teoh wrote:
 I've experienced a similar effect in Firefox, and though I 
 cannot say for sure it isn't a GC problem, I notice that it 
 causes a long spike of intensive I/O, and appears to be 
 correlated with occasional segfaults and signs of memory 
 corruption / memory leak, and generally happens only after 
 significant usage over a prolonged timeframe, generally ending 
 up in a state of extreme memory usage (several GBs in resident 
 set size for just a small number of persistent tabs) that reset 
 to more reasonable levels upon restarting and restoring exactly 
 the same tabs.
Firefox/Thunderbird are written mostly in C/C++/Rust. SpiderMonkey (the JS engine) does have a GC, but that one is incremental and generational and limited to JS memory. So I think it's unlikely that GC is the culprit. There are any number of reasons why Firefox/Thunderbird might freeze, from synchronous JavaScript to inefficient or misplaced [1] SQLite queries. [1] E.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1266829
Jan 07
prev sibling parent welkam <wwwelkam gmail.com> writes:
On Monday, 7 January 2019 at 06:52:40 UTC, H. S. Teoh wrote:
 I've experienced a similar effect in Firefox, and though I 
 cannot say for sure it isn't a GC problem, I notice that it 
 causes a long spike of intensive I/O, and appears to be 
 correlated with occasional segfaults and signs of memory 
 corruption / memory leak, and generally happens only after 
 significant usage over a prolonged timeframe, generally ending 
 up in a state of extreme memory usage (several GBs in resident 
 set size for just a small number of persistent tabs) that reset 
 to more reasonable levels upon restarting and restoring exactly 
 the same tabs.


 T
this sound a lot like memory fragmentation The Curse of External Fragmentation: Relocate or Bust! http://ithare.com/the-curse-of-external-fragmentation-relocate-or-bust/
Jan 07
prev sibling next sibling parent Russel Winder <russel winder.org.uk> writes:
On Sun, 2019-01-06 at 22:52 -0800, H. S. Teoh via Digitalmars-d wrote:
 On Sun, Jan 06, 2019 at 07:34:41PM -0800, Walter Bright via Digitalmars-d
 wrote:
 On 1/6/2019 12:28 PM, Reimer Behrends wrote:
 In general, a modern precise stop-the-world collector should be able
 to handle 1GB worth of pointers in something like .2 seconds on
 modern hardware,
=20 I use Thunderbird mail, latest version, and it has random pauses that are up to 10 seconds. It'll happen when I'm in the middle of typing, which is frustrating. =20 I'm not positive it is a GC pause, maybe it's something else, but the randomness of it suggests it is GC.
=20 I've experienced a similar effect in Firefox, and though I cannot say for sure it isn't a GC problem, I notice that it causes a long spike of intensive I/O, and appears to be correlated with occasional segfaults and signs of memory corruption / memory leak, and generally happens only after significant usage over a prolonged timeframe, generally ending up in a state of extreme memory usage (several GBs in resident set size for just a small number of persistent tabs) that reset to more reasonable levels upon restarting and restoring exactly the same tabs. =20
Some plugins to JetBrains CLion, IntelliJ IDEA, GoLand, and PyCharm (the on= es I use) can be prone to quite lengthy stoppages =E2=80=93 sadly the D plugin= 1.18 suffers this. Many people blame the Java GC, usually from prejudice rather than actual data. The Rust plugin had such a problem some weeks back, but i= t turned out not to be a GC problem but an exception and logging problem. Whilst GCs can, indeed do, have problems, some people are too quick to blam= e the GC when actually the problems are elsewhere in most of the situations. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Jan 07
prev sibling next sibling parent Jacob Carlborg <doob me.com> writes:
On 2019-01-07 04:34, Walter Bright wrote:

 I use Thunderbird mail, latest version, and it has random pauses that 
 are up to 10 seconds. It'll happen when I'm in the middle of typing, 
 which is frustrating.
 
 I'm not positive it is a GC pause, maybe it's something else, but the 
 randomness of it suggests it is GC.
For me it only happens when downloading new emails. Which suggests they're not using a separate thread for that. But that would be very poorly implemented. -- /Jacob Carlborg
Jan 07
prev sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 07:13:59PM +0000, Russel Winder via Digitalmars-d wrote:
[...]
 Whilst GCs can, indeed do, have problems, some people are too quick to
 blame the GC when actually the problems are elsewhere in most of the
 situations.
[...] Yep, that's typical GC phobia, fueled mostly by prejudice, fear, and ignorance, in the absence of hard evidence. The GC is a convenient whipping boy to blame for whatever malady you're suffering from when you're not sure what the actual cause is, and a convenient excuse not to use something you don't want to use for reasons you're too embarrassed to say out loud. T -- MSDOS = MicroSoft's Denial Of Service
Jan 07
prev sibling next sibling parent Radu <void null.pt> writes:
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 I'm somewhere between a light GC user and a  nogc user, and I 
 don't really know much about where we're at, or much about 
 start-of-the-art GC in general.

 I read comments like this: 
 https://www.reddit.com/r/programming/comments/acsg61/version_20840_of_dmd_the_d_reference_compiler_has/edbhgzi

 """
 And so we do nothing and D continues to languish as every 
 thread is
 filled with discussion about a poor implementation of a sorely
 outdated GC? Other languages in this domain get by just fine - 
 Go and
 Nim to name some off the top of my head (with the full 
 disclosure that
 the former isn't quite D performance and Nim uses a 
 thread-local GC).
 There are C / C++ libraries that provide a better GC. D's 
 tradeoff is
 basically having one of the worst GCs of any language I've used 
 this
 century to avoid a write barrier that causes a, what, 5%
 nigh-deterministic overall reduction in performance (assuming 
 we use
 fast barriers out of a cycle).

 And if a reasonably performant GC is fundamentally incompatible 
 and
 you think it will always be this way, then maybe D as a 
 language needs
 to make serious re-evaluations? D keeps saying "Don't Fear the
 Reaper", but D is one of the only languages where I actually do.
 """

 How much truth is in here?
 What is this business about write barriers making the GC fast? 
 Is that
 actually fact, or are there other hurdles?

 Is the claim that write-barriers generally slow your runtime
 performance significant? And if it is, is it 
 possible/reasonable to
 control the places where they are emit?
 Can  nogc be used to inhibit write barriers in code that we know
 doesn't interact with the GC, such that we have control over 
 that loss
 of perf?

 It's obvious that the quality of the GC implementation is a 
 mental barrier to entry for many... and D has a GC, which is 
 *attractive* to some users. Despite the fact that I don't care 
 about the GC much personally, we do need to care about this as 
 a group, and nobody seems to be making substantial progress.

 Is progress possible, or is the hard reality that the language 
 is just designed such to be resistant to a quality GC, while 
 the ecosystem sadly tends to rely on it?

 Where's the ARC stuff? What happened to opAddRef/opRelease?
There is this article [1] which covers some of the details and opportunities to get the D GC in a better shape. For a while improvements on GC front were slow, I see that lately Rainer has submitted some PRs [2] for getting various GC issues sorted. My favorite one is the new precise GC [3]. I think a big improvement would be the use of thread local heaps and GC. But this requires a rethink of `shared`, which would be nice to be done anyhow. 1 - https://olshansky.me/gc/runtime/dlang/2017/06/14/inside-d-gc.html 2 - https://github.com/dlang/druntime/commits?author=rainers&since=2018-12-1&until=2019-01-07 3 - https://github.com/dlang/druntime/pull/2418
Jan 07
prev sibling parent reply welkam <wwwelkam gmail.com> writes:
On Saturday, 5 January 2019 at 22:05:19 UTC, Manu wrote:
 poor implementation of a sorely outdated GC?
The whole r/programming needs to be garbage collected. I cant find a source but if I recall correctly mark and sweep GC has the highest program throughput of all GCs.
 What is this business about write barriers making the GC fast?
In Java and other managed languages common code has lots of small short lived objects and not a lot of pointer modifications so adding write barriers can get you concurrent GC for almost free. For a programmer like you that can turn heap allocations to stack, use scope(exit) or similar for simple allocations and leave difficult allocations to GC you wont benefit from them.
 It's obvious that the quality of the GC implementation is a 
 mental barrier to entry for many...<...> we do need to care 
 about this as a group, and nobody seems to be making 
 substantial progress.
For things to change we need to change minds. I have watched many debates in religion, climate change and politics and saw just how hard it is to change some people opinions. The best hope is to target people in the middle that dont have strong opinions one way or another and repeatedly expose to information that shows that GC is not a problem. Its not correct to say that people in this group dont care its just that everything needs work and this is not the only problem that D have. I have idea for D blog that would some what address this problem but I know myself and I know that there is low chance for this idea to materialize. I even picked title - "Why D with GC is faster than your C/C++ code" TL;DR memory management
 Is progress possible, or is the hard reality that the language 
 is just designed such to be resistant to a quality GC
With dip1000 in principle we can reduce overhead of GC by replacing GC allocations to stack or malloc making generational GC necessary.
Jan 07
parent reply JN <666total wp.pl> writes:
On Monday, 7 January 2019 at 15:54:32 UTC, welkam wrote:
 For things to change we need to change minds. I have watched 
 many debates in religion, climate change and politics and saw 
 just how hard it is to change some people opinions. The best 
 hope is to target people in the middle that dont have strong 
 opinions one way or another and repeatedly expose to 
 information that shows that GC is not a problem.
I think for many people (especially C++ folks which seem to be the main target of D leadership) the mere existence of GC is enough to turn them off. Here's a quote I found on Rust subreddit today: "I remember getting excited about D in 2002 exactly because it was supposed to be 'C++-improved' and then learning that D had GC which made it less 'C++-improved' and more 'Java-improved'. I switched to Rust exactly because I find it 'C++-improved'." The mere existence of GC instantly puts D into application programming language rather than systems programming language for most people. Are they correct? Debatable. However the mental barrier is there and it's hard to get through with a nice message.
Jan 07
next sibling parent welkam <wwwelkam gmail.com> writes:
On Monday, 7 January 2019 at 17:18:49 UTC, JN wrote:
 On Monday, 7 January 2019 at 15:54:32 UTC, welkam wrote:
 For things to change we need to change minds. I have watched 
 many debates in religion, climate change and politics and saw 
 just how hard it is to change some people opinions. The best 
 hope is to target people in the middle that dont have strong 
 opinions one way or another and repeatedly expose to 
 information that shows that GC is not a problem.
I think for many people (especially C++ folks which seem to be the main target of D leadership) the mere existence of GC is enough to turn them off. Here's a quote I found on Rust subreddit today: "I remember getting excited about D in 2002 exactly because it was supposed to be 'C++-improved' and then learning that D had GC which made it less 'C++-improved' and more 'Java-improved'. I switched to Rust exactly because I find it 'C++-improved'." The mere existence of GC instantly puts D into application programming language rather than systems programming language for most people. Are they correct? Debatable. However the mental barrier is there and it's hard to get through with a nice message.
Your sampling is bad. You pay attention to those who commented and not the silent majority. People who dont have strong opinions and open minded wont comment with such strong opinion on reddit threads discoursing language merits. They just read them and thats the people you can convince with solid arguments.
Jan 07
prev sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 05:18:49PM +0000, JN via Digitalmars-d wrote:
[...]
 I think for many people (especially C++ folks which seem to be the
 main target of D leadership) the mere existence of GC is enough to
 turn them off.
Yep. GC phobia is a common malady among C/C++ folks (I used to be one of them). I wrote D off the first time I found it because I was suffering from a particularly acute case of GC phobia, mostly colored by the bad impressions of the early versions of Java. It took me a while to come around. Reading this helped a little: https://dlang.org/spec/garbage.html but it didn't convince me, only made me cautiously willing to at least try out D a little to see for myself. I think I wasn't even willing to try D until I read that I could optionally still use malloc/free if I really wanted to, or if the GC turned out to be a problem. Clutching onto that option was an important crutch for my initial foray into D. It wasn't until I actually wrote code that let the GC do its job, that I was finally able to see, in retrospect, just how much time I wasted basically reimplementing my own version of GC (manually, and rather poorly!) and debugging the associated tricky pointer code -- all of which was unnecessary when there's a GC. Furthermore, having a GC enabled a cleanness in my internal APIs that manually-managed memory did not allow -- it eliminated memory management from being an integral aspect of my APIs and freed up the design space (and my mental concentration!) to actually focus on the problem domain instead of being bogged down with memory management details. The result was: (1) my code quality was better from having eliminated an entire class of pointer bugs and from my being able to focus more on the problem domain rather than fiddle with memory management at every turn; and (2) it had better API that directly addressed the problem domain, uncluttered by dirty memory management specifics; (3) I could achieve the above in a much shorter time (and with much less effort) than without the GC, since I eliminated the big time sink of manual memory management. This was a clear win for me. So eventually I convinced myself. :-P The other important thing I had to get over was the oft-cited performance concerns -- which in my case (and I surmise in the case of many others in the C/C++ camp) was colored by the bad impressions of early versions of Java. It took me a while to convince myself that the chains of `delete`s in the C++ dtor of a large recursive structure basically was tantamount to a GC pause, just scheduled differently. Well, OK, it's not *exactly* the same thing, but it dispelled the wrong notion that malloc/free (or C++ new/delete) came "for free". You still have to do roughly the same amount of "work" or bookkeeping either way. Of course, the cost of a GC collection is likely more than manually freeing stuff, depending on the situation; but there's the tradeoff of spending easily 3x to 4x more effort in writing correct pointer code under manual MM and *still* running the risk of pointer bugs (with the associated debugging effort required to fix them -- and pointer bugs are notorious for being hard to debug), vs. paying for a (usually) small performance cost and being much more productive, with the benefit of eliminating a whole class of pointer bugs "for free". T -- Study gravitation, it's a field with a lot of potential.
Jan 07
next sibling parent welkam <wwwelkam gmail.com> writes:
On Monday, 7 January 2019 at 18:18:05 UTC, H. S. Teoh wrote:
 but there's the tradeoff of spending easily 3x to 4x more 
 effort in writing correct pointer code under manual MM
Most code is slow because programmer didn't had time to optimize. For normal code GC wont add more than 15% overhead but with some optimization you can easily get over 2x improvement in your program. This could be excellent topic for D blog.
Jan 07
prev sibling parent reply Guillaume Piolat <first.last gmail.com> writes:
On Monday, 7 January 2019 at 18:18:05 UTC, H. S. Teoh wrote:
 Yep. GC phobia is a common malady among C/C++ folks (I used to 
 be one of them).
Guilty as charge here! What I like most about GC is the efficiency! With scanning and a global owner, you don't need to keep ownership information anywhere. Which leads to less copying (std::string leads malloc for every string copy, GC avoids that), slices being two machine words instead of 3, etc.
Jan 07
next sibling parent "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Mon, Jan 07, 2019 at 10:02:02PM +0000, Guillaume Piolat via Digitalmars-d
wrote:
 On Monday, 7 January 2019 at 18:18:05 UTC, H. S. Teoh wrote:
 Yep. GC phobia is a common malady among C/C++ folks (I used to be
 one of them).
Guilty as charge here!
I wonder if some of us ex-GC-phobists(?) should throw together a D blog entry with brief summaries / excerpts of how we got over our GC phobia and came to embrace the GC. Could be a useful way to collect some common GC myths / arguments for GC in a place where we can point people to.
 What I like most about GC is the efficiency! With scanning and a
 global owner, you don't need to keep ownership information anywhere.
That's true! And also, having a dedicated collection thread also means you get better cache coherency and economy of scale, as opposed to individually freeing small objects here and there and incurring many RAM roundtrips.
 Which leads to less copying (std::string leads malloc for every string
 copy, GC avoids that), slices being two machine words instead of 3,
 etc.
Yeah, std::string copy-on-assign and copy-on-substring do add a lot of overhead that people are often unaware of. Well, some people *are* aware of it, but the "solution" many adopt is to avoid std::* and go back to the bad ole world of char* and buffer overrun heaven. Both are pessimal. D's slices r0x0rs. T -- Too many people have open minds but closed eyes.
Jan 07
prev sibling parent Russel Winder <russel winder.org.uk> writes:
On Mon, 2019-01-07 at 14:29 -0800, H. S. Teoh via Digitalmars-d wrote:
 [=E2=80=A6]
=20
 I wonder if some of us ex-GC-phobists(?) should throw together a D blog
 entry with brief summaries / excerpts of how we got over our GC phobia
 and came to embrace the GC.  Could be a useful way to collect some
 common GC myths / arguments for GC in a place where we can point people
 to.
Or create an article for Overload or CVu which can then be a blog post on t= he D website somewhere. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Dr Russel Winder t: +44 20 7585 2200 41 Buckmaster Road m: +44 7770 465 077 London SW11 1EN, UK w: www.russel.org.uk
Jan 07