www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - GC buggy in windows?

reply tchaloupka <chalucha gmail.com> writes:
We've experiencing some really strange nasty GC behavior in our 
IOCP I/O heavy windows app.

Sometimes it hangs with just: "Unable to load thread context"

I've spend last three days with experimenting and trying to 
narrow it somehow to find exact cause :(

The problem is in GC and it's stop the world behavior.
In core.thread.osthread.sleep method there is basically:

```
SuspendThread( t.m_hndl );
GetThreadContext( t.m_hndl, &context );
```

In some cases GetThreadContext returns `ERROR_GEN_FAILURE(31)` 
which leads to the error being thrown.

First problem is, that application doesn't terminate after this 
error, but just hangs.
That's because thread is still suspended and somewhere down the 
line `join` is called on this thread which won't return - ever.

This is a nice blog explaining that the `SuspendThread` is 
actually asynchronnous: 
https://devblogs.microsoft.com/oldnewthing/?p=44743

But it also states that when `GetThreadContext` is called on it, 
we can be sure that it is actually already suspended.

So what could lead to the error? Searching in windows API 
documentation - nah, nothing as usual..

Searching on the internet - sure a lot of problems with some game 
engines using GC (unity) combined with some anticheat or 
antivirus programs - not our case.

Ok, so I've tried to compile custom druntime (what a pleasure 
itself) and found that:

* when you try to Thread.yield and get context again, it doesn't 
help, still error
* only way I could workaround this problem was resuming back the 
thread again, Thread.yield, suspend thread and try the context 
again, usually first or second try succeeds - HOORAY.

Then I've spent a lot of time figuring what is actually causing 
the error and I have a theory that the problem is with some IO 
operation being run in kernel context that can't finish when the 
thread is suspended and so the error is returned.

I ended up with this minimized test app that causes this error 
really fast.

```
import core.memory : GC;
import core.stdc.stdio;
import core.thread;
import std.random;
import std.range;

void main() {
	Thread t;
	while (true) {
		GC.collect();
		if (t is null || !t.isRunning) {
			t = new Thread(&threadProc);
			t.start();
		}
	}
}

void threadProc() {
	foreach (_; iota(uniform(0, 100))) {
		FILE* f = fopen("dummy", "a");
		scope (exit) fclose(f);
	}
}
```

compiled with: `dmd -m64 -debug test.d`
Tested on 64bit Windows 10.

I definitely think that this is a bug in a windows GC 
implementation.

Should I fill it?

What seems to be a fix to both of them is:
* retry the resume/suspend/get context on the failing thread some 
more - how many times?
* before returning the error resume the thread so it can be 
joined (I haven't looked from where it's being called on 
termination)

For me it is also questionable if terminating the application in 
this case is even the correct behavior. It might be better to 
scratch the GC attempt, resume the threads and retry on next 
collection? That might lead to other problems but as this occurs 
pretty rarely it might have a better outcome. Ideas?

PS: I'm beginning to understand the C/C++ devs to don't like GC 
languages ;-)
PPS: Now I hate windows even more.. (normally a linux dev)
PPPS: This kind of experience would definitely led away devs that 
just need to have "shit done" and don't bother with the tool 
used..
Nov 08
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
Just to confirm, this code snippet is meant to lock the entire process 
up and CPU usage go down to 0%?

If so, so far I have not confirmed it using dmd 2.087.0.
Nov 08
parent reply tchaloupka <chalucha gmail.com> writes:
On Friday, 8 November 2019 at 14:39:34 UTC, rikki cattermole 
wrote:
 Just to confirm, this code snippet is meant to lock the entire 
 process up and CPU usage go down to 0%?

 If so, so far I have not confirmed it using dmd 2.087.0.
Yep, it just outputs: C:\Users\tcha\Workspace>test.exe core.thread.osthread.ThreadError src\core\thread\osthread.d(3176): Unable to load thread context ---------------- and hangs on Thread.join (0% CPU). Tested both on physical and virtual windows 10 x86_64.
Nov 08
parent reply tchaloupka <chalucha gmail.com> writes:
On Friday, 8 November 2019 at 14:47:18 UTC, tchaloupka wrote:
 On Friday, 8 November 2019 at 14:39:34 UTC, rikki cattermole 
 wrote:
 Just to confirm, this code snippet is meant to lock the entire 
 process up and CPU usage go down to 0%?

 If so, so far I have not confirmed it using dmd 2.087.0.
Yep, it just outputs: C:\Users\tcha\Workspace>test.exe core.thread.osthread.ThreadError src\core\thread\osthread.d(3176): Unable to load thread context ---------------- and hangs on Thread.join (0% CPU). Tested both on physical and virtual windows 10 x86_64.
We've just tried it on 5 more physical PCs (all win 10 x86_64 with ssd/m2, core i5/i7 of various models). With dmd-master, dmd-2.086.1, dmd-2.089.0. All ended up same within a few secs.
Nov 08
parent reply Dennis <dkorpel gmail.com> writes:
On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
 All ended up same within a few secs.
I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
Nov 08
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 09/11/2019 4:21 AM, Dennis wrote:
 On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
 All ended up same within a few secs.
I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
Yes. This is looking more and more like an environment issue, not a bug on druntime's end. Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.
Nov 08
parent reply tchaloupka <chalucha gmail.com> writes:
On Friday, 8 November 2019 at 15:30:18 UTC, rikki cattermole 
wrote:
 On 09/11/2019 4:21 AM, Dennis wrote:
 On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
 All ended up same within a few secs.
I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
Yes. This is looking more and more like an environment issue, not a bug on druntime's end. Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.
Thanks for feedback, I've tried it on more servers where it actually worked as you both described. At the end the difference was Eset antivirus installed. I had it whole disabled to eliminate exactly this but only after it's uninstall it started to work.. So some crap was still active in it. Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together." So bug or not a bug?
Nov 08
next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 09/11/2019 4:54 AM, tchaloupka wrote:
 On Friday, 8 November 2019 at 15:30:18 UTC, rikki cattermole wrote:
 On 09/11/2019 4:21 AM, Dennis wrote:
 On Friday, 8 November 2019 at 15:01:26 UTC, tchaloupka wrote:
 All ended up same within a few secs.
I tried it a few times on my Windows 10 laptop with dmd 2.088, it just sat there for minutes using ~14% CPU (note: I have 8 logical processors) taking 10 Mb, and nothing appeared in the console. So unfortunately I couldn't reproduce it either.
Yes. This is looking more and more like an environment issue, not a bug on druntime's end. Potentially AV related (I use Avast) and I'm on Windows 10 Home 64bit.
Thanks for feedback, I've tried it on more servers where it actually worked as you both described. At the end the difference was Eset antivirus installed. I had it whole disabled to eliminate exactly this but only after it's uninstall it started to work.. So some crap was still active in it. Well still it's pretty unfortunate if some 3rd side app can brick the GC runtime. We can't just say to customers "You've got Eset installed? Screw you it won't work together." So bug or not a bug?
Bug on Eset's side. They are misbehaving in some way. You can confirm that this is the case by installing an AV like Avast with full firewall capability turned on (may need to pay, but worth while to confirm). The reason I am confident that it is a bug on the AV side and not D's is because I don't remember hearing about this happening before. It may be possible to add a workaround on our end, but we'll need Eset on our side for that I think. Based upon a quick search on Google, its looking like Eset consider this a feature not a bug. https://forum.unity.com/threads/getthreadcontext-failed.140925/
Nov 08
prev sibling parent reply bachmeier <no spam.net> writes:
On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:

 Well still it's pretty unfortunate if some 3rd side app can 
 brick the GC runtime. We can't just say to customers "You've 
 got Eset installed? Screw you it won't work together."
But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?
Nov 08
parent reply Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Friday, 8 November 2019 at 20:30:28 UTC, bachmeier wrote:
 On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:

 Well still it's pretty unfortunate if some 3rd side app can 
 brick the GC runtime. We can't just say to customers "You've 
 got Eset installed? Screw you it won't work together."
But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?
It's not OK if the interference consists of injecting random bugs into legitimate programs. Antivirus programs have a pretty awful track record in this regard. I can't think of an antivirus product that I used that didn't turn out to be defective in one way or another.
Nov 08
next sibling parent Fathou <fathou mail.address> writes:
On Saturday, 9 November 2019 at 03:02:12 UTC, Gregor Mückl wrote:
 On Friday, 8 November 2019 at 20:30:28 UTC, bachmeier wrote:
 On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:

 Well still it's pretty unfortunate if some 3rd side app can 
 brick the GC runtime. We can't just say to customers "You've 
 got Eset installed? Screw you it won't work together."
But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?
It's not OK if the interference consists of injecting random bugs into legitimate programs. Antivirus programs have a pretty awful track record in this regard. I can't think of an antivirus product that I used that didn't turn out to be defective in one way or another.
OT, but the last time I used an AV was to disinfect a relative's laptop... that already had AV on it. I don't see the point of it as a preventative measure, especially on mobile devices like phones. But, perhaps that's myopia.
Nov 08
prev sibling next sibling parent reply bachmeier <no spam.net> writes:
On Saturday, 9 November 2019 at 03:02:12 UTC, Gregor Mückl wrote:
 On Friday, 8 November 2019 at 20:30:28 UTC, bachmeier wrote:
 On Friday, 8 November 2019 at 15:54:56 UTC, tchaloupka wrote:

 Well still it's pretty unfortunate if some 3rd side app can 
 brick the GC runtime. We can't just say to customers "You've 
 got Eset installed? Screw you it won't work together."
But isn't that the purpose of antivirus software? Isn't the whole point to allow it to be able to interfere with the execution of other programs?
It's not OK if the interference consists of injecting random bugs into legitimate programs. Antivirus programs have a pretty awful track record in this regard. I can't think of an antivirus product that I used that didn't turn out to be defective in one way or another.
When you install antivirus on your computer, you're giving it control over your computer. If other programs had a way around that, it would be useless. You could make the argument that the antivirus is crappy at its job. There's nothing the D compiler (or a compiler for any other language) can do about it.
Nov 09
parent Gregor =?UTF-8?B?TcO8Y2ts?= <gregormueckl gmx.de> writes:
On Saturday, 9 November 2019 at 15:43:45 UTC, bachmeier wrote:
 On Saturday, 9 November 2019 at 03:02:12 UTC, Gregor Mückl 
 wrote:
 It's not OK if the interference consists of injecting random 
 bugs into legitimate programs. Antivirus programs have a 
 pretty awful track record in this regard. I can't think of an 
 antivirus product that I used that didn't turn out to be 
 defective in one way or another.
When you install antivirus on your computer, you're giving it control over your computer. If other programs had a way around that, it would be useless. You could make the argument that the antivirus is crappy at its job. There's nothing the D compiler (or a compiler for any other language) can do about it.
I'm trying to make exactly that argument: ALL antivirus software I've ever used turned out to be crappy. They induced bugs in legitimate, clean programs in various ways. They were somewhat successful at stopping the occasional malicious file, but the collateral damage is not pretty. I could list a few fun examples, if you want. There is a reason why first level support often tells people to temporarily deactivate their virus scanner and try again. This works more often than it actually should.
Nov 09
prev sibling parent JN <666total wp.pl> writes:
On Saturday, 9 November 2019 at 03:02:12 UTC, Gregor Mückl wrote:
 It's not OK if the interference consists of injecting random 
 bugs into legitimate programs. Antivirus programs have a pretty 
 awful track record in this regard. I can't think of an 
 antivirus product that I used that didn't turn out to be 
 defective in one way or another.
I consider most of the AV software snake oil. They slow down your OS too. The only AV software I trust is Windows Defender. For simple reason. It's in AV vendors best interest for your PC to be infected, because it sells their software and other "malware removal" "PC optimization" crapware. In case of Microsoft, it's in their best interest not to have any viruses at all because it reflects on them badly as a platform. Also, it's in their best interest to minimize any slowdowns and inconveniences AV brings. Also, these are different times. In the pre-internet times virus infections were prevalent, carried over with USB drives or in drive-by Java applet/Flash attacks. Modern web environment is sandboxed well enough that it protects you from most attacks.
Nov 09