www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Multithreading woes on Linux

reply Juan Jose Comellas <jcomellas gmail.com> writes:
It seems that there is a problem in the code generated by DMD or the code in
Phobos when using multithreading on Linux. I've been trying several ways of
rewriting my programs to avoid this problem, but I've had no success so
far. The crashes always happen inside the garbage collector. The line
reported by gdb is:


1318                byte *p = cast(byte *)(*p1);

It looks like the pointer that's being dereferenced by the GC is invalid.
I've added checks before this line to see if it was a NULL pointer and it's
not. Surprisingly (or not), my program crashes almost immediately if Phobos
and the GC are compiled with optimizations. If I only leave "-g" as the
DFLAGS in the makefiles I get these crashes much less frequently.  

In the test program I'm using I have two threads. The crash is happening on
thread 1. The full backtrace I get for the crash is attached to this post.

I'm trying to write a simplified sample program and I'll post it once I have
it ready. Walter, if you have a minute, I'd appreciate you looking into
this.
Apr 23 2006
next sibling parent Thomas Kuehne <thomas-dloop kuehne.cn> writes:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Juan Jose Comellas schrieb am 2006-04-23:
 It seems that there is a problem in the code generated by DMD or the code in
 Phobos when using multithreading on Linux. I've been trying several ways of
 rewriting my programs to avoid this problem, but I've had no success so
 far. The crashes always happen inside the garbage collector. The line
 reported by gdb is:


 1318                byte *p = cast(byte *)(*p1);
Might be related to http://d.puremagic.com/bugzilla/show_bug.cgi?id=72 A potential workaround: 1) edit dmd/src/phobos/internal/gc/linux.mak remove -relase from DFLAGS: DFLAGS=-O -inline -I../.. 2) recompile libphobos.a 3) replace your current libphobos.a with the one found at dmd/src/phobos/libphobos.a Thomas -----BEGIN PGP SIGNATURE----- iD8DBQFES+KJ3w+/yD4P9tIRAk6XAKCEv0Vcxe8Gr39gq43WwswuikaajgCgxaCQ j0UzSJRwEcrZ+59dPlfuB7g= =oJR4 -----END PGP SIGNATURE-----
Apr 23 2006
prev sibling parent reply Dave <Dave_member pathlink.com> writes:
I just ran into this - the fix in std/thread.d:

     extern (C) static void pauseHandler(int sig)
     {   int result;

     // Save all registers on the stack so they'll be scanned by the GC
     asm
     {
         pusha   ;
     }

     assert(sig == SIGUSR1);
     // Move sem_post to after t.stackTop = getESP();
     //sem_post(&flagSuspend);

     sigset_t sigmask;
     result = sigfillset(&sigmask);
     assert(result == 0);
     result = sigdelset(&sigmask, SIGUSR2);
     assert(result == 0);

     Thread t = getThis();
     t.stackTop = getESP();
     t.flags &= ~1;
     sem_post(&flagSuspend); // HERE
     while (1)
     {
         sigsuspend(&sigmask);   // suspend until SIGUSR2
         if (t.flags & 1)        // ensure it was resumeHandler()
         break;
     }

     // Restore all registers
     asm
     {
         popa    ;
     }
     }

The problem is that the t.stackTop is not valid when it is passed into 
gcx.mark() because it is being munged as pauseAll returns (and lets the 
GC commence) before the stackTop is set for all of the paused threads.

Please give it a try and if it also solves your problem then it will be 
a confirmed fix.

- Dave

Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code in
 Phobos when using multithreading on Linux. I've been trying several ways of
 rewriting my programs to avoid this problem, but I've had no success so
 far. The crashes always happen inside the garbage collector. The line
 reported by gdb is:
 

 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and it's
 not. Surprisingly (or not), my program crashes almost immediately if Phobos
 and the GC are compiled with optimizations. If I only leave "-g" as the
 DFLAGS in the makefiles I get these crashes much less frequently.  
 
 In the test program I'm using I have two threads. The crash is happening on
 thread 1. The full backtrace I get for the crash is attached to this post.
 
 I'm trying to write a simplified sample program and I'll post it once I have
 it ready. Walter, if you have a minute, I'd appreciate you looking into
 this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):





cket6Socket5FlagsZi () at
/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:1423

/home/jcomellas/devel/d/mango_test/mango/io/Socket.d:879

/home/jcomellas/devel/d/mango_test/mango/io/Conduit.d:198




std/thread.d:845


 
 Thread 1 (process 8949):







ctor13ISelectionSet ()
     at /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353

elector9ISelectorZv () at selector.d:142


 
Apr 23 2006
parent reply Juan Jose Comellas <jcomellas gmail.com> writes:
Great fix! This solved all the problems I've found so far when working with
multiple threads on Linux. I'm going to start running more complex test
cases with several hundred threads to see if I can find any additional
problems.

Thank you very much for this.

Walter, please add this fix to Phobos. Should I create an entry in D's
bugzilla?


Dave wrote:
 
 I just ran into this - the fix in std/thread.d:
 
      extern (C) static void pauseHandler(int sig)
      {   int result;
 
      // Save all registers on the stack so they'll be scanned by the GC
      asm
      {
          pusha   ;
      }
 
      assert(sig == SIGUSR1);
      // Move sem_post to after t.stackTop = getESP();
      //sem_post(&flagSuspend);
 
      sigset_t sigmask;
      result = sigfillset(&sigmask);
      assert(result == 0);
      result = sigdelset(&sigmask, SIGUSR2);
      assert(result == 0);
 
      Thread t = getThis();
      t.stackTop = getESP();
      t.flags &= ~1;
      sem_post(&flagSuspend); // HERE
      while (1)
      {
          sigsuspend(&sigmask);   // suspend until SIGUSR2
          if (t.flags & 1)        // ensure it was resumeHandler()
          break;
      }
 
      // Restore all registers
      asm
      {
          popa    ;
      }
      }
 
 The problem is that the t.stackTop is not valid when it is passed into
 gcx.mark() because it is being munged as pauseAll returns (and lets the
 GC commence) before the stackTop is set for all of the paused threads.
 
 Please give it a try and if it also solves your problem then it will be
 a confirmed fix.
 
 - Dave
 
 Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code
 in Phobos when using multithreading on Linux. I've been trying several
 ways of rewriting my programs to avoid this problem, but I've had no
 success so far. The crashes always happen inside the garbage collector.
 The line reported by gdb is:
 

 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and
 it's not. Surprisingly (or not), my program crashes almost immediately if
 Phobos and the GC are compiled with optimizations. If I only leave "-g"
 as the DFLAGS in the makefiles I get these crashes much less frequently.
 
 In the test program I'm using I have two threads. The crash is happening
 on thread 1. The full backtrace I get for the crash is attached to this
 post.
 
 I'm trying to write a simplified sample program and I'll post it once I
 have it ready. Walter, if you have a minute, I'd appreciate you looking
 into this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):


 #std/thread.d:940











 #selector.d:327


 #std/thread.d:845 11 0x55579ced in start_thread () from


 
 Thread 1 (process 8949):








     at
     /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353




Apr 23 2006
next sibling parent Justin C Calvarese <technocrat7 gmail.com> writes:
Juan Jose Comellas wrote:
 Great fix! This solved all the problems I've found so far when working with
 multiple threads on Linux. I'm going to start running more complex test
 cases with several hundred threads to see if I can find any additional
 problems.
 
 Thank you very much for this.
 
 Walter, please add this fix to Phobos. Should I create an entry in D's
 bugzilla?
I think this is exactly what bugzilla is for. I think you should go ahead and add it. -- jcc7
Apr 23 2006
prev sibling parent pmoore <pmoore_member pathlink.com> writes:
Slightly off topic:

Why does this function do a pusha and popa? Surely they are 16 bit pushes and
pops? Wouldn't you want pushad and popad instead? Note though that individual
pushes and pops would probably be better with the 64 bit future in mind as
pushad and popad beome invalid instructions in x86_64.


In article <e2gvv6$217a$1 digitaldaemon.com>, Juan Jose Comellas says...
Great fix! This solved all the problems I've found so far when working with
multiple threads on Linux. I'm going to start running more complex test
cases with several hundred threads to see if I can find any additional
problems.

Thank you very much for this.

Walter, please add this fix to Phobos. Should I create an entry in D's
bugzilla?


Dave wrote:
 
 I just ran into this - the fix in std/thread.d:
 
      extern (C) static void pauseHandler(int sig)
      {   int result;
 
      // Save all registers on the stack so they'll be scanned by the GC
      asm
      {
          pusha   ;
      }
 
      assert(sig == SIGUSR1);
      // Move sem_post to after t.stackTop = getESP();
      //sem_post(&flagSuspend);
 
      sigset_t sigmask;
      result = sigfillset(&sigmask);
      assert(result == 0);
      result = sigdelset(&sigmask, SIGUSR2);
      assert(result == 0);
 
      Thread t = getThis();
      t.stackTop = getESP();
      t.flags &= ~1;
      sem_post(&flagSuspend); // HERE
      while (1)
      {
          sigsuspend(&sigmask);   // suspend until SIGUSR2
          if (t.flags & 1)        // ensure it was resumeHandler()
          break;
      }
 
      // Restore all registers
      asm
      {
          popa    ;
      }
      }
 
 The problem is that the t.stackTop is not valid when it is passed into
 gcx.mark() because it is being munged as pauseAll returns (and lets the
 GC commence) before the stackTop is set for all of the paused threads.
 
 Please give it a try and if it also solves your problem then it will be
 a confirmed fix.
 
 - Dave
 
 Juan Jose Comellas wrote:
 It seems that there is a problem in the code generated by DMD or the code
 in Phobos when using multithreading on Linux. I've been trying several
 ways of rewriting my programs to avoid this problem, but I've had no
 success so far. The crashes always happen inside the garbage collector.
 The line reported by gdb is:
 

 1318                byte *p = cast(byte *)(*p1);
 
 It looks like the pointer that's being dereferenced by the GC is invalid.
 I've added checks before this line to see if it was a NULL pointer and
 it's not. Surprisingly (or not), my program crashes almost immediately if
 Phobos and the GC are compiled with optimizations. If I only leave "-g"
 as the DFLAGS in the makefiles I get these crashes much less frequently.
 
 In the test program I'm using I have two threads. The crash is happening
 on thread 1. The full backtrace I get for the crash is attached to this
 post.
 
 I'm trying to write a simplified sample program and I'll post it once I
 have it ready. Walter, if you have a minute, I'd appreciate you looking
 into this.
 
 
 ------------------------------------------------------------------------
 
 (gdb) thread apply all bt
 
 Thread 2 (process 8953):


 #std/thread.d:940











 #selector.d:327


 #std/thread.d:845 11 0x55579ced in start_thread () from


 
 Thread 1 (process 8949):








     at
     /home/jcomellas/devel/d/mango_test/mango/io/selector/PollSelector.d:353




Apr 24 2006