digitalmars.D - Compiler optimization breaks multi-threaded code

Michal Minich (11/11) Nov 14 2010 There is one question on SO which seems like a serious problem for atomi...

Sean Kelly (2/19) Nov 14 2010 What a mess. DMD isn't supposed to optimize across asm blocks.

Kagamin (2/10) Nov 15 2010 There're no asm blocks in the code. It's a violated contract of shared d...

Sean Kelly (6/20) Nov 16 2010 cas() contains an asm block. Though I guess in this case the compiler

stephan (46/51) Nov 16 2010 Thanks for looking into the issue around here. Just three comments from

Sean Kelly (3/56) Nov 16 2010 I'm thinking of exposing atomicStore and atomicLoad in core.atomic so fo...

stephan (6/8) Nov 17 2010 Ignore my comment. I should have looked at the code in core.atomic

Michal Minich <michal.minich gmail.com> writes:

There is one question on SO which seems like a serious problem for atomic
ops.

http://stackoverflow.com/questions/4165149/compiler-optimization-breaks-
multi-threaded-code

in short:

shared uint cnt;
void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }

is compiled with dmd -O to something like:

shared uint cnt;
void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

see the web page for details.

Nov 14 2010

Sean Kelly <sean invisibleduck.org> writes:

Michal Minich Wrote:

 There is one question on SO which seems like a serious problem for atomic
 ops.
 
 http://stackoverflow.com/questions/4165149/compiler-optimization-breaks-
 multi-threaded-code
 
 in short:
 
 shared uint cnt;
 void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }
 
 is compiled with dmd -O to something like:
 
 shared uint cnt;
 void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }
 
 see the web page for details.

What a mess.  DMD isn't supposed to optimize across asm blocks.

Nov 14 2010

Kagamin <spam here.lot> writes:

Sean Kelly Wrote:

 shared uint cnt;
 void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o = cnt; }
 
 is compiled with dmd -O to something like:
 
 shared uint cnt;
 void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

 What a mess.  DMD isn't supposed to optimize across asm blocks.

There're no asm blocks in the code. It's a violated contract of shared data
access.

Nov 15 2010

Sean Kelly <sean invisibleduck.org> writes:

Kagamin <spam here.lot> wrote:
 Sean Kelly Wrote:
 
 shared uint cnt;
 void atomicInc  ( ) { uint o; while ( !cas( &cnt, o, o + 1 ) ) o =
 cnt; }
 
 is compiled with dmd -O to something like:
 
 shared uint cnt;
 void atomicInc  ( ) { while ( !cas( &cnt, cnt, cnt + 1 ) ) { } }

 What a mess.  DMD isn't supposed to optimize across asm blocks.

 
 There're no asm blocks in the code. It's a violated contract of shared
 data access.

cas() contains an asm block. Though I guess in this case the compiler
isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work
correctly?  I know the issue with shared would still have to be fixed,
but that code uses asm for the load as well, so it probably won't be
optimized the same way.

Nov 16 2010

stephan <none example.com> writes:

Am 16.11.2010 18:09, schrieb Sean Kelly:
 cas() contains an asm block. Though I guess in this case the compiler
 isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work
 correctly?  I know the issue with shared would still have to be fixed,
 but that code uses asm for the load as well, so it probably won't be
 optimized the same way.

Thanks for looking into the issue around here. Just three comments from 
my side, Sean.

Disclaimer: based on a couple of hours chasing a bug and not much D 
experience (but some optimizing C++ compiler experience - so the issue 
looked familiar :-) )

1) atomicOp is not concerned. You only read memory once in the function 
call. Whether from a local variable that was loaded from something 
global or directly from a global, doesn't really matter (except for 
timing, maybe).

2) You are right, the compiler seems to not optimize across asm 
statements. So, the example can be fixed with the following hack:
     void atomicInc  ( ) {
         uint o;
         while ( !cas( &cnt, o, o + 1 ) ) {
             asm { nop; } o = cnt;
         }
     }
This is however more brittle than it looks, because it is not always 
clear what "optimizing across an asm block". This version has the issue 
again:
     void atomicInc  ( ) {
         uint o = cnt;
         do {
             asm { nop; } o = cnt;
         } while ( !cas( &cnt, o, o + 1 ) )
     }
While this case might look somewhat obvious, I encountered some problems 
in more complex code, and finally went for the all-inline-assembler 
solution to be on the safe side.

3) During my debugging, I believe that I saw the optimizer not only 
re-ordering reads of shared variables, but also writes to shared 
variables. IIRC, my Dekker example on SO (which fails for the missing 
s/l/mfence instructions), also sports a re-ordering of the lines
         cnt++;
         turn2 = true; flag1 = false;
into
         turn2 = true;
         cnt++;
         flag1 = false;
which in this case is not really important, but might introduce another 
bug if I was prepared to live with the risk of starvation (and remove 
turn2). If the compiler would still re-order (haven't tested), cnt++ 
would be outside of the critical section.

Hope this helps & cheers,
Stephan

Nov 16 2010

Sean Kelly <sean invisibleduck.org> writes:

stephan Wrote:

 Am 16.11.2010 18:09, schrieb Sean Kelly:
 cas() contains an asm block. Though I guess in this case the compiler
 isn't actually optimizing across it. Does atomic!"+="(&cnt, 1) work
 correctly?  I know the issue with shared would still have to be fixed,
 but that code uses asm for the load as well, so it probably won't be
 optimized the same way.

 
 Thanks for looking into the issue around here. Just three comments from 
 my side, Sean.
 
 Disclaimer: based on a couple of hours chasing a bug and not much D 
 experience (but some optimizing C++ compiler experience - so the issue 
 looked familiar :-) )
 
 1) atomicOp is not concerned. You only read memory once in the function 
 call. Whether from a local variable that was loaded from something 
 global or directly from a global, doesn't really matter (except for 
 timing, maybe).

atomicOp uses a CAS loop for the RMW operations.  The load in asm code was
actually to bypass the synchronization generated by the compiler since it isn't
needed in this case, but I guess it helped anyway.  I'm wondering if I should
change the binary op to use asm too, so the compiler doesn't do anything evil
there until shared is implemented more fully.

 2) You are right, the compiler seems to not optimize across asm 
 statements. So, the example can be fixed with the following hack:
      void atomicInc  ( ) {
          uint o;
          while ( !cas( &cnt, o, o + 1 ) ) {
              asm { nop; } o = cnt;
          }
      }
 This is however more brittle than it looks, because it is not always 
 clear what "optimizing across an asm block". This version has the issue 
 again:
      void atomicInc  ( ) {
          uint o = cnt;
          do {
              asm { nop; } o = cnt;
          } while ( !cas( &cnt, o, o + 1 ) )
      }
 While this case might look somewhat obvious, I encountered some problems 
 in more complex code, and finally went for the all-inline-assembler 
 solution to be on the safe side.
 
 3) During my debugging, I believe that I saw the optimizer not only 
 re-ordering reads of shared variables, but also writes to shared 
 variables. IIRC, my Dekker example on SO (which fails for the missing 
 s/l/mfence instructions), also sports a re-ordering of the lines
          cnt++;
          turn2 = true; flag1 = false;
 into
          turn2 = true;
          cnt++;
          flag1 = false;
 which in this case is not really important, but might introduce another 
 bug if I was prepared to live with the risk of starvation (and remove 
 turn2). If the compiler would still re-order (haven't tested), cnt++ 
 would be outside of the critical section.

I'm thinking of exposing atomicStore and atomicLoad in core.atomic so folks
have something to use until the compiler work is taken care of.  At least then
your code could still look somewhat normal.

Nov 16 2010

stephan <none example.com> writes:

 atomicOp uses a CAS loop for the RMW operations.

Ignore my comment. I should have looked at the code in core.atomic 
before commenting. I just had one test case with atomicOp!("+=") that 
worked, and assumed that atomicOp!("+=") was implemented with "lock xadd".

 I'm thinking of exposing atomicStore and atomicLoad in core.atomic so folks
have something to use until the compiler work is taken care of.

This would be a good solution. Not only does it solve this ordering 
issue, but it is actually what I want to do with a shared variable while 
I am already within a critical section anyways.

Nov 17 2010

D Programming

C/C++ Programming

Other

digitalmars.D - Compiler optimization breaks multi-threaded code