digitalmars.D - Floating Point + Threads?

dsimcha (28/28) Apr 15 2011 I'm trying to debug an extremely strange bug whose symptoms appear in a

Andrei Alexandrescu (3/7) Apr 15 2011 Does the scheduling affect the summation order?

Walter Bright (3/9) Apr 16 2011 That's a good thought. FP addition results can differ dramatically depen...

Iain Buclaw (2/12) Apr 16 2011 And not to forget optimisations too. ;)

Walter Bright (3/7) Apr 16 2011 The dmd optimizer is careful not to reorder evaluation in such a way as ...

Iain Buclaw (17/25) Apr 16 2011 And so it rightly shouldn't!

Walter Bright (2/5) Apr 16 2011 You're right on that one.

dsimcha (5/13) Apr 16 2011 No. I realize floating point addition isn't associative, but unless

Robert Jacques (12/42) Apr 15 2011 Well, on one hand floating point math is not cumulative, and running sum...

dsimcha (17/67) Apr 16 2011 Right. For this example, though, assuming floating point math behaves

dsimcha (37/46) Apr 16 2011 Ok, it's definitely **NOT** a bug in std.parallelism. Here's a reduced

Andrei Alexandrescu (4/9) Apr 16 2011 I think at this precision the difference may be in random bits. Probably...

dsimcha (6/15) Apr 16 2011 "random bits"? I am fully aware that these low order bits are numerical...

Andrei Alexandrescu (4/21) Apr 16 2011 I seem to remember that essentially some of the last bits printed in

Walter Bright (3/5) Apr 16 2011 To see what the exact bits are, print using %A format.

Iain Buclaw (2/8) Apr 16 2011 This is not something I can replicate on my workstation.

dsimcha (4/12) Apr 16 2011 Interesting. Since I know you're a Linux user, I fired up my Ubuntu VM

JimBob (5/10) Apr 16 2011 Could be something somewhere is getting truncated from real to double, w...

Timon Gehr (14/17) Apr 16 2011 Yes indeed, this is a _Windows_ "bug".

dsimcha (9/26) Apr 16 2011 Close: If I add this instruction to the function for the new thread,

dsimcha (3/43) Apr 16 2011 Read up a little on what fninit does, etc. This is IMHO a druntime bug....

Sean Kelly (6/19) Apr 16 2011 report.

Walter Bright (4/18) Apr 16 2011 The dmd startup code (actually the C startup code) does an fninit. I nev...

bearophile (5/8) Apr 16 2011 My congratulations to all the (mostly two) people involved in finding th...
Robert Jacques (5/28) Apr 16 2011 The documentation I've found on fninit seems to indicate it defaults to ...

Sean Kelly (17/23) Apr 19 2011 never thought about new thread starts. So, yeah, druntime should do an =

Robert Jacques (3/26) Apr 19 2011 Yes, that sounds right. Thanks for clarifying.
Don (29/37) Apr 20 2011 ??? Yes there is.

Sean Kelly (12/27) Apr 20 2011 I never thought about new thread starts. So, yeah, druntime should do an...

JimBob (8/33) Apr 20 2011 "Sean Kelly" wrote in message

Sean Kelly (13/53) Apr 20 2011 do an=20

Don (2/41) Apr 20 2011 Yes.

dsimcha <dsimcha yahoo.com> writes:

I'm trying to debug an extremely strange bug whose symptoms appear in a 
std.parallelism example, though I'm not at all sure the root cause is in 
std.parallelism.  The bug report is at 
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

Basically, the example in question sums up all the elements of a lazy 
range (actually, std.algorithm.map) in parallel.  It uses 
taskPool.reduce, which divides the summation into work units to be 
executed in parallel.  When executed in parallel, the results of the 
summation are non-deterministic after about the 12th decimal place, even 
though all of the following properties are true:

1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic order.

3.  The final summation of the results of all the work units is done in 
a deterministic order.

4.  The smallest term in the summation is about 5e-10.  This means the 
difference across runs is about two orders of magnitude smaller than the 
smallest term.  It can't be a concurrency bug where some terms sometimes 
get skipped.

5.  The results for the individual tasks, not just the final summation, 
differ in the low-order bits.  Each task is executed in a single thread.

6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as the 
number of task pool threads is manually set to >0.  Since it's a single 
core machine, it can't be a low level memory model issue.

What could possibly cause such small, non-deterministic differences in 
floating point results, given everything above?  I'm just looking for 
suggestions here, as I don't even know where to start hunting for a bug 
like this.

Apr 15 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

Does the scheduling affect the summation order?

Andrei

Apr 15 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

 Does the scheduling affect the summation order?

That's a good thought. FP addition results can differ dramatically depending on 
associativity.

Apr 16 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

 Does the scheduling affect the summation order?

 That's a good thought. FP addition results can differ dramatically depending on
 associativity.

And not to forget optimisations too. ;)

Apr 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 4/16/2011 6:46 AM, Iain Buclaw wrote:
 == Quote from Walter Bright (newshound2 digitalmars.com)'s article
 That's a good thought. FP addition results can differ dramatically depending on
 associativity.

 And not to forget optimisations too. ;)

The dmd optimizer is careful not to reorder evaluation in such a way as to 
change the results.

Apr 16 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 On 4/16/2011 6:46 AM, Iain Buclaw wrote:
 == Quote from Walter Bright (newshound2 digitalmars.com)'s article
 That's a good thought. FP addition results can differ dramatically depending on
 associativity.

 And not to forget optimisations too. ;)

 The dmd optimizer is careful not to reorder evaluation in such a way as to
 change the results.

And so it rightly shouldn't!

I was thinking more of a case of FPU precision rather than ordering: as in you
get
a different result computing on SSE in double precision mode on the one hand,
and
by computing on x87 in double precision then writing to a double variable in
memory.


Classic example (which could either be a bug or non-bug depending on your POV):

void test(double x, double y)
{
  double y2 = x + 1.0;
  assert(y == y2);   // triggers with -O
}

void main()
{
  double x = .012;
  double y = x + 1.0;
  test(x, y);
}

Apr 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 4/16/2011 11:43 AM, Iain Buclaw wrote:
 I was thinking more of a case of FPU precision rather than ordering: as in you
get
 a different result computing on SSE in double precision mode on the one hand,
and
 by computing on x87 in double precision then writing to a double variable in
memory.

You're right on that one.

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
 .

 Does the scheduling affect the summation order?

 Andrei

No.  I realize floating point addition isn't associative, but unless 
there's some detail I'm forgetting about, the ordering is deterministic 
within each work unit and the ordering of the final summation is 
deterministic.

Apr 16 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 I'm trying to debug an extremely strange bug whose symptoms appear in a  
 std.parallelism example, though I'm not at all sure the root cause is in  
 std.parallelism.  The bug report is at  
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717  
 .

 Basically, the example in question sums up all the elements of a lazy  
 range (actually, std.algorithm.map) in parallel.  It uses  
 taskPool.reduce, which divides the summation into work units to be  
 executed in parallel.  When executed in parallel, the results of the  
 summation are non-deterministic after about the 12th decimal place, even  
 though all of the following properties are true:

 1.  The work is divided into work units in a deterministic fashion.

 2.  Within each work unit, the summation happens in a deterministic  
 order.

 3.  The final summation of the results of all the work units is done in  
 a deterministic order.

 4.  The smallest term in the summation is about 5e-10.  This means the  
 difference across runs is about two orders of magnitude smaller than the  
 smallest term.  It can't be a concurrency bug where some terms sometimes  
 get skipped.

 5.  The results for the individual tasks, not just the final summation,  
 differ in the low-order bits.  Each task is executed in a single thread.

 6.  The rounding mode is apparently the same in all of the threads.

 7.  The bug appears even on machines with only one core, as long as the  
 number of task pool threads is manually set to >0.  Since it's a single  
 core machine, it can't be a low level memory model issue.

 What could possibly cause such small, non-deterministic differences in  
 floating point results, given everything above?  I'm just looking for  
 suggestions here, as I don't even know where to start hunting for a bug  
 like this.

Well, on one hand floating point math is not cumulative, and running sums  
have many known issues (I'd recommend looking up Khan summation). On the  
hand, it should be repeatably different.
As for suggestions? First and foremost, you should always add small to  
large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should  
the answer be better, but if your error rate goes down, you have a good  
idea of where the problem is. I'd also try isolating your implementation's  
numerics, from the underlying concurrency. i.e. use a task pool of 1 and  
don't let the host thread join it, so the entire job is done by one  
worker. The other thing to try is isolation /removing map and iota from  
the equation.

Apr 15 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 2:09 AM, Robert Jacques wrote:
 On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 I'm trying to debug an extremely strange bug whose symptoms appear in
 a std.parallelism example, though I'm not at all sure the root cause
 is in std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
 .

 Basically, the example in question sums up all the elements of a lazy
 range (actually, std.algorithm.map) in parallel. It uses
 taskPool.reduce, which divides the summation into work units to be
 executed in parallel. When executed in parallel, the results of the
 summation are non-deterministic after about the 12th decimal place,
 even though all of the following properties are true:

 1. The work is divided into work units in a deterministic fashion.

 2. Within each work unit, the summation happens in a deterministic order.

 3. The final summation of the results of all the work units is done in
 a deterministic order.

 4. The smallest term in the summation is about 5e-10. This means the
 difference across runs is about two orders of magnitude smaller than
 the smallest term. It can't be a concurrency bug where some terms
 sometimes get skipped.

 5. The results for the individual tasks, not just the final summation,
 differ in the low-order bits. Each task is executed in a single thread.

 6. The rounding mode is apparently the same in all of the threads.

 7. The bug appears even on machines with only one core, as long as the
 number of task pool threads is manually set to >0. Since it's a single
 core machine, it can't be a low level memory model issue.

 What could possibly cause such small, non-deterministic differences in
 floating point results, given everything above? I'm just looking for
 suggestions here, as I don't even know where to start hunting for a
 bug like this.

 Well, on one hand floating point math is not cumulative, and running
 sums have many known issues (I'd recommend looking up Khan summation).
 On the hand, it should be repeatably different.
 As for suggestions? First and foremost, you should always add small to
 large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should
 the answer be better, but if your error rate goes down, you have a good
 idea of where the problem is. I'd also try isolating your
 implementation's numerics, from the underlying concurrency. i.e. use a
 task pool of 1 and don't let the host thread join it, so the entire job
 is done by one worker. The other thing to try is isolation /removing map
 and iota from the equation.

Right.  For this example, though, assuming floating point math behaves 
like regular math is a good enough approximation.  The issue isn't that 
the results aren't reasonably accurate.  Furthermore, the results will 
always change slightly depending on how many work units you have.  (I 
even warn in the documentation that floating point addition is not 
associative, though it is approximately associative in the well-behaved 
cases.)

My only concern is whether this non-determinism represents some deep 
underlying bug.  For a given work unit allocation (work unit allocations 
are deterministic and only change when the number of threads changes or 
they're changed explicitly), I can't figure out how scheduling could 
change the results at all.  If I could be sure that it wasn't a symptom 
of an underlying bug in std.parallelism, I wouldn't care about this 
small amount of numerical fuzz.  Floating point math is always inexact 
and parallel summation by its nature can't be made to give the exact 
same results as serial summation.

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 10:11 AM, dsimcha wrote:
 My only concern is whether this non-determinism represents some deep
 underlying bug. For a given work unit allocation (work unit allocations
 are deterministic and only change when the number of threads changes or
 they're changed explicitly), I can't figure out how scheduling could
 change the results at all. If I could be sure that it wasn't a symptom
 of an underlying bug in std.parallelism, I wouldn't care about this
 small amount of numerical fuzz. Floating point math is always inexact
 and parallel summation by its nature can't be made to give the exact
 same results as serial summation.

Ok, it's definitely **NOT** a bug in std.parallelism.  Here's a reduced 
test case that only uses core.thread, not std.parallelism.  All it does 
is sum an array using std.algorithm.reduce from the main thread, then 
start a new thread to do the same thing and compare answers.  At the 
beginning of the summation function the rounding mode is printed to 
verify that it's the same for both threads.  The two threads get 
slightly different answers.

Just to thoroughly rule out a concurrency bug, I didn't even let the two 
threads execute concurrently.  The main thread produces its result, then 
starts and immediately joins the second thread.

import std.algorithm, core.thread, std.stdio, core.stdc.fenv;

real sumRange(const(real)[] range) {
     writeln("Rounding mode:  ", fegetround);  // 0 from both threads.
     return reduce!"a + b"(range);
}

void main() {
     immutable n = 1_000_000;
     immutable delta = 1.0 / n;

     auto terms = new real[1_000_000];
     foreach(i, ref term; terms) {
         immutable x = ( i - 0.5 ) * delta;
         term = delta / ( 1.0 + x * x ) * 1;
     }

     immutable res1 = sumRange(terms);
     writefln("%.19f", res1);

     real res2;
     auto t = new Thread( { res2 = sumRange(terms); } );
     t.start();
     t.join();
     writefln("%.19f", res2);
}


Output:
Rounding mode:  0
0.7853986633972191094
Rounding mode:  0
0.7853986633972437348

Apr 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348

I think at this precision the difference may be in random bits. Probably 
you don't need to worry about it.

Andrei

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:
 On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348

 I think at this precision the difference may be in random bits. Probably
 you don't need to worry about it.

 Andrei

"random bits"?  I am fully aware that these low order bits are numerical 
fuzz and are meaningless from a practical perspective.  I am only 
concerned because I thought these bits are supposed to be deterministic 
even if they're meaningless.  Now that I've ruled out a bug in 
std.parallelism, I'm wondering if it's a bug in druntime or DMD.

Apr 16 2011

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 4/16/11 9:59 AM, dsimcha wrote:
 On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:
 On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348

 I think at this precision the difference may be in random bits. Probably
 you don't need to worry about it.

 Andrei

 "random bits"? I am fully aware that these low order bits are numerical
 fuzz and are meaningless from a practical perspective. I am only
 concerned because I thought these bits are supposed to be deterministic
 even if they're meaningless. Now that I've ruled out a bug in
 std.parallelism, I'm wondering if it's a bug in druntime or DMD.

I seem to remember that essentially some of the last bits printed in 
such a result are essentially arbitrary. I forgot what could cause this.

Andrei

Apr 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 4/16/2011 9:52 AM, Andrei Alexandrescu wrote:
 I seem to remember that essentially some of the last bits printed in such a
 result are essentially arbitrary. I forgot what could cause this.

To see what the exact bits are, print using %A format.

In any case, floating point bits are not random. They are completely
deterministic.

Apr 16 2011

Iain Buclaw <ibuclaw ubuntu.com> writes:

== Quote from dsimcha (dsimcha yahoo.com)'s article
 On 4/16/2011 10:11 AM, dsimcha wrote:
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348

This is not something I can replicate on my workstation.

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 11:16 AM, Iain Buclaw wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 On 4/16/2011 10:11 AM, dsimcha wrote:
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348

 This is not something I can replicate on my workstation.

Interesting.  Since I know you're a Linux user, I fired up my Ubuntu VM 
and tried out my test case.  I can't reproduce it either on Linux, only 
on Windows.

Apr 16 2011

"JimBob" <jim bob.com> writes:

"dsimcha" <dsimcha yahoo.com> wrote in message 
news:iocalv$2h58$1 digitalmars.com...
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348

Could be something somewhere is getting truncated from real to double, which 
would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision 
in one of the threads?

Apr 16 2011

Timon Gehr <timon.gehr gmx.ch> writes:

 Could be something somewhere is getting truncated from real to double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
 in one of the threads?

Yes indeed, this is a _Windows_ "bug".
I have experienced this in Windows before, the main thread's FPU state register
is
initialized to lower FPU-Precision (64bits) by default by the OS, presumably to
make FP calculations faster. However, when you start a new thread, the FPU will
use the whole 80 bits for computation because, curiously, the FPU is not
reconfigured for those.
Suggested fix: Add

asm{fninit};

to the beginning of your main function, and the difference between the two will
be
gone.

This would be a compatibility issue DMD/windows which disables the "real" data
type. You might want to file a bug report for druntime if my suggested fix
works.
(This would imply that the real type was basically identical to the double type
in
Windows all along!)

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 2:15 PM, Timon Gehr wrote:
 Could be something somewhere is getting truncated from real to double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
 in one of the threads?

 Yes indeed, this is a _Windows_ "bug".
 I have experienced this in Windows before, the main thread's FPU state
register is
 initialized to lower FPU-Precision (64bits) by default by the OS, presumably to
 make FP calculations faster. However, when you start a new thread, the FPU will
 use the whole 80 bits for computation because, curiously, the FPU is not
 reconfigured for those.
 Suggested fix: Add

 asm{fninit};

 to the beginning of your main function, and the difference between the two
will be
 gone.

 This would be a compatibility issue DMD/windows which disables the "real" data
 type. You might want to file a bug report for druntime if my suggested fix
works.
 (This would imply that the real type was basically identical to the double
type in
 Windows all along!)

Close:  If I add this instruction to the function for the new thread, 
the difference goes away.  The relevant statement is:

     auto t = new Thread( {
         asm { fninit; }
         res2 = sumRange(terms);
     } );

At any rate, this is a **huge** WTF that should probably be fixed in 
druntime.  Once I understand it a little better, I'll file a bug report.

Apr 16 2011

dsimcha <dsimcha yahoo.com> writes:

On 4/16/2011 2:24 PM, dsimcha wrote:
 On 4/16/2011 2:15 PM, Timon Gehr wrote:
 Could be something somewhere is getting truncated from real to
 double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower
 precision
 in one of the threads?

 Yes indeed, this is a _Windows_ "bug".
 I have experienced this in Windows before, the main thread's FPU state
 register is
 initialized to lower FPU-Precision (64bits) by default by the OS,
 presumably to
 make FP calculations faster. However, when you start a new thread, the
 FPU will
 use the whole 80 bits for computation because, curiously, the FPU is not
 reconfigured for those.
 Suggested fix: Add

 asm{fninit};

 to the beginning of your main function, and the difference between the
 two will be
 gone.

 This would be a compatibility issue DMD/windows which disables the
 "real" data
 type. You might want to file a bug report for druntime if my suggested
 fix works.
 (This would imply that the real type was basically identical to the
 double type in
 Windows all along!)

 Close: If I add this instruction to the function for the new thread, the
 difference goes away. The relevant statement is:

 auto t = new Thread( {
 asm { fninit; }
 res2 = sumRange(terms);
 } );

 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug report.

Read up a little on what fninit does, etc.  This is IMHO a druntime bug. 
  Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .

Apr 16 2011

Sean Kelly <sean invisibleduck.org> writes:

On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
=20
 Close: If I add this instruction to the function for the new thread, =


the
 difference goes away. The relevant statement is:
=20
 auto t =3D new Thread( {
 asm { fninit; }
 res2 =3D sumRange(terms);
 } );
=20
 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug =


report.
=20
 Read up a little on what fninit does, etc.  This is IMHO a druntime =

bug.  Filed as http://d.puremagic.com/issues/show_bug.cgi?id=3D5847 .

Really a Windows bug that should be fixed in druntime :-)  I know I'm =
splitting hairs.  This will be fixed for the next release.

Apr 16 2011

Walter Bright <newshound2 digitalmars.com> writes:

On 4/16/2011 11:51 AM, Sean Kelly wrote:
 On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
 Close: If I add this instruction to the function for the new thread, the
 difference goes away. The relevant statement is:

 auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug report.

 Read up a little on what fninit does, etc.  This is IMHO a druntime bug.
 Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .

 Really a Windows bug that should be fixed in druntime :-)  I know I'm
 splitting hairs.  This will be fixed for the next release.


The dmd startup code (actually the C startup code) does an fninit. I never 
thought about new thread starts. So, yeah, druntime should do an fninit on 
thread creation.

Apr 16 2011

bearophile <bearophileHUGS lycos.com> writes:

Walter:

 The dmd startup code (actually the C startup code) does an fninit. I never 
 thought about new thread starts. So, yeah, druntime should do an fninit on 
 thread creation.

My congratulations to all the (mostly two) people involved in finding this bug
and its causes :-)
I'd like to see this module in Phobos.

Bye,
bearophile

Apr 16 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 4/16/2011 11:51 AM, Sean Kelly wrote:
 On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
 Close: If I add this instruction to the function for the new thread,  
 the
 difference goes away. The relevant statement is:

 auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug  
 report.

 Read up a little on what fninit does, etc.  This is IMHO a druntime  
 bug.
 Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .

 Really a Windows bug that should be fixed in druntime :-)  I know I'm
 splitting hairs.  This will be fixed for the next release.


 The dmd startup code (actually the C startup code) does an fninit. I  
 never thought about new thread starts. So, yeah, druntime should do an  
 fninit on thread creation.

The documentation I've found on fninit seems to indicate it defaults to  
64-bit precision, which means that by default we aren't seeing the benefit  
of D's reals. I'd much prefer 80-bit precision by default.

Apr 16 2011

Sean Kelly <sean invisibleduck.org> writes:

On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =

<newshound2 digitalmars.com> wrote:
=20
=20
 The dmd startup code (actually the C startup code) does an fninit. I =


never thought about new thread starts. So, yeah, druntime should do an =
fninit on thread creation.
=20
 The documentation I've found on fninit seems to indicate it defaults =

to 64-bit precision, which means that by default we aren't seeing the =
benefit of D's reals. I'd much prefer 80-bit precision by default.

There is no option to set "80-bit precision" via the FPU control word.  =
Section 8.1.5.2 of the Intel 64 SDM says the following:

"The precision-control (PC) field (bits 8 and 9 of the x87 FPU control =
word) determines the precision (64, 53, or 24 bits) of floating-point =
calculations made by the x87 FPU (see Table 8-2). The default precision =
is double extended precision, which uses the full 64-bit significand =
available with the double extended-precision floating-point format of =
the x87 FPU data registers. This setting is best suited for most =
applications, because it allows applications to take full advantage of =
the maximum precision available with the x87 FPU data registers."

So it sounds like finit/fninit does what we want.=

Apr 19 2011

"Robert Jacques" <sandford jhu.edu> writes:

On Tue, 19 Apr 2011 14:18:46 -0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I  
 never thought about new thread starts. So, yeah, druntime should do an  
 fninit on thread creation.

 The documentation I've found on fninit seems to indicate it defaults to  
 64-bit precision, which means that by default we aren't seeing the  
 benefit of D's reals. I'd much prefer 80-bit precision by default.

 There is no option to set "80-bit precision" via the FPU control word.   
 Section 8.1.5.2 of the Intel 64 SDM says the following:

 "The precision-control (PC) field (bits 8 and 9 of the x87 FPU control  
 word) determines the precision (64, 53, or 24 bits) of floating-point  
 calculations made by the x87 FPU (see Table 8-2). The default precision  
 is double extended precision, which uses the full 64-bit significand  
 available with the double extended-precision floating-point format of  
 the x87 FPU data registers. This setting is best suited for most  
 applications, because it allows applications to take full advantage of  
 the maximum precision available with the x87 FPU data registers."

 So it sounds like finit/fninit does what we want.

Yes, that sounds right. Thanks for clarifying.

Apr 19 2011

Don <nospam nospam.com> writes:

Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com>
wrote:
 The dmd startup code (actually the C startup code) does an fninit. I never
thought about new thread starts. So, yeah, druntime should do an fninit on
thread creation.

 The documentation I've found on fninit seems to indicate it defaults to 64-bit
precision, which means that by default we aren't seeing the benefit of D's
reals. I'd much prefer 80-bit precision by default.

 
 There is no option to set "80-bit precision" via the FPU control word.  

??? Yes there is.

enum PrecisionControl : short {
     PRECISION80 = 0x300,
     PRECISION64 = 0x200,
     PRECISION32 = 0x000
};

/** Set the number of bits of precision used by 'real'.
  *
  * Returns: the old precision.
  * This is not supported on all platforms.
  */
PrecisionControl reduceRealPrecision(PrecisionControl prec) {
    version(D_InlineAsm_X86) {
         short cont;
         asm {
             fstcw cont;
             mov CX, cont;
             mov AX, cont;
             and EAX, 0x0300; // Form the return value
             and CX,  0xFCFF;
             or  CX,  prec;
             mov cont, CX;
             fldcw cont;
         }
     } else {
            assert(0, "Not yet supported");
     }
}

Apr 20 2011

Sean Kelly <sean invisibleduck.org> writes:

On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =



<newshound2 digitalmars.com> wrote:
=20
 The dmd startup code (actually the C startup code) does an fninit. =




I never thought about new thread starts. So, yeah, druntime should do an =
fninit on thread creation.
 The documentation I've found on fninit seems to indicate it defaults =



to 64-bit precision, which means that by default we aren't seeing the =
benefit of D's reals. I'd much prefer 80-bit precision by default.
 There is no option to set "80-bit precision" via the FPU control =


word. =20
=20
 ??? Yes there is.
=20
 enum PrecisionControl : short {
    PRECISION80 =3D 0x300,
    PRECISION64 =3D 0x200,
    PRECISION32 =3D 0x000
 };

So has Intel deprecated 80-bit FPU support?  Why do the docs for this =
say that 64-bit is the highest precision?  And more importantly, does =
this mean that we should be setting the PC field explicitly instead of =
relying on fninit?  The docs say that fninit initializes to 64-bit =
precision.  Or is that inaccurate as well?=

Apr 20 2011

"JimBob" <jim bob.com> writes:

"Sean Kelly" <sean invisibleduck.org> wrote in message 
news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.

 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the 
 benefit of D's reals. I'd much prefer 80-bit precision by default.

 There is no option to set "80-bit precision" via the FPU control word.

 ??? Yes there is.

 enum PrecisionControl : short {
    PRECISION80 = 0x300,
    PRECISION64 = 0x200,
    PRECISION32 = 0x000
 };

So has Intel deprecated 80-bit FPU support?  Why do the docs for this say 
that 64-bit
 is the highest precision?  And more importantly, does this mean that we 
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say that 
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=

You misread the docs, it's talking about precision which is just the size of 
the mantisa, not the actual full size of the floating point data. IE...

80 float = 64 bit precision
64 float = 53 bit precision
32 float = 24 bit precision

Apr 20 2011

Sean Kelly <sean invisibleduck.org> writes:

On Apr 20, 2011, at 10:46 AM, JimBob wrote:
=20
 "Sean Kelly" <sean invisibleduck.org> wrote in message=20
 news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
 On Apr 20, 2011, at 5:06 AM, Don wrote:
=20
 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright=20
 <newshound2 digitalmars.com> wrote:
=20
 The dmd startup code (actually the C startup code) does an fninit. =





I=20
 never thought about new thread starts. So, yeah, druntime should =





do an=20
 fninit on thread creation.

 The documentation I've found on fninit seems to indicate it =




defaults to=20
 64-bit precision, which means that by default we aren't seeing the=20=




 benefit of D's reals. I'd much prefer 80-bit precision by default.

 There is no option to set "80-bit precision" via the FPU control =



word.
=20
 ??? Yes there is.
=20
 enum PrecisionControl : short {
   PRECISION80 =3D 0x300,
   PRECISION64 =3D 0x200,
   PRECISION32 =3D 0x000
 };
=20
 So has Intel deprecated 80-bit FPU support?  Why do the docs for this =


say=20
 that 64-bit
 is the highest precision?  And more importantly, does this mean that =


we=20
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say =


that=20
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=3D

=20
 You misread the docs, it's talking about precision which is just the =

size of=20
 the mantisa, not the actual full size of the floating point data. =

IE...
=20
 80 float =3D 64 bit precision
 64 float =3D 53 bit precision
 32 float =3D 24 bit precision

Oops, you're right.  So to summarize: fninit does what we want because =
it sets 64-bit precision, which is effectively 80-bit mode.  Is this =
correct?=

Apr 20 2011

Don <nospam nospam.com> writes:

Sean Kelly wrote:
 On Apr 20, 2011, at 10:46 AM, JimBob wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
 On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.

 The documentation I've found on fninit seems to indicate it defaults to 
 64-bit precision, which means that by default we aren't seeing the 
 benefit of D's reals. I'd much prefer 80-bit precision by default.

 There is no option to set "80-bit precision" via the FPU control word.

 ??? Yes there is.

 enum PrecisionControl : short {
   PRECISION80 = 0x300,
   PRECISION64 = 0x200,
   PRECISION32 = 0x000
 };

 So has Intel deprecated 80-bit FPU support?  Why do the docs for this say 
 that 64-bit
 is the highest precision?  And more importantly, does this mean that we 
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say that 
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=

 You misread the docs, it's talking about precision which is just the size of 
 the mantisa, not the actual full size of the floating point data. IE...

 80 float = 64 bit precision
 64 float = 53 bit precision
 32 float = 24 bit precision

 
 Oops, you're right.  So to summarize: fninit does what we want because it sets
64-bit precision, which is effectively 80-bit mode.  Is this correct?

Yes.

Apr 20 2011

D Programming

C/C++ Programming

Other

digitalmars.D - Floating Point + Threads?