www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Floating Point + Threads?

reply dsimcha <dsimcha yahoo.com> writes:
I'm trying to debug an extremely strange bug whose symptoms appear in a 
std.parallelism example, though I'm not at all sure the root cause is in 
std.parallelism.  The bug report is at 
https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .

Basically, the example in question sums up all the elements of a lazy 
range (actually, std.algorithm.map) in parallel.  It uses 
taskPool.reduce, which divides the summation into work units to be 
executed in parallel.  When executed in parallel, the results of the 
summation are non-deterministic after about the 12th decimal place, even 
though all of the following properties are true:

1.  The work is divided into work units in a deterministic fashion.

2.  Within each work unit, the summation happens in a deterministic order.

3.  The final summation of the results of all the work units is done in 
a deterministic order.

4.  The smallest term in the summation is about 5e-10.  This means the 
difference across runs is about two orders of magnitude smaller than the 
smallest term.  It can't be a concurrency bug where some terms sometimes 
get skipped.

5.  The results for the individual tasks, not just the final summation, 
differ in the low-order bits.  Each task is executed in a single thread.

6.  The rounding mode is apparently the same in all of the threads.

7.  The bug appears even on machines with only one core, as long as the 
number of task pool threads is manually set to >0.  Since it's a single 
core machine, it can't be a low level memory model issue.

What could possibly cause such small, non-deterministic differences in 
floating point results, given everything above?  I'm just looking for 
suggestions here, as I don't even know where to start hunting for a bug 
like this.
Apr 15 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order? Andrei
Apr 15 2011
next sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order?
That's a good thought. FP addition results can differ dramatically depending on associativity.
Apr 16 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 On 4/15/2011 8:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717 .
Does the scheduling affect the summation order?
That's a good thought. FP addition results can differ dramatically depending on associativity.
And not to forget optimisations too. ;)
Apr 16 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/16/2011 6:46 AM, Iain Buclaw wrote:
 == Quote from Walter Bright (newshound2 digitalmars.com)'s article
 That's a good thought. FP addition results can differ dramatically depending on
 associativity.
And not to forget optimisations too. ;)
The dmd optimizer is careful not to reorder evaluation in such a way as to change the results.
Apr 16 2011
parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from Walter Bright (newshound2 digitalmars.com)'s article
 On 4/16/2011 6:46 AM, Iain Buclaw wrote:
 == Quote from Walter Bright (newshound2 digitalmars.com)'s article
 That's a good thought. FP addition results can differ dramatically depending on
 associativity.
And not to forget optimisations too. ;)
The dmd optimizer is careful not to reorder evaluation in such a way as to change the results.
And so it rightly shouldn't! I was thinking more of a case of FPU precision rather than ordering: as in you get a different result computing on SSE in double precision mode on the one hand, and by computing on x87 in double precision then writing to a double variable in memory. Classic example (which could either be a bug or non-bug depending on your POV): void test(double x, double y) { double y2 = x + 1.0; assert(y == y2); // triggers with -O } void main() { double x = .012; double y = x + 1.0; test(x, y); }
Apr 16 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/16/2011 11:43 AM, Iain Buclaw wrote:
 I was thinking more of a case of FPU precision rather than ordering: as in you
get
 a different result computing on SSE in double precision mode on the one hand,
and
 by computing on x87 in double precision then writing to a double variable in
memory.
You're right on that one.
Apr 16 2011
prev sibling parent dsimcha <dsimcha yahoo.com> writes:
On 4/15/2011 11:40 PM, Andrei Alexandrescu wrote:
 On 4/15/11 10:22 PM, dsimcha wrote:
 I'm trying to debug an extremely strange bug whose symptoms appear in a
 std.parallelism example, though I'm not at all sure the root cause is in
 std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
 .
Does the scheduling affect the summation order? Andrei
No. I realize floating point addition isn't associative, but unless there's some detail I'm forgetting about, the ordering is deterministic within each work unit and the ordering of the final summation is deterministic.
Apr 16 2011
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 I'm trying to debug an extremely strange bug whose symptoms appear in a  
 std.parallelism example, though I'm not at all sure the root cause is in  
 std.parallelism.  The bug report is at  
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717  
 .

 Basically, the example in question sums up all the elements of a lazy  
 range (actually, std.algorithm.map) in parallel.  It uses  
 taskPool.reduce, which divides the summation into work units to be  
 executed in parallel.  When executed in parallel, the results of the  
 summation are non-deterministic after about the 12th decimal place, even  
 though all of the following properties are true:

 1.  The work is divided into work units in a deterministic fashion.

 2.  Within each work unit, the summation happens in a deterministic  
 order.

 3.  The final summation of the results of all the work units is done in  
 a deterministic order.

 4.  The smallest term in the summation is about 5e-10.  This means the  
 difference across runs is about two orders of magnitude smaller than the  
 smallest term.  It can't be a concurrency bug where some terms sometimes  
 get skipped.

 5.  The results for the individual tasks, not just the final summation,  
 differ in the low-order bits.  Each task is executed in a single thread.

 6.  The rounding mode is apparently the same in all of the threads.

 7.  The bug appears even on machines with only one core, as long as the  
 number of task pool threads is manually set to >0.  Since it's a single  
 core machine, it can't be a low level memory model issue.

 What could possibly cause such small, non-deterministic differences in  
 floating point results, given everything above?  I'm just looking for  
 suggestions here, as I don't even know where to start hunting for a bug  
 like this.
Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
Apr 15 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 2:09 AM, Robert Jacques wrote:
 On Fri, 15 Apr 2011 23:22:04 -0400, dsimcha <dsimcha yahoo.com> wrote:

 I'm trying to debug an extremely strange bug whose symptoms appear in
 a std.parallelism example, though I'm not at all sure the root cause
 is in std.parallelism. The bug report is at
 https://github.com/dsimcha/std.parallelism/issues/1#issuecomment-1011717
 .

 Basically, the example in question sums up all the elements of a lazy
 range (actually, std.algorithm.map) in parallel. It uses
 taskPool.reduce, which divides the summation into work units to be
 executed in parallel. When executed in parallel, the results of the
 summation are non-deterministic after about the 12th decimal place,
 even though all of the following properties are true:

 1. The work is divided into work units in a deterministic fashion.

 2. Within each work unit, the summation happens in a deterministic order.

 3. The final summation of the results of all the work units is done in
 a deterministic order.

 4. The smallest term in the summation is about 5e-10. This means the
 difference across runs is about two orders of magnitude smaller than
 the smallest term. It can't be a concurrency bug where some terms
 sometimes get skipped.

 5. The results for the individual tasks, not just the final summation,
 differ in the low-order bits. Each task is executed in a single thread.

 6. The rounding mode is apparently the same in all of the threads.

 7. The bug appears even on machines with only one core, as long as the
 number of task pool threads is manually set to >0. Since it's a single
 core machine, it can't be a low level memory model issue.

 What could possibly cause such small, non-deterministic differences in
 floating point results, given everything above? I'm just looking for
 suggestions here, as I don't even know where to start hunting for a
 bug like this.
Well, on one hand floating point math is not cumulative, and running sums have many known issues (I'd recommend looking up Khan summation). On the hand, it should be repeatably different. As for suggestions? First and foremost, you should always add small to large, so try using iota(n-1,-1,-1) instead of iota(n). Not only should the answer be better, but if your error rate goes down, you have a good idea of where the problem is. I'd also try isolating your implementation's numerics, from the underlying concurrency. i.e. use a task pool of 1 and don't let the host thread join it, so the entire job is done by one worker. The other thing to try is isolation /removing map and iota from the equation.
Right. For this example, though, assuming floating point math behaves like regular math is a good enough approximation. The issue isn't that the results aren't reasonably accurate. Furthermore, the results will always change slightly depending on how many work units you have. (I even warn in the documentation that floating point addition is not associative, though it is approximately associative in the well-behaved cases.) My only concern is whether this non-determinism represents some deep underlying bug. For a given work unit allocation (work unit allocations are deterministic and only change when the number of threads changes or they're changed explicitly), I can't figure out how scheduling could change the results at all. If I could be sure that it wasn't a symptom of an underlying bug in std.parallelism, I wouldn't care about this small amount of numerical fuzz. Floating point math is always inexact and parallel summation by its nature can't be made to give the exact same results as serial summation.
Apr 16 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 10:11 AM, dsimcha wrote:
 My only concern is whether this non-determinism represents some deep
 underlying bug. For a given work unit allocation (work unit allocations
 are deterministic and only change when the number of threads changes or
 they're changed explicitly), I can't figure out how scheduling could
 change the results at all. If I could be sure that it wasn't a symptom
 of an underlying bug in std.parallelism, I wouldn't care about this
 small amount of numerical fuzz. Floating point math is always inexact
 and parallel summation by its nature can't be made to give the exact
 same results as serial summation.
Ok, it's definitely **NOT** a bug in std.parallelism. Here's a reduced test case that only uses core.thread, not std.parallelism. All it does is sum an array using std.algorithm.reduce from the main thread, then start a new thread to do the same thing and compare answers. At the beginning of the summation function the rounding mode is printed to verify that it's the same for both threads. The two threads get slightly different answers. Just to thoroughly rule out a concurrency bug, I didn't even let the two threads execute concurrently. The main thread produces its result, then starts and immediately joins the second thread. import std.algorithm, core.thread, std.stdio, core.stdc.fenv; real sumRange(const(real)[] range) { writeln("Rounding mode: ", fegetround); // 0 from both threads. return reduce!"a + b"(range); } void main() { immutable n = 1_000_000; immutable delta = 1.0 / n; auto terms = new real[1_000_000]; foreach(i, ref term; terms) { immutable x = ( i - 0.5 ) * delta; term = delta / ( 1.0 + x * x ) * 1; } immutable res1 = sumRange(terms); writefln("%.19f", res1); real res2; auto t = new Thread( { res2 = sumRange(terms); } ); t.start(); t.join(); writefln("%.19f", res2); } Output: Rounding mode: 0 0.7853986633972191094 Rounding mode: 0 0.7853986633972437348
Apr 16 2011
next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
Apr 16 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:
 On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
"random bits"? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
Apr 16 2011
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 4/16/11 9:59 AM, dsimcha wrote:
 On 4/16/2011 10:55 AM, Andrei Alexandrescu wrote:
 On 4/16/11 9:52 AM, dsimcha wrote:
 Output:
 Rounding mode: 0
 0.7853986633972191094
 Rounding mode: 0
 0.7853986633972437348
I think at this precision the difference may be in random bits. Probably you don't need to worry about it. Andrei
"random bits"? I am fully aware that these low order bits are numerical fuzz and are meaningless from a practical perspective. I am only concerned because I thought these bits are supposed to be deterministic even if they're meaningless. Now that I've ruled out a bug in std.parallelism, I'm wondering if it's a bug in druntime or DMD.
I seem to remember that essentially some of the last bits printed in such a result are essentially arbitrary. I forgot what could cause this. Andrei
Apr 16 2011
parent Walter Bright <newshound2 digitalmars.com> writes:
On 4/16/2011 9:52 AM, Andrei Alexandrescu wrote:
 I seem to remember that essentially some of the last bits printed in such a
 result are essentially arbitrary. I forgot what could cause this.
To see what the exact bits are, print using %A format. In any case, floating point bits are not random. They are completely deterministic.
Apr 16 2011
prev sibling next sibling parent reply Iain Buclaw <ibuclaw ubuntu.com> writes:
== Quote from dsimcha (dsimcha yahoo.com)'s article
 On 4/16/2011 10:11 AM, dsimcha wrote:
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348
This is not something I can replicate on my workstation.
Apr 16 2011
parent dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 11:16 AM, Iain Buclaw wrote:
 == Quote from dsimcha (dsimcha yahoo.com)'s article
 On 4/16/2011 10:11 AM, dsimcha wrote:
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348
This is not something I can replicate on my workstation.
Interesting. Since I know you're a Linux user, I fired up my Ubuntu VM and tried out my test case. I can't reproduce it either on Linux, only on Windows.
Apr 16 2011
prev sibling parent reply "JimBob" <jim bob.com> writes:
"dsimcha" <dsimcha yahoo.com> wrote in message 
news:iocalv$2h58$1 digitalmars.com...
 Output:
 Rounding mode:  0
 0.7853986633972191094
 Rounding mode:  0
 0.7853986633972437348
Could be something somewhere is getting truncated from real to double, which would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision in one of the threads?
Apr 16 2011
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
 Could be something somewhere is getting truncated from real to double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
 in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Apr 16 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 2:15 PM, Timon Gehr wrote:
 Could be something somewhere is getting truncated from real to double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower precision
 in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Apr 16 2011
parent reply dsimcha <dsimcha yahoo.com> writes:
On 4/16/2011 2:24 PM, dsimcha wrote:
 On 4/16/2011 2:15 PM, Timon Gehr wrote:
 Could be something somewhere is getting truncated from real to
 double, which
 would mean 12 fewer bits of mantisa. Maybe the FPU is set to lower
 precision
 in one of the threads?
Yes indeed, this is a _Windows_ "bug". I have experienced this in Windows before, the main thread's FPU state register is initialized to lower FPU-Precision (64bits) by default by the OS, presumably to make FP calculations faster. However, when you start a new thread, the FPU will use the whole 80 bits for computation because, curiously, the FPU is not reconfigured for those. Suggested fix: Add asm{fninit}; to the beginning of your main function, and the difference between the two will be gone. This would be a compatibility issue DMD/windows which disables the "real" data type. You might want to file a bug report for druntime if my suggested fix works. (This would imply that the real type was basically identical to the double type in Windows all along!)
Close: If I add this instruction to the function for the new thread, the difference goes away. The relevant statement is: auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } ); At any rate, this is a **huge** WTF that should probably be fixed in druntime. Once I understand it a little better, I'll file a bug report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Apr 16 2011
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
=20
 Close: If I add this instruction to the function for the new thread, =
the
 difference goes away. The relevant statement is:
=20
 auto t =3D new Thread( {
 asm { fninit; }
 res2 =3D sumRange(terms);
 } );
=20
 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug =
report.
=20
 Read up a little on what fninit does, etc.  This is IMHO a druntime =
bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=3D5847 . Really a Windows bug that should be fixed in druntime :-) I know I'm = splitting hairs. This will be fixed for the next release.
Apr 16 2011
parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 4/16/2011 11:51 AM, Sean Kelly wrote:
 On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
 Close: If I add this instruction to the function for the new thread, the
 difference goes away. The relevant statement is:

 auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release.
The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
Apr 16 2011
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
Walter:

 The dmd startup code (actually the C startup code) does an fninit. I never 
 thought about new thread starts. So, yeah, druntime should do an fninit on 
 thread creation.
My congratulations to all the (mostly two) people involved in finding this bug and its causes :-) I'd like to see this module in Phobos. Bye, bearophile
Apr 16 2011
prev sibling parent reply "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
<newshound2 digitalmars.com> wrote:

 On 4/16/2011 11:51 AM, Sean Kelly wrote:
 On Apr 16, 2011, at 11:43 AM, dsimcha wrote:
 Close: If I add this instruction to the function for the new thread,  
 the
 difference goes away. The relevant statement is:

 auto t = new Thread( { asm { fninit; } res2 = sumRange(terms); } );

 At any rate, this is a **huge** WTF that should probably be fixed in
 druntime. Once I understand it a little better, I'll file a bug  
 report.
Read up a little on what fninit does, etc. This is IMHO a druntime bug. Filed as http://d.puremagic.com/issues/show_bug.cgi?id=5847 .
Really a Windows bug that should be fixed in druntime :-) I know I'm splitting hairs. This will be fixed for the next release.
The dmd startup code (actually the C startup code) does an fninit. I never thought about new thread starts. So, yeah, druntime should do an fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
Apr 16 2011
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =
<newshound2 digitalmars.com> wrote:
=20
=20
 The dmd startup code (actually the C startup code) does an fninit. I =
never thought about new thread starts. So, yeah, druntime should do an = fninit on thread creation.
=20
 The documentation I've found on fninit seems to indicate it defaults =
to 64-bit precision, which means that by default we aren't seeing the = benefit of D's reals. I'd much prefer 80-bit precision by default. There is no option to set "80-bit precision" via the FPU control word. = Section 8.1.5.2 of the Intel 64 SDM says the following: "The precision-control (PC) field (bits 8 and 9 of the x87 FPU control = word) determines the precision (64, 53, or 24 bits) of floating-point = calculations made by the x87 FPU (see Table 8-2). The default precision = is double extended precision, which uses the full 64-bit significand = available with the double extended-precision floating-point format of = the x87 FPU data registers. This setting is best suited for most = applications, because it allows applications to take full advantage of = the maximum precision available with the x87 FPU data registers." So it sounds like finit/fninit does what we want.=
Apr 19 2011
next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Tue, 19 Apr 2011 14:18:46 -0400, Sean Kelly <sean invisibleduck.org>  
wrote:

 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:

 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright  
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I  
 never thought about new thread starts. So, yeah, druntime should do an  
 fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control word. Section 8.1.5.2 of the Intel 64 SDM says the following: "The precision-control (PC) field (bits 8 and 9 of the x87 FPU control word) determines the precision (64, 53, or 24 bits) of floating-point calculations made by the x87 FPU (see Table 8-2). The default precision is double extended precision, which uses the full 64-bit significand available with the double extended-precision floating-point format of the x87 FPU data registers. This setting is best suited for most applications, because it allows applications to take full advantage of the maximum precision available with the x87 FPU data registers." So it sounds like finit/fninit does what we want.
Yes, that sounds right. Thanks for clarifying.
Apr 19 2011
prev sibling parent reply Don <nospam nospam.com> writes:
Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright <newshound2 digitalmars.com>
wrote:
 The dmd startup code (actually the C startup code) does an fninit. I never
thought about new thread starts. So, yeah, druntime should do an fninit on
thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control word.
??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; /** Set the number of bits of precision used by 'real'. * * Returns: the old precision. * This is not supported on all platforms. */ PrecisionControl reduceRealPrecision(PrecisionControl prec) { version(D_InlineAsm_X86) { short cont; asm { fstcw cont; mov CX, cont; mov AX, cont; and EAX, 0x0300; // Form the return value and CX, 0xFCFF; or CX, prec; mov cont, CX; fldcw cont; } } else { assert(0, "Not yet supported"); } }
Apr 20 2011
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright =
<newshound2 digitalmars.com> wrote:
=20
 The dmd startup code (actually the C startup code) does an fninit. =
I never thought about new thread starts. So, yeah, druntime should do an = fninit on thread creation.
 The documentation I've found on fninit seems to indicate it defaults =
to 64-bit precision, which means that by default we aren't seeing the = benefit of D's reals. I'd much prefer 80-bit precision by default.
 There is no option to set "80-bit precision" via the FPU control =
word. =20
=20
 ??? Yes there is.
=20
 enum PrecisionControl : short {
    PRECISION80 =3D 0x300,
    PRECISION64 =3D 0x200,
    PRECISION32 =3D 0x000
 };
So has Intel deprecated 80-bit FPU support? Why do the docs for this = say that 64-bit is the highest precision? And more importantly, does = this mean that we should be setting the PC field explicitly instead of = relying on fninit? The docs say that fninit initializes to 64-bit = precision. Or is that inaccurate as well?=
Apr 20 2011
parent reply "JimBob" <jim bob.com> writes:
"Sean Kelly" <sean invisibleduck.org> wrote in message 
news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control word.
??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?=
You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision
Apr 20 2011
parent reply Sean Kelly <sean invisibleduck.org> writes:
On Apr 20, 2011, at 10:46 AM, JimBob wrote:
=20
 "Sean Kelly" <sean invisibleduck.org> wrote in message=20
 news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
 On Apr 20, 2011, at 5:06 AM, Don wrote:
=20
 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright=20
 <newshound2 digitalmars.com> wrote:
=20
 The dmd startup code (actually the C startup code) does an fninit. =
I=20
 never thought about new thread starts. So, yeah, druntime should =
do an=20
 fninit on thread creation.
The documentation I've found on fninit seems to indicate it =
defaults to=20
 64-bit precision, which means that by default we aren't seeing the=20=
 benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control =
word.
=20
 ??? Yes there is.
=20
 enum PrecisionControl : short {
   PRECISION80 =3D 0x300,
   PRECISION64 =3D 0x200,
   PRECISION32 =3D 0x000
 };
=20
 So has Intel deprecated 80-bit FPU support?  Why do the docs for this =
say=20
 that 64-bit
 is the highest precision?  And more importantly, does this mean that =
we=20
 should be setting
 the PC field explicitly instead of relying on fninit?  The docs say =
that=20
 fninit initializes to
 64-bit precision.  Or is that inaccurate as well?=3D
=20 You misread the docs, it's talking about precision which is just the =
size of=20
 the mantisa, not the actual full size of the floating point data. =
IE...
=20
 80 float =3D 64 bit precision
 64 float =3D 53 bit precision
 32 float =3D 24 bit precision
Oops, you're right. So to summarize: fninit does what we want because = it sets 64-bit precision, which is effectively 80-bit mode. Is this = correct?=
Apr 20 2011
parent Don <nospam nospam.com> writes:
Sean Kelly wrote:
 On Apr 20, 2011, at 10:46 AM, JimBob wrote:
 "Sean Kelly" <sean invisibleduck.org> wrote in message 
 news:mailman.3597.1303316625.4748.digitalmars-d puremagic.com...
 On Apr 20, 2011, at 5:06 AM, Don wrote:

 Sean Kelly wrote:
 On Apr 16, 2011, at 1:02 PM, Robert Jacques wrote:
 On Sat, 16 Apr 2011 15:32:12 -0400, Walter Bright 
 <newshound2 digitalmars.com> wrote:
 The dmd startup code (actually the C startup code) does an fninit. I 
 never thought about new thread starts. So, yeah, druntime should do an 
 fninit on thread creation.
The documentation I've found on fninit seems to indicate it defaults to 64-bit precision, which means that by default we aren't seeing the benefit of D's reals. I'd much prefer 80-bit precision by default.
There is no option to set "80-bit precision" via the FPU control word.
??? Yes there is. enum PrecisionControl : short { PRECISION80 = 0x300, PRECISION64 = 0x200, PRECISION32 = 0x000 }; So has Intel deprecated 80-bit FPU support? Why do the docs for this say that 64-bit is the highest precision? And more importantly, does this mean that we should be setting the PC field explicitly instead of relying on fninit? The docs say that fninit initializes to 64-bit precision. Or is that inaccurate as well?=
You misread the docs, it's talking about precision which is just the size of the mantisa, not the actual full size of the floating point data. IE... 80 float = 64 bit precision 64 float = 53 bit precision 32 float = 24 bit precision
Oops, you're right. So to summarize: fninit does what we want because it sets 64-bit precision, which is effectively 80-bit mode. Is this correct?
Yes.
Apr 20 2011