digitalmars.D - Who Ordered Memory Fences on an x86?

Walter Bright (3/3) Nov 05 2008 Another one of Bartosz' great blogs:

Nick Sabalausky (14/17) Nov 05 2008 Call me a grumpy old fart, but I'd be happy just tossing fences in

Russell Lewis (9/15) Nov 05 2008 Ok, I'm not going to go as far as that. :) But I've heard that Intel

Nick Sabalausky (14/29) Nov 05 2008 Yea, I've been hearing a lot about that. Be interesting to see what happ...

BCS (3/18) Nov 06 2008 One option people are throwing around in this is the Field Programmable ...

Nick Sabalausky (4/22) Nov 06 2008 Oh yes, the processor that can rewire itself on-the-fly. A very interest...

Walter Bright (32/45) Nov 05 2008 Bartosz, Andrei, Sean and I have discussed this at length. My personal

Nick Sabalausky (17/62) Nov 05 2008 From reading the article, I was under the impression that not using expl...

Walter Bright (3/9) Nov 06 2008 It's the way it is for performance reasons.
Russell Lewis (8/39) Nov 06 2008 In theory, you can write a portable program that uses explicit fences.

Russell Lewis (3/6) Nov 06 2008 Wise choice. I'm readying my firehose so that I can spray you down when...

Walter Bright <newshound1 digitalmars.com> writes:

Another one of Bartosz' great blogs:

http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/

This will be required reading when we start implementing shared types.

Nov 05 2008

"Nick Sabalausky" <a a.a> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:get95v$1r39$1 digitalmars.com...
 Another one of Bartosz' great blogs:

 http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/

 This will be required reading when we start implementing shared types.

Call me a grumpy old fart, but I'd be happy just tossing fences in 
everywhere (when a multicore is detected) and be done with the whole mess, 
just because trying to wring every little bit of speed from, say, a 3+ GHz 
multicore processor strikes me as a highly unworthy pursuit. I'd rather 
optimize for the lower end and let the fancy overpriced crap handle it 
however it will.

And that's even before tossing in the consideration that (to my dismay) most 
code these days is written in languages/platforms (ex, "Ajaxy" web-apps) 
that throw any notion of performance straight into the trash anyway (what's 
100 extra cycles here and there, when the browser/interpreter/OS/whatever 
makes something as simple as navigation and text entry less responsive than 
it was on a 1MHz 6502?).

Nov 05 2008

Russell Lewis <webmaster villagersonline.com> writes:

Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole mess, 
 just because trying to wring every little bit of speed from, say, a 3+ GHz 
 multicore processor strikes me as a highly unworthy pursuit. I'd rather 
 optimize for the lower end and let the fancy overpriced crap handle it 
 however it will.

Ok, I'm not going to go as far as that.  :)  But I've heard that Intel 
has been pondering doing almost that, at the hardware level.  The theory 
is that years from now, our CPUs will not be "one or a few extremely 
complex processors," but instead "hundreds or thousands of simplistic 
processors."  You could implement a pretty braindead execution model (no 
reordering, etc.) if you had 1024 cores all working in parallel.

The question, of course, is how fast the software will come along to the 
point where it can actually make use of that many cores.

Nov 05 2008

"Nick Sabalausky" <a a.a> writes:

"Russell Lewis" <webmaster villagersonline.com> wrote in message 
news:geu020$6fd$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

 Ok, I'm not going to go as far as that.  :)  But I've heard that Intel has 
 been pondering doing almost that, at the hardware level.  The theory is 
 that years from now, our CPUs will not be "one or a few extremely complex 
 processors," but instead "hundreds or thousands of simplistic processors." 
 You could implement a pretty braindead execution model (no reordering, 
 etc.) if you had 1024 cores all working in parallel.

 The question, of course, is how fast the software will come along to the 
 point where it can actually make use of that many cores.

Yea, I've been hearing a lot about that. Be interesting to see what happens. 
Console gaming will probably be the best place to watch to see how that 
turns out (what with Sony's Cell and all).

Also, a few years ago I heard about some research on a "smart memory" chip 
that would allow certain basic operations to be performed within the memory 
chip itself (basically turning the memory cells into registers, AIUI). IIRC, 
the benefit they were aiming for was reducing the bottleneck of CPU<->RAM 
bus traffic. I haven't heard anything about it since then, but between that 
and the "lots of simple CPU cores" predictions, I wouldn't be surprised 
(though I'm not sure I would bet on it either) to eventually see traditional 
memory and processors (and maybe even hard drives) become replaced by a 
hybrid "CPU/RAM" chip.

Nov 05 2008

BCS <ao pathlink.com> writes:

Reply to Nick,


 Yea, I've been hearing a lot about that. Be interesting to see what
 happens. Console gaming will probably be the best place to watch to
 see how that turns out (what with Sony's Cell and all).
 
 Also, a few years ago I heard about some research on a "smart memory"
 chip that would allow certain basic operations to be performed within
 the memory chip itself (basically turning the memory cells into
 registers, AIUI). IIRC, the benefit they were aiming for was reducing
 the bottleneck of CPU<->RAM bus traffic. I haven't heard anything
 about it since then, but between that and the "lots of simple CPU
 cores" predictions, I wouldn't be surprised (though I'm not sure I
 would bet on it either) to eventually see traditional memory and
 processors (and maybe even hard drives) become replaced by a hybrid
 "CPU/RAM" chip.
 

One option people are throwing around in this is the Field Programmable
Processor 
Array, think a FPGA with more complex gates.

Nov 06 2008

"Nick Sabalausky" <a a.a> writes:

"BCS" <ao pathlink.com> wrote in message 
news:78ccfa2d350718cb0e1e3ffdd376 news.digitalmars.com...
 Reply to Nick,


 Yea, I've been hearing a lot about that. Be interesting to see what
 happens. Console gaming will probably be the best place to watch to
 see how that turns out (what with Sony's Cell and all).

 Also, a few years ago I heard about some research on a "smart memory"
 chip that would allow certain basic operations to be performed within
 the memory chip itself (basically turning the memory cells into
 registers, AIUI). IIRC, the benefit they were aiming for was reducing
 the bottleneck of CPU<->RAM bus traffic. I haven't heard anything
 about it since then, but between that and the "lots of simple CPU
 cores" predictions, I wouldn't be surprised (though I'm not sure I
 would bet on it either) to eventually see traditional memory and
 processors (and maybe even hard drives) become replaced by a hybrid
 "CPU/RAM" chip.

 One option people are throwing around in this is the Field Programmable 
 Processor Array, think a FPGA with more complex gates.

Oh yes, the processor that can rewire itself on-the-fly. A very interesting 
idea.

Nov 06 2008

Walter Bright <newshound1 digitalmars.com> writes:

Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole mess, 
 just because trying to wring every little bit of speed from, say, a 3+ GHz 
 multicore processor strikes me as a highly unworthy pursuit. I'd rather 
 optimize for the lower end and let the fancy overpriced crap handle it 
 however it will.
 
 And that's even before tossing in the consideration that (to my dismay) most 
 code these days is written in languages/platforms (ex, "Ajaxy" web-apps) 
 that throw any notion of performance straight into the trash anyway (what's 
 100 extra cycles here and there, when the browser/interpreter/OS/whatever 
 makes something as simple as navigation and text entry less responsive than 
 it was on a 1MHz 6502?).

Bartosz, Andrei, Sean and I have discussed this at length. My personal 
view is that nobody actually understands the proper use of fences (the 
CPU documentation on exactly what they do is frustratingly obtuse, which 
does not help at all). Then there's the issue of fences behaving very 
differently on different CPUs. If you use explicit fences, you have no 
hope of portability.

To address this, the idea we've been tossing about is to allow the only 
operations on shared variables to be read and write, implemented as 
compiler intrinsics:

shared int x;
...
int y = shared_read(x);
shared_write(x, y + 1);

which implements: int y = x++;

(Forget the names of the intrinsics for the moment.)

Yes, it's painfully explicit. But it's easy to visually verify 
correctness, and one can grep for them for code review purposes. Each 
shared_read and shared_write are guaranteed to be sequentially 
consistent, within a thread as well as among multiple threads.

How they are implemented is up to the compiler. The compiler can do the 
naive approach and lard them up with airtight fences, or a more advanced 
compiler can do data flow analysis and compute a reasonable minimum 
number of fences required.

The point here is that only *one* person needs to know how the fences 
actually work on the target CPU, the person who writes the compiler back 
end. And even that person only needs to solve the problem once. I think 
that's a far more tractable problem than trying to educate every 
programmer out there on the subtleties of fences for every CPU variant.

Yes, this screws down very tightly what can be done with shared 
variables. Once we get this done, and get it right, we'll be able to see 
much more clearly where the right places are to loosen those screws.

Nov 05 2008

"Nick Sabalausky" <a a.a> writes:

"Walter Bright" <newshound1 digitalmars.com> wrote in message 
news:geu161$91s$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

 And that's even before tossing in the consideration that (to my dismay) 
 most code these days is written in languages/platforms (ex, "Ajaxy" 
 web-apps) that throw any notion of performance straight into the trash 
 anyway (what's 100 extra cycles here and there, when the 
 browser/interpreter/OS/whatever makes something as simple as navigation 
 and text entry less responsive than it was on a 1MHz 6502?).

 Bartosz, Andrei, Sean and I have discussed this at length. My personal 
 view is that nobody actually understands the proper use of fences (the CPU 
 documentation on exactly what they do is frustratingly obtuse, which does 
 not help at all). Then there's the issue of fences behaving very 
 differently on different CPUs. If you use explicit fences, you have no 
 hope of portability.

From reading the article, I was under the impression that not using explicit 
fences lead to CPUs inevitably making false assumptions and thus spitting 
out erroneus results. So it sounds like explicit fences are a case of 
"dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere 
and you get unportable machine code. Don't use explicit fences and you get 
errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little 
nclear on, is this whole mess only applicable when multiple cores are in 
use, or do the same problems crop up on unicore chips?

 To address this, the idea we've been tossing about is to allow the only 
 operations on shared variables to be read and write, implemented as 
 compiler intrinsics:

 shared int x;
 ...
 int y = shared_read(x);
 shared_write(x, y + 1);

 which implements: int y = x++;

 (Forget the names of the intrinsics for the moment.)

 Yes, it's painfully explicit. But it's easy to visually verify 
 correctness, and one can grep for them for code review purposes. Each 
 shared_read and shared_write are guaranteed to be sequentially consistent, 
 within a thread as well as among multiple threads.

I volunteer to respond to inevitable "Why does D's shared memory access 
syntax suck so badly?" inqueries with "You can thank the CPU vendors for 
that" ;)  (Or do I misunderstand the root issue?)

 How they are implemented is up to the compiler. The compiler can do the 
 naive approach and lard them up with airtight fences, or a more advanced 
 compiler can do data flow analysis and compute a reasonable minimum number 
 of fences required.

 The point here is that only *one* person needs to know how the fences 
 actually work on the target CPU, the person who writes the compiler back 
 end. And even that person only needs to solve the problem once. I think 
 that's a far more tractable problem than trying to educate every 
 programmer out there on the subtleties of fences for every CPU variant.

 Yes, this screws down very tightly what can be done with shared variables. 
 Once we get this done, and get it right, we'll be able to see much more 
 clearly where the right places are to loosen those screws.

Seems to make sense.

Maybe I'm being naive, but would it have made more sense for CPUs to assume 
memory accesses *cannot* be reordered unless told otherwise, instead of the 
other way around?

Nov 05 2008

Walter Bright <newshound1 digitalmars.com> writes:

Nick Sabalausky wrote:
 Also, one thing I'ma little 
 nclear on, is this whole mess only applicable when multiple cores are in 
 use, or do the same problems crop up on unicore chips?

It's a problem only on multicore chips.


 Maybe I'm being naive, but would it have made more sense for CPUs to assume 
 memory accesses *cannot* be reordered unless told otherwise, instead of the 
 other way around? 

It's the way it is for performance reasons.

Nov 06 2008

Russell Lewis <webmaster villagersonline.com> writes:

Nick Sabalausky wrote:
 "Walter Bright" <newshound1 digitalmars.com> wrote in message 
 news:geu161$91s$1 digitalmars.com...
 Nick Sabalausky wrote:
 Call me a grumpy old fart, but I'd be happy just tossing fences in 
 everywhere (when a multicore is detected) and be done with the whole 
 mess, just because trying to wring every little bit of speed from, say, a 
 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd 
 rather optimize for the lower end and let the fancy overpriced crap 
 handle it however it will.

 And that's even before tossing in the consideration that (to my dismay) 
 most code these days is written in languages/platforms (ex, "Ajaxy" 
 web-apps) that throw any notion of performance straight into the trash 
 anyway (what's 100 extra cycles here and there, when the 
 browser/interpreter/OS/whatever makes something as simple as navigation 
 and text entry less responsive than it was on a 1MHz 6502?).

 Bartosz, Andrei, Sean and I have discussed this at length. My personal 
 view is that nobody actually understands the proper use of fences (the CPU 
 documentation on exactly what they do is frustratingly obtuse, which does 
 not help at all). Then there's the issue of fences behaving very 
 differently on different CPUs. If you use explicit fences, you have no 
 hope of portability.

 
 From reading the article, I was under the impression that not using explicit 
 fences lead to CPUs inevitably making false assumptions and thus spitting 
 out erroneus results. So it sounds like explicit fences are a case of 
 "dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere 
 and you get unportable machine code. Don't use explicit fences and you get 
 errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little 
 nclear on, is this whole mess only applicable when multiple cores are in 
 use, or do the same problems crop up on unicore chips?

In theory, you can write a portable program that uses explicit fences. 
The problem is that you have to design for the absolute worst-case CPU, 
which basically means putting an absolute read/write fence in almost 
every conceivable place.  That would make performance suck.

For reasonable performance, you need to cut down on the fences to only 
those that are required in order to get correctness...and that is 
definitely *not* portable.

Nov 06 2008

Russell Lewis <webmaster villagersonline.com> writes:

Walter Bright wrote:
 Yes, this screws down very tightly what can be done with shared 
 variables. Once we get this done, and get it right, we'll be able to see 
 much more clearly where the right places are to loosen those screws.

Wise choice.  I'm readying my firehose so that I can spray you down when 
the flames come. :)

Nov 06 2008

D Programming

C/C++ Programming

Other

digitalmars.D - Who Ordered Memory Fences on an x86?