digitalmars.D - Who Ordered Memory Fences on an x86?
- Walter Bright (3/3) Nov 05 2008 Another one of Bartosz' great blogs:
- Nick Sabalausky (14/17) Nov 05 2008 Call me a grumpy old fart, but I'd be happy just tossing fences in
- Russell Lewis (9/15) Nov 05 2008 Ok, I'm not going to go as far as that. :) But I've heard that Intel
- Nick Sabalausky (14/29) Nov 05 2008 Yea, I've been hearing a lot about that. Be interesting to see what happ...
- BCS (3/18) Nov 06 2008 One option people are throwing around in this is the Field Programmable ...
- Nick Sabalausky (4/22) Nov 06 2008 Oh yes, the processor that can rewire itself on-the-fly. A very interest...
- Walter Bright (32/45) Nov 05 2008 Bartosz, Andrei, Sean and I have discussed this at length. My personal
- Nick Sabalausky (17/62) Nov 05 2008 From reading the article, I was under the impression that not using expl...
- Walter Bright (3/9) Nov 06 2008 It's the way it is for performance reasons.
- Russell Lewis (8/39) Nov 06 2008 In theory, you can write a portable program that uses explicit fences.
- Russell Lewis (3/6) Nov 06 2008 Wise choice. I'm readying my firehose so that I can spray you down when...
Another one of Bartosz' great blogs: http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/ This will be required reading when we start implementing shared types.
Nov 05 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:get95v$1r39$1 digitalmars.com...Another one of Bartosz' great blogs: http://www.reddit.com/r/programming/comments/7bmkt/who_ordered_memory_fences_on_an_x86/ This will be required reading when we start implementing shared types.Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will. And that's even before tossing in the consideration that (to my dismay) most code these days is written in languages/platforms (ex, "Ajaxy" web-apps) that throw any notion of performance straight into the trash anyway (what's 100 extra cycles here and there, when the browser/interpreter/OS/whatever makes something as simple as navigation and text entry less responsive than it was on a 1MHz 6502?).
Nov 05 2008
Nick Sabalausky wrote:Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will.Ok, I'm not going to go as far as that. :) But I've heard that Intel has been pondering doing almost that, at the hardware level. The theory is that years from now, our CPUs will not be "one or a few extremely complex processors," but instead "hundreds or thousands of simplistic processors." You could implement a pretty braindead execution model (no reordering, etc.) if you had 1024 cores all working in parallel. The question, of course, is how fast the software will come along to the point where it can actually make use of that many cores.
Nov 05 2008
"Russell Lewis" <webmaster villagersonline.com> wrote in message news:geu020$6fd$1 digitalmars.com...Nick Sabalausky wrote:Yea, I've been hearing a lot about that. Be interesting to see what happens. Console gaming will probably be the best place to watch to see how that turns out (what with Sony's Cell and all). Also, a few years ago I heard about some research on a "smart memory" chip that would allow certain basic operations to be performed within the memory chip itself (basically turning the memory cells into registers, AIUI). IIRC, the benefit they were aiming for was reducing the bottleneck of CPU<->RAM bus traffic. I haven't heard anything about it since then, but between that and the "lots of simple CPU cores" predictions, I wouldn't be surprised (though I'm not sure I would bet on it either) to eventually see traditional memory and processors (and maybe even hard drives) become replaced by a hybrid "CPU/RAM" chip.Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will.Ok, I'm not going to go as far as that. :) But I've heard that Intel has been pondering doing almost that, at the hardware level. The theory is that years from now, our CPUs will not be "one or a few extremely complex processors," but instead "hundreds or thousands of simplistic processors." You could implement a pretty braindead execution model (no reordering, etc.) if you had 1024 cores all working in parallel. The question, of course, is how fast the software will come along to the point where it can actually make use of that many cores.
Nov 05 2008
Reply to Nick,Yea, I've been hearing a lot about that. Be interesting to see what happens. Console gaming will probably be the best place to watch to see how that turns out (what with Sony's Cell and all). Also, a few years ago I heard about some research on a "smart memory" chip that would allow certain basic operations to be performed within the memory chip itself (basically turning the memory cells into registers, AIUI). IIRC, the benefit they were aiming for was reducing the bottleneck of CPU<->RAM bus traffic. I haven't heard anything about it since then, but between that and the "lots of simple CPU cores" predictions, I wouldn't be surprised (though I'm not sure I would bet on it either) to eventually see traditional memory and processors (and maybe even hard drives) become replaced by a hybrid "CPU/RAM" chip.One option people are throwing around in this is the Field Programmable Processor Array, think a FPGA with more complex gates.
Nov 06 2008
"BCS" <ao pathlink.com> wrote in message news:78ccfa2d350718cb0e1e3ffdd376 news.digitalmars.com...Reply to Nick,Oh yes, the processor that can rewire itself on-the-fly. A very interesting idea.Yea, I've been hearing a lot about that. Be interesting to see what happens. Console gaming will probably be the best place to watch to see how that turns out (what with Sony's Cell and all). Also, a few years ago I heard about some research on a "smart memory" chip that would allow certain basic operations to be performed within the memory chip itself (basically turning the memory cells into registers, AIUI). IIRC, the benefit they were aiming for was reducing the bottleneck of CPU<->RAM bus traffic. I haven't heard anything about it since then, but between that and the "lots of simple CPU cores" predictions, I wouldn't be surprised (though I'm not sure I would bet on it either) to eventually see traditional memory and processors (and maybe even hard drives) become replaced by a hybrid "CPU/RAM" chip.One option people are throwing around in this is the Field Programmable Processor Array, think a FPGA with more complex gates.
Nov 06 2008
Nick Sabalausky wrote:Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will. And that's even before tossing in the consideration that (to my dismay) most code these days is written in languages/platforms (ex, "Ajaxy" web-apps) that throw any notion of performance straight into the trash anyway (what's 100 extra cycles here and there, when the browser/interpreter/OS/whatever makes something as simple as navigation and text entry less responsive than it was on a 1MHz 6502?).Bartosz, Andrei, Sean and I have discussed this at length. My personal view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability. To address this, the idea we've been tossing about is to allow the only operations on shared variables to be read and write, implemented as compiler intrinsics: shared int x; ... int y = shared_read(x); shared_write(x, y + 1); which implements: int y = x++; (Forget the names of the intrinsics for the moment.) Yes, it's painfully explicit. But it's easy to visually verify correctness, and one can grep for them for code review purposes. Each shared_read and shared_write are guaranteed to be sequentially consistent, within a thread as well as among multiple threads. How they are implemented is up to the compiler. The compiler can do the naive approach and lard them up with airtight fences, or a more advanced compiler can do data flow analysis and compute a reasonable minimum number of fences required. The point here is that only *one* person needs to know how the fences actually work on the target CPU, the person who writes the compiler back end. And even that person only needs to solve the problem once. I think that's a far more tractable problem than trying to educate every programmer out there on the subtleties of fences for every CPU variant. Yes, this screws down very tightly what can be done with shared variables. Once we get this done, and get it right, we'll be able to see much more clearly where the right places are to loosen those screws.
Nov 05 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:geu161$91s$1 digitalmars.com...Nick Sabalausky wrote:From reading the article, I was under the impression that not using explicit fences lead to CPUs inevitably making false assumptions and thus spitting out erroneus results. So it sounds like explicit fences are a case of "dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere and you get unportable machine code. Don't use explicit fences and you get errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little nclear on, is this whole mess only applicable when multiple cores are in use, or do the same problems crop up on unicore chips?Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will. And that's even before tossing in the consideration that (to my dismay) most code these days is written in languages/platforms (ex, "Ajaxy" web-apps) that throw any notion of performance straight into the trash anyway (what's 100 extra cycles here and there, when the browser/interpreter/OS/whatever makes something as simple as navigation and text entry less responsive than it was on a 1MHz 6502?).Bartosz, Andrei, Sean and I have discussed this at length. My personal view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability.To address this, the idea we've been tossing about is to allow the only operations on shared variables to be read and write, implemented as compiler intrinsics: shared int x; ... int y = shared_read(x); shared_write(x, y + 1); which implements: int y = x++; (Forget the names of the intrinsics for the moment.) Yes, it's painfully explicit. But it's easy to visually verify correctness, and one can grep for them for code review purposes. Each shared_read and shared_write are guaranteed to be sequentially consistent, within a thread as well as among multiple threads.I volunteer to respond to inevitable "Why does D's shared memory access syntax suck so badly?" inqueries with "You can thank the CPU vendors for that" ;) (Or do I misunderstand the root issue?)How they are implemented is up to the compiler. The compiler can do the naive approach and lard them up with airtight fences, or a more advanced compiler can do data flow analysis and compute a reasonable minimum number of fences required. The point here is that only *one* person needs to know how the fences actually work on the target CPU, the person who writes the compiler back end. And even that person only needs to solve the problem once. I think that's a far more tractable problem than trying to educate every programmer out there on the subtleties of fences for every CPU variant. Yes, this screws down very tightly what can be done with shared variables. Once we get this done, and get it right, we'll be able to see much more clearly where the right places are to loosen those screws.Seems to make sense. Maybe I'm being naive, but would it have made more sense for CPUs to assume memory accesses *cannot* be reordered unless told otherwise, instead of the other way around?
Nov 05 2008
Nick Sabalausky wrote:Also, one thing I'ma little nclear on, is this whole mess only applicable when multiple cores are in use, or do the same problems crop up on unicore chips?It's a problem only on multicore chips.Maybe I'm being naive, but would it have made more sense for CPUs to assume memory accesses *cannot* be reordered unless told otherwise, instead of the other way around?It's the way it is for performance reasons.
Nov 06 2008
Nick Sabalausky wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in message news:geu161$91s$1 digitalmars.com...In theory, you can write a portable program that uses explicit fences. The problem is that you have to design for the absolute worst-case CPU, which basically means putting an absolute read/write fence in almost every conceivable place. That would make performance suck. For reasonable performance, you need to cut down on the fences to only those that are required in order to get correctness...and that is definitely *not* portable.Nick Sabalausky wrote:From reading the article, I was under the impression that not using explicit fences lead to CPUs inevitably making false assumptions and thus spitting out erroneus results. So it sounds like explicit fences are a case of "dammed if you do, dammed if you don't": ie, "Use explicit fences everywhere and you get unportable machine code. Don't use explicit fences and you get errors." Is this accurate? (If so, what a mess!) Also, one thing I'ma little nclear on, is this whole mess only applicable when multiple cores are in use, or do the same problems crop up on unicore chips?Call me a grumpy old fart, but I'd be happy just tossing fences in everywhere (when a multicore is detected) and be done with the whole mess, just because trying to wring every little bit of speed from, say, a 3+ GHz multicore processor strikes me as a highly unworthy pursuit. I'd rather optimize for the lower end and let the fancy overpriced crap handle it however it will. And that's even before tossing in the consideration that (to my dismay) most code these days is written in languages/platforms (ex, "Ajaxy" web-apps) that throw any notion of performance straight into the trash anyway (what's 100 extra cycles here and there, when the browser/interpreter/OS/whatever makes something as simple as navigation and text entry less responsive than it was on a 1MHz 6502?).Bartosz, Andrei, Sean and I have discussed this at length. My personal view is that nobody actually understands the proper use of fences (the CPU documentation on exactly what they do is frustratingly obtuse, which does not help at all). Then there's the issue of fences behaving very differently on different CPUs. If you use explicit fences, you have no hope of portability.
Nov 06 2008
Walter Bright wrote:Yes, this screws down very tightly what can be done with shared variables. Once we get this done, and get it right, we'll be able to see much more clearly where the right places are to loosen those screws.Wise choice. I'm readying my firehose so that I can spray you down when the flames come. :)
Nov 06 2008