digitalmars.D - Multicores and Publication Safety
- Walter Bright (4/4) Aug 04 2008 "What memory fences are useful for on multiprocessors; and why you
- Walter Bright (3/4) Aug 04 2008 There seems to be a cadre of reddit readers who immediately vote down
- Jb (7/11) Aug 04 2008 None of that is relevant on x86 as far as I understand. I could only fin...
- Brad Roberts (2/19) Aug 04 2008 Pay very close attention to sections 2.3 and 2.4 of that document.
- Sean Kelly (33/52) Aug 04 2008 2.4 is the most interesting aspect of PC. It means that you can run
- Brad Roberts (10/76) Aug 04 2008 For that example, section 2.8 kicks in, locked instructions (such as
- Jb (22/43) Aug 05 2008 They dont override 2.1, they complement it. IE...
- Walter Bright (6/9) Aug 05 2008 It's risky to write such code, however, because:
- Jb (18/27) Aug 05 2008 You cant design / write your code based on the idea that someone who doe...
- Walter Bright (3/36) Aug 05 2008 I think that is because the current language technology is deficient. We...
- Jb (6/19) Aug 06 2008 FWIW i think you're right.
- Sean Kelly (11/27) Aug 04 2008 Not true. The actual behavior of IA-32 processors has been hotly
- Jb (6/28) Aug 05 2008 Thats news to me.
- Sean Kelly (3/34) Aug 05 2008 True enough. It's mostly an issue with creating mutexes and the like.
- Sean Kelly (4/27) Aug 05 2008 I don't know that this was ever confirmed with anyone at AMD, but it did...
- Jb (15/42) Aug 05 2008 I did a bit of googling and it does seem older AMDs were less strongly
- Sean Kelly (19/65) Aug 05 2008 At least AMD and Intel have figured out how to separate discussion of
- Benji Smith (8/13) Aug 05 2008 Interesting you should bring this up. I was just reading an article
"What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/
Aug 04 2008
Walter Bright wrote:http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/There seems to be a cadre of reddit readers who immediately vote down anything on D. That can be counteracted if the community votes them up!
Aug 04 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com..."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
Jb wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Pay very close attention to sections 2.3 and 2.4 of that document."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
Brad Roberts wrote:Jb wrote:2.4 is the most interesting aspect of PC. It means that you can run into situations like this: // Thread A x = 1; // Thread B if( x == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // may fail Alex Terekhov came up with a sneaky solution for this based on how the IA-32 spec says CAS is currently implemented: // Thread A x = 1; // Thread B t = CAS( x, 0, 0 ); if( t == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // true In essence, Intel currently implements CAS by either storing the new value /or/ re-storing the old value based on the result of the comparison, and because all stores from a single processor are ordered, Thread C is therefore guaranteed to see the store to x before the store to y. As cool as I find the above solution, however, I do hope that this helps to demonstrate the complexity of lock-free programming. It also shows just how complex analysis of this stuff is. Even with the full source code available it would take some doing for a compiler to recognize a problem similar to the above. Sean"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Pay very close attention to sections 2.3 and 2.4 of that document."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
Sean Kelly wrote:Brad Roberts wrote:For that example, section 2.8 kicks in, locked instructions (such as CAS) help constrain ordering. So.. summary. Reordering is real, even on x86 class hardware. To make life even more interesting, there's also various cpu bugs that help make things even worse. See this thread (unconfirmed info, but interesting non-the-less) on the linux-kernel mailing list: http://www.ussg.iu.edu/hypermail/linux/kernel/0808.0/0882.html Whee, BradJb wrote:2.4 is the most interesting aspect of PC. It means that you can run into situations like this: // Thread A x = 1; // Thread B if( x == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // may fail Alex Terekhov came up with a sneaky solution for this based on how the IA-32 spec says CAS is currently implemented: // Thread A x = 1; // Thread B t = CAS( x, 0, 0 ); if( t == 1 ) y = 1; // Thread C if( y == 1 ) assert( x == 1 ); // true In essence, Intel currently implements CAS by either storing the new value /or/ re-storing the old value based on the result of the comparison, and because all stores from a single processor are ordered, Thread C is therefore guaranteed to see the store to x before the store to y. As cool as I find the above solution, however, I do hope that this helps to demonstrate the complexity of lock-free programming. It also shows just how complex analysis of this stuff is. Even with the full source code available it would take some doing for a compiler to recognize a problem similar to the above. Sean"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Pay very close attention to sections 2.3 and 2.4 of that document."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
"Brad Roberts" <braddr puremagic.com> wrote in message news:mailman.10.1217908384.1156.digitalmars-d puremagic.com...Jb wrote:They dont override 2.1, they complement it. IE... *Stores cannot be reordered with other stores* *Loads cannot be reordered with other loads* x = 1; ready = 1; Happens in order whether or not a load is reordered with those stores. You cant have a situation where a processor sees the write to "ready" before it sees the write "x". What Bartoz said.. "writes to memory can be completed out of order and" Is not true on x86. What 2.3 is saying is that a later load could be reordered before either store, but it still cant be reordered before the store to 'x' and after the store to 'ready', because the order of those stores cannot be changed. If it gets reordered before the store to 'x' it implicity gets reordered before the store to ready. That's the whole point of the ordering of stores / loads being enforced. Reagrding 2.4 : What this is saying is that there may be a delay between processors seeing each others stores, not that they can be seen out of order. Processor 1 may see it's own write to 'x' before processor 2 does, but processor 2 still wont see the write to 'ready' before the write to 'x'."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Pay very close attention to sections 2.3 and 2.4 of that document."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 05 2008
Jb wrote:What Bartoz said.. "writes to memory can be completed out of order and" Is not true on x86.It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaks 2. Intel may change this behavior on future x86's, which means your code will break years from now
Aug 05 2008
"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g795mq$25jq$1 digitalmars.com...Jb wrote:You cant design / write your code based on the idea that someone who doesnt know what they are doing will try and modify it later. And if they are unaware of memory ordering they are likely unaware of alignment atomicity, and probably dont understand the subtleties of syncronization, and a whole bunch of other issues. I'm not saying every joe blogs programmer should know about memory ordering and use it where they can to avoid more expensive syncronization primatives. But the compiler and stdlib, or multithreding librarys, should know about it. I dont think the compiler should be dumping memory fences all over the place on the assumtion that they might be needed by the x86 processors of 2012.What Bartoz said.. "writes to memory can be completed out of order and" Is not true on x86.It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaks2. Intel may change this behavior on future x86's, which means your code will break years from nowI dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
Aug 05 2008
Jb Wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g795mq$25jq$1 digitalmars.com...The model the compiler uses is to generate code "as if" fences were inserted everywhere. The compiler may, however, as part of optimization and generating code for a particular CPU, elide as many as it can.Jb wrote:You cant design / write your code based on the idea that someone who doesnt know what they are doing will try and modify it later. And if they are unaware of memory ordering they are likely unaware of alignment atomicity, and probably dont understand the subtleties of syncronization, and a whole bunch of other issues. I'm not saying every joe blogs programmer should know about memory ordering and use it where they can to avoid more expensive syncronization primatives. But the compiler and stdlib, or multithreding librarys, should know about it. I dont think the compiler should be dumping memory fences all over the place on the assumtion that they might be needed by the x86 processors of 2012.What Bartoz said.. "writes to memory can be completed out of order and" Is not true on x86.It's risky to write such code, however, because: 1. someone else may try to port it to another processor, and then be mystified as to why it breaksI think that is because the current language technology is deficient. We aim to fix that with D :-)2. Intel may change this behavior on future x86's, which means your code will break years from nowI dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
Aug 05 2008
"Walter Bright" <walter nospammm-digitalmars.com> wrote in message news:g7b7h1$aeb$1 digitalmars.com...FWIW i think you're right. But a little more help from the hardware would be nice aswell. I'd like to see "lock free" (non blocking) syncronization made a bit easier, somthing like a double CAS.I think that is because the current language technology is deficient. We aim to fix that with D :-)2. Intel may change this behavior on future x86's, which means your code will break years from nowI dont think they could because i think a lot of code probably already relys on it. And i think it's likely that the new comitment to strong memory ordering, from both AMD and INTEL (both have pdfs regarding 64 bit that specify it), is mainly because they realize it is needed to help progress with multi core.
Aug 06 2008
Jb wrote:"Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads. Also, even under the PCsc model it is completely legal to "hoist" loads above stores, or equivalently, to "sink" stores below loads. In short, unless you've *really* done your homework I suggest being very careful with respect to lock-free programming--ie. always perform fully sequenced operations just to be safe. Tango has had such a module from the start, and it looks like Phobos2 may get one fairly soon as well. Sean"What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 04 2008
"Sean Kelly" <sean invisibleduck.org> wrote in message news:g78man$17sb$1 digitalmars.com...Jb wrote:Thats news to me."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.Also, even under the PCsc model it is completely legal to "hoist" loads above stores, or equivalently, to "sink" stores below loads.Yes but as long as stores are not reordered with other stores, and loads not reordered with other loads, then that kind of re-ordering wont result in the situation Bartoz described.
Aug 05 2008
Jb wrote:"Sean Kelly" <sean invisibleduck.org> wrote in message news:g78man$17sb$1 digitalmars.com...True enough. It's mostly an issue with creating mutexes and the like. SeanJb wrote:Thats news to me."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.Also, even under the PCsc model it is completely legal to "hoist" loads above stores, or equivalently, to "sink" stores below loads.Yes but as long as stores are not reordered with other stores, and loads not reordered with other loads, then that kind of re-ordering wont result in the situation Bartoz described.
Aug 05 2008
Jb wrote:"Sean Kelly" <sean invisibleduck.org> wrote in message news:g78man$17sb$1 digitalmars.com...I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it. SeanJb wrote:Thats news to me."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 05 2008
"Sean Kelly" <sean invisibleduck.org> wrote in message news:g79ugv$mdd$1 digitalmars.com...Jb wrote:I did a bit of googling and it does seem older AMDs were less strongly ordered. It seems SSE/3DNow non temporal stores particulary. But it looks like they have gone for strong ordering with AMD64. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf From 7.2 : Multiprocessor Memory Ordering. "Loads do not pass previous loads (loads are not re-ordered). Stores do not pass previous stores (stores are not re-ordered)" Although skim reading more of chapter 7 it looks like they might do reordering behind the scence, or "such that the appearance of in-order execution is maintained" as they say. My guess is that strong ordering, or at least the appearance of it, is an important factor in multi core cpus scalling well."Sean Kelly" <sean invisibleduck.org> wrote in message news:g78man$17sb$1 digitalmars.com...I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it.Jb wrote:Thats news to me."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.
Aug 05 2008
Jb wrote:"Sean Kelly" <sean invisibleduck.org> wrote in message news:g79ugv$mdd$1 digitalmars.com...At least AMD and Intel have figured out how to separate discussion of implementation issues with visible behavior. The original IA-32 spec was an absolute disaster in this respect. I'm also encouraged that the memory model has been both fully specified and strengthened to PCsc or better. The x86 has always been pretty easy to deal with and it's nice to see that this will continue to be true. I suppose my only question at this point is how the official memory barrier instructions apply to normal (non-SSE) instruction ordering. I don't suppose the recent specs say anything about this?Jb wrote:I did a bit of googling and it does seem older AMDs were less strongly ordered. It seems SSE/3DNow non temporal stores particulary. But it looks like they have gone for strong ordering with AMD64. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf From 7.2 : Multiprocessor Memory Ordering. "Loads do not pass previous loads (loads are not re-ordered). Stores do not pass previous stores (stores are not re-ordered)" Although skim reading more of chapter 7 it looks like they might do reordering behind the scence, or "such that the appearance of in-order execution is maintained" as they say."Sean Kelly" <sean invisibleduck.org> wrote in message news:g78man$17sb$1 digitalmars.com...I don't know that this was ever confirmed with anyone at AMD, but it did come up in the C++0x talks and I believe the linux kernel accounts for it.Jb wrote:Thats news to me."Walter Bright" <newshound1 digitalmars.com> wrote in message news:g7855a$2sd3$1 digitalmars.com...Not true. The actual behavior of IA-32 processors has been hotly debated, but it's been established that at least certain AMD processors may reorder loads."What memory fences are useful for on multiprocessors; and why you should care, even if you're not an assembly programmer." http://bartoszmilewski.wordpress.com/2008/08/04/multicores-and-publication-safety/ http://www.reddit.com/comments/6uuqc/multicores_and_publication_safety/None of that is relevant on x86 as far as I understand. I could only find the one regarding x86-64, but as far as I know it's the same on x86-32. http://www.intel.com/products/processor/manuals/318147.pdf The key point being loads are not reordered with other loads, and stores are not reordered with other stores.My guess is that strong ordering, or at least the appearance of it, is an important factor in multi core cpus scalling well.Yup. And the Intel announcement makes the very good point that it's a huge factor in performance per watt as well. Strengthening the memory model and shrinking the pipeline allows for a tremendous amount of logic hardware to simply be thrown away, which means smaller, cooler, more energy-efficient CPUs. My big question now is how computers will be built in the coming years... will we have a few traditional (fast) cores plus a general-purpose parallel computing cluster? I suppose I should read that Intel paper posted yesterday. Sean
Aug 05 2008
Sean Kelly wrote:My big question now is how computers will be built in the coming years... will we have a few traditional (fast) cores plus a general-purpose parallel computing cluster? SeanInteresting you should bring this up. I was just reading an article yesterday about the "Cell Broadband Engine" used in the Playstation 3. It features one general-purpose 64-bit PowerPC chip (the "Power Processor Element") and eight co-processing cores (the "Synergistic Processing Units"), each with a 128-bit SIMD architecture. So, at least from the perspective of IBM and Sony, the answer is "yes". --benji
Aug 05 2008