digitalmars.D - Is 2X faster large memcpy interesting?
- Don (16/16) Mar 26 2009 The next D2 runtime will include my cache-size detection code. This
- Georg Wrede (6/24) Mar 26 2009 What's the alternative? What would you do instead? Is there something
- Andrei Alexandrescu (29/39) Mar 26 2009 I'd think so. In this day and age it is appalling that we don't quite
- Sean Kelly (7/21) Mar 26 2009 I don't know how sad this is. For better or worse, programming is still...
- Walter Bright (5/10) Mar 26 2009 It turns out that efficiently copying objects only a few bytes long
- Don (19/61) Mar 27 2009 More precisely, the optimal algorithm depends on the size of the
- Walter Bright (2/12) Mar 26 2009 I say go for it!
- Christopher Wright (5/15) Mar 26 2009 I don't use large arrays very often. When I do, I would not copy them if...
- Trass3r (3/14) Mar 26 2009 Well, arrays > 32K aren't that unsual, esp. in scientific computing.
- Thomas Moran (10/16) Mar 26 2009 Don, have you seen Agner Fog's memcpy() and memmove() implementations
- Don (5/24) Mar 27 2009 I'm aware of Agner's code (it was a motivation), but I deliberately
- Moritz Warning (3/4) Mar 26 2009 It's a basic building block for many programs.
- JC (5/23) Mar 27 2009 The applications that I write usually work with matrices of size 600x600...
The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it? BTW: I tested the memcpy() code provided in AMD's 1992 optimisation manual, and in Intel's 2007 manual. Only one of them actually gave any benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't Intel!) I've noticed that AMD's docs are usually greatly superior to Intels, but this time the difference is unbelievable.
Mar 26 2009
Don wrote:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it? BTW: I tested the memcpy() code provided in AMD's 1992 optimisation manual, and in Intel's 2007 manual. Only one of them actually gave any benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't Intel!) I've noticed that AMD's docs are usually greatly superior to Intels, but this time the difference is unbelievable.What's the alternative? What would you do instead? Is there something cooler or more important for D to do? (IMHO, if the other alternatives have any merit, then I'd vote for them.) But then again, you've already invested in this, and it clearly interests you. Labourious, yes, but it sounds fun.
Mar 26 2009
Don wrote:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it?I'd think so. In this day and age it is appalling that we don't quite know how to quickly copy memory around. A long time ago I ran some measurements (http://www.ddj.com/cpp/184403799) and I was quite surprised. My musings were as true then as now. And now we're getting to the second freakin' Space Odyssey! =============== Things are clearly hazy, aren't they? First off, maybe it came as a surprise to you that there's more than one way to fill and copy objects. Then, there's no single variant of fill and copy that works best on all compilers, data sets, and machines. (I guess if I tested the same code on a Celeron, which has less cache, I would have gotten very different results. To say nothing about other architectures.) As a rule of thumb, it's generally good to use memcpy (and consequently fill-by-copy) if you can — for large data sets, memcpy doesn't make much difference, and for smaller data sets, it might be much faster. For cheap-to-copy objects, Duff's Device might perform faster than a simple for loop. Ultimately, all this is subject to your compiler's and machine's whims and quirks. There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. We've done electronic computing for more than 50 years now, and we strive to design more and more complex systems, with unsatisfactory results. Software development is messy. Could it be because the fundamental tools and means we use are low-level, inefficient, and not standardized? Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory. ================ Andrei
Mar 26 2009
== Quote from Andrei Alexandrescu (SeeWebsiteForEmail erdani.org)'s articleAs a rule of thumb, it's generally good to use memcpy (and consequently fill-by-copy) if you can — for large data sets, memcpy doesn't make much difference, and for smaller data sets, it might be much faster. For cheap-to-copy objects, Duff's Device might perform faster than a simple for loop. Ultimately, all this is subject to your compiler's and machine's whims and quirks. There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. We've done electronic computing for more than 50 years now, and we strive to design more and more complex systems, with unsatisfactory results. Software development is messy. Could it be because the fundamental tools and means we use are low-level, inefficient, and not standardized? Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory.I don't know how sad this is. For better or worse, programming is still a craft, much like blacksmithing. Code is largely written from scratch for each project, techniques are jealously guarded (in our case via copyright law), etc. This may not be great from the perspective of progress, but it certainly makes the work more interesting. But then I'm a tinker at heart, so YMMV.
Mar 26 2009
Andrei Alexandrescu wrote:I'd think so. In this day and age it is appalling that we don't quite know how to quickly copy memory around. A long time ago I ran some measurements (http://www.ddj.com/cpp/184403799) and I was quite surprised. My musings were as true then as now. And now we're getting to the second freakin' Space Odyssey!It turns out that efficiently copying objects only a few bytes long requires a bunch of code. So, I gave up on having the compiler generate intrinsic memcpy code, and instead just call the library function memcpy. This is implemented in the next update.
Mar 26 2009
Andrei Alexandrescu wrote:Don wrote:More precisely, the optimal algorithm depends on the size of the affected memory, and the size of the CPU caches. However, the only thing which really varies between machines is where the thresholds are.The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it?I'd think so. In this day and age it is appalling that we don't quite know how to quickly copy memory around. A long time ago I ran some measurements (http://www.ddj.com/cpp/184403799) and I was quite surprised. My musings were as true then as now. And now we're getting to the second freakin' Space Odyssey! =============== Things are clearly hazy, aren't they? First off, maybe it came as a surprise to you that there's more than one way to fill and copy objects. Then, there's no single variant of fill and copy that works best on all compilers, data sets, and machines. (I guess if I tested the same code on a Celeron, which has less cache, I would have gotten very different results. To say nothing about other architectures.)As a rule of thumb, it's generally good to use memcpy (and consequently fill-by-copy) if you can — for large data sets, memcpy doesn't make much difference, and for smaller data sets, it might be much faster. For cheap-to-copy objects, Duff's Device might perform faster than a simple for loop. Ultimately, all this is subject to your compiler's and machine's whims and quirks.This is why I think it's absolutely critical to have cache size determination as a fundamental runtime library primitive. Since memory speeds have only tripled since 1980, but processor speeds have improved by 1000X, I think it's become less and less acceptable for a systems language to be cache-unaware.There is a very deep, and sad, realization underlying all this. We are in 2001, the year of the Spatial Odyssey. We've done electronic computing for more than 50 years now, and we strive to design more and more complex systems, with unsatisfactory results. Software development is messy. Could it be because the fundamental tools and means we use are low-level, inefficient, and not standardized? Just step out of the box and look at us — after 50 years, we're still not terribly good at filling and copying memory. ================ AndreiSummarizing everyones comments -- nobody thinks that fast large memcopy is terribly important, but a slow memcpy seems philosophically wrong. I wanted to work on getting array operations to be cache-aware. Of course, the simplest possible array operation is a[]=b[]. On a Core2 with 1Mb/core L2 cache, DMD's memcpy performance drops to 0.6 bytes/cycles once you fall out of the L2 cache; but a 64-bit CPU can do 8 bytes/cycle under optimum conditions, when the data is coming from the L1 cache. So I figured it doesn't make sense to optimize the more complex operations when the trivial one is has so much potential for improvement.
Mar 27 2009
Don wrote:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it?I say go for it!
Mar 26 2009
Don wrote:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it?I don't use large arrays very often. When I do, I would not copy them if I could avoid it. Usually, either I keep catenating to an array until a certain point, then I only ever need to read from it, with no copying ever necessary. So I would rarely, if ever, benefit from this.
Mar 26 2009
Don schrieb:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it?Well, arrays > 32K aren't that unsual, esp. in scientific computing. Even a small 200x200 matrix makes up 40000*8 bytes.
Mar 26 2009
On 26/03/2009 20:08, Don wrote:BTW: I tested the memcpy() code provided in AMD's 1992 optimisation manual, and in Intel's 2007 manual. Only one of them actually gave any benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't Intel!) I've noticed that AMD's docs are usually greatly superior to Intels, but this time the difference is unbelievable.Don, have you seen Agner Fog's memcpy() and memmove() implementations included with the most recent versions of his manuals? In the unaligned case they read two XMM words and shift/combine them into the target alignment, so all loads and stores are aligned. Pretty cool. He says (modestly): ; This method is 2 - 6 times faster than the implementations in the ; standard C libraries (MS, Gnu) when src or dest are misaligned. ; When src and dest are aligned by 16 (relative to each other) then this ; function is only slightly faster than the best standard libraries.
Mar 26 2009
Thomas Moran wrote:On 26/03/2009 20:08, Don wrote:I'm aware of Agner's code (it was a motivation), but I deliberately haven't looked at it since it's GPLed. BTW, I already have copy-with-shifting implemented for the implemenation of bigint.BTW: I tested the memcpy() code provided in AMD's 1992 optimisation manual, and in Intel's 2007 manual. Only one of them actually gave any benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't Intel!) I've noticed that AMD's docs are usually greatly superior to Intels, but this time the difference is unbelievable.Don, have you seen Agner Fog's memcpy() and memmove() implementations included with the most recent versions of his manuals? In the unaligned case they read two XMM words and shift/combine them into the target alignment, so all loads and stores are aligned. Pretty cool. He says (modestly): ; This method is 2 - 6 times faster than the implementations in the ; standard C libraries (MS, Gnu) when src or dest are misaligned. ; When src and dest are aligned by 16 (relative to each other) then this ; function is only slightly faster than the best standard libraries.
Mar 27 2009
On Thu, 26 Mar 2009 21:08:40 +0100, Don wrote:Is it worth spending any more time on it?It's a basic building block for many programs. Let the code flow! :p
Mar 26 2009
The applications that I write usually work with matrices of size 600x600 up to 2000x2000 and since they are doubles, that is a good chunk of memory. Unleash the optimizations! JC Don wrote:The next D2 runtime will include my cache-size detection code. This makes it possible to write a cache-aware memcpy, using (for example) non-temporal writes when the arrays being copied exceed the size of the largest cache. In my tests, it gives a speed-up of approximately 2X in such cases. The downside is, it's a fair bit of work to implement, and it only affects extremely large arrays, so I fear it's basically irrelevant (It probably won't help arrays < 32K in size). Do people actually copy megabyte-sized arrays? Is it worth spending any more time on it? BTW: I tested the memcpy() code provided in AMD's 1992 optimisation manual, and in Intel's 2007 manual. Only one of them actually gave any benefit when run on a 2008 Intel Core2 -- which was it? (Hint: it wasn't Intel!) I've noticed that AMD's docs are usually greatly superior to Intels, but this time the difference is unbelievable.
Mar 27 2009