digitalmars.D - Andrei Alexandrescu needs to read this
- welkam (11/11) Oct 23 2019 I watched many of his talks and he frequently talks about
- Jonathan Marler (16/27) Oct 23 2019 That's why Andrei says "always measure". He understands how
- welkam (14/15) Oct 23 2019 I see you didnt read the paper. The performance measurement that
- H. S. Teoh (119/128) Oct 23 2019 Yeah, I tend to get skeptical when people start micro-optimizing for
- welkam (9/11) Oct 24 2019 Thats the feeling I get when reading blog post on Rust compiler
- Jon Degenhardt (7/23) Oct 24 2019 Interesting observation. On the surface it seems this might also
- Walter Bright (10/17) Oct 23 2019 Keep in mind that starting in the late 70's, CPUs started being designed...
- H. S. Teoh (34/54) Oct 23 2019 Indeed! In the old days it was all about minimizing instructions. But
- Walter Bright (3/8) Oct 23 2019 It's pretty straightforward why - the optimizer developers tend to be C/...
- Mark (11/22) Oct 27 2019 Would it be reasonable to say that modern CPUs basically do JIT
- drug (5/28) Oct 27 2019 Wasn't Itanium fail be caused that AMD suggested another architecture
- rikki cattermole (3/32) Oct 27 2019 Intel created Itanium.
- drug (6/39) Oct 31 2019 That's well known fact, I believe.
- rikki cattermole (3/26) Oct 27 2019 I have described modern x86 cpu's as an application VM, so I will agree
- Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (42/48) Oct 31 2019 Old CISC CPUs did just that, so they could have high level
- welkam (5/7) Oct 24 2019 Yeah at some point compiler writers and chip designers were in
- Daniel Kozak (3/14) Oct 24 2019 https://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s
- Andrei Alexandrescu (2/13) Oct 27 2019 I know of the paper and its follow-up by Berger et al. Thanks.
I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this. Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
Oct 23 2019
On Wednesday, 23 October 2019 at 21:37:26 UTC, welkam wrote:I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this. Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdfThat's why Andrei says "always measure". He understands how complex modern CPUs are and that it's basically pointless to try to predict performance. He has some good talks where he shows that one of the biggest causes of performance problems is not understanding how the processor cache works, but his point was that you'll never be able to theoretically predict the performance of hardware in today's world. Always measure. What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take. So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance. When hardware was "dumber", it was better to make your code smarter. An odd switch in paradigms.
Oct 23 2019
On Wednesday, 23 October 2019 at 22:03:29 UTC, Jonathan Marler wrote:That's why Andrei says "always measure".I see you didnt read the paper. The performance measurement that you talk about and Andrei does measures two things: 1. change due to code change and 2. change due to code layout changes. When performance change is in order of magnitude then you can safely assume it was because of code change you made but when the difference is less than 10% it becomes unclear what actually is responsible for that difference. If you had read the paper you would find out that gcc's -O3 changes performance over -O2 from -8% to +12% on the same application. Simple measurement is not sufficient to conclude that your change in algorithm is what is responsible for measured performance increase.
Oct 23 2019
On Wed, Oct 23, 2019 at 11:20:07PM +0000, welkam via Digitalmars-d wrote: [...]When performance change is in order of magnitude then you can safely assume it was because of code change you made but when the difference is less than 10% it becomes unclear what actually is responsible for that difference. If you had read the paper you would find out that gcc's -O3 changes performance over -O2 from -8% to +12% on the same application. Simple measurement is not sufficient to conclude that your change in algorithm is what is responsible for measured performance increase.Yeah, I tend to get skeptical when people start micro-optimizing for small performance increases. When it's a 30%-40% or higher increase, then I'd say it's reasonably safe to conclude that the algorithm change was responsible. But if it's 2% or 5% then it's harder to be confident you aren't being misled by other factors. Also, I tend to ignore differences of less than 1s or 0.5s in benchmarks, because it's hard to tell whether the 0.002s increase is caused by the code, or by other factors. When people start optimizing over sub-second performance increases I start getting suspicious. As a general principle I'd say that if a set of benchmarks are being compared, they need to be run in the *exact* same environment and then compared. If a set of measurements were made 5 days ago and we compare them with measurements made today, there's no accounting for what subtle system differences may have crept in in the meantime, that may throw off the results. But this paper reveals even more interesting points, though, about performance differences across systems or CPU models. For this, I'd propose that any benchmarks we're basing algorithm decisions on ought to be verified on at least two (preferably more) divergent systems. E.g., if running benchmark B on a Debian Linux server leads to the conclusion that algorithm A1 is better than A2, then I'd say we should check whether running B on a Windows machine leads to the same conclusion. Or if the code change is Linux-specific, I'd say test it also on a Slackware desktop system to see if we get different conclusions from a Debian server system. The more variety incorporated into the sample set the better. However, I felt the point the paper makes about address randomization should actually be a *beneficial* point: rather than turn it off, run the exact same benchmark, say, 500 times with ASLR *enabled*, which should balance out any biases by incorporating into the results both beneficial and detrimental address alignments that may have resulted from ASLR. If you make conclusions based on benchmarks taken with ASLR disabled, then you run the risk that the algorithm performs better without ASLR but worse in typical user environments which *would* have ASLR enabled. Another problem with benchmark-based optimizations, that the paper didn't mention, is that you run into the risk of optimizing for the benchmark at the expense of actual, real-world software. Typical software, for example, doesn't run memcpy 50 million times in a tight loop; typically memcpy calls are sprinkled among a whole bunch of other stuff. If you micro-optimize memcpy to beat the benchmark, you could be misled by CPU cache / branch predictor effects, which would *not* trigger in actual application software because the usage pattern is completely different. You could potentially be making memcpy run *slower* in Excel even though it runs faster in a benchmark, for example. Anecdote. Once I wrote several D analogues of the Unix wc utility and ran a comparison with my Debian distro's stock wc utility (which is the GNU version). It basically amounted to calling memchr on newline characters, for GNU wc. In the D versions I used various different algorithms, including using std.mmap, std.parallelism, reading the file the traditional way by blocks, etc.. Then I ran the various versions of wc on various sets of data with different characteristics. I discovered something very interesting: GNU wc was generally on par with, or outperformed the D versions of the code for files that contained long lines, but performed more poorly when given files that contained short lines. Glancing at the glibc source code revealed why: glibc's memchr used an elaborate bit hack based algorithm that scanned the target string 8 bytes at a time. This required the data to be aligned, however, so when the string was not aligned, it had to manually process up to 7 bytes at either end of the string with a different algorithm. So when the lines were long, the overall performance was dominated by the 8-byte at a time scanning code, which was very fast for large buffers. However, when given a large number of short strings, the overhead of setting up for the 8-byte scan became more costly than the savings, so it performed more poorly than a naïve byte-by-byte scan. I confirmed this conclusion by artificially constructing an input file with extremely long lines, and another input file with extremely short lines. Then I compared GNU wc's performance with a naïve D function that basically did a byte-by-byte scan. As expected, wc lost to the naïve scanner on the input file with extremely short lines, but won by a large margin on the file with extremely long lines. Subsequently I was able to duplicate this result in D by writing the same 8-byte at a time scanner in D. Taking a step back, I realized that this was a classic case of optimizing for the benchmark: it seemed likely that glibc's memchr was optimized for scanning large buffers, given the kind of algorithm used to implement it. Since benchmarks tend to be written for large test cases, my suspicion was that this algorithm was favored because the author(s) were testing the code on large buffers. But this optimization came at the expense of performance in the small buffer case. Which means the actual benefit of this optimization depended on what your application uses memchr for. If you regularly scan large buffers, then glibc's implementation will give you better performance. However, if your application deals with a lot of small strings, you might be better off writing your own naïve byte-by-byte scanning code, because it will actually outperform glibc. My gut feeling is that a lot of software actually frequently deals with short strings: think about your typical Facebook post or text message, or database of customer names. Your customers aren't going to have names that are several KB long, so if the software uses glibc's memchr on names, then it's performing rather poorly. A live chat system sends short messages at a time, and a lot of network protocols also center around short(ish) messages. Large data like video or music generally don't use memchr() because they aren't textual. And even then, you tend to process them in blocks typically 4KB each or so. So it's questionable whether glibc's choice of memchr implementation is really optimal in the sense of benefitting the most common applications, rather than just excelling at an artificial benchmark that doesn't accurately represent typical real-world usage. Another corollary from all this, is that sometimes, there is no unique optimization that will work best for everybody. There is no "optimal" code that's detached from its surrounding context; you optimize for one use case at the detriment of another. And unless you have specific use cases in mind, it's hard, or even impossible, to make the "right" decisions -- and this is especially the case for standard libraries that must be as generic as possible. The more generic you get, the harder it is to choose the best algorithm. At a certain point it becomes outright impossible because "best" becomes ill-defined (best for who?). And this comes back to my point that I get suspicious when people start trying to squeeze that last 5-10% performance out of their code. Beware, because you could be optimizing for your benchmark rather than the user's actual environment. T -- Старый друг лучше новых двух.
Oct 23 2019
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:you could be optimizing for your benchmark rather than the user's actual environment.Thats the feeling I get when reading blog post on Rust compiler speed improvements. The other thing to keep in mind is that you need to be mindful about what CPU resources are limiting your performance because if you change your algorithm to use more D-cache and it shows speed improvements in micro benchmarks if your application is already starving for D-cache you would reduce performance when you add that change to whole application.
Oct 24 2019
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:I discovered something very interesting: GNU wc was generally on par with, or outperformed the D versions of the code for files that contained long lines, but performed more poorly when given files that contained short lines. Glancing at the glibc source code revealed why: glibc's memchr used an elaborate bit hack based algorithm that scanned the target string 8 bytes at a time. This required the data to be aligned, however, so when the string was not aligned, it had to manually process up to 7 bytes at either end of the string with a different algorithm. So when the lines were long, the overall performance was dominated by the 8-byte at a time scanning code, which was very fast for large buffers. However, when given a large number of short strings, the overhead of setting up for the 8-byte scan became more costly than the savings, so it performed more poorly than a naïve byte-by-byte scan.Interesting observation. On the surface it seems this might also apply to splitter and find when used on narrow strings. I believe these call memchr on narrow strings. A common paradigm is to read lines, then call splitter to identify individual fields. Fields are often short, even when lines are long. --Jon
Oct 24 2019
On 10/23/2019 3:03 PM, Jonathan Marler wrote:What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take. So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance. When hardware was "dumber", it was better to make your code smarter. An odd switch in paradigms.Keep in mind that starting in the late 70's, CPUs started being designed around the way compilers generate code. (Before then, instruction sets were a wacky collection of seemingly unrelated instructions. Compilers like orthogonality, and specialized instructions to do things like stack frame setup / teardown.) This means that unusual instruction sequences tend to perform less well than the ordinary stuff a compiler generates. It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.
Oct 23 2019
On Wed, Oct 23, 2019 at 04:22:08PM -0700, Walter Bright via Digitalmars-d wrote:On 10/23/2019 3:03 PM, Jonathan Marler wrote:Indeed! In the old days it was all about minimizing instructions. But nowadays, minimizing instructions may make your code perform worse if you increased the number of branches, thereby causing more branch hazards. On the flip side, some good optimizers can eliminate branch hazards in certain cases, e.g.: bool cond; x = cond ? y+1 : y; can be rewritten by the optimizer as: x = y + cond; which allows for a branchless translation into machine code. Generally, though, it's a bad idea to write this sort of optimizations in the source code: it runs the risk of confusing the optimizer, which may cause it to be disabled for that piece of code, resulting in poor generated code. It's usually better to just trust the optimizer to do its job. Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take. So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance. When hardware was "dumber", it was better to make your code smarter. An odd switch in paradigms.Keep in mind that starting in the late 70's, CPUs started being designed around the way compilers generate code. (Before then, instruction sets were a wacky collection of seemingly unrelated instructions. Compilers like orthogonality, and specialized instructions to do things like stack frame setup / teardown.) This means that unusual instruction sequences tend to perform less well than the ordinary stuff a compiler generates.Yeah, nowadays with microcode, you can't trust the surface appearance of the assembly instructions anymore. What looks like the same number of instructions can have very different performance characteristics depending on how it's actually implemented in the microcode.It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.Interesting, I didn't know this. T -- Guns don't kill people. Bullets do.
Oct 23 2019
On 10/23/2019 4:51 PM, H. S. Teoh wrote:It's pretty straightforward why - the optimizer developers tend to be C/C++ developers who look at the generated code from their C/C++ compiler.It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.Interesting, I didn't know this.
Oct 23 2019
On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.) TWould it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 27 2019
27.10.2019 23:11, Mark пишет:On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.) TWould it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 27 2019
On 28/10/2019 9:39 AM, drug wrote:27.10.2019 23:11, Mark пишет:Intel created Itanium. AMD instead created AMD64 aka x86_64.On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.) TWould it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 27 2019
On 10/28/19 2:03 AM, rikki cattermole wrote:On 28/10/2019 9:39 AM, drug wrote:That's well known fact, I believe. I meant that Intel planned that Itanium would be the next generation processor after x86. But AMD extended x86 and created amd64 and the plane of Intel failed because to use Itanium you shall recompile everything and to use amd64 you just run you software as before.27.10.2019 23:11, Mark пишет:Intel created Itanium. AMD instead created AMD64 aka x86_64.On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.) TWould it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 31 2019
On 28/10/2019 9:11 AM, Mark wrote:On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:I have described modern x86 cpu's as an application VM, so I will agree with you :)Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.) TWould it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 27 2019
On Sunday, 27 October 2019 at 20:11:38 UTC, Mark wrote:Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions?Old CISC CPUs did just that, so they could have high level instructions, and were reprogrammable... (I guess x86 also has that feature, at least to some extent). Then RISC CPUs came in the 90s and didn't do that, thus they were faster and more compact as they could throw out the decoder (the bits in the instructions were carefully designed so that the decoding was instantaneous). But then memory bandwidth became an issue and developers started to write more and more bloated software... x86 is an old CISC architecture and simply survives because of market dominance and R&D investments. Also, with increased real estate (more transistors) they can sacrifice lots of space for the instruction decoding... The major change over the past 40 years that is causing sensitivity to instruction ordering is that modern CPUs can have deep pipelines (executing many instructions at the same time in a long staging queue), that they are superscalar (execute instructions in parallell), execute instructions speculatively (execute instructions even though the result might be discarded later), do tight-loop instruction unrolling before pipelining, and have various schemes for branch prediction (so that they execute the right sequence after a branch before they know what the branch-condition looks like). Is this a good approach? Probably not... You would get much better performance from the same number of transistors by using many simple cores and a clever memory architecture, but that would not work with current software and development practice...branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently.VLIW is not a bad concept, re RISC, but perhaps not profitable in terms of R&D. You probably could get better scheduling of instructions if it was determined to be optimal statically. As the compiler would then have a "perfect model" of how much of the CPU is being utilized, and could give programmers feedback on it too. But then you would need to recompile software to the actual CPU and have more advanced compilers, and perhaps write software in a different manner to avoid bad branching patterns. Existing software code bases and a developer culture that is resilient to change do limit progress... People pay to have their existing stuff to run well, they won't pay if they have to write new stuff in new ways, unless the benefits are extreme (e.g. GPUs)...
Oct 31 2019
On Wednesday, 23 October 2019 at 23:22:08 UTC, Walter Bright wrote:instruction sets were a wacky collection of seemingly unrelated instructions.Yeah at some point compiler writers and chip designers were in competition on who can produce better code. You can look at those instructions as an attempt at peephole optimization.
Oct 24 2019
On Wed, Oct 23, 2019 at 11:40 PM welkam via Digitalmars-d <digitalmars-d puremagic.com> wrote:I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this. Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdfhttps://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s
Oct 24 2019
On 10/23/19 5:37 PM, welkam wrote:I watched many of his talks and he frequently talks about optimization that produce single digits % of speed up in frequently used algorithms but doesnt provide adequate prove that his change in algorithm was the reason why we see performance differences. Modern CPUs are sensitive to many things and one of them is code layout in memory. Hot loops are the most susceptible to this to the point where changing user name under which executable is run changes performance. A paper below goes deeper into this. Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytk wicz-wrong-data.pdfI know of the paper and its follow-up by Berger et al. Thanks.
Oct 27 2019