www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Andrei Alexandrescu needs to read this

reply welkam <wwwelkam gmail.com> writes:
I watched many of his talks and he frequently talks about 
optimization that produce single digits % of speed up in 
frequently used algorithms but doesnt provide adequate prove that 
his change in algorithm was the reason why we see performance 
differences. Modern CPUs are sensitive to many things and one of 
them is code layout in memory. Hot loops are the most susceptible 
to this to the point where changing user name under which 
executable is run changes performance. A paper below goes deeper 
into this.

Producing Wrong Data Without Doing Anything Obviously Wrong!
https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
Oct 23 2019
next sibling parent reply Jonathan Marler <johnnymarler gmail.com> writes:
On Wednesday, 23 October 2019 at 21:37:26 UTC, welkam wrote:
 I watched many of his talks and he frequently talks about 
 optimization that produce single digits % of speed up in 
 frequently used algorithms but doesnt provide adequate prove 
 that his change in algorithm was the reason why we see 
 performance differences. Modern CPUs are sensitive to many 
 things and one of them is code layout in memory. Hot loops are 
 the most susceptible to this to the point where changing user 
 name under which executable is run changes performance. A paper 
 below goes deeper into this.

 Producing Wrong Data Without Doing Anything Obviously Wrong!
 https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
That's why Andrei says "always measure". He understands how complex modern CPUs are and that it's basically pointless to try to predict performance. He has some good talks where he shows that one of the biggest causes of performance problems is not understanding how the processor cache works, but his point was that you'll never be able to theoretically predict the performance of hardware in today's world. Always measure. What I find funny is that there are alot of clever tricks you can do to make your code execute less operations, but with modern CPUs it's more about making your code more predictable so that the cache can predict what to load next and which branches you're more likely to take. So in a way, as CPUs get smarter, you want to make your code "dumber" (i.e . more predictable) in order to get the best performance. When hardware was "dumber", it was better to make your code smarter. An odd switch in paradigms.
Oct 23 2019
next sibling parent reply welkam <wwwelkam gmail.com> writes:
On Wednesday, 23 October 2019 at 22:03:29 UTC, Jonathan Marler 
wrote:

 That's why Andrei says "always measure".
I see you didnt read the paper. The performance measurement that you talk about and Andrei does measures two things: 1. change due to code change and 2. change due to code layout changes. When performance change is in order of magnitude then you can safely assume it was because of code change you made but when the difference is less than 10% it becomes unclear what actually is responsible for that difference. If you had read the paper you would find out that gcc's -O3 changes performance over -O2 from -8% to +12% on the same application. Simple measurement is not sufficient to conclude that your change in algorithm is what is responsible for measured performance increase.
Oct 23 2019
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Oct 23, 2019 at 11:20:07PM +0000, welkam via Digitalmars-d wrote:
[...]
 When performance change is in order of magnitude then you can safely
 assume it was because of code change you made but when the difference
 is less than 10% it becomes unclear what actually is responsible for
 that difference. If you had read the paper you would find out that
 gcc's -O3 changes performance over -O2 from -8% to +12% on the same
 application.
 
 Simple measurement is not sufficient to conclude that your change in
 algorithm is what is responsible for measured performance increase.
Yeah, I tend to get skeptical when people start micro-optimizing for small performance increases. When it's a 30%-40% or higher increase, then I'd say it's reasonably safe to conclude that the algorithm change was responsible. But if it's 2% or 5% then it's harder to be confident you aren't being misled by other factors. Also, I tend to ignore differences of less than 1s or 0.5s in benchmarks, because it's hard to tell whether the 0.002s increase is caused by the code, or by other factors. When people start optimizing over sub-second performance increases I start getting suspicious. As a general principle I'd say that if a set of benchmarks are being compared, they need to be run in the *exact* same environment and then compared. If a set of measurements were made 5 days ago and we compare them with measurements made today, there's no accounting for what subtle system differences may have crept in in the meantime, that may throw off the results. But this paper reveals even more interesting points, though, about performance differences across systems or CPU models. For this, I'd propose that any benchmarks we're basing algorithm decisions on ought to be verified on at least two (preferably more) divergent systems. E.g., if running benchmark B on a Debian Linux server leads to the conclusion that algorithm A1 is better than A2, then I'd say we should check whether running B on a Windows machine leads to the same conclusion. Or if the code change is Linux-specific, I'd say test it also on a Slackware desktop system to see if we get different conclusions from a Debian server system. The more variety incorporated into the sample set the better. However, I felt the point the paper makes about address randomization should actually be a *beneficial* point: rather than turn it off, run the exact same benchmark, say, 500 times with ASLR *enabled*, which should balance out any biases by incorporating into the results both beneficial and detrimental address alignments that may have resulted from ASLR. If you make conclusions based on benchmarks taken with ASLR disabled, then you run the risk that the algorithm performs better without ASLR but worse in typical user environments which *would* have ASLR enabled. Another problem with benchmark-based optimizations, that the paper didn't mention, is that you run into the risk of optimizing for the benchmark at the expense of actual, real-world software. Typical software, for example, doesn't run memcpy 50 million times in a tight loop; typically memcpy calls are sprinkled among a whole bunch of other stuff. If you micro-optimize memcpy to beat the benchmark, you could be misled by CPU cache / branch predictor effects, which would *not* trigger in actual application software because the usage pattern is completely different. You could potentially be making memcpy run *slower* in Excel even though it runs faster in a benchmark, for example. Anecdote. Once I wrote several D analogues of the Unix wc utility and ran a comparison with my Debian distro's stock wc utility (which is the GNU version). It basically amounted to calling memchr on newline characters, for GNU wc. In the D versions I used various different algorithms, including using std.mmap, std.parallelism, reading the file the traditional way by blocks, etc.. Then I ran the various versions of wc on various sets of data with different characteristics. I discovered something very interesting: GNU wc was generally on par with, or outperformed the D versions of the code for files that contained long lines, but performed more poorly when given files that contained short lines. Glancing at the glibc source code revealed why: glibc's memchr used an elaborate bit hack based algorithm that scanned the target string 8 bytes at a time. This required the data to be aligned, however, so when the string was not aligned, it had to manually process up to 7 bytes at either end of the string with a different algorithm. So when the lines were long, the overall performance was dominated by the 8-byte at a time scanning code, which was very fast for large buffers. However, when given a large number of short strings, the overhead of setting up for the 8-byte scan became more costly than the savings, so it performed more poorly than a naïve byte-by-byte scan. I confirmed this conclusion by artificially constructing an input file with extremely long lines, and another input file with extremely short lines. Then I compared GNU wc's performance with a naïve D function that basically did a byte-by-byte scan. As expected, wc lost to the naïve scanner on the input file with extremely short lines, but won by a large margin on the file with extremely long lines. Subsequently I was able to duplicate this result in D by writing the same 8-byte at a time scanner in D. Taking a step back, I realized that this was a classic case of optimizing for the benchmark: it seemed likely that glibc's memchr was optimized for scanning large buffers, given the kind of algorithm used to implement it. Since benchmarks tend to be written for large test cases, my suspicion was that this algorithm was favored because the author(s) were testing the code on large buffers. But this optimization came at the expense of performance in the small buffer case. Which means the actual benefit of this optimization depended on what your application uses memchr for. If you regularly scan large buffers, then glibc's implementation will give you better performance. However, if your application deals with a lot of small strings, you might be better off writing your own naïve byte-by-byte scanning code, because it will actually outperform glibc. My gut feeling is that a lot of software actually frequently deals with short strings: think about your typical Facebook post or text message, or database of customer names. Your customers aren't going to have names that are several KB long, so if the software uses glibc's memchr on names, then it's performing rather poorly. A live chat system sends short messages at a time, and a lot of network protocols also center around short(ish) messages. Large data like video or music generally don't use memchr() because they aren't textual. And even then, you tend to process them in blocks typically 4KB each or so. So it's questionable whether glibc's choice of memchr implementation is really optimal in the sense of benefitting the most common applications, rather than just excelling at an artificial benchmark that doesn't accurately represent typical real-world usage. Another corollary from all this, is that sometimes, there is no unique optimization that will work best for everybody. There is no "optimal" code that's detached from its surrounding context; you optimize for one use case at the detriment of another. And unless you have specific use cases in mind, it's hard, or even impossible, to make the "right" decisions -- and this is especially the case for standard libraries that must be as generic as possible. The more generic you get, the harder it is to choose the best algorithm. At a certain point it becomes outright impossible because "best" becomes ill-defined (best for who?). And this comes back to my point that I get suspicious when people start trying to squeeze that last 5-10% performance out of their code. Beware, because you could be optimizing for your benchmark rather than the user's actual environment. T -- Старый друг лучше новых двух.
Oct 23 2019
next sibling parent welkam <wwwelkam gmail.com> writes:
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
 you could be optimizing for your benchmark rather than the 
 user's actual environment.
Thats the feeling I get when reading blog post on Rust compiler speed improvements. The other thing to keep in mind is that you need to be mindful about what CPU resources are limiting your performance because if you change your algorithm to use more D-cache and it shows speed improvements in micro benchmarks if your application is already starving for D-cache you would reduce performance when you add that change to whole application.
Oct 24 2019
prev sibling parent Jon Degenhardt <jond noreply.com> writes:
On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
 I discovered something very interesting: GNU wc was generally 
 on par with, or outperformed the D versions of the code for 
 files that contained long lines, but performed more poorly when 
 given files that contained short lines.

 Glancing at the glibc source code revealed why: glibc's memchr 
 used an elaborate bit hack based algorithm that scanned the 
 target string 8 bytes at a time. This required the data to be 
 aligned, however, so when the string was not aligned, it had to 
 manually process up to 7 bytes at either end of the string with 
 a different algorithm.  So when the lines were long, the 
 overall performance was dominated by the 8-byte at a time 
 scanning code, which was very fast for large buffers.  However, 
 when given a large number of short strings, the overhead of 
 setting up for the 8-byte scan became more costly than the 
 savings, so it performed more poorly than a naïve byte-by-byte 
 scan.
Interesting observation. On the surface it seems this might also apply to splitter and find when used on narrow strings. I believe these call memchr on narrow strings. A common paradigm is to read lines, then call splitter to identify individual fields. Fields are often short, even when lines are long. --Jon
Oct 24 2019
prev sibling parent reply Walter Bright <newshound2 digitalmars.com> writes:
On 10/23/2019 3:03 PM, Jonathan Marler wrote:
 What I find funny is that there are alot of clever tricks you can do to make 
 your code execute less operations, but with modern CPUs it's more about making 
 your code more predictable so that the cache can predict what to load next and 
 which branches you're more likely to take.  So in a way, as CPUs get smarter, 
 you want to make your code "dumber" (i.e . more predictable) in order to get
the 
 best performance.  When hardware was "dumber", it was better to make your
code 
 smarter.  An odd switch in paradigms.
Keep in mind that starting in the late 70's, CPUs started being designed around the way compilers generate code. (Before then, instruction sets were a wacky collection of seemingly unrelated instructions. Compilers like orthogonality, and specialized instructions to do things like stack frame setup / teardown.) This means that unusual instruction sequences tend to perform less well than the ordinary stuff a compiler generates. It's also true that code optimizers are tuned to what the local C/C++ compiler generates, even if the optimizer is designed to work with multiple diverse languages.
Oct 23 2019
next sibling parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Wed, Oct 23, 2019 at 04:22:08PM -0700, Walter Bright via Digitalmars-d wrote:
 On 10/23/2019 3:03 PM, Jonathan Marler wrote:
 What I find funny is that there are alot of clever tricks you can do
 to make your code execute less operations, but with modern CPUs it's
 more about making your code more predictable so that the cache can
 predict what to load next and which branches you're more likely to
 take.  So in a way, as CPUs get smarter, you want to make your code
 "dumber" (i.e .  more predictable) in order to get the best
 performance.  When hardware was "dumber", it was better to make your
 code smarter.  An odd switch in paradigms.
Indeed! In the old days it was all about minimizing instructions. But nowadays, minimizing instructions may make your code perform worse if you increased the number of branches, thereby causing more branch hazards. On the flip side, some good optimizers can eliminate branch hazards in certain cases, e.g.: bool cond; x = cond ? y+1 : y; can be rewritten by the optimizer as: x = y + cond; which allows for a branchless translation into machine code. Generally, though, it's a bad idea to write this sort of optimizations in the source code: it runs the risk of confusing the optimizer, which may cause it to be disabled for that piece of code, resulting in poor generated code. It's usually better to just trust the optimizer to do its job. Another recent development is the occasional divergence of performance characteristics of CPUs across members of the same family, i.e., the same instruction on two different CPU models may perform quite differently. Meaning that this sort of low-level optimization is really best left to the optimizer to optimize for the actual target CPU, rather than to choose a fixed series of instructions in an asm block that may perform poorly on some CPUs. (This is also where JIT compilation can win over static compilation, if you ship a generic binary that isn't specifically targeted for the customer's CPU model.)
 Keep in mind that starting in the late 70's, CPUs started being
 designed around the way compilers generate code. (Before then,
 instruction sets were a wacky collection of seemingly unrelated
 instructions. Compilers like orthogonality, and specialized
 instructions to do things like stack frame setup / teardown.)
 
 This means that unusual instruction sequences tend to perform less
 well than the ordinary stuff a compiler generates.
Yeah, nowadays with microcode, you can't trust the surface appearance of the assembly instructions anymore. What looks like the same number of instructions can have very different performance characteristics depending on how it's actually implemented in the microcode.
 It's also true that code optimizers are tuned to what the local C/C++
 compiler generates, even if the optimizer is designed to work with
 multiple diverse languages.
Interesting, I didn't know this. T -- Guns don't kill people. Bullets do.
Oct 23 2019
next sibling parent Walter Bright <newshound2 digitalmars.com> writes:
On 10/23/2019 4:51 PM, H. S. Teoh wrote:
 It's also true that code optimizers are tuned to what the local C/C++
 compiler generates, even if the optimizer is designed to work with
 multiple diverse languages.
Interesting, I didn't know this.
It's pretty straightforward why - the optimizer developers tend to be C/C++ developers who look at the generated code from their C/C++ compiler.
Oct 23 2019
prev sibling parent reply Mark <smarksc gmail.com> writes:
On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models 
 may perform quite differently.  Meaning that this sort of 
 low-level optimization is really best left to the optimizer to 
 optimize for the actual target CPU, rather than to choose a 
 fixed series of instructions in an asm block that may perform 
 poorly on some CPUs.  (This is also where JIT compilation can 
 win over static compilation, if you ship a generic binary that 
 isn't specifically targeted for the customer's CPU model.)

 T
Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Oct 27 2019
next sibling parent reply drug <drug2004 bk.ru> writes:
27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of performance 
 characteristics of CPUs across members of the same family, i.e., the 
 same instruction on two different CPU models may perform quite 
 differently.  Meaning that this sort of low-level optimization is 
 really best left to the optimizer to optimize for the actual target 
 CPU, rather than to choose a fixed series of instructions in an asm 
 block that may perform poorly on some CPUs.  (This is also where JIT 
 compilation can win over static compilation, if you ship a generic 
 binary that isn't specifically targeted for the customer's CPU model.)

 T
Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?
Oct 27 2019
parent reply rikki cattermole <rikki cattermole.co.nz> writes:
On 28/10/2019 9:39 AM, drug wrote:
 27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models may 
 perform quite differently.  Meaning that this sort of low-level 
 optimization is really best left to the optimizer to optimize for the 
 actual target CPU, rather than to choose a fixed series of 
 instructions in an asm block that may perform poorly on some CPUs.  
 (This is also where JIT compilation can win over static compilation, 
 if you ship a generic binary that isn't specifically targeted for the 
 customer's CPU model.)

 T
Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?
Intel created Itanium. AMD instead created AMD64 aka x86_64.
Oct 27 2019
parent drug <drug2004 bk.ru> writes:
On 10/28/19 2:03 AM, rikki cattermole wrote:
 On 28/10/2019 9:39 AM, drug wrote:
 27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models may 
 perform quite differently.  Meaning that this sort of low-level 
 optimization is really best left to the optimizer to optimize for 
 the actual target CPU, rather than to choose a fixed series of 
 instructions in an asm block that may perform poorly on some CPUs. 
 (This is also where JIT compilation can win over static compilation, 
 if you ship a generic binary that isn't specifically targeted for 
 the customer's CPU model.)

 T
Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
Wasn't Itanium fail be caused that AMD suggested another architecture that was capable to run existing software while in case of Itanium users was forced to recompile their source code? So Itanium failed just because it was incompatible to x86?
Intel created Itanium. AMD instead created AMD64 aka x86_64.
That's well known fact, I believe. I meant that Intel planned that Itanium would be the next generation processor after x86. But AMD extended x86 and created amd64 and the plane of Intel failed because to use Itanium you shall recompile everything and to use amd64 you just run you software as before.
Oct 31 2019
prev sibling next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 28/10/2019 9:11 AM, Mark wrote:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of performance 
 characteristics of CPUs across members of the same family, i.e., the 
 same instruction on two different CPU models may perform quite 
 differently.  Meaning that this sort of low-level optimization is 
 really best left to the optimizer to optimize for the actual target 
 CPU, rather than to choose a fixed series of instructions in an asm 
 block that may perform poorly on some CPUs.  (This is also where JIT 
 compilation can win over static compilation, if you ship a generic 
 binary that isn't specifically targeted for the customer's CPU model.)

 T
Would it be reasonable to say that modern CPUs basically do JIT compilation of assembly instructions? Or at the very least, that they have a built-in "runtime" that is responsible for all that ILP magic - cache policy algorithms, MESI protocol, the branch predictor and so on. If so, you could argue that the Itanium was an attempt to avoid this "runtime" and transfer all these responsibilities to the compiler and/or programmer. Not a very successful one, apparently. It is also a bit analogous to the GC vs. deterministic manual memory management debate.
I have described modern x86 cpu's as an application VM, so I will agree with you :)
Oct 27 2019
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Sunday, 27 October 2019 at 20:11:38 UTC, Mark wrote:
 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions?
Old CISC CPUs did just that, so they could have high level instructions, and were reprogrammable... (I guess x86 also has that feature, at least to some extent). Then RISC CPUs came in the 90s and didn't do that, thus they were faster and more compact as they could throw out the decoder (the bits in the instructions were carefully designed so that the decoding was instantaneous). But then memory bandwidth became an issue and developers started to write more and more bloated software... x86 is an old CISC architecture and simply survives because of market dominance and R&D investments. Also, with increased real estate (more transistors) they can sacrifice lots of space for the instruction decoding... The major change over the past 40 years that is causing sensitivity to instruction ordering is that modern CPUs can have deep pipelines (executing many instructions at the same time in a long staging queue), that they are superscalar (execute instructions in parallell), execute instructions speculatively (execute instructions even though the result might be discarded later), do tight-loop instruction unrolling before pipelining, and have various schemes for branch prediction (so that they execute the right sequence after a branch before they know what the branch-condition looks like). Is this a good approach? Probably not... You would get much better performance from the same number of transistors by using many simple cores and a clever memory architecture, but that would not work with current software and development practice...
 branch predictor and so on. If so, you could argue that the 
 Itanium was an attempt to avoid this "runtime" and transfer all 
 these responsibilities to the compiler and/or programmer. Not a 
 very successful one, apparently.
VLIW is not a bad concept, re RISC, but perhaps not profitable in terms of R&D. You probably could get better scheduling of instructions if it was determined to be optimal statically. As the compiler would then have a "perfect model" of how much of the CPU is being utilized, and could give programmers feedback on it too. But then you would need to recompile software to the actual CPU and have more advanced compilers, and perhaps write software in a different manner to avoid bad branching patterns. Existing software code bases and a developer culture that is resilient to change do limit progress... People pay to have their existing stuff to run well, they won't pay if they have to write new stuff in new ways, unless the benefits are extreme (e.g. GPUs)...
Oct 31 2019
prev sibling parent welkam <wwwelkam gmail.com> writes:
On Wednesday, 23 October 2019 at 23:22:08 UTC, Walter Bright 
wrote:
 instruction sets were a wacky collection of seemingly unrelated 
 instructions.
Yeah at some point compiler writers and chip designers were in competition on who can produce better code. You can look at those instructions as an attempt at peephole optimization.
Oct 24 2019
prev sibling next sibling parent Daniel Kozak <kozzi11 gmail.com> writes:
On Wed, Oct 23, 2019 at 11:40 PM welkam via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 I watched many of his talks and he frequently talks about
 optimization that produce single digits % of speed up in
 frequently used algorithms but doesnt provide adequate prove that
 his change in algorithm was the reason why we see performance
 differences. Modern CPUs are sensitive to many things and one of
 them is code layout in memory. Hot loops are the most susceptible
 to this to the point where changing user name under which
 executable is run changes performance. A paper below goes deeper
 into this.

 Producing Wrong Data Without Doing Anything Obviously Wrong!
 https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
https://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s
Oct 24 2019
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 10/23/19 5:37 PM, welkam wrote:
 I watched many of his talks and he frequently talks about optimization 
 that produce single digits % of speed up in frequently used algorithms 
 but doesnt provide adequate prove that his change in algorithm was the 
 reason why we see performance differences. Modern CPUs are sensitive to 
 many things and one of them is code layout in memory. Hot loops are the 
 most susceptible to this to the point where changing user name under 
 which executable is run changes performance. A paper below goes deeper 
 into this.
 
 Producing Wrong Data Without Doing Anything Obviously Wrong!
 https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytk
wicz-wrong-data.pdf 
I know of the paper and its follow-up by Berger et al. Thanks.
Oct 27 2019