digitalmars.D - Andrei Alexandrescu needs to read this

welkam (11/11) Oct 23 2019 I watched many of his talks and he frequently talks about

Jonathan Marler (16/27) Oct 23 2019 That's why Andrei says "always measure". He understands how

welkam (14/15) Oct 23 2019 I see you didnt read the paper. The performance measurement that

H. S. Teoh (119/128) Oct 23 2019 Yeah, I tend to get skeptical when people start micro-optimizing for

welkam (9/11) Oct 24 2019 Thats the feeling I get when reading blog post on Rust compiler
Jon Degenhardt (7/23) Oct 24 2019 Interesting observation. On the surface it seems this might also

Walter Bright (10/17) Oct 23 2019 Keep in mind that starting in the late 70's, CPUs started being designed...

H. S. Teoh (34/54) Oct 23 2019 Indeed! In the old days it was all about minimizing instructions. But

Walter Bright (3/8) Oct 23 2019 It's pretty straightforward why - the optimizer developers tend to be C/...
Mark (11/22) Oct 27 2019 Would it be reasonable to say that modern CPUs basically do JIT

drug (5/28) Oct 27 2019 Wasn't Itanium fail be caused that AMD suggested another architecture

rikki cattermole (3/32) Oct 27 2019 Intel created Itanium.

drug (6/39) Oct 31 2019 That's well known fact, I believe.

rikki cattermole (3/26) Oct 27 2019 I have described modern x86 cpu's as an application VM, so I will agree
Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= (42/48) Oct 31 2019 Old CISC CPUs did just that, so they could have high level

welkam (5/7) Oct 24 2019 Yeah at some point compiler writers and chip designers were in

Daniel Kozak (3/14) Oct 24 2019 https://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s
Andrei Alexandrescu (2/13) Oct 27 2019 I know of the paper and its follow-up by Berger et al. Thanks.

welkam <wwwelkam gmail.com> writes:

I watched many of his talks and he frequently talks about 
optimization that produce single digits % of speed up in 
frequently used algorithms but doesnt provide adequate prove that 
his change in algorithm was the reason why we see performance 
differences. Modern CPUs are sensitive to many things and one of 
them is code layout in memory. Hot loops are the most susceptible 
to this to the point where changing user name under which 
executable is run changes performance. A paper below goes deeper 
into this.

Producing Wrong Data Without Doing Anything Obviously Wrong!
https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

Oct 23 2019

Jonathan Marler <johnnymarler gmail.com> writes:

On Wednesday, 23 October 2019 at 21:37:26 UTC, welkam wrote:
I watched many of his talks and he frequently talks about
optimization that produce single digits % of speed up in
frequently used algorithms but doesnt provide adequate prove
that his change in algorithm was the reason why we see
performance differences. Modern CPUs are sensitive to many
things and one of them is code layout in memory. Hot loops are
the most susceptible to this to the point where changing user
name under which executable is run changes performance. A paper
below goes deeper into this.

Producing Wrong Data Without Doing Anything Obviously Wrong!
https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

That's why Andrei says "always measure". He understands how
complex modern CPUs are and that it's basically pointless to try
to predict performance. He has some good talks where he shows
that one of the biggest causes of performance problems is not
understanding how the processor cache works, but his point was
that you'll never be able to theoretically predict the
performance of hardware in today's world. Always measure.

What I find funny is that there are alot of clever tricks you can
do to make your code execute less operations, but with modern
CPUs it's more about making your code more predictable so that
the cache can predict what to load next and which branches you're
more likely to take. So in a way, as CPUs get smarter, you want
to make your code "dumber" (i.e . more predictable) in order to
get the best performance. When hardware was "dumber", it was
better to make your code smarter. An odd switch in paradigms.

Oct 23 2019

welkam <wwwelkam gmail.com> writes:

On Wednesday, 23 October 2019 at 22:03:29 UTC, Jonathan Marler 
wrote:

 That's why Andrei says "always measure".

I see you didnt read the paper. The performance measurement that 
you talk about and Andrei does measures two things: 1. change due 
to code change and 2. change due to code layout changes.
When performance change is in order of magnitude then you can 
safely assume it was because of code change you made but when the 
difference is less than 10% it becomes unclear what actually is 
responsible for that difference. If you had read the paper you 
would find out that gcc's -O3 changes performance over -O2 from 
-8% to +12% on the same application.

Simple measurement is not sufficient to conclude that your change 
in algorithm is what is responsible for measured performance 
increase.

Oct 23 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Oct 23, 2019 at 11:20:07PM +0000, welkam via Digitalmars-d wrote:
[...]
 When performance change is in order of magnitude then you can safely
 assume it was because of code change you made but when the difference
 is less than 10% it becomes unclear what actually is responsible for
 that difference. If you had read the paper you would find out that
 gcc's -O3 changes performance over -O2 from -8% to +12% on the same
 application.
 
 Simple measurement is not sufficient to conclude that your change in
 algorithm is what is responsible for measured performance increase.

Yeah, I tend to get skeptical when people start micro-optimizing for
small performance increases. When it's a 30%-40% or higher increase,
then I'd say it's reasonably safe to conclude that the algorithm change
was responsible. But if it's 2% or 5% then it's harder to be confident
you aren't being misled by other factors.  Also, I tend to ignore
differences of less than 1s or 0.5s in benchmarks, because it's hard to
tell whether the 0.002s increase is caused by the code, or by other
factors. When people start optimizing over sub-second performance
increases I start getting suspicious.

As a general principle I'd say that if a set of benchmarks are being
compared, they need to be run in the *exact* same environment and then
compared. If a set of measurements were made 5 days ago and we compare
them with measurements made today, there's no accounting for what subtle
system differences may have crept in in the meantime, that may throw off
the results.

But this paper reveals even more interesting points, though, about
performance differences across systems or CPU models. For this, I'd
propose that any benchmarks we're basing algorithm decisions on ought to
be verified on at least two (preferably more) divergent systems. E.g.,
if running benchmark B on a Debian Linux server leads to the conclusion
that algorithm A1 is better than A2, then I'd say we should check
whether running B on a Windows machine leads to the same conclusion. Or
if the code change is Linux-specific, I'd say test it also on a
Slackware desktop system to see if we get different conclusions from a
Debian server system. The more variety incorporated into the sample set
the better.

However, I felt the point the paper makes about address randomization
should actually be a *beneficial* point: rather than turn it off, run
the exact same benchmark, say, 500 times with ASLR *enabled*, which
should balance out any biases by incorporating into the results both
beneficial and detrimental address alignments that may have resulted
from ASLR.  If you make conclusions based on benchmarks taken with ASLR
disabled, then you run the risk that the algorithm performs better
without ASLR but worse in typical user environments which *would* have
ASLR enabled.

Another problem with benchmark-based optimizations, that the paper
didn't mention, is that you run into the risk of optimizing for the
benchmark at the expense of actual, real-world software. Typical
software, for example, doesn't run memcpy 50 million times in a tight
loop; typically memcpy calls are sprinkled among a whole bunch of other
stuff. If you micro-optimize memcpy to beat the benchmark, you could be
misled by CPU cache / branch predictor effects, which would *not*
trigger in actual application software because the usage pattern is
completely different. You could potentially be making memcpy run
*slower* in Excel even though it runs faster in a benchmark, for
example.

Anecdote.  Once I wrote several D analogues of the Unix wc utility and
ran a comparison with my Debian distro's stock wc utility (which is the
GNU version). It basically amounted to calling memchr on newline
characters, for GNU wc. In the D versions I used various different
algorithms, including using std.mmap, std.parallelism, reading the file
the traditional way by blocks, etc..  Then I ran the various versions of
wc on various sets of data with different characteristics.

I discovered something very interesting: GNU wc was generally on par
with, or outperformed the D versions of the code for files that
contained long lines, but performed more poorly when given files that
contained short lines.

Glancing at the glibc source code revealed why: glibc's memchr used an
elaborate bit hack based algorithm that scanned the target string 8
bytes at a time. This required the data to be aligned, however, so when
the string was not aligned, it had to manually process up to 7 bytes at
either end of the string with a different algorithm.  So when the lines
were long, the overall performance was dominated by the 8-byte at a time
scanning code, which was very fast for large buffers.  However, when
given a large number of short strings, the overhead of setting up for
the 8-byte scan became more costly than the savings, so it performed
more poorly than a naïve byte-by-byte scan.

I confirmed this conclusion by artificially constructing an input file
with extremely long lines, and another input file with extremely short
lines.  Then I compared GNU wc's performance with a naïve D function
that basically did a byte-by-byte scan.  As expected, wc lost to the
naïve scanner on the input file with extremely short lines, but won by a
large margin on the file with extremely long lines.  Subsequently I was
able to duplicate this result in D by writing the same 8-byte at a time
scanner in D.

Taking a step back, I realized that this was a classic case of
optimizing for the benchmark: it seemed likely that glibc's memchr was
optimized for scanning large buffers, given the kind of algorithm used
to implement it. Since benchmarks tend to be written for large test
cases, my suspicion was that this algorithm was favored because the
author(s) were testing the code on large buffers. But this optimization
came at the expense of performance in the small buffer case.  Which
means the actual benefit of this optimization depended on what your
application uses memchr for.  If you regularly scan large buffers, then
glibc's implementation will give you better performance.  However, if
your application deals with a lot of small strings, you might be better
off writing your own naïve byte-by-byte scanning code, because it will
actually outperform glibc.

My gut feeling is that a lot of software actually frequently deals with
short strings: think about your typical Facebook post or text message,
or database of customer names. Your customers aren't going to have names
that are several KB long, so if the software uses glibc's memchr on
names, then it's performing rather poorly. A live chat system sends
short messages at a time, and a lot of network protocols also center
around short(ish) messages.  Large data like video or music generally
don't use memchr() because they aren't textual. And even then, you tend
to process them in blocks typically 4KB each or so.  So it's
questionable whether glibc's choice of memchr implementation is really
optimal in the sense of benefitting the most common applications, rather
than just excelling at an artificial benchmark that doesn't accurately
represent typical real-world usage.

Another corollary from all this, is that sometimes, there is no unique
optimization that will work best for everybody.  There is no "optimal"
code that's detached from its surrounding context; you optimize for one
use case at the detriment of another.  And unless you have specific use
cases in mind, it's hard, or even impossible, to make the "right"
decisions -- and this is especially the case for standard libraries that
must be as generic as possible. The more generic you get, the harder it
is to choose the best algorithm. At a certain point it becomes outright
impossible because "best" becomes ill-defined (best for who?).

And this comes back to my point that I get suspicious when people start
trying to squeeze that last 5-10% performance out of their code. Beware,
because you could be optimizing for your benchmark rather than the
user's actual environment.


T

-- 
Старый друг лучше новых двух.

Oct 23 2019

welkam <wwwelkam gmail.com> writes:

On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
 you could be optimizing for your benchmark rather than the 
 user's actual environment.

Thats the feeling I get when reading blog post on Rust compiler 
speed improvements.

The other thing to keep in mind is that you need to be mindful 
about what CPU resources are limiting your performance because if 
you change your algorithm to use more D-cache and it shows speed 
improvements in micro benchmarks if your application is already 
starving for D-cache you would reduce performance when you add 
that change to whole application.

Oct 24 2019

Jon Degenhardt <jond noreply.com> writes:

On Thursday, 24 October 2019 at 00:53:27 UTC, H. S. Teoh wrote:
 I discovered something very interesting: GNU wc was generally 
 on par with, or outperformed the D versions of the code for 
 files that contained long lines, but performed more poorly when 
 given files that contained short lines.

 Glancing at the glibc source code revealed why: glibc's memchr 
 used an elaborate bit hack based algorithm that scanned the 
 target string 8 bytes at a time. This required the data to be 
 aligned, however, so when the string was not aligned, it had to 
 manually process up to 7 bytes at either end of the string with 
 a different algorithm.  So when the lines were long, the 
 overall performance was dominated by the 8-byte at a time 
 scanning code, which was very fast for large buffers.  However, 
 when given a large number of short strings, the overhead of 
 setting up for the 8-byte scan became more costly than the 
 savings, so it performed more poorly than a naïve byte-by-byte 
 scan.

Interesting observation. On the surface it seems this might also 
apply to splitter and find when used on narrow strings. I believe 
these call memchr on narrow strings. A common paradigm is to read 
lines, then call splitter to identify individual fields. Fields 
are often short, even when lines are long.

--Jon

Oct 24 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 10/23/2019 3:03 PM, Jonathan Marler wrote:
 What I find funny is that there are alot of clever tricks you can do to make 
 your code execute less operations, but with modern CPUs it's more about making 
 your code more predictable so that the cache can predict what to load next and 
 which branches you're more likely to take.  So in a way, as CPUs get smarter, 
 you want to make your code "dumber" (i.e . more predictable) in order to get
the 
 best performance.  When hardware was "dumber", it was better to make your
code 
 smarter.  An odd switch in paradigms.

Keep in mind that starting in the late 70's, CPUs started being designed around 
the way compilers generate code. (Before then, instruction sets were a wacky 
collection of seemingly unrelated instructions. Compilers like orthogonality, 
and specialized instructions to do things like stack frame setup / teardown.)

This means that unusual instruction sequences tend to perform less well than
the 
ordinary stuff a compiler generates.

It's also true that code optimizers are tuned to what the local C/C++ compiler 
generates, even if the optimizer is designed to work with multiple diverse 
languages.

Oct 23 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, Oct 23, 2019 at 04:22:08PM -0700, Walter Bright via Digitalmars-d wrote:
 On 10/23/2019 3:03 PM, Jonathan Marler wrote:
 What I find funny is that there are alot of clever tricks you can do
 to make your code execute less operations, but with modern CPUs it's
 more about making your code more predictable so that the cache can
 predict what to load next and which branches you're more likely to
 take.� So in a way, as CPUs get smarter, you want to make your code
 "dumber" (i.e .  more predictable) in order to get the best
 performance.� When hardware was "dumber", it was better to make your
 code smarter.� An odd switch in paradigms.


Indeed!  In the old days it was all about minimizing instructions. But
nowadays, minimizing instructions may make your code perform worse if
you increased the number of branches, thereby causing more branch
hazards.

On the flip side, some good optimizers can eliminate branch hazards in
certain cases, e.g.:

	bool cond;
	x = cond ? y+1 : y;

can be rewritten by the optimizer as:

	x = y + cond;

which allows for a branchless translation into machine code.

Generally, though, it's a bad idea to write this sort of optimizations
in the source code: it runs the risk of confusing the optimizer, which
may cause it to be disabled for that piece of code, resulting in poor
generated code.  It's usually better to just trust the optimizer to do
its job.

Another recent development is the occasional divergence of performance
characteristics of CPUs across members of the same family, i.e., the
same instruction on two different CPU models may perform quite
differently.  Meaning that this sort of low-level optimization is really
best left to the optimizer to optimize for the actual target CPU, rather
than to choose a fixed series of instructions in an asm block that may
perform poorly on some CPUs.  (This is also where JIT compilation can
win over static compilation, if you ship a generic binary that isn't
specifically targeted for the customer's CPU model.)


 Keep in mind that starting in the late 70's, CPUs started being
 designed around the way compilers generate code. (Before then,
 instruction sets were a wacky collection of seemingly unrelated
 instructions. Compilers like orthogonality, and specialized
 instructions to do things like stack frame setup / teardown.)
 
 This means that unusual instruction sequences tend to perform less
 well than the ordinary stuff a compiler generates.

Yeah, nowadays with microcode, you can't trust the surface appearance of
the assembly instructions anymore. What looks like the same number of
instructions can have very different performance characteristics
depending on how it's actually implemented in the microcode.


 It's also true that code optimizers are tuned to what the local C/C++
 compiler generates, even if the optimizer is designed to work with
 multiple diverse languages.

Interesting, I didn't know this.


T

-- 
Guns don't kill people. Bullets do.

Oct 23 2019

Walter Bright <newshound2 digitalmars.com> writes:

On 10/23/2019 4:51 PM, H. S. Teoh wrote:
 It's also true that code optimizers are tuned to what the local C/C++
 compiler generates, even if the optimizer is designed to work with
 multiple diverse languages.

 
 Interesting, I didn't know this.

It's pretty straightforward why - the optimizer developers tend to be C/C++ 
developers who look at the generated code from their C/C++ compiler.

Oct 23 2019

Mark <smarksc gmail.com> writes:

On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models 
 may perform quite differently.  Meaning that this sort of 
 low-level optimization is really best left to the optimizer to 
 optimize for the actual target CPU, rather than to choose a 
 fixed series of instructions in an asm block that may perform 
 poorly on some CPUs.  (This is also where JIT compilation can 
 win over static compilation, if you ship a generic binary that 
 isn't specifically targeted for the customer's CPU model.)

 T

Would it be reasonable to say that modern CPUs basically do JIT 
compilation of assembly instructions? Or at the very least, that 
they have a built-in "runtime" that is responsible for all that 
ILP magic - cache policy algorithms, MESI protocol, the branch 
predictor and so on. If so, you could argue that the Itanium was 
an attempt to avoid this "runtime" and transfer all these 
responsibilities to the compiler and/or programmer. Not a very 
successful one, apparently.

It is also a bit analogous to the GC vs. deterministic manual 
memory management debate.

Oct 27 2019

drug <drug2004 bk.ru> writes:

27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of performance 
 characteristics of CPUs across members of the same family, i.e., the 
 same instruction on two different CPU models may perform quite 
 differently.  Meaning that this sort of low-level optimization is 
 really best left to the optimizer to optimize for the actual target 
 CPU, rather than to choose a fixed series of instructions in an asm 
 block that may perform poorly on some CPUs.  (This is also where JIT 
 compilation can win over static compilation, if you ship a generic 
 binary that isn't specifically targeted for the customer's CPU model.)

 T

 
 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions? Or at the very least, that they 
 have a built-in "runtime" that is responsible for all that ILP magic - 
 cache policy algorithms, MESI protocol, the branch predictor and so on. 
 If so, you could argue that the Itanium was an attempt to avoid this 
 "runtime" and transfer all these responsibilities to the compiler and/or 
 programmer. Not a very successful one, apparently.
 
 It is also a bit analogous to the GC vs. deterministic manual memory 
 management debate.

Wasn't Itanium fail be caused that AMD suggested another architecture 
that was capable to run existing software while in case of Itanium users 
was forced to recompile their source code? So Itanium failed just 
because it was incompatible to x86?

Oct 27 2019

rikki cattermole <rikki cattermole.co.nz> writes:

On 28/10/2019 9:39 AM, drug wrote:
 27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models may 
 perform quite differently.  Meaning that this sort of low-level 
 optimization is really best left to the optimizer to optimize for the 
 actual target CPU, rather than to choose a fixed series of 
 instructions in an asm block that may perform poorly on some CPUs.  
 (This is also where JIT compilation can win over static compilation, 
 if you ship a generic binary that isn't specifically targeted for the 
 customer's CPU model.)

 T

 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions? Or at the very least, that they 
 have a built-in "runtime" that is responsible for all that ILP magic - 
 cache policy algorithms, MESI protocol, the branch predictor and so 
 on. If so, you could argue that the Itanium was an attempt to avoid 
 this "runtime" and transfer all these responsibilities to the compiler 
 and/or programmer. Not a very successful one, apparently.

 It is also a bit analogous to the GC vs. deterministic manual memory 
 management debate.

 Wasn't Itanium fail be caused that AMD suggested another architecture 
 that was capable to run existing software while in case of Itanium users 
 was forced to recompile their source code? So Itanium failed just 
 because it was incompatible to x86?

Intel created Itanium.
AMD instead created AMD64 aka x86_64.

Oct 27 2019

drug <drug2004 bk.ru> writes:

On 10/28/19 2:03 AM, rikki cattermole wrote:
 On 28/10/2019 9:39 AM, drug wrote:
 27.10.2019 23:11, Mark пишет:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of 
 performance characteristics of CPUs across members of the same 
 family, i.e., the same instruction on two different CPU models may 
 perform quite differently.  Meaning that this sort of low-level 
 optimization is really best left to the optimizer to optimize for 
 the actual target CPU, rather than to choose a fixed series of 
 instructions in an asm block that may perform poorly on some CPUs. 
 (This is also where JIT compilation can win over static compilation, 
 if you ship a generic binary that isn't specifically targeted for 
 the customer's CPU model.)

 T

 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions? Or at the very least, that they 
 have a built-in "runtime" that is responsible for all that ILP magic 
 - cache policy algorithms, MESI protocol, the branch predictor and so 
 on. If so, you could argue that the Itanium was an attempt to avoid 
 this "runtime" and transfer all these responsibilities to the 
 compiler and/or programmer. Not a very successful one, apparently.

 It is also a bit analogous to the GC vs. deterministic manual memory 
 management debate.

 Wasn't Itanium fail be caused that AMD suggested another architecture 
 that was capable to run existing software while in case of Itanium 
 users was forced to recompile their source code? So Itanium failed 
 just because it was incompatible to x86?

 
 Intel created Itanium.
 AMD instead created AMD64 aka x86_64.

That's well known fact, I believe.

I meant that Intel planned that Itanium would be the next generation 
processor after x86. But AMD extended x86 and created amd64 and the 
plane of Intel failed because to use Itanium you shall recompile 
everything and to use amd64 you just run you software as before.

Oct 31 2019

rikki cattermole <rikki cattermole.co.nz> writes:

On 28/10/2019 9:11 AM, Mark wrote:
 On Wednesday, 23 October 2019 at 23:51:16 UTC, H. S. Teoh wrote:
 Another recent development is the occasional divergence of performance 
 characteristics of CPUs across members of the same family, i.e., the 
 same instruction on two different CPU models may perform quite 
 differently.  Meaning that this sort of low-level optimization is 
 really best left to the optimizer to optimize for the actual target 
 CPU, rather than to choose a fixed series of instructions in an asm 
 block that may perform poorly on some CPUs.  (This is also where JIT 
 compilation can win over static compilation, if you ship a generic 
 binary that isn't specifically targeted for the customer's CPU model.)

 T

 
 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions? Or at the very least, that they 
 have a built-in "runtime" that is responsible for all that ILP magic - 
 cache policy algorithms, MESI protocol, the branch predictor and so on. 
 If so, you could argue that the Itanium was an attempt to avoid this 
 "runtime" and transfer all these responsibilities to the compiler and/or 
 programmer. Not a very successful one, apparently.
 
 It is also a bit analogous to the GC vs. deterministic manual memory 
 management debate.

I have described modern x86 cpu's as an application VM, so I will agree 
with you :)

Oct 27 2019

Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:

On Sunday, 27 October 2019 at 20:11:38 UTC, Mark wrote:
 Would it be reasonable to say that modern CPUs basically do JIT 
 compilation of assembly instructions?

Old CISC CPUs did just that, so they could have high level 
instructions, and were reprogrammable... (I guess x86 also has 
that feature, at least to some extent).

Then RISC CPUs came in the 90s and didn't do that, thus they were 
faster and more compact as they could throw out the decoder (the 
bits in the instructions were carefully designed so that the 
decoding was instantaneous). But then memory bandwidth became an 
issue and developers started to write more and more bloated 
software...

x86 is an old CISC architecture and simply survives because of 
market dominance and R&D investments. Also, with increased real 
estate (more transistors) they can sacrifice lots of space for 
the instruction decoding...

The major change over the past 40 years that is causing 
sensitivity to instruction ordering is that modern CPUs can have 
deep pipelines (executing many instructions at the same time in a 
long staging queue), that they are superscalar (execute 
instructions in parallell), execute instructions speculatively 
(execute instructions even though the result might be discarded 
later), do tight-loop instruction unrolling before pipelining, 
and have various schemes for branch prediction (so that they 
execute the right sequence after a branch before they know what 
the branch-condition looks like).

Is this a good approach? Probably not... You would get much 
better performance from the same number of transistors by using 
many simple cores and a clever memory architecture, but that 
would not work with current software and development practice...

 branch predictor and so on. If so, you could argue that the 
 Itanium was an attempt to avoid this "runtime" and transfer all 
 these responsibilities to the compiler and/or programmer. Not a 
 very successful one, apparently.

VLIW is not a bad concept, re RISC, but perhaps not profitable in 
terms of R&D.

You probably could get better scheduling of instructions if it 
was determined to be optimal statically. As the compiler would 
then have a "perfect model" of how much of the CPU is being 
utilized, and could give programmers feedback on it too.  But 
then you would need to recompile software to the actual CPU and 
have more advanced compilers, and perhaps write software in a 
different manner to avoid bad branching patterns.

Existing software code bases and a developer culture that is 
resilient to change do limit progress...

People pay to have their existing stuff to run well, they won't 
pay if they have to write new stuff in new ways, unless the 
benefits are extreme (e.g. GPUs)...

Oct 31 2019

welkam <wwwelkam gmail.com> writes:

On Wednesday, 23 October 2019 at 23:22:08 UTC, Walter Bright 
wrote:
 instruction sets were a wacky collection of seemingly unrelated 
 instructions.

Yeah at some point compiler writers and chip designers were in 
competition on who can produce better code. You can look at those 
instructions as an attempt at peephole optimization.

Oct 24 2019

Daniel Kozak <kozzi11 gmail.com> writes:

On Wed, Oct 23, 2019 at 11:40 PM welkam via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 I watched many of his talks and he frequently talks about
 optimization that produce single digits % of speed up in
 frequently used algorithms but doesnt provide adequate prove that
 his change in algorithm was the reason why we see performance
 differences. Modern CPUs are sensitive to many things and one of
 them is code layout in memory. Hot loops are the most susceptible
 to this to the point where changing user name under which
 executable is run changes performance. A paper below goes deeper
 into this.

 Producing Wrong Data Without Doing Anything Obviously Wrong!
 https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf

https://www.youtube.com/watch?v=r-TLSBdHe1A&t=20s

Oct 24 2019

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 10/23/19 5:37 PM, welkam wrote:
 I watched many of his talks and he frequently talks about optimization 
 that produce single digits % of speed up in frequently used algorithms 
 but doesnt provide adequate prove that his change in algorithm was the 
 reason why we see performance differences. Modern CPUs are sensitive to 
 many things and one of them is code layout in memory. Hot loops are the 
 most susceptible to this to the point where changing user name under 
 which executable is run changes performance. A paper below goes deeper 
 into this.
 
 Producing Wrong Data Without Doing Anything Obviously Wrong!
 https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytk
wicz-wrong-data.pdf 

I know of the paper and its follow-up by Berger et al. Thanks.

Oct 27 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Andrei Alexandrescu needs to read this