digitalmars.D - Requesting Help with Optimizing Code

Kyle Ingraham (101/101) Apr 07 2021 Hello all. I have been working on learning computational

tsbockman (40/52) Apr 07 2021 I agree, intuitively that sounds like there is a lot of room for

Kyle Ingraham (14/39) Apr 08 2021 In hindsight you are right here. I'll divert this sort of post

Bastiaan Veelo (14/15) Apr 08 2021 No, but you can quite easily:

tsbockman (5/10) Apr 08 2021 In keeping with my earlier comment about structuring nested loops

Max Haughton (63/166) Apr 07 2021 I am away from a proper dev machine at the moment so I can't

tsbockman (14/18) Apr 07 2021 From what I've seen, LLVM's code generation and optimization for

Max Haughton (12/30) Apr 07 2021 You can do multiversioning fairly easily these days.

Iain Buclaw (2/5) Apr 08 2021 Perhaps a @target_clones UDA in {gcc/ldc}.attributes. ;-)

Kyle Ingraham (7/26) Apr 08 2021 I haven't tried this yet but will give it a go. Ideally I'd want

Max Haughton (12/38) Apr 08 2021 GDC has equivalent options - dmd does not have any equivalent
tsbockman (10/13) Apr 08 2021 I learned a lot about micro-optimization from studying the ASM

Guillaume Piolat (4/5) Apr 08 2021 - profiling?

Guillaume Piolat (5/12) Apr 08 2021 Also if you don't need the precision: always use powf instead of

Kyle Ingraham (4/18) Apr 08 2021 Great tips here. I wasn't aware of powf and its cousins. I also

Kyle Ingraham <kyle kyleingraham.com> writes:

Hello all. I have been working on learning computational 
photography and have been using D to do that. I recently got some 
code running that performs [chromatic 
adaptation](https://en.wikipedia.org/wiki/Chromatic_adaptation) 
(white balancing). The output is still not ideal (image is 
overexposed) but it does correct color casts. The issue I have is 
with performance. With a few optimizations found with profiling I 
have been able to drop processing time from ~10.8 to ~6.2 seconds 
for a 16 megapixel test image. That still feels like too long 
however. Image editing programs are usually much faster.

The optimizations that I've implemented:
* Remove `immutable` from constants. The type mismatch between 
constants (`immutable(double)`) and pixel values (`double`) 
caused time-consuming checks for compatible types in mir 
operations and triggered run-time type conversions and memory 
allocations (sorry if I butchered this description).
* Use `mir.math.common.pow` in place of `std.math.pow`.
* Use ` optmath` for linearization functions 
(https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source
photog/color.d#L192 and
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L318).

Is there anything else I can do to improve performance?

I tested the code under the following conditions:
* Compiled with `dub build --build=release --compiler=ldmd2`
* dub v1.23.0, ldc v1.24.0
* Intel Xeon W-2170B 2.5GHz (4.3GHz turbo)
* [Test 
image](https://user-images.githubusercontent.com/25495787/113943277-52054180-97d0-11eb-82be-934cf3d22112.jpg)
* Test code:
```d

/+ dub.sdl:
     name "photog-test"
     dependency "photog" version="~>0.1.1-alpha"
     dependency "jpeg-turbod" version="~>0.2.0"
+/

import std.datetime.stopwatch : AutoStart, StopWatch;
import std.file : read, write;
import std.stdio : writeln, writefln;

import jpeg_turbod;
import mir.ndslice : reshape, sliced;

import photog.color : chromAdapt, Illuminant, rgb2Xyz;
import photog.utils : imageMean, toFloating, toUnsigned;

void main()
{
     const auto jpegFile = "image-in.jpg";
     auto jpegInput = cast(ubyte[]) jpegFile.read;

     auto dc = new Decompressor();
     ubyte[] pixels;
     int width, height;
     bool decompressed = dc.decompress(jpegInput, pixels, width, 
height);

     if (!decompressed)
     {
         dc.errorInfo.writeln;
         return;
     }

     auto image = pixels.sliced(height, width, 3).toFloating;

     int err;
     double[] srcIlluminant = image
         .imageMean
         .reshape([1, 1, 3], err)
         .rgb2Xyz
         .field;
     assert(err == 0);

     auto sw = StopWatch(AutoStart.no);

     sw.start;
     auto ca = chromAdapt(image, srcIlluminant, 
Illuminant.d65).toUnsigned;
     sw.stop;

     auto timeTaken = sw.peek.split!("seconds", "msecs");
     writefln("%d.%d seconds", timeTaken.seconds, timeTaken.msecs);

     auto c = new Compressor();
     ubyte[] jpegOutput;
     bool compressed = c.compress(ca.field, jpegOutput, width, 
height, 90);

     if (!compressed)
     {
         c.errorInfo.writeln;
         return;
     }

     "image-out.jpg".write(jpegOutput);
}
```

Functions found through profiling to be taking most time:
* Chromatic adaptation: 
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
* RGB to XYZ: 
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
* XYZ to RGB: 
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268

A profile for the test code is 
[here](https://github.com/kyleingraham/photog/files/6274974/trace.zip). The
trace.log.dot file can be viewed with xdot. The PDF version is
[here](https://github.com/kyleingraham/photog/files/6275358/trace.log.pdf). The
profile was generated using:

* Compiled with dub build --build=profile --compiler=ldmd2
* Visualized with profdump - dub run profdump -- -f -d -t 0.1 
trace.log trace.log.dot

The branch containing the optimized code is here: 
https://github.com/kyleingraham/photog/tree/up-chromadapt-perf
The corresponding release is here: 
https://github.com/kyleingraham/photog/releases/tag/v0.1.1-alpha

If you've gotten this far thank you so much for reading. I hope 
there's enough information here to ease thinking about 
optimizations.

Apr 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
The issue I have is with performance.

This belongs in the "Learn" forum, I think.

With a few optimizations found with profiling I have been able
to drop processing time from ~10.8 to ~6.2 seconds for a 16
megapixel test image. That still feels like too long however.
Image editing programs are usually much faster.

I agree, intuitively that sounds like there is a lot of room for
further optimization.

Functions found through profiling to be taking most time:
* Chromatic adaptation:
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
* RGB to XYZ:
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
* XYZ to RGB:
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268

This is very high level code, of a sort which depends heavily on
inlining, loop unrolling and/or auto-vectorization, and a few
other key optimizations to perform well. So, I really can't tell
which parts are likely to compile down to good ASM just from
skimming it.

Personally, I try to stick to `static foreach` and normal runtime
loops for the inner loops of really CPU-intensive stuff like
this, so that it's easier to see from the source code what the
ASM will look like. With inlining and value range propagation
it's definitely possible to get good performance out of
range-based or functional-style code, but it's harder because the
D source code is that much further away from what the CPU will
actually do.

All that said, I can offer a few tips which may or may not help
here:

The only concrete floating-point type I see mentioned in those
functions is `double`, which is almost always extreme overkill
for color calculations. You should probably be using `float` (32
bits per channel), instead. Even half-precision (16 bits per
channel) is more than the human eye can distinguish. The extra 13
bits of precision offered by `float` should be more than
sufficient to absorb any rounding errors.

Structure your loops (or functional equivalents) such that the
"for each pixel" loop is the outermost, or as far out as you can
get it. Do as much as you can with each pixel while it is still
in registers or L1 cache. Otherwise, you may end up bottle-necked
by memory bandwidth.

Make sure you have optimizations enabled, especially cross module
inlining, -O3, the most recent SIMD instruction set you are
comfortable requiring (almost everyone in the developed world now
has Intel SSE4 or ARM Neon), and fused multiply add/subtract.

There will always be three primary colors throughout the entire
maintenance life of your code, so just go ahead and specialize
for that. You don't need a generic matrix multiplication
algorithm, for instance. A specialized 3x3 or 4x4 version could
be branchless and SIMD accelerated.

Apr 07 2021

Kyle Ingraham <kyle kyleingraham.com> writes:

On Thursday, 8 April 2021 at 03:25:39 UTC, tsbockman wrote:
 On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
 The issue I have is with performance.

 This belongs in the "Learn" forum, I think.

In hindsight you are right here. I'll divert this sort of post 
there in the future.

 Personally, I try to stick to `static foreach` and normal 
 runtime loops for the inner loops of really CPU-intensive stuff 
 like this, so that it's easier to see from the source code what 
 the ASM will look like.

Are compilers able to take loops and parallelize them?

 The only concrete floating-point type I see mentioned in those 
 functions is `double`, which is almost always extreme overkill 
 for color calculations.

Great tip here. I'll think about the needs for the data I'm 
working on before choosing the first type on the shelf.

 Structure your loops (or functional equivalents) such that the 
 "for each pixel" loop is the outermost, or as far out as you 
 can get it. Do as much as you can with each pixel while it is 
 still in registers or L1 cache. Otherwise, you may end up 
 bottle-necked by memory bandwidth.

 Make sure you have optimizations enabled, especially cross 
 module inlining, -O3, the most recent SIMD instruction set you 
 are comfortable requiring (almost everyone in the developed 
 world now has Intel SSE4 or ARM Neon), and fused multiply 
 add/subtract.

Are there any sources you would recommend for learning more about 
these techniques e.g. code bases, articles etc.? I'll see how far 
I can get with creative googling.

 There will always be three primary colors throughout the entire 
 maintenance life of your code, so just go ahead and specialize 
 for that. You don't need a generic matrix multiplication 
 algorithm, for instance. A specialized 3x3 or 4x4 version could 
 be branchless and SIMD accelerated.

I had thought about this for one of my functions but didn't think 
to extend it further to a function I can use everywhere. I'll do 
that.

Thank you for taking the time to write this up. I'm sure these 
tips will go a long way for improving performance.

Apr 08 2021

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Thursday, 8 April 2021 at 16:37:57 UTC, Kyle Ingraham wrote:
 Are compilers able to take loops and parallelize them?

No, but you can quite easily:

```
// Find the logarithm of every number from
// 1 to 1_000_000 in parallel, using the
// default TaskPool instance.
auto logs = new double[1_000_000];

foreach (i, ref elem; parallel(logs))
{
     elem = log(i + 1.0);
}
```

https://dlang.org/phobos/std_parallelism.html#.parallel

-- Bastiaan.

Apr 08 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 8 April 2021 at 16:59:07 UTC, Bastiaan Veelo wrote:
 On Thursday, 8 April 2021 at 16:37:57 UTC, Kyle Ingraham wrote:
 Are compilers able to take loops and parallelize them?

 No, but you can quite easily:
 ...
 https://dlang.org/phobos/std_parallelism.html#.parallel

In keeping with my earlier comment about structuring nested loops 
optimally, the parallelism should be applied at the level of the 
"for each pixel" loop, so that each pixel's memory only needs to 
be touched by one CPU core.

Apr 08 2021

Max Haughton <maxhaton gmail.com> writes:

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
 Hello all. I have been working on learning computational 
 photography and have been using D to do that. I recently got 
 some code running that performs [chromatic 
 adaptation](https://en.wikipedia.org/wiki/Chromatic_adaptation) 
 (white balancing). The output is still not ideal (image is 
 overexposed) but it does correct color casts. The issue I have 
 is with performance. With a few optimizations found with 
 profiling I have been able to drop processing time from ~10.8 
 to ~6.2 seconds for a 16 megapixel test image. That still feels 
 like too long however. Image editing programs are usually much 
 faster.

 The optimizations that I've implemented:
 * Remove `immutable` from constants. The type mismatch between 
 constants (`immutable(double)`) and pixel values (`double`) 
 caused time-consuming checks for compatible types in mir 
 operations and triggered run-time type conversions and memory 
 allocations (sorry if I butchered this description).
 * Use `mir.math.common.pow` in place of `std.math.pow`.
 * Use ` optmath` for linearization functions 
 (https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source
photog/color.d#L192 and
https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L318).

 Is there anything else I can do to improve performance?

 I tested the code under the following conditions:
 * Compiled with `dub build --build=release --compiler=ldmd2`
 * dub v1.23.0, ldc v1.24.0
 * Intel Xeon W-2170B 2.5GHz (4.3GHz turbo)
 * [Test 
 image](https://user-images.githubusercontent.com/25495787/113943277-52054180-97d0-11eb-82be-934cf3d22112.jpg)
 * Test code:
 ```d

 /+ dub.sdl:
     name "photog-test"
     dependency "photog" version="~>0.1.1-alpha"
     dependency "jpeg-turbod" version="~>0.2.0"
 +/

 import std.datetime.stopwatch : AutoStart, StopWatch;
 import std.file : read, write;
 import std.stdio : writeln, writefln;

 import jpeg_turbod;
 import mir.ndslice : reshape, sliced;

 import photog.color : chromAdapt, Illuminant, rgb2Xyz;
 import photog.utils : imageMean, toFloating, toUnsigned;

 void main()
 {
     const auto jpegFile = "image-in.jpg";
     auto jpegInput = cast(ubyte[]) jpegFile.read;

     auto dc = new Decompressor();
     ubyte[] pixels;
     int width, height;
     bool decompressed = dc.decompress(jpegInput, pixels, width, 
 height);

     if (!decompressed)
     {
         dc.errorInfo.writeln;
         return;
     }

     auto image = pixels.sliced(height, width, 3).toFloating;

     int err;
     double[] srcIlluminant = image
         .imageMean
         .reshape([1, 1, 3], err)
         .rgb2Xyz
         .field;
     assert(err == 0);

     auto sw = StopWatch(AutoStart.no);

     sw.start;
     auto ca = chromAdapt(image, srcIlluminant, 
 Illuminant.d65).toUnsigned;
     sw.stop;

     auto timeTaken = sw.peek.split!("seconds", "msecs");
     writefln("%d.%d seconds", timeTaken.seconds, 
 timeTaken.msecs);

     auto c = new Compressor();
     ubyte[] jpegOutput;
     bool compressed = c.compress(ca.field, jpegOutput, width, 
 height, 90);

     if (!compressed)
     {
         c.errorInfo.writeln;
         return;
     }

     "image-out.jpg".write(jpegOutput);
 }
 ```

 Functions found through profiling to be taking most time:
 * Chromatic adaptation: 
 https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L354
 * RGB to XYZ: 
 https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L142
 * XYZ to RGB: 
 https://github.com/kyleingraham/photog/blob/up-chromadapt-perf/source/photog/color.d#L268

 A profile for the test code is 
 [here](https://github.com/kyleingraham/photog/files/6274974/trace.zip). The
trace.log.dot file can be viewed with xdot. The PDF version is
[here](https://github.com/kyleingraham/photog/files/6275358/trace.log.pdf). The
profile was generated using:

 * Compiled with dub build --build=profile --compiler=ldmd2
 * Visualized with profdump - dub run profdump -- -f -d -t 0.1 
 trace.log trace.log.dot

 The branch containing the optimized code is here: 
 https://github.com/kyleingraham/photog/tree/up-chromadapt-perf
 The corresponding release is here: 
 https://github.com/kyleingraham/photog/releases/tag/v0.1.1-alpha

 If you've gotten this far thank you so much for reading. I hope 
 there's enough information here to ease thinking about 
 optimizations.

I am away from a proper dev machine at the moment so I can't 
really delve into this in much detail, but some thoughts:

Are you making the compiler aware of your machine? Although the 
obvious point here is vector width (you have AVX-512 from what I 
can see, however I'm not sure if this is actually a win or not on 
Skylake W), the compiler is also forced to use a completely 
generic scheduling model which may or may not yield good code for 
your procesor. For LDC, you'll want `-mcpu=native`. Also use 
cross module inlining if you aren't already.

I also notice in your hot code, you are using function contracts: 
When contracts are being checked, LDC actually does use them to 
optimize the code, however in release builds when they are turned 
off this is not the case. If you are happy to assume that they 
are true in release builds (I believe you should at least), you 
can give the compiler this additional information via the use of 
an intrinsic or similar (with LDC I just use inline IR, on GDC 
you have __builtin_unreachable()). LDC *will* assume asserts in 
release builds as far as I have tested, however this still 
invokes a branch (just not to a proper assert handler).

Via `pragma(inline, true)` you can wrap this idea into a little 
"function" like this

```D
pragma(LDC_inline_ir)
     R inlineIR(string s, R, P...)(P);

pragma(inline, true)
void assume()(bool condition)
{
     //flesh this out if you want actually do something in debug 
builds.
     if(!condition)
     {
         inlineIR!("unreachable;", void)();
     }
}
```
So you call this with information you know about your code, and 
the optimizer will assume that the condition is true - in your 
case it looks like the arrays are supposed to be of length 3, so 
if you let the optimizer assume this it can yield a much much 
much better function because of it (Compilers these days will 
usually generate a big unrolled loop using the full vector width 
of the machine, and some little ones that - if you put your 
modular arithmetic hat on - make up the difference for the whole 
array, but in your case you could just want 3 movs maybe). Using 
this method can also yield much more aggressive autovectorization 
for some loops.

Helping the optimizer in the aforementioned manner will often 
*not* yield dramatic performance increases, at least in 
throughput, however they will yield code that is smaller and has 
a simpler control flow graph - these two facts are kind to 
latency and your instruction cache (The I$ is quite big in 2021, 
however processors now utilise a so-called L0 cache for the 
decoded micro-ops which is not very big and can also yield 
counterintuitive results on - admittedly contrived - loop 
examples due to tricks that the CPU does to make typical code 
faster ceasing to apply). I think AMD Zen has some interesting 
loop alignment requirements but I could be misremembering.

As for the usual performance traps you'll need to look at the 
program running in perf, or vTune if you can be bothered (vTune 
is like a rifle to perf's air pistol, but is enormous on your 
disk, it does however understand D code pretty well so I am 
finding it quite interesting to play with).

Apr 07 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:
 Although the obvious point here is vector width (you have
 AVX-512 from what I can see, however I'm not sure if this is
 actually a win or not on Skylake W)

 From what I've seen, LLVM's code generation and optimization for 
AVX-512 auto-vectorization is still quite bad and immature 
compared to AVX2 and earlier, and the wider the SIMD register the 
more that data structures and algorithms have to be specifically 
tailored to really benefit from them. Also, using AVX-512 
instructions forces the CPU to downclock.

So, I wouldn't expect much benefit from AVX-512 for the time 
being, unless you're going to hand optimize for it.

 For LDC, you'll want -mcpu=native`.

Only do this if you don't care about the binary working on any 
CPU but your own. Otherwise, you need to look at something like 
the Steam Hardware survey and decide what percentage of the 
market you want to capture (open the "Other Settings" section): 
https://store.steampowered.com/hwsurvey

Apr 07 2021

Max Haughton <maxhaton gmail.com> writes:

On Thursday, 8 April 2021 at 03:45:06 UTC, tsbockman wrote:
 On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:
 Although the obvious point here is vector width (you have
 AVX-512 from what I can see, however I'm not sure if this is
 actually a win or not on Skylake W)

 From what I've seen, LLVM's code generation and optimization 
 for AVX-512 auto-vectorization is still quite bad and immature 
 compared to AVX2 and earlier, and the wider the SIMD register 
 the more that data structures and algorithms have to be 
 specifically tailored to really benefit from them. Also, using 
 AVX-512 instructions forces the CPU to downclock.

 So, I wouldn't expect much benefit from AVX-512 for the time 
 being, unless you're going to hand optimize for it.

 For LDC, you'll want -mcpu=native`.

 Only do this if you don't care about the binary working on any 
 CPU but your own. Otherwise, you need to look at something like 
 the Steam Hardware survey and decide what percentage of the 
 market you want to capture (open the "Other Settings" section): 
 https://store.steampowered.com/hwsurvey

You can do multiversioning fairly easily these days.

And AVX-512 downclocking can be quite complicated, I have seen 
benchmarks where one still can achieve a decent speedup even 
*with* downclocking. At very least it's worth profiling - the 
reason why I brought up Skylake W specifically is that some of 
the earlier ones actually emulated the 512 bit vector 
instructions rather than having proper support in the function 
units IIRC.

D needs finer grained control of the optimizer *inside* loops - 
e.g. I don't care about inlining writeln, but if something 
doesn't get inlined inside a hot loop you're fucked.

Apr 07 2021

Iain Buclaw <ibuclaw gdcproject.org> writes:

On Thursday, 8 April 2021 at 03:58:36 UTC, Max Haughton wrote:
 On Thursday, 8 April 2021 at 03:45:06 UTC, tsbockman wrote:
 [...]

 You can do multiversioning fairly easily these days.

Perhaps a  target_clones UDA in {gcc/ldc}.attributes. ;-)

Apr 08 2021

Kyle Ingraham <kyle kyleingraham.com> writes:

On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:
 Are you making the compiler aware of your machine? Although the 
 obvious point here is vector width (you have AVX-512 from what 
 I can see, however I'm not sure if this is actually a win or 
 not on Skylake W), the compiler is also forced to use a 
 completely generic scheduling model which may or may not yield 
 good code for your procesor. For LDC, you'll want 
 `-mcpu=native`. Also use cross module inlining if you aren't 
 already.

I haven't tried this yet but will give it a go. Ideally I'd want 
performance independent of compilers but could sprinkle this into 
the build config for when LDC is used.

 I also notice in your hot code, you are using function 
 contracts: When contracts are being checked, LDC actually does 
 use them to optimize the code, however in release builds when 
 they are turned off this is not the case. If you are happy to 
 assume that they are true in release builds (I believe you 
 should at least), you can give the compiler this additional 
 information via the use of an intrinsic or similar (with LDC I 
 just use inline IR, on GDC you have __builtin_unreachable()). 
 LDC *will* assume asserts in release builds as far as I have 
 tested, however this still invokes a branch (just not to a 
 proper assert handler)...

Thanks for shining light on a level of optimization that I did 
not know existed. I've got a lot of learning to do in this area. 
Do you have any resources you'd recommend?

Apr 08 2021

Max Haughton <maxhaton gmail.com> writes:

On Thursday, 8 April 2021 at 17:00:31 UTC, Kyle Ingraham wrote:
 On Thursday, 8 April 2021 at 03:27:12 UTC, Max Haughton wrote:
 Are you making the compiler aware of your machine? Although 
 the obvious point here is vector width (you have AVX-512 from 
 what I can see, however I'm not sure if this is actually a win 
 or not on Skylake W), the compiler is also forced to use a 
 completely generic scheduling model which may or may not yield 
 good code for your procesor. For LDC, you'll want 
 `-mcpu=native`. Also use cross module inlining if you aren't 
 already.

 I haven't tried this yet but will give it a go. Ideally I'd 
 want performance independent of compilers but could sprinkle 
 this into the build config for when LDC is used.

 I also notice in your hot code, you are using function 
 contracts: When contracts are being checked, LDC actually does 
 use them to optimize the code, however in release builds when 
 they are turned off this is not the case. If you are happy to 
 assume that they are true in release builds (I believe you 
 should at least), you can give the compiler this additional 
 information via the use of an intrinsic or similar (with LDC I 
 just use inline IR, on GDC you have __builtin_unreachable()). 
 LDC *will* assume asserts in release builds as far as I have 
 tested, however this still invokes a branch (just not to a 
 proper assert handler)...

 Thanks for shining light on a level of optimization that I did 
 not know existed. I've got a lot of learning to do in this 
 area. Do you have any resources you'd recommend?

GDC has equivalent options - dmd does not have any equivalent 
however.

As for resources in this particular area there isn't really an 
exact science to it beyond not assuming that the compiler can 
read your mind. These optimizations can be quite ad-hoc i.e. the 
propagation of this information through the function is not 
always intuitive (or possible).

One trick is to declare a method as inline true, then assume 
something in that method - the compiler will then assume (say) 
that the return value is aligned or whatever. But proceed with 
caution.

Apr 08 2021

tsbockman <thomas.bockman gmail.com> writes:

On Thursday, 8 April 2021 at 17:00:31 UTC, Kyle Ingraham wrote:
 Thanks for shining light on a level of optimization that I did 
 not know existed. I've got a lot of learning to do in this 
 area. Do you have any resources you'd recommend?

I learned a lot about micro-optimization from studying the ASM 
output of my inner loop code in Godbolt's Compiler Explorer, 
which supports D:
https://godbolt.org/

Other complementary resources:
https://www.felixcloutier.com/x86/
https://www.agner.org/optimize/

Understanding the basics of how prefetchers and cache memory 
hierarchies work is important, too.

Apr 08 2021

Guillaume Piolat <first.name spam.org> writes:

On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
 Is there anything else I can do to improve performance?

- profiling?
- you can use _mm_pow_ps in intel-intrinsics package to have 4x 
pow at once for the price of one.

Apr 08 2021

Guillaume Piolat <first.name spam.org> writes:

On Thursday, 8 April 2021 at 14:17:09 UTC, Guillaume Piolat wrote:
 On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
 Is there anything else I can do to improve performance?

 - profiling?
 - you can use _mm_pow_ps in intel-intrinsics package to have 4x 
 pow at once for the price of one.

Also if you don't need the precision: always use powf instead of 
pow for double, use expf instead of exp for double, etc. Else you 
pay extra.
In particular, llvm_pow with a float argument is a lot faster.

Apr 08 2021

Kyle Ingraham <kyle kyleingraham.com> writes:

On Thursday, 8 April 2021 at 14:20:27 UTC, Guillaume Piolat wrote:
 On Thursday, 8 April 2021 at 14:17:09 UTC, Guillaume Piolat 
 wrote:
 On Thursday, 8 April 2021 at 01:24:23 UTC, Kyle Ingraham wrote:
 Is there anything else I can do to improve performance?

 - profiling?
 - you can use _mm_pow_ps in intel-intrinsics package to have 
 4x pow at once for the price of one.

 Also if you don't need the precision: always use powf instead 
 of pow for double, use expf instead of exp for double, etc. 
 Else you pay extra.
 In particular, llvm_pow with a float argument is a lot faster.

Great tips here. I wasn't aware of powf and its cousins. I also 
like that intel-intrisics works across compilers and CPU 
architectures. I'll see what I can do with them.

Apr 08 2021

D Programming

C/C++ Programming

Other

digitalmars.D - Requesting Help with Optimizing Code