## digitalmars.D - Vector Optimizations

• Kyle Furlong (8/8) Jan 26 2006 Well my templated Vector is nearly complete! What remains is some
• James Dunne (20/29) Jan 26 2006 I'm afraid with this sort of approach the best kind of SIMD optimization...
• Kyle Furlong (8/40) Jan 26 2006 Basically I want to utilize all the capabilities of the processor for
• James Dunne (33/78) Jan 27 2006 1) You must know the capabilities of the processor on which your library...
Kyle Furlong <kylefurlong gmail.com> writes:
```Well my templated Vector is nearly complete! What remains is some
complex algebra quirks, implementing fast algorithms for cross products
for different dimension vectors, and optimizing the various operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc.
suggestions on my Vector struct or vector optimization in D are welcome.

P.S. - The sources are attached, but the latest Velocity source can be
found at http://svn.dsource.org/projects/velocity/trunk/source/
```
Jan 26 2006
James Dunne <james.jdunne gmail.com> writes:
```Kyle Furlong wrote:
Well my templated Vector is nearly complete! What remains is some
complex algebra quirks, implementing fast algorithms for cross products
for different dimension vectors, and optimizing the various operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc.
suggestions on my Vector struct or vector optimization in D are welcome.

[snippy]

I'm afraid with this sort of approach the best kind of SIMD optimization
you'd get would be single operations.  Unless the compiler can perform
ridiculous amounts of optimization, you won't get *much* benefit.  You
will get some, and it might be enough for you.  But obviously
hand-written SIMD instructions in assembly language is going to yield
far superior results.

True support for SIMD instructions should be a function of the compiler
because it needs to know how to align the data in memory, what registers
are available for use, and how it can streamline vectorized operations.
If the compiler doesn't optimize your code much and leaves the method
calls in, you'll just have exactly that: basically calling a function to
copy a vector into a SIMD register, executing the SIMD instruction,
copying the vector out of the register into the stack, and returning
that value to the caller somehow.  All in all, extremely suboptimal IMHO.

So, in summary, the OO methodolgy doesn't apply well to vectorization it
seems.

--
Regards,
James Dunne
```
Jan 26 2006
Kyle Furlong <kylefurlong gmail.com> writes:
```James Dunne wrote:
Kyle Furlong wrote:
Well my templated Vector is nearly complete! What remains is some
complex algebra quirks, implementing fast algorithms for cross
products for different dimension vectors, and optimizing the various
operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc.
or suggestions on my Vector struct or vector optimization in D are
welcome.

[snippy]

I'm afraid with this sort of approach the best kind of SIMD optimization
you'd get would be single operations.  Unless the compiler can perform
ridiculous amounts of optimization, you won't get *much* benefit.  You
will get some, and it might be enough for you.  But obviously
hand-written SIMD instructions in assembly language is going to yield
far superior results.

True support for SIMD instructions should be a function of the compiler
because it needs to know how to align the data in memory, what registers
are available for use, and how it can streamline vectorized operations.
If the compiler doesn't optimize your code much and leaves the method
calls in, you'll just have exactly that: basically calling a function to
copy a vector into a SIMD register, executing the SIMD instruction,
copying the vector out of the register into the stack, and returning
that value to the caller somehow.  All in all, extremely suboptimal IMHO.

So, in summary, the OO methodolgy doesn't apply well to vectorization it
seems.

Basically I want to utilize all the capabilities of the processor for
each operation. For example, most of the operations atm are running a
for loop accross all the elements of the wrapped array of whatevers and
doing the operation. I'm asking how to optimize this to use the SIMD
instructions of the various processors. I tried to unroll the loops
using the duffs device walter put together with templates, but couldnt
figure out how to get it to work with array indexing.
```
Jan 26 2006
James Dunne <james.jdunne gmail.com> writes:
```Kyle Furlong wrote:
James Dunne wrote:

Kyle Furlong wrote:

Well my templated Vector is nearly complete! What remains is some
complex algebra quirks, implementing fast algorithms for cross
products for different dimension vectors, and optimizing the various
operations.

In this vein, what is the best way to get SIMD, SSE, SSE2 etc.
or suggestions on my Vector struct or vector optimization in D are
welcome.

[snippy]

I'm afraid with this sort of approach the best kind of SIMD
optimization you'd get would be single operations.  Unless the
compiler can perform ridiculous amounts of optimization, you won't get
*much* benefit.  You will get some, and it might be enough for you.
But obviously hand-written SIMD instructions in assembly language is
going to yield far superior results.

True support for SIMD instructions should be a function of the
compiler because it needs to know how to align the data in memory,
what registers are available for use, and how it can streamline
vectorized operations.  If the compiler doesn't optimize your code
much and leaves the method calls in, you'll just have exactly that:
basically calling a function to copy a vector into a SIMD register,
executing the SIMD instruction, copying the vector out of the register
into the stack, and returning that value to the caller somehow.  All
in all, extremely suboptimal IMHO.

So, in summary, the OO methodolgy doesn't apply well to vectorization
it seems.

Basically I want to utilize all the capabilities of the processor for
each operation. For example, most of the operations atm are running a
for loop accross all the elements of the wrapped array of whatevers and
doing the operation. I'm asking how to optimize this to use the SIMD
instructions of the various processors. I tried to unroll the loops
using the duffs device walter put together with templates, but couldnt
figure out how to get it to work with array indexing.

1) You must know the capabilities of the processor on which your library
will be running on.  This is possible to determine with the CPUID
instruction on x86 machines.  You can check for flags which indicate
support for SSE, SSE2, SSE3, MMX, 3Dnow!, etc.

2) You must guarantee that the data with which you are working is
aligned in memory per the requirements of the specific SIMD capabilities
you're targeting/using.  This can be done with custom class allocators.
Do structs have this ability too?

3) SIMD instruction sets usually have very fixed capabilities, such as
working with either a vector of 4 32-bit floats, or 2 64-bit doubles.
These hardware registers have fixed size limitations.  Continuing to
allow for n-dimensional vectors via templates will have you writing so
much special-case code for SIMD support that the templating support just
won't be worth it.  Besides, most mathematics dealing with physical laws
use vectors of dimension 4 or less.  Also, you might have to ditch the
complex-number support.

If you really want SIMD support, there isn't any way around these
requirements.

As per loop-unrolling, you can simply hand-unroll your loops.  The basic
concept is to do as much work as you can within one loop iteration, and
to simultaneously cut down on the number of loop iterations.  Loops mean
branches, and branches are bad for pipelines.

In fact, I'm quite sure a generic templating approach to loop unrolling
would be great!

--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/MU/S d-pu s:+ a-->? C++++\$ UL+++ P--- L+++ !E W-- N++ o? K? w--- O
M--  V? PS PE Y+ PGP- t+ 5 X+ !R tv-->!tv b- DI++(+) D++ G e++>e
h>--->++ r+++ y+++
------END GEEK CODE BLOCK------

James Dunne
```
Jan 27 2006