digitalmars.D - FFT in D (using SIMD) and benchmarks

a (131/131) Jan 24 2012 Since SIMD types were added to D I've ported an FFT that I was

Iain Buclaw (13/141) Jan 24 2012 e

a (2/4) Jan 24 2012 No, GDC vectors worked fine. It was a pretty straight forward

bearophile (6/10) Jan 24 2012 I know your code is relatively simple, so it's not meant to be the faste...

a (44/59) Jan 24 2012 "bench" program in the fftw test directory gives this when run in

bearophile (8/12) Jan 24 2012 So your FFT code is rather good, despite being so much simpler and short...

a (2/17) Jan 24 2012 I have updated the graph now.

Marco Leise (3/3) Jan 25 2012 If you want to, you can take a look at the link in this forum post:

Robert Jacques (2/5) Jan 25 2012 For those who want the short version. MIT has developed a faster method ...

Manu (8/136) Jan 25 2012 Can you paste disassembly's of the GDC code and the G++ code?

a (87/98) Jan 25 2012 I've uploaded it here:

Iain Buclaw (5/12) Jan 25 2012 That's right, go on and give that GDC maintainer guy more work to do.

Trass3r (2/2) Jan 25 2012 For some reason your messages never have the proper position (i.e. they ...

Jonathan M Davis (5/7) Jan 25 2012 I believe that it has something to do with newsgroup vs mailing list pos...

Manu (2/14) Jan 25 2012 How can I resist, he does such a fast and awesome job of it! :P

"a" <a a.com> writes:

Since SIMD types were added to D I've ported an FFT that I was 
writing in C++ to D. The code is here:

https://github.com/jerro/pfft

Because dmd currently doesn't have an intrinsic for the SHUFPS 
instruction I've included a version block with some GDC specific 
code (this gave me a speedup of up to 80%). I've benchmarked the 
scalar and SSE version of code compiled with both DMD and GDC and 
also the c++ code using SSE. The results are below. The left 
column is base two logarithm of the array size and the right 
column is GFLOPS defined as the number of floating point 
operations that the most basic FFT algorithm would perform 
divided by the time taken (the algorithm I used performs just a 
bit less operations):

GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I 
took that definition from http://www.fftw.org/speed/ )

Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

Results:

GDC SSE:

2	0.833648
3	1.23383
4	6.92712
5	8.93348
6	10.9212
7	11.9306
8	12.5338
9	13.4025
10	13.5835
11	13.6992
12	13.493
13	12.7082
14	9.32621
15	9.15256
16	9.31431
17	8.38154
18	8.267
19	7.61852
20	7.14305
21	7.01786
22	6.58934

G++ SSE:

2	1.65933
3	1.96071
4	7.09683
5	9.66308
6	11.1498
7	11.9315
8	12.5712
9	13.4241
10	13.4907
11	13.6524
12	13.4215
13	12.6472
14	9.62755
15	9.24289
16	9.64412
17	8.88006
18	8.66819
19	8.28623
20	7.74581
21	7.6395
22	7.33506

GDC scalar:

2	0.808422
3	1.20835
4	2.66921
5	2.81166
6	2.99551
7	3.26423
8	3.61477
9	3.90741
10	4.04009
11	4.20405
12	4.21491
13	4.30896
14	3.79835
15	3.80497
16	3.94784
17	3.98417
18	3.58506
19	3.33992
20	3.42309
21	3.21923
22	3.25673

DMD SSE:

2	0.497946
3	0.773551
4	3.79912
5	3.78027
6	3.85155
7	4.06491
8	4.30895
9	4.53038
10	4.61006
11	4.82098
12	4.7455
13	4.85332
14	3.37768
15	3.44962
16	3.54049
17	3.40236
18	3.47339
19	3.40212
20	3.15997
21	3.32644
22	3.22767

DMD scalar:

2	0.478998
3	0.772341
4	1.6106
5	1.68516
6	1.7083
7	1.70625
8	1.68684
9	1.66931
10	1.66125
11	1.63756
12	1.61885
13	1.60459
14	1.402
15	1.39665
16	1.37894
17	1.36306
18	1.27189
19	1.21033
20	1.25719
21	1.21315
22	1.21606

SIMD gives between 2 and 3.5 speedup for GDC compiled code and 
between 2.5 and 3 for DMD. Code compiled with GDC is just a 
little bit slower than G++ (and just for some values of n), which 
is really nice.

Jan 24 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 25 January 2012 00:04, a <a a.com> wrote:
 Since SIMD types were added to D I've ported an FFT that I was writing in
 C++ to D. The code is here:

 https://github.com/jerro/pfft

 Because dmd currently doesn't have an intrinsic for the SHUFPS instructio=

n
 I've included a version block with some GDC specific code (this gave me a
 speedup of up to 80%). I've benchmarked the scalar and SSE version of cod=

e
 compiled with both DMD and GDC and also the c++ code using SSE. The resul=

ts
 are below. The left column is base two logarithm of the array size and th=

e
 right column is GFLOPS defined as the number of floating point operations
 that the most basic FFT algorithm would perform divided by the time taken
 (the algorithm I used performs just a bit less operations):

 GFLOPS =3D 5 n log2(n) / (time for one FFT in nanoseconds) =A0 (I took th=

at
 definition from http://www.fftw.org/speed/ )

 Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

 Results:

 GDC SSE:

 2 =A0 =A0 =A0 0.833648
 3 =A0 =A0 =A0 1.23383
 4 =A0 =A0 =A0 6.92712
 5 =A0 =A0 =A0 8.93348
 6 =A0 =A0 =A0 10.9212
 7 =A0 =A0 =A0 11.9306
 8 =A0 =A0 =A0 12.5338
 9 =A0 =A0 =A0 13.4025
 10 =A0 =A0 =A013.5835
 11 =A0 =A0 =A013.6992
 12 =A0 =A0 =A013.493
 13 =A0 =A0 =A012.7082
 14 =A0 =A0 =A09.32621
 15 =A0 =A0 =A09.15256
 16 =A0 =A0 =A09.31431
 17 =A0 =A0 =A08.38154
 18 =A0 =A0 =A08.267
 19 =A0 =A0 =A07.61852
 20 =A0 =A0 =A07.14305
 21 =A0 =A0 =A07.01786
 22 =A0 =A0 =A06.58934

 G++ SSE:

 2 =A0 =A0 =A0 1.65933
 3 =A0 =A0 =A0 1.96071
 4 =A0 =A0 =A0 7.09683
 5 =A0 =A0 =A0 9.66308
 6 =A0 =A0 =A0 11.1498
 7 =A0 =A0 =A0 11.9315
 8 =A0 =A0 =A0 12.5712
 9 =A0 =A0 =A0 13.4241
 10 =A0 =A0 =A013.4907
 11 =A0 =A0 =A013.6524
 12 =A0 =A0 =A013.4215
 13 =A0 =A0 =A012.6472
 14 =A0 =A0 =A09.62755
 15 =A0 =A0 =A09.24289
 16 =A0 =A0 =A09.64412
 17 =A0 =A0 =A08.88006
 18 =A0 =A0 =A08.66819
 19 =A0 =A0 =A08.28623
 20 =A0 =A0 =A07.74581
 21 =A0 =A0 =A07.6395
 22 =A0 =A0 =A07.33506

 GDC scalar:

 2 =A0 =A0 =A0 0.808422
 3 =A0 =A0 =A0 1.20835
 4 =A0 =A0 =A0 2.66921
 5 =A0 =A0 =A0 2.81166
 6 =A0 =A0 =A0 2.99551
 7 =A0 =A0 =A0 3.26423
 8 =A0 =A0 =A0 3.61477
 9 =A0 =A0 =A0 3.90741
 10 =A0 =A0 =A04.04009
 11 =A0 =A0 =A04.20405
 12 =A0 =A0 =A04.21491
 13 =A0 =A0 =A04.30896
 14 =A0 =A0 =A03.79835
 15 =A0 =A0 =A03.80497
 16 =A0 =A0 =A03.94784
 17 =A0 =A0 =A03.98417
 18 =A0 =A0 =A03.58506
 19 =A0 =A0 =A03.33992
 20 =A0 =A0 =A03.42309
 21 =A0 =A0 =A03.21923
 22 =A0 =A0 =A03.25673

 DMD SSE:

 2 =A0 =A0 =A0 0.497946
 3 =A0 =A0 =A0 0.773551
 4 =A0 =A0 =A0 3.79912
 5 =A0 =A0 =A0 3.78027
 6 =A0 =A0 =A0 3.85155
 7 =A0 =A0 =A0 4.06491
 8 =A0 =A0 =A0 4.30895
 9 =A0 =A0 =A0 4.53038
 10 =A0 =A0 =A04.61006
 11 =A0 =A0 =A04.82098
 12 =A0 =A0 =A04.7455
 13 =A0 =A0 =A04.85332
 14 =A0 =A0 =A03.37768
 15 =A0 =A0 =A03.44962
 16 =A0 =A0 =A03.54049
 17 =A0 =A0 =A03.40236
 18 =A0 =A0 =A03.47339
 19 =A0 =A0 =A03.40212
 20 =A0 =A0 =A03.15997
 21 =A0 =A0 =A03.32644
 22 =A0 =A0 =A03.22767

 DMD scalar:

 2 =A0 =A0 =A0 0.478998
 3 =A0 =A0 =A0 0.772341
 4 =A0 =A0 =A0 1.6106
 5 =A0 =A0 =A0 1.68516
 6 =A0 =A0 =A0 1.7083
 7 =A0 =A0 =A0 1.70625
 8 =A0 =A0 =A0 1.68684
 9 =A0 =A0 =A0 1.66931
 10 =A0 =A0 =A01.66125
 11 =A0 =A0 =A01.63756
 12 =A0 =A0 =A01.61885
 13 =A0 =A0 =A01.60459
 14 =A0 =A0 =A01.402
 15 =A0 =A0 =A01.39665
 16 =A0 =A0 =A01.37894
 17 =A0 =A0 =A01.36306
 18 =A0 =A0 =A01.27189
 19 =A0 =A0 =A01.21033
 20 =A0 =A0 =A01.25719
 21 =A0 =A0 =A01.21315
 22 =A0 =A0 =A01.21606

 SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.=

5
 and 3 for DMD. Code compiled with GDC is just a little bit slower than G+=

+
 (and just for some values of n), which is really nice.


That is quite interesting, and really cool at the same time.  :)

Did you run into any issues with GDC's implementation of vectors?

--=20
Iain Buclaw

*(p < e ? p++ : p) =3D (c & 0x0f) + '0';

Jan 24 2012

"a" <a a.com> writes:

 Did you run into any issues with GDC's implementation of 
 vectors?

No, GDC vectors worked fine. It was a pretty straight forward 
port from C++ version using intrinsics from emmintrin.h.

Jan 24 2012

bearophile <bearophileHUGS lycos.com> writes:

a:

 Because dmd currently doesn't have an intrinsic for the SHUFPS 
 instruction I've included a version block with some GDC specific 
 code (this gave me a speedup of up to 80%).

It seems an instruction worth having in dmd too.


 Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

I know your code is relatively simple, so it's not meant to be the fastest on
the ground, but in your nice graph _as reference point_ I'd like to see a line
for the FTTW too. Such line is able to show us how close or how far all this is
from an industry standard performance.
(And if possible I'd like to see two lines for the LDC2 compiler too.)

Bye,
bearophile

Jan 24 2012

"a" <a a.com> writes:

On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
 a:

 Because dmd currently doesn't have an intrinsic for the SHUFPS 
 instruction I've included a version block with some GDC 
 specific code (this gave me a speedup of up to 80%).

 It seems an instruction worth having in dmd too.


 Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

 I know your code is relatively simple, so it's not meant to be 
 the fastest on the ground, but in your nice graph _as reference 
 point_ I'd like to see a line for the FTTW too. Such line is 
 able to show us how close or how far all this is from an 
 industry standard performance.
 (And if possible I'd like to see two lines for the LDC2 
 compiler too.)

 Bye,
 bearophile

"bench" program in the fftw test directory gives this when run in 
a loop:


2	Problem: 4, setup: 21.00 us, time: 11.16 ns, ``mflops'': 3583.7
3	Problem: 8, setup: 21.00 us, time: 22.84 ns, ``mflops'': 5254.3
4	Problem: 16, setup: 24.00 us, time: 46.83 ns, ``mflops'': 6833.9
5	Problem: 32, setup: 290.00 us, time: 56.71 ns, ``mflops'': 14108
6	Problem: 64, setup: 1.00 ms, time: 111.47 ns, ``mflops'': 17225
7	Problem: 128, setup: 2.06 ms, time: 227.22 ns, ``mflops'': 19717
8	Problem: 256, setup: 3.99 ms, time: 499.48 ns, ``mflops'': 20501
9	Problem: 512, setup: 7.11 ms, time: 1.10 us, ``mflops'': 20958
10	Problem: 1024, setup: 14.51 ms, time: 2.47 us, ``mflops'': 
20690
11	Problem: 2048, setup: 30.18 ms, time: 5.72 us, ``mflops'': 
19693
12	Problem: 4096, setup: 61.20 ms, time: 13.20 us, ``mflops'': 
18622
13	Problem: 8192, setup: 127.97 ms, time: 36.02 us, ``mflops'': 
14784
14	Problem: 16384, setup: 252.58 ms, time: 82.43 us, ``mflops'': 
13913
15	Problem: 32768, setup: 490.55 ms, time: 194.14 us, ``mflops'': 
12659
16	Problem: 65536, setup: 1.13 s, time: 422.50 us, ``mflops'': 
12409
17	Problem: 131072, setup: 2.67 s, time: 994.75 us, ``mflops'': 
11200
18	Problem: 262144, setup: 5.77 s, time: 2.28 ms, ``mflops'': 
10338
19	Problem: 524288, setup: 1.72 s, time: 9.50 ms, ``mflops'': 
5243.4
20	Problem: 1048576, setup: 5.51 s, time: 20.55 ms, ``mflops'': 
5102.8
21	Problem: 2097152, setup: 9.55 s, time: 42.88 ms, ``mflops'': 
5135.2
22	Problem: 4194304, setup: 26.51 s, time: 88.56 ms, ``mflops'': 
5209.8

This was with fftw compiled for single precision and with SSE, 
but without AVX support. When I compiled fftw with AVX support, 
the peak was at about 30 GFLOPS, IIRC. It is possible that it 
would be even faster if I configured it in a different way. The 
C++ version of my FFT also supports AVX and gets to about 24 
GFLOPS when using it. If AVX types will be added to D, I will 
port that part too.

Jan 24 2012

bearophile <bearophileHUGS lycos.com> writes:

a:

 I have updated the graph now.

Thank you, it's a very nice graph. The reference lines of FFTW help. From how
much AVX improves the situation I see Intel engineers know their work; I didn't
know AVX is able to bring such performance improvement (in carefully written
code).


 The C++ version of my FFT also supports AVX and gets to
 about 24 GFLOPS when using it.

So your FFT code is rather good, despite being so much simpler and shorter than
FFTW code :-) Your code seems good to replace Phobos std.numeric.fft(), if you
will ever want to donate your FFT code to Phobos. 


 If AVX types will be added to D, I will port that part too.

Walter has added support for YMM registers too in D/DMD, so I presume having
AVX1 instructions are quite an option. I presume we will have them too. But ask
Walter for more details on this.

And hopefully we'll see good implementations of this algorithm too :-)
http://web.mit.edu/newsoffice/2012/faster-fourier-transforms-0118.html?tmpl=component&print=1

Bye,
bearophile

Jan 24 2012

"a" <a a.com> writes:

On Wednesday, 25 January 2012 at 00:49:15 UTC, bearophile wrote:
 a:

 Because dmd currently doesn't have an intrinsic for the SHUFPS 
 instruction I've included a version block with some GDC 
 specific code (this gave me a speedup of up to 80%).

 It seems an instruction worth having in dmd too.


 Chart: http://cloud.github.com/downloads/jerro/pfft/image.png

 I know your code is relatively simple, so it's not meant to be 
 the fastest on the ground, but in your nice graph _as reference 
 point_ I'd like to see a line for the FTTW too. Such line is 
 able to show us how close or how far all this is from an 
 industry standard performance.
 (And if possible I'd like to see two lines for the LDC2 
 compiler too.)

 Bye,
 bearophile

I have updated the graph now.

Jan 24 2012

"Marco Leise" <Marco.Leise gmx.de> writes:

If you want to, you can take a look at the link in this forum post:  
http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT
It looks like MIT has a new FFT algorithm.

Jan 25 2012

"Robert Jacques" <sandford jhu.edu> writes:

On Wed, 25 Jan 2012 05:58:52 -0600, Marco Leise <Marco.Leise gmx.de> wrote:
 If you want to, you can take a look at the link in this forum post:
 http://encode.ru/threads/1461-New-Fast-Fourier-Transform-Algorithm?highlight=FFT
 It looks like MIT has a new FFT algorithm.

For those who want the short version. MIT has developed a faster method of
doing a sparse FFT transform. Since many applications (like compression) work
by throwing away the 'little' frequencies, those algorithms can get a practical
speed-up of ~10x, depending on compression level by switching from the FFT's
O(n log n) algorithm to MIT's new O(k log n) algorithm. The new algorithm is
still slower for doing all n frequencies, so it's not going to replace all FFT
usage, just lossy FFT usage.

Jan 25 2012

Manu <turkeyman gmail.com> writes:

Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDC has missed.
Like the other day I noticed GDC was unnecessarily generating a stack frame
for leaf functions, which Iain already fixed.

I'd also be interested to try out my experimental std.simd (portable)
library in the context of your FFT... might give that a shot, I think it'll
work well.


On 25 January 2012 02:04, a <a a.com> wrote:

 Since SIMD types were added to D I've ported an FFT that I was writing in
 C++ to D. The code is here:

 https://github.com/jerro/pfft

 Because dmd currently doesn't have an intrinsic for the SHUFPS instruction
 I've included a version block with some GDC specific code (this gave me a
 speedup of up to 80%). I've benchmarked the scalar and SSE version of code
 compiled with both DMD and GDC and also the c++ code using SSE. The results
 are below. The left column is base two logarithm of the array size and the
 right column is GFLOPS defined as the number of floating point operations
 that the most basic FFT algorithm would perform divided by the time taken
 (the algorithm I used performs just a bit less operations):

 GFLOPS = 5 n log2(n) / (time for one FFT in nanoseconds)   (I took that
 definition from http://www.fftw.org/speed/ )

 Chart: http://cloud.github.com/**downloads/jerro/pfft/image.png<http://cloud.github.com/downloads/jerro/pfft/image.png>

 Results:

 GDC SSE:

 2       0.833648
 3       1.23383
 4       6.92712
 5       8.93348
 6       10.9212
 7       11.9306
 8       12.5338
 9       13.4025
 10      13.5835
 11      13.6992
 12      13.493
 13      12.7082
 14      9.32621
 15      9.15256
 16      9.31431
 17      8.38154
 18      8.267
 19      7.61852
 20      7.14305
 21      7.01786
 22      6.58934

 G++ SSE:

 2       1.65933
 3       1.96071
 4       7.09683
 5       9.66308
 6       11.1498
 7       11.9315
 8       12.5712
 9       13.4241
 10      13.4907
 11      13.6524
 12      13.4215
 13      12.6472
 14      9.62755
 15      9.24289
 16      9.64412
 17      8.88006
 18      8.66819
 19      8.28623
 20      7.74581
 21      7.6395
 22      7.33506

 GDC scalar:

 2       0.808422
 3       1.20835
 4       2.66921
 5       2.81166
 6       2.99551
 7       3.26423
 8       3.61477
 9       3.90741
 10      4.04009
 11      4.20405
 12      4.21491
 13      4.30896
 14      3.79835
 15      3.80497
 16      3.94784
 17      3.98417
 18      3.58506
 19      3.33992
 20      3.42309
 21      3.21923
 22      3.25673

 DMD SSE:

 2       0.497946
 3       0.773551
 4       3.79912
 5       3.78027
 6       3.85155
 7       4.06491
 8       4.30895
 9       4.53038
 10      4.61006
 11      4.82098
 12      4.7455
 13      4.85332
 14      3.37768
 15      3.44962
 16      3.54049
 17      3.40236
 18      3.47339
 19      3.40212
 20      3.15997
 21      3.32644
 22      3.22767

 DMD scalar:

 2       0.478998
 3       0.772341
 4       1.6106
 5       1.68516
 6       1.7083
 7       1.70625
 8       1.68684
 9       1.66931
 10      1.66125
 11      1.63756
 12      1.61885
 13      1.60459
 14      1.402
 15      1.39665
 16      1.37894
 17      1.36306
 18      1.27189
 19      1.21033
 20      1.25719
 21      1.21315
 22      1.21606

 SIMD gives between 2 and 3.5 speedup for GDC compiled code and between 2.5
 and 3 for DMD. Code compiled with GDC is just a little bit slower than G++
 (and just for some values of n), which is really nice.

Jan 25 2012

"a" <a a.com> writes:

On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:
 Can you paste disassembly's of the GDC code and the G++ code?
 I imagine there's something trivial in the scheduler that GDC 
 has missed.

 Like the other day I noticed GDC was unnecessarily generating a 
 stack frame
 for leaf functions, which Iain already fixed.

I've uploaded it here:

https://github.com/downloads/jerro/pfft/disassembly.zip

For g++ there is the disassembly of the entire object file. For 
gdc that included parts of the standard library and was about 
170k lines so I coppied just the functions that took the most 
time. I have also included percentages of time taken by functions 
for "./bench 22". I noticed that 2% of time is taken by memset. 
This could be caused by static array initialization in 
fft_passes_stride, so I'll void initialize static arrays there.

For very small sizes the code compiled with g++ is probably 
faster because it's more aggressively inlined.

 I'd also be interested to try out my experimental std.simd 
 (portable)
 library in the context of your FFT... might give that a shot, I 
 think it'll
 work well.

For that you probably only need to replace the code in sse.d 
(create stdsimd.d or something). The type that you pass as a 
first parameter for the FFT template should define at least the 
following:

- an alias "vec" for the vector type
- an alias "T" for the scalar type
- an enum vec_size which is the number of scalars in a vector - a 
function "static vec scalar_to_vector(T)", which takes a scalar 
and returns
a vector with all elements set to that scalar
- a function "static void bit_reverse_swap_16(T * p0, T * p1, T * 
p2, T * p3, size_t i1, size_t i2)"
- a function "static void bit_reverse_16(T * p0, T * p1, T * p2, 
T * p3, size_t i)"

You can start with using Scalar!T.bit_reverse_16 and 
Scalar!T.bit_reverse_swap_16 from fft_impl.d but that won't be 
very fast. Those to funcions do the following:

bit_reverse_16: This one reads 16 elements - four from p0 + i, 
four from p1 + i, four from p2 + i and four from p3 + i. Then it 
"bit reverses" them - this means that for each pair of indices j 
and k such that k has the reverse
bits of j, it swaps element at j and element at k. I define the 
index here so that the element that was read from address pn + i 
+ m has index n * 4 + m.
After bit reversing them, it writes the elements back. You cann 
assume that
(p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to 
4*T.sizeof.

bit_reverse_swap_16: This one reads 16 elements from  (p0 + i), 
(p1 + i), (p2 + i) and (p3 + i)
and bitreverses them and also the elements from (p0 + j), (p1 + 
j), (p2 + j) and (p3 + j)
and bitreverses those. Then it writes the elements that were read 
from offsets i to offsets j and vice versa. You can assue that 
(p0 + i), (p1 + i), (p2 + i) and (p3 + i),
(p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to 
4*T.sizeof.

If you want to speed up the fft for larg sizes (greater or equal 
to 1<<Options.largeLimit, which is roughly those that do not fit 
in L1 cache) a bit, you can also write those functions:

- static void interleave(int n)(vec a, vec b, ref vec c, ref vec 
d)
Here n will be a power of two larger than 1 and less or eaqual to 
vec_size.
The function breaks vector a into n equaly sized parts and does 
the same
for b, then interleaves those parts and writes them to c,d. For 
example:
for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to c 
and (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it should 
write (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d.

- static void deinterleave(int n)(vec a, vec b, ref vec c, ref 
vec d)
This is an inverse of interleave!n(a, b, c, d)

- static void complex_array_to_real_imag_vec(int n)(float * arr, 
ref vec rr, ref vec ri)
This one reads n pairs from arr. It repeats the first element of 
each pair
vec_size / n times and writes them to rr. It also repeats the 
second element
of each pair vec_size / n times and writes them to ri. For 
example: if vec_size
is 4 and n is 2 then the it should write (arr[0], arr[0], arr[2], 
arr[2])
to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assume 
that arr
is aligned to 2*n*T.sizeof.

The functions interleave, deinterleave and 
complex_array_to_real_imag_vec
are called in the inner loop, so if you can't make them very 
fast, you should
just ommit them. The algorithm will then fall back (it checks for 
the presence
of interleave!2) to a scalar one for the passes where these 
functions would be used otherwise.

Jan 25 2012

Iain Buclaw <ibuclaw ubuntu.com> writes:

On 25 January 2012 12:54, Manu <turkeyman gmail.com> wrote:
 Can you paste disassembly's of the GDC code and the G++ code?
 I imagine there's something trivial in the scheduler that GDC has missed.
 Like the other day I noticed GDC was unnecessarily generating a stack frame
 for leaf functions, which Iain already fixed.

 I'd also be interested to try out my experimental std.simd (portable)
 library in the context of your FFT... might give that a shot, I think it'll
 work well.

That's right, go on and give that GDC maintainer guy more work to do.


-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';

Jan 25 2012

Trass3r <un known.com> writes:

For some reason your messages never have the proper position (i.e. they  
aren't connected with the post the respond to) in the message tree.

Jan 25 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Wednesday, January 25, 2012 22:37:41 Trass3r wrote:
 For some reason your messages never have the proper position (i.e. they
 aren't connected with the post the respond to) in the message tree.

I believe that it has something to do with newsgroup vs mailing list post, but 
I'm not sure. It could also be your client. Some clients are smarter about 
threading posts than others.

- Jonathan M Davis

Jan 25 2012

Manu <turkeyman gmail.com> writes:

On 25 January 2012 20:50, Iain Buclaw <ibuclaw ubuntu.com> wrote:

 On 25 January 2012 12:54, Manu <turkeyman gmail.com> wrote:
 Can you paste disassembly's of the GDC code and the G++ code?
 I imagine there's something trivial in the scheduler that GDC has missed.
 Like the other day I noticed GDC was unnecessarily generating a stack

 frame
 for leaf functions, which Iain already fixed.

 I'd also be interested to try out my experimental std.simd (portable)
 library in the context of your FFT... might give that a shot, I think

 it'll
 work well.

 That's right, go on and give that GDC maintainer guy more work to do.


How can I resist, he does such a fast and awesome job of it! :P

Jan 25 2012

D Programming

C/C++ Programming

Other

digitalmars.D - FFT in D (using SIMD) and benchmarks