www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Performance of tables slower than built in?

reply JS <JS.Music.Works gmail.com> writes:
I am trying to create some fast sin, sinc, and exponential 
routines to speed up some code by using tables... but it seems 
it's slower than the function itself?!?


The code below uses matplotlibd but is not necessary. The timing 
code is lower.
Ideally I'd like to have quadratic interpolation and all that but 
if the code is going to be slower and less accurate then what is 
the point?

Surely I'm doing something wrong?

I wasn't able to test this with LDC because it's giving me linker 
errors for some reason.

I'm also curious as to what D is using internally in the first 
place.

The generation uses a quarter sin since the waveform is 
symmetric. This does slow down the code a little but I can't 
imagine the few ops it uses to do this is more costly than 
whatever is used for computing sin.





import std.math;
import std.range;
import std.random;
import std.algorithm;
import plt = matplotlibd.pyplot;
import std.stdio, core.stdc.stdlib;


__gshared double[] QuarterSinTab;

auto GenQuarterSinTab(int res = 4*100)
{
	res++;	// Add extra endpoint to make linear interpolation 
easier/faster
	alias T = typeof(QuarterSinTab[0]);
	if (QuarterSinTab !is null) free(QuarterSinTab.ptr);
	QuarterSinTab = (cast(T*)malloc(T.sizeof*res))[0..res];

	res--;
	for(int i = 0; i < res; i++)
		QuarterSinTab[i] = sin(PI*(i/cast(double)res));	

	QuarterSinTab[$-1] = QuarterSinTab[0];
}

/*
	Uses the tabularized quarter sine and extends it to the full 
range
*/
auto sinTab(string Method = "Linear")(typeof(QuarterSinTab[0]) x)
{
	alias T = typeof(QuarterSinTab[0]);
	auto len = QuarterSinTab.length - 1;
	auto s = -1;		
	auto m = x - cast(int)x;	
	auto m2 = 1;
	
	if (x < 0) { m = m + 1; s = 1; }
	if ((abs(x) % 2) < 1) s = -s;			
	auto a = m*len;
	auto ai = cast(int)(m*len);

	switch(Method)
	{	
		default:
		case "Linear":
			auto f = a - ai;
			return s*((1 - f)*QuarterSinTab[ai] + f*QuarterSinTab[ai+1]);
		case "Constant":
			return s*QuarterSinTab[ai];
		
	}
}

void main() {
	GenQuarterSinTab;

	auto x = iota(-10, 10, 0.01).map!(x => x * PI);
     auto y = x.map!("sin(PI*a)");
	auto z = x.map!((x){ return sin(PI*x) - sinTab(x);});
     //plt.plot(x, y, "r-", ["label": "$y=sin(x)$"]);
	plt.plot(x, z, "b-", ["label": "$y$"]);
     plt.xlim(-5, 5);
     plt.ylim(-0.001, 0.001);
     plt.legend();
	plt.show();

	import std.datetime;
	double xxx = 0;
	StopWatch sw;
     sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sinTab(i);
	auto t = sw.peek().msecs;
	writeln(t);
	sw.stop();
	writeln(xxx);
	
	xxx = 0;
	sw.reset();
	sw.start();
	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
	t = sw.peek().msecs;
	writeln(t);
	sw.stop();
}
May 21 2019
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?
There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
May 21 2019
next sibling parent matheus <m g.com> writes:
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?
There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
Exactly, and by the way this is old feature already. I remember like long time ago when I was studying/writing some games and I decided to test pre-computed arrays (LUT) for SIN/COS vs functions, and the later would beat the arrays pretty easily. And by the way when porting old games, sometimes you usually (If not change the game logic too much), get rid of the LUT and use functions directly. Matheus.
May 21 2019
prev sibling parent reply Alex <AJ gmail.com> writes:
On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?
There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? And most of the time these functions are computed by some series that requires many terms. I'd expect, say, to compute sin one would require at least 10 multiplies for any accuracy... and surely that is much slower than simply accessing a table(it's true that my code is more complex due to the modulos and maybe that is eating up the diff). Do you have any proof of your claims? Like a paper that discusses such things so I can see what's really going on and how they achieve such performance(and how accurate)?
May 23 2019
parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 23.05.19 12:21, Alex wrote:
 On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential routines 
 to speed up some code by using tables... but it seems it's slower 
 than the function itself?!?
There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? ...
Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.) I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance.
 And most of the time these functions are computed by some series that 
 requires many terms. I'd expect, say, to compute sin one would require 
 at least 10 multiplies for any accuracy... and surely that is much 
 slower than simply accessing a table(it's true that my code is more 
 complex due to the modulos and maybe that is eating up the diff).
 
 Do you have any proof of your claims? Like a paper that discusses such 
 things so I can see what's really going on and how they achieve such 
 performance(and how accurate)?
Not exactly what you asked, but this might help: https://www.agner.org/optimize Also, look up the CORDIC algorithm.
May 23 2019
parent reply Alex <AJ gmail.com> writes:
On Thursday, 23 May 2019 at 15:20:22 UTC, Timon Gehr wrote:
 On 23.05.19 12:21, Alex wrote:
 On Wednesday, 22 May 2019 at 00:55:37 UTC, Adam D. Ruppe wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it 
 seems it's slower than the function itself?!?
There's intrinsic cpu instructions for some of those that can do the math faster than waiting on memory access. It is quite likely calculating it is actually faster. Even carefully written and optimized tables tend to just have a very small win relative to the cpu nowadays.
Surely not? I'm not sure what method is used to calculate them and maybe a table method is used internally for the common functions(maybe the periodic ones) but memory access surely is faster than multiplying doubles? ...
Depends on what kind of memory access, and what kind of faster. If you hit L1 cache then a memory access might be (barely) faster than a single double multiplication. (But modern hardware usually can do multiple double multiplies in parallel, and presumably also multiple memory reads, using SIMD and/or instruction-level parallelism.) I think a single in-register double multiplication will be roughly 25 times faster than an access to main memory. Each access to main memory will pull an entire cache line from main memory to the cache, so if you have good locality (you usually won't with a LUT), your memory accesses will be faster on average. There are a lot of other microarchitectural details that can matter quite a lot for performance.
I realize a lot of optimizations could be made but I still find it hard to believe that it could be done faster even with special hardware unless a LUT is used in the hardware already or some highly optimized algorithms. The whole point of general purpose computing was to get away from specialized hardware because it was cheaper and one could spend the resources in improving raw cycle performance which would benefit the entire system. But I guess because of complexity of hardware now days it's hard to really figure out what is going on. http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ Maybe when it is an "intrinsic" it can avoid some of the issues that might otherwise cost in a LUT(such as a context switch).
 And most of the time these functions are computed by some 
 series that requires many terms. I'd expect, say, to compute 
 sin one would require at least 10 multiplies for any 
 accuracy... and surely that is much slower than simply 
 accessing a table(it's true that my code is more complex due 
 to the modulos and maybe that is eating up the diff).
 
 Do you have any proof of your claims? Like a paper that 
 discusses such things so I can see what's really going on and 
 how they achieve such performance(and how accurate)?
Not exactly what you asked, but this might help: https://www.agner.org/optimize Also, look up the CORDIC algorithm.
I don't see how that can necessarily be faster. A LUT can give full 64-bit precision with one operation. The CORDIC needs iteration, at least 10 to be of any use. LUT's are precision independent assuming the creation cost is not included. I couldn't find anything in that link that specifically addressed speeding up typical math functions. It would be nice to know exactly what is going on, which functions are optimized and the typical speed up over a lut. I have some code that uses sin, exp and a few other primitive algebraic functions and in one case the code is extremely slow(it uses exp). I haven't done much testing of all this but something just seems off somewhere. Either the sin is heavily optimized and exp is not(which it too have a CORDIC like algorithm since exp(a+b) = exp(a)*exp(b)) or something else is happening that I'm not aware of.. which is why I went to tabularizing these functions in the first place. Also, I don't know the kinda accuracy they have, which doesn't help much but I could use a more accurate algorithm which is slower since it is pre-computed. That site I linked does talk about some of the issues and stuff... I guess I just need to update my thinking on modern cpus a little better. I guess there is no tool that can tell one exactly what is happening to a piece of code in the cpu... basically an instruction level profiler?
May 23 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 23 May 2019 at 21:30:47 UTC, Alex wrote:
 I don't see how that can necessarily be faster. A LUT can give 
 full 64-bit precision with one operation. The CORDIC needs 
 iteration, at least 10 to be of any use. LUT's are precision 
 independent assuming the creation cost is not included.
It isn't, IEEE sin is typically not fast. Arm cpus let you run fewer iterations though for nonstandard floats.
 I have some code that uses sin, exp and a few other primitive 
 algebraic functions and in one case the code is extremely 
 slow(it uses exp).  I haven't done much testing of all this but 
 something just seems off somewhere.
Dont know about exp, but some operations are slow when you get too close to zero, so called denormal numbers.
 I guess there is no tool that can tell one exactly what is 
 happening to a piece of code in the cpu... basically an 
 instruction level profiler?
Vtune from Intel, not free... Afaik.
May 23 2019
prev sibling next sibling parent rikki cattermole <rikki cattermole.co.nz> writes:
On 22/05/2019 12:22 PM, JS wrote:
 I am trying to create some fast sin, sinc, and exponential routines to 
 speed up some code by using tables... but it seems it's slower than the 
 function itself?!?
 
 
 The code below uses matplotlibd but is not necessary. The timing code is 
 lower.
 Ideally I'd like to have quadratic interpolation and all that but if the 
 code is going to be slower and less accurate then what is the point?
 
 Surely I'm doing something wrong?
 
 I wasn't able to test this with LDC because it's giving me linker errors 
 for some reason.
 
 I'm also curious as to what D is using internally in the first place.
It should be the libc's. With regards to your code, it looks like your computation is wrong. I gave it a go (although refactored) and it just wouldn't output the right set of values. So performance isn't all that important right now.
May 21 2019
prev sibling next sibling parent reply Basile B. <b2.temp gmx.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?

 [...]
Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch.
May 22 2019
next sibling parent Basile B. <b2.temp gmx.com> writes:
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?

 [...]
Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/}
Oh no... I meant "==" obviously
 Also takes care to the type used. With DMD the implicit 
 coercion of float and double can lead to extra conversions.

 You'll directly see a 15% gain after refactoring the switch.
May 22 2019
prev sibling parent reply Alex <AJ gmail.com> writes:
On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?

 [...]
Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch.
Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.
May 23 2019
parent Basilez B. <b2.temp gmx.com> writes:
On Thursday, 23 May 2019 at 10:16:42 UTC, Alex wrote:
 On Wednesday, 22 May 2019 at 08:25:58 UTC, Basile B. wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it 
 seems it's slower than the function itself?!?

 [...]
Hi, lookup tables ARE faster but the problem you have here, and I'm surprised that nobody noticed it so far, is that YOUR SWITCH LEADS TO A RUNTIME STRING COMPARISON AT RUNTIME. Just replace it with a static if (Method = "Linear") { /*...*/} else { /*...*/} Also takes care to the type used. With DMD the implicit coercion of float and double can lead to extra conversions. You'll directly see a 15% gain after refactoring the switch.
Surely not?!?! Surely the compiler can optimize that switch since the value passed is CT? I thought the whole point of not having static switch(analogous to static if) was because it would go ahead and optimize these cases for us... and it's just a switch, just a jmp table.
Try by yourself but to be clear note that I don't like your attitude, which I find disrespectful. I'm just here to help, I explained you the big problem you have in your code and you start discussing something that's not to be discussed AT ALL. Look at this https://d.godbolt.org/z/vtzVdp, in sinTab you'll be able to see call pure nothrow nogc safe int object.__switch!(immutable(char), "Linear", "Constant").__switch(scope const(immutable(char)[])) PLT and this even with LDC2 -O3. That's why your LUT is so slow. refactor the switch with "static if", the branch will be eliminated and you'll to see the perf improvement.
May 24 2019
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?
Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU. LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).
May 23 2019
parent reply Alex <AJ gmail.com> writes:
On Thursday, 23 May 2019 at 18:57:03 UTC, Ola Fosheim Grøstad 
wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?
Not when I tried it with one of the online compilers, LUT is 3-4 times faster on your tight inner loop, but I guess it depends on the CPU. LUT should be very fast in long-running tight inner loops like that once the cache is warm with such as small LUT that fits in working memory (cache).
I've used very small LUT's like a length of 5 and it didn't significantly change anything. Unless it is being hit hard at the start and that is skewing the results, it doesn't seem to be the issue. I haven't tested this well but was just thrown off by the results as it should easily have been inverted and I expected quite a significant speed up(several factors) and not the reverse.
May 23 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 23 May 2019 at 21:50:38 UTC, Alex wrote:
 I've used very small LUT's like a length of 5 and it didn't 
 significantly change anything.
Use a size that is 2^n, then mask the index and hopefully that will turn off bounds checks. E.g. If LUT size is 16, then index the lut with "i&15"?
 I haven't tested this well but was just thrown off by the 
 results as it should easily have been inverted and I expected 
 quite a significant speed up(several factors) and not the 
 reverse.
Well, you could take the time times clock frequency, divide it by number of iterations and calculate number of cycles per iteration. If it is more than a dozen, then something is wrong.
May 23 2019
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 	xxx = 0;
 	sw.reset();
 	sw.start();
 	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
 	t = sw.peek().msecs;
 	writeln(t);
 	sw.stop();
 }
What you are doing wrong is that xxx is never used... So DMD removes it altogether? If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense? Ola.
May 23 2019
parent reply Alex <AJ gmail.com> writes:
On Thursday, 23 May 2019 at 19:17:40 UTC, Ola Fosheim Grøstad 
wrote:
 On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 	xxx = 0;
 	sw.reset();
 	sw.start();
 	for(double i = 0; i < 10000000; i++) xxx += sin(PI*i);
 	t = sw.peek().msecs;
 	writeln(t);
 	sw.stop();
 }
What you are doing wrong is that xxx is never used... So DMD removes it altogether? If you add writeln(xxx) after the second loop as well, then maybe the measurements make more sense? Ola.
maybe, I thought I put in the writeln(xxx) in there to do that... yeah, looking at my code it's there, so that is not the problem. This is the code I had, import std.datetime; double xxx = 0; StopWatch sw; sw.start(); for(double i = 0; i < 10000000; i++) xxx += sinTab(i); auto t = sw.peek().msecs; writeln(t); sw.stop(); writeln(xxx); xxx = 0; sw.reset(); sw.start(); for(double i = 0; i < 10000000; i++) xxx += sin(PI*i); t = sw.peek().msecs; writeln(t); sw.stop(); writeln(xxx); Either I modified it later or deleted one of those lines or the editor screwed up when I pasted it... I don't recall touching the code afterwards. Either way, sin it's still twice as fast. Also, in the code the sinTab version is missing the writeln so it would have been faster.. so it is not being optimized out.
May 23 2019
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
 Either way, sin it's still twice as fast. Also, in the code the 
 sinTab version is missing the writeln so it would have been 
 faster.. so it is not being optimized out.
Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT: 709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead.
May 24 2019
next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
 So the LUT is 3-4 times faster even with your quarter-LUT 
 overhead.
4-5 times faster actually, since I made the LUT size known at compiletime.
May 24 2019
prev sibling next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
 https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c
I made an error here: "return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai+1)&511]);" Should of course be: return s*((1 - f)*QuarterSinTab[ai&511] + f*QuarterSinTab[(ai&511)+1]); However that does not impact the performance.
May 24 2019
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad wrote:
 On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
 Either way, sin it's still twice as fast. Also, in the code 
 the sinTab version is missing the writeln so it would have 
 been faster.. so it is not being optimized out.
Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT: 709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead.
FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin… You just have to make sure that the generated instructions fills the entire CPU pipeline.
May 24 2019
next sibling parent reply Alex <AJ gmail.com> writes:
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad 
 wrote:
 On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
 Either way, sin it's still twice as fast. Also, in the code 
 the sinTab version is missing the writeln so it would have 
 been faster.. so it is not being optimized out.
Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT: 709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead.
FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin… You just have to make sure that the generated instructions fills the entire CPU pipeline.
Well, the QuarterWave was suppose to generate just a quarter since that is all that is required for these functions due to symmetry and periodicity. I started with a half to get that working then figure out the sign flipping. Essentially one just has to tabulate a quarter of sin, that is, from 0 to 90o and then get the sin right. This allows one to have 4 times the resolution or 1/4 the size at the same cost. Or, to put it another say, sin as 4 fold redundancy. I'll check out your code, thanks for looking in to it.
May 24 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 12:01:55 UTC, Alex wrote:
 Well, the QuarterWave was suppose to generate just a quarter 
 since that is all that is required for these functions due to 
 symmetry and periodicity. I started with a half to get that 
 working then figure out the sign flipping.
Sure, it is a tradeoff. You pollute the cache less this way, but you have to figure out the sign and the lookup-direction. The trick is then to turn the phase into an unsigned integer then you get: 1. the highest bit will tell you that you need to use the inverse sign for the result. 2. the next highest bit will tell you that you need too look up in the reverse direction What is key to performance here is that x86 can do many simple integer/bit operations in parallel, but only a few floating point operations. Also avoid all conditionals. Use bitmasking instead, something along the line of: const ulong phase = mantissa^((1UL<<63)-((mantissa>>62)&1)); const uint quarterphase = (phase>>53)&511; (Haven't checked the correctness of this, but this shows the general principle.) Ola.
May 24 2019
prev sibling next sibling parent reply Alex <AJ gmail.com> writes:
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 24 May 2019 at 08:33:34 UTC, Ola Fosheim Grøstad 
 wrote:
 On Thursday, 23 May 2019 at 21:47:45 UTC, Alex wrote:
 Either way, sin it's still twice as fast. Also, in the code 
 the sinTab version is missing the writeln so it would have 
 been faster.. so it is not being optimized out.
Well, when I run this modified version: https://gist.github.com/run-dlang/9f29a83b7b6754da98993063029ef93c on https://run.dlang.io/ then I get: LUT: 709 sin(x): 2761 So the LUT is 3-4 times faster even with your quarter-LUT overhead.
FWIW, as far as I can tell I managed to get the lookup version down to 104 by using bit manipulation tricks like these: auto fastQuarterLookup(double x){ const ulong mantissa = cast(ulong)( (x - floor(x)) * (cast(double)(1UL<<63)*2.0) ); const double sign = cast(double)(-cast(uint)((mantissa>>63)&1)); … etc So it seems like a quarter-wave LUT is 27 times faster than sin…
If so then that is great and what I'd expected to achieve originally. I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out.
 You just have to make sure that the generated instructions 
 fills the entire CPU pipeline.
What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know. Are you saying that I did not have enough instructions that the pipeline could take advantage of? In any case, I'll check your code out and try to figure out the details and see exactly what is going on. If it truly is a 27x faster then then that is very relevant and knowing why is important. Of course, a lot of that might simply be due to LDC and I wasn't able to determine this. Can you do some stats for dmd and ldc? You seem to be interested in this, are you up for a challenge? The idea is to use tables to optimize these functions. Half sin was done above but quarter sine can be used(there are 4 quadrants but only one has to be tabularized because all the others differ by sign and reversion(1 - x), it's a matter of figuring out the sign). Of course it requires extra computation so it would be interesting to see the difference in performance for the extra logic. Then there is exp exp(x) can be written as exp(floor(x) + {x}) = exp(floor(x))*exp({x}) and so one can optimize this by tabulating exp(x) for 0<= x < 1 which is the fractional part of exp(x). Then tabulating it for a wide range of integers(probably in 2's). e.g., exp(3.5) = exp(3)*exp(.5) both come from a lookup table. or one could do exp(3) = exp(1 + 1 + 1) = exp(1)*exp(1)*exp(1) (this requires iteration if we do not tabulate exp(3). Hence one would limit the iteration count by tabulating things like exp(10^k) and exp(k) for -10 < k < 10. The idea is basically one can get really dense LUT's for a small range that then are used to build the function for arbitrary inputs. With linear interpolation one can get very accurate(for all practical purposes) LUT table methods that, if your code is right, is at least an order of magnitude faster. The memory requirements will be quite small with linear interpolation(and ideally quadratic or cubic if the costs are not too high). That was what I was starting to work on before I got thrown off by it being much slower. It seems you already have the half-sin done. One could do things like sin, cos(obviously easy), exp, exp()^2, erf(x), sinh, cosh, etc. Things like sec could also be done as it would save a division since it seems they take about 30 cycles. But it would depend on the memory used. [I can't mess with this now because I've gotta work other things at the moment] Thanks.
May 24 2019
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
 So it seems like a quarter-wave LUT is 27 times faster than 
 sin…
If so then that is great and what I'd expected to achieve originally. I guess this is using LDC though? I wasn't able to compile with LDC since after updating I'm getting linker errors that I have to go figure out.
Yes, the gist linked above is just your code with minor changes, that was 4-5 times faster. To get to 27 times faster you need to use the integer bit-manipulation scheme that I suggest above. Just beware that I haven't checked the code, so it might be off by ±1 and such. Anyway, it is more fun for you to code up your own version than to try to figure out mine. Just follow the principles and you should get close to that performance, I think. (I'll refine the code later, but don't really have time now)
 You just have to make sure that the generated instructions 
 fills the entire CPU pipeline.
What exactly does this mean? I realize the pipeline in cpu's is how the cpu decodes and optimizes the instructions but when you say "You have to make sure" that pre-supposes there is a method or algorithm to know.
Yes, you have to look up information about the CPU in your computer. Each core has a set of "lanes" that are computed simultanously. Some instructions can go into many lanes, but not all. Then there might be bubbles in the pipeline (the lane) that can be filled up with integer/bit manipulation instructions. It is tedious to look that stuff up. So, last resort. Just try to mix simple integer with simple double computations (avoid division).
 Are you saying that I did not have enough instructions that the 
 pipeline could take advantage of?
Yes, you most likely got bubbles. Empty space where the core has nothing to send down a lane, because it is waiting for some computation to finish so that it can figure out what to do next. Basic optimization: Step 1: reduce dependencies between computations Step 2: make sure you generate a mix of simple integer/double instructions that can fill up all the computation lanes at the same time Step 3: make sure loops only contain a few instructions, the CPU can unroll loops in hardware if they are short. (not valid here though)
 Of course, a lot of that might simply be due to LDC and I 
 wasn't able to determine this.
I think I got better performance because I filled more lanes in the pipeline.
 Half sin was done above but quarter sine can be used(there are 
 4 quadrants but only one has to be tabularized because all the 
 others differ by sign and reversion(1 - x), it's a matter of 
 figuring out the sign).
Yes, as I mentioned, the first bit of the phase is the sign and the second bit of the phase is the reversion of the indexing.
 Of course it requires extra computation so it would be 
 interesting to see the difference in performance for the extra 
 logic.
It adds perhaps 2-5 cycles or so, my guessing.
 exp(x) can be written as exp(floor(x) + {x}) = 
 exp(floor(x))*exp({x})
[...]
 With linear interpolation one can get very accurate(for all 
 practical purposes) LUT table methods that, if your code is 
 right, is at least an order of magnitude faster. The memory 
 requirements will be quite small with linear interpolation
I think you need to do something with the x before you look up, so that you have some kind of fast nonlinear mapping to the indexes. But for exp() you might prefer an approximation instead, perhaps polynomial taylor series perhaps. Searching the web should give some ideas.
 It seems you already have the half-sin done.
I did the quarter sin though, not the half-sin (but that is almost the same, just drop the reversion of the indexing). (Let's talk about this later, since we both have other things on our plate. Fun topic! :-)
May 24 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 12:56:55 UTC, Ola Fosheim Grøstad wrote:
 suggest above. Just beware that I haven't checked the code, so 
 it might be off by ±1 and such.
So before using such code for anything important, make sure that it is tested for the edge cases, like denormal numbers (values close to zero). Roundoff-errors where computations "accidentally" cross over 1.0 and stuff like that.
May 24 2019
prev sibling next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
 If it truly is a 27x faster then then that is very relevant and 
 knowing why is important.

 Of course, a lot of that might simply be due to LDC and I 
 wasn't able to determine this.
Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know.
May 24 2019
parent reply Alex <AJ gmail.com> writes:
On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
 If it truly is a 27x faster then then that is very relevant 
 and knowing why is important.

 Of course, a lot of that might simply be due to LDC and I 
 wasn't able to determine this.
Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know.
The thing is, the LUT can have as much precision as one wants. One could even spend several days calculating it then loading it from disk. I'm not sure what the real precision of the build in functions are but it shouldn't be hard to max out a double using standard methods(even if slow, but irrelevant after the LUT has been created). Right now I'm just using the built ins... Maybe later I'll get back around to all this and make some progress.
May 24 2019
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
 I'm not sure what the real precision of the build in functions 
 are but it shouldn't be hard to max out a double using standard 
 methods(even if slow, but irrelevant after the LUT has been 
 created).
LUTs are primarily useful when you use sin(x) as a signal or when a crude approximation is good enough. One advantage of a LUT is that you can store a more complex computation than the basic function. Like a filtered square wave.
May 24 2019
parent reply NaN <divide by.zero> writes:
On Friday, 24 May 2019 at 17:40:40 UTC, Ola Fosheim Grøstad wrote:
 On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
 I'm not sure what the real precision of the build in functions 
 are but it shouldn't be hard to max out a double using 
 standard methods(even if slow, but irrelevant after the LUT 
 has been created).
LUTs are primarily useful when you use sin(x) as a signal or when a crude approximation is good enough. One advantage of a LUT is that you can store a more complex computation than the basic function. Like a filtered square wave.
Its pretty common technique in audio synthesis. What i've done in the past is store a table of polynomial segments that were optimised with curve fitting. It's bit extra work to calculate the the waveform but actual works out faster than having huge LUTs since you're typically only producing maybe 100 samples in each interrupt callback, so it gets pretty likely that your LUT is pushed into slower cache memory in between calls to generate the the audio.
May 25 2019
next sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:
 Its pretty common technique in audio synthesis.
Indeed. CSound does this.
 What i've done in the past is store a table of polynomial 
 segments that were optimised with curve fitting.
That's an interesting solution, how do you avoid higher order discontinuities between segments? Crossfading? Or maybe it wasn't audible.
May 25 2019
parent reply NaN <divide by.zero> writes:
On Saturday, 25 May 2019 at 09:52:22 UTC, Ola Fosheim Grøstad 
wrote:
 On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:
 Its pretty common technique in audio synthesis.
Indeed. CSound does this.
 What i've done in the past is store a table of polynomial 
 segments that were optimised with curve fitting.
That's an interesting solution, how do you avoid higher order discontinuities between segments? Crossfading? Or maybe it wasn't audible.
I used an evolutionary optimisation algorithm on the table all at once. So you do a weighted sum of max deviation, and 1st and 2nd order discontinuity at the joins. And minimise that across the table as a whole. It seemed you could massively overweight the discontinuities without really affecting the curve fitting that much. So perfect joins only cost a little extra deviation in the fit of the polynomial.
May 25 2019
parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote:
 I used an evolutionary optimisation algorithm on the table all 
 at once. So you do a weighted sum of max deviation, and 1st and 
 2nd order discontinuity at the joins. And minimise that across 
 the table as a whole. It seemed you could massively overweight 
 the discontinuities without really affecting the curve fitting 
 that much. So perfect joins only cost a little extra deviation 
 in the fit of the polynomial.
Wow, it is pretty cool that you managed to minimize 2nd order discontinuity like that! Is there a paper describing the optimisation algorithm you used?
May 25 2019
parent reply NaN <divide by.zero> writes:
On Saturday, 25 May 2019 at 14:58:25 UTC, Ola Fosheim Grøstad 
wrote:
 On Saturday, 25 May 2019 at 12:51:20 UTC, NaN wrote:
 I used an evolutionary optimisation algorithm on the table all 
 at once. So you do a weighted sum of max deviation, and 1st 
 and 2nd order discontinuity at the joins. And minimise that 
 across the table as a whole. It seemed you could massively 
 overweight the discontinuities without really affecting the 
 curve fitting that much. So perfect joins only cost a little 
 extra deviation in the fit of the polynomial.
Wow, it is pretty cool that you managed to minimize 2nd order discontinuity like that! Is there a paper describing the optimisation algorithm you used?
Its differential evolution, you can pretty much optimise anything you can quantify. http://www1.icsi.berkeley.edu/~storn/code.html
May 25 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 25 May 2019 at 20:07:19 UTC, NaN wrote:
 Its differential evolution, you can pretty much optimise 
 anything you can quantify.

 http://www1.icsi.berkeley.edu/~storn/code.html
Thanks, they even have Python example code. Nice.
May 25 2019
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Saturday, 25 May 2019 at 09:04:31 UTC, NaN wrote:
 It's bit extra work to calculate the the waveform but actual 
 works out faster than having huge LUTs since you're typically 
 only producing maybe 100 samples in each interrupt callback
Another hybrid option when filling a buffer might be to fill two buffers with an approximation that is crossfaded between the end-points. A bit tricky, but say you have a taylor polynomial (or even a recurrence relation) that is "correct" near the beginning of the buffer, and another one that is correct near the end of the buffer. Then you could fill two buffers from each end and cross fade between the buffers. You might get some phase-cancellation errors and phasing-like distortion though.
May 25 2019
prev sibling parent NaN <divide by.zero> writes:
On Friday, 24 May 2019 at 17:04:33 UTC, Alex wrote:
 On Friday, 24 May 2019 at 13:57:30 UTC, Ola Fosheim Grøstad 
 wrote:
 On Friday, 24 May 2019 at 12:24:02 UTC, Alex wrote:
 If it truly is a 27x faster then then that is very relevant 
 and knowing why is important.

 Of course, a lot of that might simply be due to LDC and I 
 wasn't able to determine this.
Just one more thing you really ought to consider: It isn't obvious that a LUT using double will be more precise than computing sin using single precision float. So when I use single precision float with ldc and "-O3 -ffast-math" then I get roughly 300ms. So in that case the LUT is only 3 times faster. Perhaps not worth it then. You might as well just use float instead of double. The downside is that -ffast-math makes all your computations more inaccurate. It is also possible that recent processors can do multiple sin/cos as simd. Too many options… I know.
The thing is, the LUT can have as much precision as one wants. One could even spend several days calculating it then loading it from disk.
With linear interpolation of what is effectively a discrete time signal, you get an extra 12db signal to noise ratio each time you 2x oversample the input. So basically if you start with LUT of 4 samples, that should give you about 8dbs at baseline, each time you double the length you get an extra 12dbs. Or in simpler terms double the table length get 2 bits less error. So 8192 table size should give you approximately 24 bits clear of errors. But be aware that error is relative to the maximum magnitude of the signal in the table.
May 25 2019
prev sibling parent reply Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
This appears to get roughly the same results as sin(x):


__gshared double[512+1] QuarterSinTab;

void init(){
     const auto res = QuarterSinTab.length-1;
     for(int i = 0; i < res; i++)
         QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));	
	QuarterSinTab[$-1] = sin(PI*0.5);
}

auto fastQuarterLookup(double x){
     const ulong mantissa = cast(ulong)( (x - floor(x)) * 
(cast(double)(1UL<<54)) );
     const double sign = cast(double)(1 - 
cast(int)((mantissa>>52)&2));
     const ulong phase = 
(mantissa^((1UL<<53)-((mantissa>>52)&1)))&((1UL<<53) -1);
     const uint quarterphase = (phase>>43)&511;
     const double frac = 
cast(double)(phase&((1UL<<43)-1))*cast(double)(1.0/(1UL<<43));
     return sign*((1.0-frac)*QuarterSinTab[quarterphase] + 
frac*QuarterSinTab[quarterphase+1]);
}


Ola.
May 24 2019
parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
With linear interpolation you get roughly the same result with 
floats, a little more efficient too (half the memory and a bit 
faster):

__gshared float[512+1] QuarterSinTab;

void init(){
     const auto res = QuarterSinTab.length-1;
     for(int i = 0; i < res; i++)
         QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));	
	QuarterSinTab[$-1] = sin(PI*0.5);
}

auto fastQuarterLookup(float x){
     const uint mantissa = cast(uint)( (x - floor(x)) * 
(cast(float)(1U<<24)) );
     const float sign = cast(float)(1 - 
cast(int)((mantissa>>22)&2));
     const uint phase = 
(mantissa^((1U<<23)-((mantissa>>22)&1)))&((1U<<23) -1);
     const uint quarterphase = (phase>>13)&511;
     const float frac = 
cast(float)(phase&((1U<<13)-1))*cast(float)(1.0f/(1U<<13));
     return sign*((1.0f-frac)*QuarterSinTab[quarterphase] + 
frac*QuarterSinTab[quarterphase+1]);
}
May 24 2019
prev sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Friday, 24 May 2019 at 11:45:46 UTC, Ola Fosheim Grøstad wrote:
     const double sign = 
 cast(double)(-cast(uint)((mantissa>>63)&1));
Yep, this was wrong (0 or -1). Should be something like (1 or -1): const double sign = cast(double)(1-cast(uint)((mantissa>>62)&2)); You'll have to code it up more carefully than I did, just following the same principles. (These ±1 errors do not affect the performance.). Also, for comparison, just running the 2 lookups in the loop are at 32ms. So 3 times that sounds reasonable for extracting the phase, determining the sign, reversing the phase and doing the linear interpolation.
May 24 2019
prev sibling next sibling parent Ola Fosheim =?UTF-8?B?R3LDuHN0YWQ=?= <ola.fosheim.grostad gmail.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 	for(int i = 0; i < res; i++)
 		QuarterSinTab[i] = sin(PI*(i/cast(double)res));	
Btw, I think this creates a half sine, not a quarter, so you want (?): QuarterSinTab[i] = sin(PI*(0.5*i/cast(double)res));
 	QuarterSinTab[$-1] = QuarterSinTab[0];
This creates a discontinuity if you create a quarter sine, in that case you probably wanted: QuarterSinTab[$-1] = sin(PI*0.5) Otherwise you will never get 1 or -1. But none of these affect performance.
May 24 2019
prev sibling parent Danny Arends <Danny.Arends gmail.com> writes:
On Wednesday, 22 May 2019 at 00:22:09 UTC, JS wrote:
 I am trying to create some fast sin, sinc, and exponential 
 routines to speed up some code by using tables... but it seems 
 it's slower than the function itself?!?

 [...]
I'll just leave this here: https://www.dannyarends.nl/Using%20CTFE%20in%20D%20to%20speed%20up%20Sine%20and%20Cosine?
May 25 2019