digitalmars.D.learn - Align a variable on the stack.
- TheFlyingFiddle (29/29) Nov 03 2015 Is there a built in way to do this in dmd?
- Nicholas Wilson (19/48) Nov 03 2015 Note that there are two different alignments:
- TheFlyingFiddle (43/61) Nov 04 2015 Thanks for the reply. I did some more checking around and I found
- Marc =?UTF-8?B?U2Now7x0eg==?= (7/9) Nov 05 2015 Can you publish two compilable and runnable versions of the code
- TheFlyingFiddle (54/62) Nov 05 2015 I created a simple example here:
- TheFlyingFiddle (4/6) Nov 05 2015 I forgot to mention this but I am using DMD 2.069.0-rc2 for x86
- TheFlyingFiddle (41/48) Nov 05 2015 I reduced it further:
- rsw0x (5/10) Nov 05 2015 these run at the exact same speed for me and produce identical
- TheFlyingFiddle (4/16) Nov 05 2015 Are you running on windows?
- rsw0x (2/19) Nov 05 2015 linux x86-64
- steven kladitis (9/26) Nov 06 2015 I am still disappointed that DMD is not native 64 bit in windows
- BBaz (7/11) Nov 06 2015 It's because they can't make a nice distribution. DMD win32 is a
- Marc =?UTF-8?B?U2Now7x0eg==?= (10/10) Nov 06 2015 Ok, benchA and benchB have the same assembler code generated.
- Marc =?UTF-8?B?U2Now7x0eg==?= (3/13) Nov 06 2015 Forgot to add that this is on Linux x86_64, so that probably
- TheFlyingFiddle (10/24) Nov 06 2015 I tested swapping around the functions on windows x86 and I still
- BBasile (8/17) Nov 05 2015 wow that's quite strange. FP members should be initialized
- arGus (18/18) Nov 06 2015 I did some testing on Linux and Windows.
- rsw0x (2/20) Nov 06 2015 File a bug report, this probably needs Walter to look at it.
Is there a built in way to do this in dmd? Basically I want to do this: auto decode(T)(...) { while(...) { T t = T.init; //I want this aligned to 64 bytes. } } Currently I am using: align(64) struct Aligner(T) { T value; } auto decode(T)(...) { Aligner!T t = void; while(...) { t.value = T.init; } } But is there a less hacky way? From the documentation of align it seems i cannot use that for this kind of stuff. Also I don't want to have to use align(64) on my T struct type since for my usecase I am decoding arrays of T. The reason that I want to do this in the first place is that if the variable is aligned i get about a 2.5x speedup (i don't really know why... found it by accident)
Nov 03 2015
On Tuesday, 3 November 2015 at 23:29:45 UTC, TheFlyingFiddle wrote:Is there a built in way to do this in dmd? Basically I want to do this: auto decode(T)(...) { while(...) { T t = T.init; //I want this aligned to 64 bytes. } } Currently I am using: align(64) struct Aligner(T) { T value; } auto decode(T)(...) { Aligner!T t = void; while(...) { t.value = T.init; } } But is there a less hacky way? From the documentation of align it seems i cannot use that for this kind of stuff. Also I don't want to have to use align(64) on my T struct type since for my usecase I am decoding arrays of T. The reason that I want to do this in the first place is that if the variable is aligned i get about a 2.5x speedup (i don't really know why... found it by accident)Note that there are two different alignments: to control padding between instances on the stack (arrays) to control padding between members of a struct align(64) //arrays struct foo { align(16) short baz; //between members align (1) float quux; } your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing. As to a less jacky solution I'm not sure there is one.
Nov 03 2015
On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson wrote:Note that there are two different alignments: to control padding between instances on the stack (arrays) to control padding between members of a struct align(64) //arrays struct foo { align(16) short baz; //between members align (1) float quux; } your 2.5x speedup is due to aligned vs. unaligned loads and stores which for SIMD type stuff has a really big effect. Basically misaligned stuff is really slow. IIRC there was a (blog/paper?) of someone on a uC spending a vast amount of time in ONE misaligned integer assignment causing traps and getting the kernel involved. Not quite as bad on x86 but still with doing. As to a less jacky solution I'm not sure there is one.Thanks for the reply. I did some more checking around and I found that it was not really an alignment problem but was caused by using the default init value of my type. My starting type. align(64) struct Phys { float x, y, z, w; //More stuff. } //Was 64 bytes in size at the time. The above worked fine, it was fast and all. But after a while I wanted the data in a diffrent format. So I started decoding positions, and other variables in separate arrays. Something like this: align(16) struct Pos { float x, y, z, w; } This counter to my limited knowledge of how cpu's work was much slower. Doing the same thing lot's of times, touching less memory with less branches should in theory at-least be faster right? So after I ruled out bottlenecks in the parser I assumed there was some alignment problems so I did my Aligner hack. This caused to code to run faster so I assumed this was the cause... Naive! (there was a typo in the code I submitted to begin with I used a = Align!(T).init and not a.value = T.init) The performance was actually cased by the line : t = T.init no matter if it was aligned or not. I solved the problem by changing the struct to look like this. align(16) struct Pos { float x = float.nan; float y = float.nan; float z = float.nan; float w = float.nan; } Basically T.init get's explicit values. But... this should be the same Pos.init as the default Pos.init. So I really fail to understand how this could fix the problem. I guessed the compiler generates some slightly different code if I do it this way? And that this slightly different code fixes some bottleneck in the cpu. But when I took a look at the assembly of the function I could not find any difference in the generated code... I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?
Nov 04 2015
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote:I don't really know where to go from here to figure out the underlying cause. Does anyone have any suggestions?Can you publish two compilable and runnable versions of the code that exhibit the difference? Then we can have a look at the generated assembly. If there's really different code being generated depending on whether the .init value is explicitly set to float.nan or not, then this suggests there is a bug in DMD.
Nov 05 2015
On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote: Can you publish two compilable and runnable versions of the code that exhibit the difference? Then we can have a look at the generated assembly. If there's really different code being generated depending on whether the .init value is explicitly set to float.nan or not, then this suggests there is a bug in DMD.I created a simple example here: struct A { float x, y, z ,w; } struct B { float x=float.nan; float y=float.nan; float z=float.nan; float w=float.nan; } void initVal(T)(ref T t, ref float k) { pragma(inline, false); t.x = k; t.y = k * 2; t.z = k / 2; t.w = k^^3; } __gshared A[] a; void benchA() { A val; foreach(float f; 0 .. 1000_000) { val = A.init; initVal(val, f); a ~= val; } } __gshared B[] b; void benchB() { B val; foreach(float f; 0 .. 1000_000) { val = B.init; initVal(val, f); b ~= val; } } int main(string[] argv) { import std.datetime; import std.stdio; auto res = benchmark!(benchA, benchB)(1); writeln("Default: ", res[0]); writeln("Explicit: ", res[1]); return 0; } output: Default: TickDuration(1637842) Explicit: TickDuration(167088) ~10x slowdown...
Nov 05 2015
On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle wrote:On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote: ~10x slowdown...I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.
Nov 05 2015
On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle wrote:I reduced it further: struct A { float x, y, z ,w; } struct B { float x=float.nan; float y=float.nan; float z=float.nan; float w=float.nan; } void initVal(T)(ref T t, ref float k) { pragma(inline, false); } void benchA() { foreach(float f; 0 .. 1000_000) { A val = A.init; initVal(val, f); } } void benchB() { foreach(float f; 0 .. 1000_000) { B val = B.init; initVal(val, f); } } int main(string[] argv) { import std.datetime; import std.stdio; auto res = benchmark!(benchA, benchB)(1); writeln("Default: ", res[0]); writeln("Explicit: ", res[1]); readln; return 0; } also i am using dmd -release -boundcheck=off -inline The pragma(inline, false) is there to prevent it from removing the assignment in the loop.On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote: ~10x slowdown...I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 windows.
Nov 05 2015
On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline[...]I reduced it further: [...]
Nov 05 2015
On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline[...]I reduced it further: [...]
Nov 05 2015
On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:linux x86-64On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline[...]I reduced it further: [...]
Nov 05 2015
On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:I am still disappointed that DMD is not native 64 bit in windows yet. Please show exactly how you are getting 64 bit to work in windows 10. I have never gotten this to work. for any version of DMD. All of my new $400.00 systems are 4 gig 64 bit windows 10... and the processor instruction sets are very nice. I dabble in assembler. I have always wondered why D does not take advantage of newer instructions...... and 64 bit. I see a 64 Bit droid Compiler for D. :):):)On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle wrote:Are you running on windows? I tested on windows x64 and there I also get the exact same speed for both functions.On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle wrote:these run at the exact same speed for me and produce identical assembly output from a quick glance dmd 2.069, -O -release -inline[...]I reduced it further: [...]
Nov 06 2015
On Saturday, 7 November 2015 at 03:18:59 UTC, steven kladitis wrote:[...] I am still disappointed that DMD is not native 64 bit in windows yet. [...]It's because they can't make a nice distribution. DMD win32 is a nice package that works out of the box (compiler, standard C lib, standard D lib, linker, etc) without any further configuration or derquirement. DMD win64 requires MSVS for the standard C lib and the linker.
Nov 06 2015
Ok, benchA and benchB have the same assembler code generated. However, I _can_ reproduce the slowdown albeit on average only 20%-40%, not a factor of 10. It turns out that it's always the first tested function that's slower. You can test this by switching benchA and benchB in the call to benchmark(). I suspect the reason is that the OS is paging in the code the first time, and we're actually seeing the cost of the page fault. If you a second round of benchmarks after the first one, that one shows more or less the same performance for both functions.
Nov 06 2015
On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:Ok, benchA and benchB have the same assembler code generated. However, I _can_ reproduce the slowdown albeit on average only 20%-40%, not a factor of 10.Forgot to add that this is on Linux x86_64, so that probably explains the difference.It turns out that it's always the first tested function that's slower. You can test this by switching benchA and benchB in the call to benchmark(). I suspect the reason is that the OS is paging in the code the first time, and we're actually seeing the cost of the page fault. If you a second round of benchmarks after the first one, that one shows more or less the same performance for both functions.
Nov 06 2015
On Friday, 6 November 2015 at 11:38:29 UTC, Marc Schütz wrote:On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:I tested swapping around the functions on windows x86 and I still get the same slowdown with the default initializer. Still basically the same running speed of both functions on windows x64. Interestingly enough the slowdown disappear if I add another float variable to the structs. This causes the assembly to change to using different instructions so I guess that is why. Also it only seems to affect small structs with floats in them. If I change the memebers to int both versions run at the same speed on x86 aswell.Ok, benchA and benchB have the same assembler code generated. However, I _can_ reproduce the slowdown albeit on average only 20%-40%, not a factor of 10.Forgot to add that this is on Linux x86_64, so that probably explains the difference.It turns out that it's always the first tested function that's slower. You can test this by switching benchA and benchB in the call to benchmark(). I suspect the reason is that the OS is paging in the code the first time, and we're actually seeing the cost of the page fault. If you a second round of benchmarks after the first one, that one shows more or less the same performance for both functions.
Nov 06 2015
On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle wrote:[...] I solved the problem by changing the struct to look like this. align(16) struct Pos { float x = float.nan; float y = float.nan; float z = float.nan; float w = float.nan; }wow that's quite strange. FP members should be initialized without initializer ! Eg you should get the same with align(16) struct Pos { float x, y, ,z, w; }
Nov 05 2015
I did some testing on Linux and Windows. I ran the code with ten times the iterations, and found the results consistent with what has previously been observed in this thread. The code seems to run just fine on Linux, but is slowed down 10x on Windows x86. Windows (32-bit) rdmd bug.d -inline -boundscheck=off -release Default: TickDuration(14398890) Explicit: TickDuration(168888) Linux (64-bit) rdmd bug.d -m64 -inline -boundscheck=off Default: TickDuration(59090876) Explicit: TickDuration(49529493) Linux (32-bit) rdmd bug.d -inline -boundscheck=off Default: TickDuration(58882306) Explicit: TickDuration(49231968)
Nov 06 2015
On Friday, 6 November 2015 at 17:55:47 UTC, arGus wrote:I did some testing on Linux and Windows. I ran the code with ten times the iterations, and found the results consistent with what has previously been observed in this thread. The code seems to run just fine on Linux, but is slowed down 10x on Windows x86. Windows (32-bit) rdmd bug.d -inline -boundscheck=off -release Default: TickDuration(14398890) Explicit: TickDuration(168888) Linux (64-bit) rdmd bug.d -m64 -inline -boundscheck=off Default: TickDuration(59090876) Explicit: TickDuration(49529493) Linux (32-bit) rdmd bug.d -inline -boundscheck=off Default: TickDuration(58882306) Explicit: TickDuration(49231968)File a bug report, this probably needs Walter to look at it.
Nov 06 2015