digitalmars.D.learn - Align a variable on the stack.

TheFlyingFiddle (29/29) Nov 03 2015 Is there a built in way to do this in dmd?

Nicholas Wilson (19/48) Nov 03 2015 Note that there are two different alignments:

TheFlyingFiddle (43/61) Nov 04 2015 Thanks for the reply. I did some more checking around and I found

Marc =?UTF-8?B?U2Now7x0eg==?= (7/9) Nov 05 2015 Can you publish two compilable and runnable versions of the code

TheFlyingFiddle (54/62) Nov 05 2015 I created a simple example here:

TheFlyingFiddle (4/6) Nov 05 2015 I forgot to mention this but I am using DMD 2.069.0-rc2 for x86

TheFlyingFiddle (41/48) Nov 05 2015 I reduced it further:

rsw0x (5/10) Nov 05 2015 these run at the exact same speed for me and produce identical

TheFlyingFiddle (4/16) Nov 05 2015 Are you running on windows?

rsw0x (2/19) Nov 05 2015 linux x86-64
steven kladitis (9/26) Nov 06 2015 I am still disappointed that DMD is not native 64 bit in windows

BBaz (7/11) Nov 06 2015 It's because they can't make a nice distribution. DMD win32 is a

Marc =?UTF-8?B?U2Now7x0eg==?= (10/10) Nov 06 2015 Ok, benchA and benchB have the same assembler code generated.

Marc =?UTF-8?B?U2Now7x0eg==?= (3/13) Nov 06 2015 Forgot to add that this is on Linux x86_64, so that probably

TheFlyingFiddle (10/24) Nov 06 2015 I tested swapping around the functions on windows x86 and I still

BBasile (8/17) Nov 05 2015 wow that's quite strange. FP members should be initialized

arGus (18/18) Nov 06 2015 I did some testing on Linux and Windows.

rsw0x (2/20) Nov 06 2015 File a bug report, this probably needs Walter to look at it.

TheFlyingFiddle <borin.lukas gmail.com> writes:

Is there a built in way to do this in dmd?

Basically I want to do this:

auto decode(T)(...)
{
    while(...)
    {
       T t = T.init; //I want this aligned to 64 bytes.
    }
}


Currently I am using:

align(64) struct Aligner(T)
{
    T value;
}

auto decode(T)(...)
{
    Aligner!T t = void;
    while(...)
    {
       t.value = T.init;
    }
}

But is there a less hacky way? From the documentation of align it 
seems i cannot use that for this kind of stuff. Also I don't want 
to have to use align(64) on my T struct type since for my usecase 
I am decoding arrays of T.

The reason that I want to do this in the first place is that if 
the variable is aligned i get about a 2.5x speedup (i don't 
really know why... found it by accident)

Nov 03 2015

Nicholas Wilson <iamthewilsonator hotmail.com> writes:

On Tuesday, 3 November 2015 at 23:29:45 UTC, TheFlyingFiddle 
wrote:
 Is there a built in way to do this in dmd?

 Basically I want to do this:

 auto decode(T)(...)
 {
    while(...)
    {
       T t = T.init; //I want this aligned to 64 bytes.
    }
 }


 Currently I am using:

 align(64) struct Aligner(T)
 {
    T value;
 }

 auto decode(T)(...)
 {
    Aligner!T t = void;
    while(...)
    {
       t.value = T.init;
    }
 }

 But is there a less hacky way? From the documentation of align 
 it seems i cannot use that for this kind of stuff. Also I don't 
 want to have to use align(64) on my T struct type since for my 
 usecase I am decoding arrays of T.

 The reason that I want to do this in the first place is that if 
 the variable is aligned i get about a 2.5x speedup (i don't 
 really know why... found it by accident)

Note that there are two different alignments:
          to control padding between instances on the stack 
(arrays)
          to control padding between members of a struct

align(64) //arrays
struct foo
{
       align(16) short baz; //between members
       align (1) float quux;
}

your 2.5x speedup is due to aligned vs. unaligned loads and 
stores which for SIMD type stuff has a really big effect. 
Basically misaligned stuff is really slow. IIRC there was a 
(blog/paper?) of someone on a uC spending a vast amount of time 
in ONE misaligned integer assignment causing traps and getting 
the kernel involved. Not quite as bad on x86 but still with doing.

As to a less jacky solution I'm not sure there is one.

Nov 03 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Wednesday, 4 November 2015 at 01:14:31 UTC, Nicholas Wilson 
wrote:
 Note that there are two different alignments:
          to control padding between instances on the stack 
 (arrays)
          to control padding between members of a struct

 align(64) //arrays
 struct foo
 {
       align(16) short baz; //between members
       align (1) float quux;
 }

 your 2.5x speedup is due to aligned vs. unaligned loads and 
 stores which for SIMD type stuff has a really big effect. 
 Basically misaligned stuff is really slow. IIRC there was a 
 (blog/paper?) of someone on a uC spending a vast amount of time 
 in ONE misaligned integer assignment causing traps and getting 
 the kernel involved. Not quite as bad on x86 but still with 
 doing.

 As to a less jacky solution I'm not sure there is one.

Thanks for the reply. I did some more checking around and I found 
that it was not really an alignment problem but was caused by 
using the default init value of my type.

My starting type.
align(64) struct Phys
{
    float x, y, z, w;
    //More stuff.
} //Was 64 bytes in size at the time.

The above worked fine, it was fast and all. But after a while I 
wanted the data in a diffrent format. So I started decoding 
positions, and other variables in separate arrays.

Something like this:
align(16) struct Pos { float x, y, z, w; }

This counter to my limited knowledge of how cpu's work was much 
slower. Doing the same thing lot's of times, touching less memory 
with less branches should in theory at-least be faster right? So 
after I ruled out bottlenecks in the parser I assumed there was 
some alignment problems so I did my Aligner hack. This caused to 
code to run faster so I assumed this was the cause... Naive! 
(there was a typo in the code I submitted to begin with I used a 
= Align!(T).init and not a.value = T.init)

The performance was actually cased by the line : t = T.init no 
matter if it was aligned or not. I solved the problem by changing 
the struct to look like this.
align(16) struct Pos
{
     float x = float.nan;
     float y = float.nan;
     float z = float.nan;
     float w = float.nan;
}

Basically T.init get's explicit values. But... this should be the 
same Pos.init as the default Pos.init. So I really fail to 
understand how this could fix the problem. I guessed the compiler 
generates some slightly different code if I do it this way? And 
that this slightly different code fixes some bottleneck in the 
cpu. But when I took a look at the assembly of the function I 
could not find any difference in the generated code...

I don't really know where to go from here to figure out the 
underlying cause. Does anyone have any suggestions?

Nov 04 2015

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
wrote:
 I don't really know where to go from here to figure out the 
 underlying cause. Does anyone have any suggestions?

Can you publish two compilable and runnable versions of the code 
that exhibit the difference? Then we can have a look at the 
generated assembly. If there's really different code being 
generated depending on whether the .init value is explicitly set 
to float.nan or not, then this suggests there is a bug in DMD.

Nov 05 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
 On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
 wrote:
 Can you publish two compilable and runnable versions of the 
 code that exhibit the difference? Then we can have a look at 
 the generated assembly. If there's really different code being 
 generated depending on whether the .init value is explicitly 
 set to float.nan or not, then this suggests there is a bug in 
 DMD.

I created a simple example here:

struct A { float x, y, z ,w; }
struct B
{
    float x=float.nan;
    float y=float.nan;
    float z=float.nan;
    float w=float.nan;
}

void initVal(T)(ref T t, ref float k)
{
     pragma(inline, false);
     t.x = k;
     t.y = k * 2;
     t.z = k / 2;
     t.w = k^^3;
}


__gshared A[] a;
void benchA()
{
     A val;
     foreach(float f; 0 .. 1000_000)
     {
	val = A.init;
	initVal(val, f);
	a ~= val;
     }
}

__gshared B[] b;
void benchB()
{
     B val;
     foreach(float f; 0 .. 1000_000)
     {
         val = B.init;
         initVal(val, f);
	b ~= val;
     }
}


int main(string[] argv)
{
     import std.datetime;
     import std.stdio;

     auto res = benchmark!(benchA, benchB)(1);
     writeln("Default: ", res[0]);
     writeln("Explicit: ", res[1]);
	
     return 0;
}

output:

Default:  TickDuration(1637842)
Explicit: TickDuration(167088)

~10x slowdown...

Nov 05 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz wrote:
 ~10x slowdown...

I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 
windows.

Nov 05 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 21:22:18 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 11:14:50 UTC, Marc Schütz 
 wrote:
 ~10x slowdown...

 I forgot to mention this but I am using DMD 2.069.0-rc2 for x86 
 windows.

I reduced it further:

struct A { float x, y, z ,w; }
struct B
{
    float x=float.nan;
    float y=float.nan;
    float z=float.nan;
    float w=float.nan;
}

void initVal(T)(ref T t, ref float k) { pragma(inline, false); }

void benchA()
{
    foreach(float f; 0 .. 1000_000)
    {
       A val = A.init;
       initVal(val, f);
    }
}

void benchB()
{
    foreach(float f; 0 .. 1000_000)
    {
       B val = B.init;
       initVal(val, f);
    }
}

int main(string[] argv)
{
    import std.datetime;
    import std.stdio;

    auto res = benchmark!(benchA, benchB)(1);
    writeln("Default:  ", res[0]);
    writeln("Explicit: ", res[1]);

    readln;
    return 0;
}

also i am using dmd -release -boundcheck=off -inline

The pragma(inline, false) is there to prevent it from removing 
the assignment in the loop.

Nov 05 2015

rsw0x <anonymous anonymous.com> writes:

On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]

 I reduced it further:

 [...]

these run at the exact same speed for me and produce identical 
assembly output from a quick glance
dmd 2.069, -O -release -inline

Nov 05 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]

 I reduced it further:

 [...]

 these run at the exact same speed for me and produce identical 
 assembly output from a quick glance
 dmd 2.069, -O -release -inline

Are you running on windows?

I tested on windows x64 and there I also get the exact same speed 
for both functions.

Nov 05 2015

rsw0x <anonymous anonymous.com> writes:

On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:
 On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]

 I reduced it further:

 [...]

 these run at the exact same speed for me and produce identical 
 assembly output from a quick glance
 dmd 2.069, -O -release -inline

 Are you running on windows?

 I tested on windows x64 and there I also get the exact same 
 speed for both functions.

linux x86-64

Nov 05 2015

steven kladitis <steven_kladitis yahoo.com> writes:

On Friday, 6 November 2015 at 01:17:20 UTC, TheFlyingFiddle wrote:
 On Friday, 6 November 2015 at 00:43:49 UTC, rsw0x wrote:
 On Thursday, 5 November 2015 at 23:37:45 UTC, TheFlyingFiddle 
 wrote:
 On Thursday, 5 November 2015 at 21:24:03 UTC, TheFlyingFiddle 
 wrote:
 [...]

 I reduced it further:

 [...]

 these run at the exact same speed for me and produce identical 
 assembly output from a quick glance
 dmd 2.069, -O -release -inline

 Are you running on windows?

 I tested on windows x64 and there I also get the exact same 
 speed for both functions.

I am still disappointed that DMD is not native 64 bit in windows 
yet.  Please show exactly how you are getting 64 bit to work in 
windows 10. I have never gotten this to work.  for any version of 
DMD.   All of my new $400.00 systems are 4 gig 64 bit windows 
10...    and the processor instruction sets are very nice.  I 
dabble in assembler.   I have always wondered why D does not take 
advantage of newer instructions...... and 64 bit. I see a 64 Bit 
droid Compiler for D.  :):):)

Nov 06 2015

BBaz <bb.temp gmx.com> writes:

On Saturday, 7 November 2015 at 03:18:59 UTC, steven kladitis 
wrote:
 [...]
 I am still disappointed that DMD is not native 64 bit in 
 windows yet.
 [...]

It's because they can't make a nice distribution. DMD win32 is a 
nice package that works out of the box (compiler, standard C lib, 
standard D lib, linker, etc) without any further configuration or 
derquirement.

DMD win64 requires MSVS for the standard C lib and the linker.

Nov 06 2015

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

Ok, benchA and benchB have the same assembler code generated. 
However, I _can_ reproduce the slowdown albeit on average only 
20%-40%, not a factor of 10.

It turns out that it's always the first tested function that's 
slower. You can test this by switching benchA and benchB in the 
call to benchmark(). I suspect the reason is that the OS is 
paging in the code the first time, and we're actually seeing the 
cost of the page fault. If you a second round of benchmarks after 
the first one, that one shows more or less the same performance 
for both functions.

Nov 06 2015

Marc =?UTF-8?B?U2Now7x0eg==?= <schuetzm gmx.net> writes:

On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:
 Ok, benchA and benchB have the same assembler code generated. 
 However, I _can_ reproduce the slowdown albeit on average only 
 20%-40%, not a factor of 10.

Forgot to add that this is on Linux x86_64, so that probably 
explains the difference.

 It turns out that it's always the first tested function that's 
 slower. You can test this by switching benchA and benchB in the 
 call to benchmark(). I suspect the reason is that the OS is 
 paging in the code the first time, and we're actually seeing 
 the cost of the page fault. If you a second round of benchmarks 
 after the first one, that one shows more or less the same 
 performance for both functions.

Nov 06 2015

TheFlyingFiddle <borin.lukas gmail.com> writes:

On Friday, 6 November 2015 at 11:38:29 UTC, Marc Schütz wrote:
 On Friday, 6 November 2015 at 11:37:22 UTC, Marc Schütz wrote:
 Ok, benchA and benchB have the same assembler code generated. 
 However, I _can_ reproduce the slowdown albeit on average only 
 20%-40%, not a factor of 10.

 Forgot to add that this is on Linux x86_64, so that probably 
 explains the difference.

 It turns out that it's always the first tested function that's 
 slower. You can test this by switching benchA and benchB in 
 the call to benchmark(). I suspect the reason is that the OS 
 is paging in the code the first time, and we're actually 
 seeing the cost of the page fault. If you a second round of 
 benchmarks after the first one, that one shows more or less 
 the same performance for both functions.


I tested swapping around the functions on windows x86 and I still 
get the same slowdown with the default initializer. Still 
basically the same running speed of both functions on windows 
x64. Interestingly enough the slowdown disappear if I add another 
float variable to the structs. This causes the assembly to change 
to using different instructions so I guess that is why. Also it 
only seems to affect small structs with floats in them. If I 
change the memebers to int both versions run at the same speed on 
x86 aswell.

Nov 06 2015

BBasile <bb.temp gmx.com> writes:

On Thursday, 5 November 2015 at 03:52:47 UTC, TheFlyingFiddle 
wrote:
 [...]
 I solved the problem by changing the struct to look like this.
 align(16) struct Pos
 {
     float x = float.nan;
     float y = float.nan;
     float z = float.nan;
     float w = float.nan;
 }

wow that's quite strange. FP members should be initialized 
without initializer ! Eg you should get the same with

align(16) struct Pos
{
      float x, y, ,z, w;
}

Nov 05 2015

arGus <mailinator mailinator.com> writes:

I did some testing on Linux and Windows.
I ran the code with ten times the iterations, and found the 
results consistent with what has previously been observed in this 
thread.
The code seems to run just fine on Linux, but is slowed down 10x 
on Windows x86.


Windows (32-bit)

rdmd bug.d -inline -boundscheck=off -release
Default:  TickDuration(14398890)
Explicit: TickDuration(168888)


Linux (64-bit)

rdmd bug.d -m64 -inline -boundscheck=off
Default:  TickDuration(59090876)
Explicit: TickDuration(49529493)


Linux (32-bit)

rdmd bug.d -inline -boundscheck=off
Default:  TickDuration(58882306)
Explicit: TickDuration(49231968)

Nov 06 2015

rsw0x <anonymous anonymous.com> writes:

On Friday, 6 November 2015 at 17:55:47 UTC, arGus wrote:
 I did some testing on Linux and Windows.
 I ran the code with ten times the iterations, and found the 
 results consistent with what has previously been observed in 
 this thread.
 The code seems to run just fine on Linux, but is slowed down 
 10x on Windows x86.


 Windows (32-bit)

 rdmd bug.d -inline -boundscheck=off -release
 Default:  TickDuration(14398890)
 Explicit: TickDuration(168888)


 Linux (64-bit)

 rdmd bug.d -m64 -inline -boundscheck=off
 Default:  TickDuration(59090876)
 Explicit: TickDuration(49529493)


 Linux (32-bit)

 rdmd bug.d -inline -boundscheck=off
 Default:  TickDuration(58882306)
 Explicit: TickDuration(49231968)

File a bug report, this probably needs Walter to look at it.

Nov 06 2015

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Align a variable on the stack.