digitalmars.D - Pointer alignments in the type system?

bearophile (76/76) Jun 24 2013 Lately I am playing a little with SIMD in D, and with the work in

bearophile (64/82) Jul 12 2013 I didn't think to also take a look at the asm produced. Even if I
bearophile (47/47) Aug 15 2013 The XMM registers I am using are efficient when you feed them

bearophile (3/5) Aug 15 2013 Sorry, I meant to say ZMM registers.
Manu (25/65) Aug 16 2013 I'd argue that if you're alloc-ing many instances of 10 short's

Manu (37/102) Aug 15 2013 Good man! :)

bearophile (55/79) Aug 16 2013 If you use a[16..$] or a[32..$] then the compiler knows

"bearophile" <bearophileHUGS lycos.com> writes:

Lately I am playing a little with SIMD in D, and with the work in 
progress module std.simd, and I have added several related bug 
reports to Bugzilla.

This program compiles with no errors nor warnings with ldc2 on 
Windows32 bit:


import core.simd;

__gshared immutable int[16]
     a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 
15, 16],
     b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  
2,  1];
__gshared int[16] c;

void main() {
     auto va = cast(int4[])a;
     auto vb = cast(int4[])b;
     auto vc = cast(int4[])c;
     vc[0] = va[0] + vb[0];
}


ldc2 generates this main:

__Dmain:
     subl    $12, %esp
     movl    $16, 8(%esp)
     movl    $4, 4(%esp)
     movl    $16, (%esp)
     calll   __d_array_cast_len
     movdqa  __D5test51ayG16i, %xmm0
     paddd   __D5test51byG16i, %xmm0
     movdqa  %xmm0, __D5test51cG16i
     xorl    %eax, %eax
     addl    $12, %esp
     ret


It uses the instruction movdqa, that assumes a,b and c to be 
aligned to 16 bytes. But I think there is no guarantee they are.

This is the LL code generated using the -output-ll switch of ldc2 
(it's a kind of nearly universal bytecode for llvm):

 _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, 
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 
13, i32 14, i32 15, i32 16]
 _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 
13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, 
i32 4, i32 3, i32 2, i32 1]
 _D5test51cG16i = global [16 x i32] zeroinitializer


If I add a "align(16)" annotation to a, b, c it adds the align 16 
annotation in the LL code too:

 _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, 
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 
13, i32 14, i32 15, i32 16], align 16
 _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 
13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, 
i32 4, i32 3, i32 2, i32 1], align 16
 _D5test51cG16i = global [16 x i32] zeroinitializer, align 16


Instead of array casts if I cast the pointers the result is 
similar, but there is no call to __d_array_cast_len (this is a 
raw cast, so I don't expect much help from the type system 
here...):

     auto va = cast(int4*)a.ptr;
     auto vb = cast(int4*)b.ptr;
     auto vc = cast(int4*)c.ptr;


I'd like to receive an alignment warning in those cast(int4[]), 
or the compiler should not use movdqa in such case. This means 
that maybe __d_array_cast_len should keep and transmit the 
alignment of its input pointer to the pointer in the output 
array. And maybe this means the D front-end should keep an 
alignment information for each pointer (or for pointers that will 
be used in situations where the alignment is important), 
integrating it in its type system (and perhaps perform alignment 
inference like the purity inference done for function templates).

The alignment of pointers is important for the CPU, so maybe the 
type system of a system language should keep track of them, and 
enforce the correctness of the alignments. Maybe this could be 
done with no further annotation burden for the programmer. A 
potential problem is how to mix this alignment inference with the 
needs of separate compilation. I think the align() annotations 
suffice for that.

Bye,
bearophile

Jun 24 2013

"bearophile" <bearophileHUGS lycos.com> writes:

 This is the LL code generated using the -output-ll switch of 
 ldc2 (it's a kind of nearly universal bytecode for llvm):

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 
 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, 
 i32 13, i32 14, i32 15, i32 16]
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, 
 i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 
 5, i32 4, i32 3, i32 2, i32 1]
  _D5test51cG16i = global [16 x i32] zeroinitializer


 If I add a "align(16)" annotation to a, b, c it adds the align 
 16 annotation in the LL code too:

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 
 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, 
 i32 13, i32 14, i32 15, i32 16], align 16
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, 
 i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 
 5, i32 4, i32 3, i32 2, i32 1], align 16
  _D5test51cG16i = global [16 x i32] zeroinitializer, align 16

I didn't think to also take a look at the asm produced. Even if I 
don't add align(16) in the code, and even if the LLVM IR doesn't 
have an alignment annotation, the data in the asm has such 
alignment:


     .data
     .globl  __D4test1ayG16i
     .align  16
__D4test1ayG16i:
     .long   1
     .long   2
     .long   3
     .long   4
     .long   5
     .long   6
     .long   7
     .long   8
     .long   9
     .long   10
     .long   11
     .long   12
     .long   13
     .long   14
     .long   15
     .long   16

     .globl  __D4test1byG16i
     .align  16
__D4test1byG16i:
     .long   16
     .long   15
     .long   14
     .long   13
     .long   12
     .long   11
     .long   10
     .long   9
     .long   8
     .long   7
     .long   6
     .long   5
     .long   4
     .long   3
     .long   2
     .long   1


Regarding adding the alignment to the D type system, to assure 
sliced data is aligned, a possible first step is to support:

void foo(align(16) double[] a1,
          align(16) double[] a2) {}


That annotation is dynamically tested at entry point. So it's 
similar to this:

void foo(double[] a1, double[] a2)
in {
     assert(isAligned(a1, 16) && isAligned(a2, 16));
} body {}


But inside the body of foo() the compiler is allowed to use 
aligned SIMD instructions (in theory this should be true for the 
version with asserts too).

A more elegant (and maybe better) solution is to verify the 
alignment statically. This means foo() accepts only 16-aligned 
arrays. Array cast(), new array, dup, idup, and little else is 
statically known to produce/return aligned arrays. And for the 
other slices you need something like makeAligned!N() that returns 
a statically known alignment.

Bye,
bearophile

Jul 12 2013

"bearophile" <bearophileHUGS lycos.com> writes:

The XMM registers I am using are efficient when you feed them 
memory from arrays aligned to 16 bytes, as the D GC produces. But 
the YMM registers used by the AVX/AVX2 instructions prefer an 
alignment of 32 bytes. And the Intel Xeon Phi (MIC) has XMM 
registers that are efficient when the arrays are aligned to 64 
bytes.

When I am not using SIMD code, and I want a small array of little 
elements, like an array of 10 ushorts, having it aligned to 16 
bytes is a waste of space (despite helps the GC reduce the 
fragmentation).

So I have written a small enhancement request, where I suggest 
that arrays for YMM registers could be allocated with an 
alignment of 32 bytes:
http://d.puremagic.com/issues/show_bug.cgi?id=10826

Having the array alignments in the D type system could be useful. 
To be backward-compatible you also need a generic unknown 
alignment (like a void* for alignments), so you can assign arrays 
of any alignment to it, it could be denoted with '0'.

Some rough ideas:


import core.simd: double2, double4;
auto a = new int[10];
static assert(__traits(alignment, a) == 16);
auto b = new int[128]<32>;
static assert(__traits(alignment, b) == 32);
auto c1 = new double2[128];
auto c2 = new double4[64];
static assert(__traits(alignment, c1) == 16);
static assert(__traits(alignment, c2) == 32);

void foo1(int[]<32> a) {
     // Uses YMM registers to modify a
     // ...
}

void foo2(int[] a)
if (__traits(alignment, a) == 32) {
     // Uses YMM registers to modify a
     // ...
}

void foo3(size_t N)(int[]<N> a) {
     static if (N >= 32) {
         // Uses YMM registers to modify a
         // ...
     } else {
         // ...
     }
}


Bye,
bearophile

Aug 15 2013

"bearophile" <bearophileHUGS lycos.com> writes:

 And the Intel Xeon Phi (MIC) has XMM registers that are 
 efficient when the arrays are aligned to 64 bytes.

Sorry, I meant to say ZMM registers.

Bye,
bearophile

Aug 15 2013

Manu <turkeyman gmail.com> writes:

On 16 August 2013 04:39, bearophile <bearophileHUGS lycos.com> wrote:

 The XMM registers I am using are efficient when you feed them memory from
 arrays aligned to 16 bytes, as the D GC produces. But the YMM registers
 used by the AVX/AVX2 instructions prefer an alignment of 32 bytes. And the
 Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays
 are aligned to 64 bytes.

 When I am not using SIMD code, and I want a small array of little
 elements, like an array of 10 ushorts, having it aligned to 16 bytes is a
 waste of space (despite helps the GC reduce the fragmentation).

I'd argue that if you're alloc-ing many instances of 10 short's
independently on the heap, then you're probably doing it wrong.
If you're only doing it once or twice, then it's not a significant waste of
memory as you say.

So I have written a small enhancement request, where I suggest that arrays
 for YMM registers could be allocated with an alignment of 32 bytes:
 http://d.puremagic.com/issues/**show_bug.cgi?id=10826<http://d.puremagic.com/issues/show_bug.cgi?id=10826>


This shouldn't be an enhancement request, it should be the rule.
__vector()'s should be intrinsically aligned to their sizeof. If they are
not, it should be a bug.

Having the array alignments in the D type system could be useful. To be
 backward-compatible you also need a generic unknown alignment (like a void*
 for alignments), so you can assign arrays of any alignment to it, it could
 be denoted with '0'.

 Some rough ideas:


 import core.simd: double2, double4;
 auto a = new int[10];
 static assert(__traits(alignment, a) == 16);
 auto b = new int[128]<32>;
 static assert(__traits(alignment, b) == 32);
 auto c1 = new double2[128];
 auto c2 = new double4[64];
 static assert(__traits(alignment, c1) == 16);
 static assert(__traits(alignment, c2) == 32);

 void foo1(int[]<32> a) {
     // Uses YMM registers to modify a
     // ...
 }

 void foo2(int[] a)
 if (__traits(alignment, a) == 32) {
     // Uses YMM registers to modify a
     // ...
 }

 void foo3(size_t N)(int[]<N> a) {
     static if (N >= 32) {
         // Uses YMM registers to modify a
         // ...
     } else {
         // ...
     }
 }

The thing is, a/b/c1/c2 is really just: struct { size_t length; T *ptr; }
Those symbol names just refer to the dynamic array struct... It doesn't
make a lot of sense to query the alignment of those symbols.
__traits(alignment, s) would == sizeof(size_t) every time.

I've thought about this sort of thing many times before... but I'm not
convinced. I still think it falls over with the possibility of slicing, and
separate compilation.

The thing you're asking for is alignment of the base of the array, not
alignment of elements within the array or alignment of the dynamic array
structure its self.
Alignment of the base of the array isn't really expressible as part of the
type. It's just a request to the allocator.
I use a mallocAligned() function in C. Sadly, I think this is one of the
mistakes of a discreet 'new' operator, which has a syntax that doesn't lend
its self to arbitrary parameters.

Aug 16 2013

Manu <turkeyman gmail.com> writes:

On 24 June 2013 22:03, bearophile <bearophileHUGS lycos.com> wrote:

 Lately I am playing a little with SIMD in D, and with the work in progress
 module std.simd, and I have added several related bug reports to Bugzilla.

Good man! :)

This program compiles with no errors nor warnings with ldc2 on Windows32
 bit:


 import core.simd;

 __gshared immutable int[16]
     a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15, 16],
     b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  2,  1];
 __gshared int[16] c;

 void main() {
     auto va = cast(int4[])a;
     auto vb = cast(int4[])b;
     auto vc = cast(int4[])c;
     vc[0] = va[0] + vb[0];
 }


 ldc2 generates this main:

 __Dmain:
     subl    $12, %esp
     movl    $16, 8(%esp)
     movl    $4, 4(%esp)
     movl    $16, (%esp)
     calll   __d_array_cast_len
     movdqa  __D5test51ayG16i, %xmm0
     paddd   __D5test51byG16i, %xmm0
     movdqa  %xmm0, __D5test51cG16i
     xorl    %eax, %eax
     addl    $12, %esp
     ret


 It uses the instruction movdqa, that assumes a,b and c to be aligned to 16
 bytes. But I think there is no guarantee they are.

Those int[16]'s aren't specified as aligned, but I bet LLVM has an
optimisation which opportunistically aligns them given some criteria, and
then generates code accordingly.

This is the LL code generated using the -output-ll switch of ldc2 (it's a
 kind of nearly universal bytecode for llvm):

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
 i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
 i32 16]
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
 i32 1]
  _D5test51cG16i = global [16 x i32] zeroinitializer


 If I add a "align(16)" annotation to a, b, c it adds the align 16
 annotation in the LL code too:

And so it should, you have now made the alignment explicit. It's not an
optimisation anymore, it's an explicit request.

 _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
 i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
 i32 16], align 16
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
 i32 1], align 16
  _D5test51cG16i = global [16 x i32] zeroinitializer, align 16


 Instead of array casts if I cast the pointers the result is similar, but
 there is no call to __d_array_cast_len (this is a raw cast, so I don't
 expect much help from the type system here...):

     auto va = cast(int4*)a.ptr;
     auto vb = cast(int4*)b.ptr;
     auto vc = cast(int4*)c.ptr;


 I'd like to receive an alignment warning in those cast(int4[]), or the
 compiler should not use movdqa in such case.


I suspect that LLVM scheduling a movdqa is permitted only because it has
hard evidence during codegen that it is safe to do so.
Try the same thing with separate compilation across link units, see if the
presumption of alignment goes away?

I'm not sure about the warning, it doesn't really sound right to me. Maybe
it would be useful... or maybe it would be annoying.
A cast is a cast, it is a deliberate act of reinterpretation... the user
needs to take some responsibility in this case.

Although on the other hand, a cast TO a __vector() (or any aligned type)
pointer or array is absolutely useless unless the source pointer is aligned
correctly. And in *some* circumstances, the compiler could statically prove
this... but in many cases, it probably couldn't.
What happens if you do:
  int[16] a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15,
16];
  auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the alignment.
(and now it's too short)

The compiler doesn't only need to consider the alignment of the source
pointer, but also the offset at any given time. This is probably
impossible.

This means that maybe __d_array_cast_len should keep and transmit the
 alignment of its input pointer to the pointer in the output array. And
 maybe this means the D front-end should keep an alignment information for
 each pointer (or for pointers that will be used in situations where the
 alignment is important), integrating it in its type system (and perhaps
 perform alignment inference like the purity inference done for function
 templates).

 The alignment of pointers is important for the CPU, so maybe the type
 system of a system language should keep track of them, and enforce the
 correctness of the alignments. Maybe this could be done with no further
 annotation burden for the programmer. A potential problem is how to mix
 this alignment inference with the needs of separate compilation. I think
 the align() annotations suffice for that.

Yes, kinda like I was saying above. It might be fine if you are explicit
with align() statements, but whenever you are sharing pointers between link
units, how can you ever retain the information the compiler needs?

I think it's a mistake to make it work in some cases and not others. So for
this reason, I maintain that the user just needs to take responsibility for
their casts...

Aug 15 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Manu:

 I'm not sure about the warning, it doesn't really sound right 
 to me. Maybe
 it would be useful... or maybe it would be annoying.
 A cast is a cast, it is a deliberate act of reinterpretation... 
 the user
 needs to take some responsibility in this case.

OK.


 What happens if you do:
   int[16] a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 
 13, 14, 15,
 16];
   auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the 
 alignment.
 (and now it's too short)

If you use a[16..$] or a[32..$] then the compiler knows 
statically that the slice has kept the 16 bytes alignment.
If you use a[i..$] where i is a run-time value, then the compiler 
assumes the worst, that a is aligned to 4 bytes.
A kind of cast could be used to force the compiler assume a 
certain pointer/array has a certain alignment (better languages 
could allow you to write a proof of such alignment, but this is 
not possible in D, unless you add something like the Microsoft Z3 
solver inside the compiler).


 Yes, kinda like I was saying above. It might be fine if you are 
 explicit
 with align() statements, but whenever you are sharing pointers 
 between link
 units, how can you ever retain the information the compiler 
 needs?

The title of this thread is "Pointer alignments in the type 
system", that means that the static information of the alignment 
is kept with the type of the pointers/arrays (including dynamic 
arrays). In D you can use int[3] across link units because the 
length is part of the static information of the array.

To make things back-ward compatible you need a "don't know" 
alignment, that is like the "void*" type for pointers, or maybe 
for the minimum alignment you could just use the natural 
alignment of the type.

See some my successive posts for some examples.


 I think it's a mistake to make it work in some cases and not 
 others.

I agree. My proposal is to make them always work (unless you 
destroy the static information with a cast or a runtime-valued 
slicing). Runtime functions like "idup" etc are supposed to keep 
and "transmit" the alignment information.


 So for
 this reason, I maintain that the user just needs to take 
 responsibility for their casts...

In D we have a good static type system, unlike for example in 
Python2. One of the points of a static type system is to keep 
some static information, to use it to enforce certain invariants 
statically, and avoid some bugs and offer performance. SIMD 
programming works more efficiently with certain alignments. In my 
opinion it's not too much hard to retain such alignment 
information and avoid certain bugs statically, and produce fast 
code. Having a static type system and not using it is a waste.

This means this a will has a standard aliment of 16:
auto a = new int[128];
static assert(__traits(alignment, a) == 16);

this b has a statically knwn alignment of 4:
auto b = a[i .. $];
static assert(__traits(alignment, b) == 4);
auto b2 = cast(align(16))b;
static assert(__traits(alignment, b2) == 16);

This c has a statically known alignment of 16:
auto c = a.dup;
static assert(__traits(alignment, c) == 16);

auto d = new align(32) int[128]; // A possible syntax.
static assert(__traits(alignment, d) == 32);
// Other possible syntax:
auto d2 = new int[128] align(32);
auto d3 = new int[128]<32>;

auto e = d[64..$];
static assert(__traits(alignment, e) == 32);

align(32) int[128] f;
static assert(__traits(alignment, f) == 32);

Bye,
bearophile

Aug 16 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Pointer alignments in the type system?