www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Pointer alignments in the type system?

reply "bearophile" <bearophileHUGS lycos.com> writes:
Lately I am playing a little with SIMD in D, and with the work in 
progress module std.simd, and I have added several related bug 
reports to Bugzilla.

This program compiles with no errors nor warnings with ldc2 on 
Windows32 bit:


import core.simd;

__gshared immutable int[16]
     a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 
15, 16],
     b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  
2,  1];
__gshared int[16] c;

void main() {
     auto va = cast(int4[])a;
     auto vb = cast(int4[])b;
     auto vc = cast(int4[])c;
     vc[0] = va[0] + vb[0];
}


ldc2 generates this main:

__Dmain:
     subl    $12, %esp
     movl    $16, 8(%esp)
     movl    $4, 4(%esp)
     movl    $16, (%esp)
     calll   __d_array_cast_len
     movdqa  __D5test51ayG16i, %xmm0
     paddd   __D5test51byG16i, %xmm0
     movdqa  %xmm0, __D5test51cG16i
     xorl    %eax, %eax
     addl    $12, %esp
     ret


It uses the instruction movdqa, that assumes a,b and c to be 
aligned to 16 bytes. But I think there is no guarantee they are.

This is the LL code generated using the -output-ll switch of ldc2 
(it's a kind of nearly universal bytecode for llvm):

 _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, 
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 
13, i32 14, i32 15, i32 16]
 _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 
13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, 
i32 4, i32 3, i32 2, i32 1]
 _D5test51cG16i = global [16 x i32] zeroinitializer


If I add a "align(16)" annotation to a, b, c it adds the align 16 
annotation in the LL code too:

 _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, 
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 
13, i32 14, i32 15, i32 16], align 16
 _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 
13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, 
i32 4, i32 3, i32 2, i32 1], align 16
 _D5test51cG16i = global [16 x i32] zeroinitializer, align 16


Instead of array casts if I cast the pointers the result is 
similar, but there is no call to __d_array_cast_len (this is a 
raw cast, so I don't expect much help from the type system 
here...):

     auto va = cast(int4*)a.ptr;
     auto vb = cast(int4*)b.ptr;
     auto vc = cast(int4*)c.ptr;


I'd like to receive an alignment warning in those cast(int4[]), 
or the compiler should not use movdqa in such case. This means 
that maybe __d_array_cast_len should keep and transmit the 
alignment of its input pointer to the pointer in the output 
array. And maybe this means the D front-end should keep an 
alignment information for each pointer (or for pointers that will 
be used in situations where the alignment is important), 
integrating it in its type system (and perhaps perform alignment 
inference like the purity inference done for function templates).

The alignment of pointers is important for the CPU, so maybe the 
type system of a system language should keep track of them, and 
enforce the correctness of the alignments. Maybe this could be 
done with no further annotation burden for the programmer. A 
potential problem is how to mix this alignment inference with the 
needs of separate compilation. I think the align() annotations 
suffice for that.

Bye,
bearophile
Jun 24 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
 This is the LL code generated using the -output-ll switch of 
 ldc2 (it's a kind of nearly universal bytecode for llvm):

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 
 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, 
 i32 13, i32 14, i32 15, i32 16]
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, 
 i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 
 5, i32 4, i32 3, i32 2, i32 1]
  _D5test51cG16i = global [16 x i32] zeroinitializer


 If I add a "align(16)" annotation to a, b, c it adds the align 
 16 annotation in the LL code too:

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 
 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, 
 i32 13, i32 14, i32 15, i32 16], align 16
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, 
 i32 13, i32 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 
 5, i32 4, i32 3, i32 2, i32 1], align 16
  _D5test51cG16i = global [16 x i32] zeroinitializer, align 16
I didn't think to also take a look at the asm produced. Even if I don't add align(16) in the code, and even if the LLVM IR doesn't have an alignment annotation, the data in the asm has such alignment: .data .globl __D4test1ayG16i .align 16 __D4test1ayG16i: .long 1 .long 2 .long 3 .long 4 .long 5 .long 6 .long 7 .long 8 .long 9 .long 10 .long 11 .long 12 .long 13 .long 14 .long 15 .long 16 .globl __D4test1byG16i .align 16 __D4test1byG16i: .long 16 .long 15 .long 14 .long 13 .long 12 .long 11 .long 10 .long 9 .long 8 .long 7 .long 6 .long 5 .long 4 .long 3 .long 2 .long 1 Regarding adding the alignment to the D type system, to assure sliced data is aligned, a possible first step is to support: void foo(align(16) double[] a1, align(16) double[] a2) {} That annotation is dynamically tested at entry point. So it's similar to this: void foo(double[] a1, double[] a2) in { assert(isAligned(a1, 16) && isAligned(a2, 16)); } body {} But inside the body of foo() the compiler is allowed to use aligned SIMD instructions (in theory this should be true for the version with asserts too). A more elegant (and maybe better) solution is to verify the alignment statically. This means foo() accepts only 16-aligned arrays. Array cast(), new array, dup, idup, and little else is statically known to produce/return aligned arrays. And for the other slices you need something like makeAligned!N() that returns a statically known alignment. Bye, bearophile
Jul 12 2013
prev sibling next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
The XMM registers I am using are efficient when you feed them 
memory from arrays aligned to 16 bytes, as the D GC produces. But 
the YMM registers used by the AVX/AVX2 instructions prefer an 
alignment of 32 bytes. And the Intel Xeon Phi (MIC) has XMM 
registers that are efficient when the arrays are aligned to 64 
bytes.

When I am not using SIMD code, and I want a small array of little 
elements, like an array of 10 ushorts, having it aligned to 16 
bytes is a waste of space (despite helps the GC reduce the 
fragmentation).

So I have written a small enhancement request, where I suggest 
that arrays for YMM registers could be allocated with an 
alignment of 32 bytes:
http://d.puremagic.com/issues/show_bug.cgi?id=10826

Having the array alignments in the D type system could be useful. 
To be backward-compatible you also need a generic unknown 
alignment (like a void* for alignments), so you can assign arrays 
of any alignment to it, it could be denoted with '0'.

Some rough ideas:


import core.simd: double2, double4;
auto a = new int[10];
static assert(__traits(alignment, a) == 16);
auto b = new int[128]<32>;
static assert(__traits(alignment, b) == 32);
auto c1 = new double2[128];
auto c2 = new double4[64];
static assert(__traits(alignment, c1) == 16);
static assert(__traits(alignment, c2) == 32);

void foo1(int[]<32> a) {
     // Uses YMM registers to modify a
     // ...
}

void foo2(int[] a)
if (__traits(alignment, a) == 32) {
     // Uses YMM registers to modify a
     // ...
}

void foo3(size_t N)(int[]<N> a) {
     static if (N >= 32) {
         // Uses YMM registers to modify a
         // ...
     } else {
         // ...
     }
}


Bye,
bearophile
Aug 15 2013
next sibling parent "bearophile" <bearophileHUGS lycos.com> writes:
 And the Intel Xeon Phi (MIC) has XMM registers that are 
 efficient when the arrays are aligned to 64 bytes.
Sorry, I meant to say ZMM registers. Bye, bearophile
Aug 15 2013
prev sibling parent Manu <turkeyman gmail.com> writes:
On 16 August 2013 04:39, bearophile <bearophileHUGS lycos.com> wrote:

 The XMM registers I am using are efficient when you feed them memory from
 arrays aligned to 16 bytes, as the D GC produces. But the YMM registers
 used by the AVX/AVX2 instructions prefer an alignment of 32 bytes. And the
 Intel Xeon Phi (MIC) has XMM registers that are efficient when the arrays
 are aligned to 64 bytes.

 When I am not using SIMD code, and I want a small array of little
 elements, like an array of 10 ushorts, having it aligned to 16 bytes is a
 waste of space (despite helps the GC reduce the fragmentation).
I'd argue that if you're alloc-ing many instances of 10 short's independently on the heap, then you're probably doing it wrong. If you're only doing it once or twice, then it's not a significant waste of memory as you say. So I have written a small enhancement request, where I suggest that arrays
 for YMM registers could be allocated with an alignment of 32 bytes:
 http://d.puremagic.com/issues/**show_bug.cgi?id=10826<http://d.puremagic.com/issues/show_bug.cgi?id=10826>
This shouldn't be an enhancement request, it should be the rule. __vector()'s should be intrinsically aligned to their sizeof. If they are not, it should be a bug. Having the array alignments in the D type system could be useful. To be
 backward-compatible you also need a generic unknown alignment (like a void*
 for alignments), so you can assign arrays of any alignment to it, it could
 be denoted with '0'.

 Some rough ideas:


 import core.simd: double2, double4;
 auto a = new int[10];
 static assert(__traits(alignment, a) == 16);
 auto b = new int[128]<32>;
 static assert(__traits(alignment, b) == 32);
 auto c1 = new double2[128];
 auto c2 = new double4[64];
 static assert(__traits(alignment, c1) == 16);
 static assert(__traits(alignment, c2) == 32);

 void foo1(int[]<32> a) {
     // Uses YMM registers to modify a
     // ...
 }

 void foo2(int[] a)
 if (__traits(alignment, a) == 32) {
     // Uses YMM registers to modify a
     // ...
 }

 void foo3(size_t N)(int[]<N> a) {
     static if (N >= 32) {
         // Uses YMM registers to modify a
         // ...
     } else {
         // ...
     }
 }
The thing is, a/b/c1/c2 is really just: struct { size_t length; T *ptr; } Those symbol names just refer to the dynamic array struct... It doesn't make a lot of sense to query the alignment of those symbols. __traits(alignment, s) would == sizeof(size_t) every time. I've thought about this sort of thing many times before... but I'm not convinced. I still think it falls over with the possibility of slicing, and separate compilation. The thing you're asking for is alignment of the base of the array, not alignment of elements within the array or alignment of the dynamic array structure its self. Alignment of the base of the array isn't really expressible as part of the type. It's just a request to the allocator. I use a mallocAligned() function in C. Sadly, I think this is one of the mistakes of a discreet 'new' operator, which has a syntax that doesn't lend its self to arbitrary parameters.
Aug 16 2013
prev sibling parent reply Manu <turkeyman gmail.com> writes:
On 24 June 2013 22:03, bearophile <bearophileHUGS lycos.com> wrote:

 Lately I am playing a little with SIMD in D, and with the work in progress
 module std.simd, and I have added several related bug reports to Bugzilla.
Good man! :) This program compiles with no errors nor warnings with ldc2 on Windows32
 bit:


 import core.simd;

 __gshared immutable int[16]
     a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 13, 14, 15, 16],
     b = [16, 15, 14, 13, 12, 11, 10,  9, 8,  7,  6,  5,  4,  3,  2,  1];
 __gshared int[16] c;

 void main() {
     auto va = cast(int4[])a;
     auto vb = cast(int4[])b;
     auto vc = cast(int4[])c;
     vc[0] = va[0] + vb[0];
 }


 ldc2 generates this main:

 __Dmain:
     subl    $12, %esp
     movl    $16, 8(%esp)
     movl    $4, 4(%esp)
     movl    $16, (%esp)
     calll   __d_array_cast_len
     movdqa  __D5test51ayG16i, %xmm0
     paddd   __D5test51byG16i, %xmm0
     movdqa  %xmm0, __D5test51cG16i
     xorl    %eax, %eax
     addl    $12, %esp
     ret


 It uses the instruction movdqa, that assumes a,b and c to be aligned to 16
 bytes. But I think there is no guarantee they are.
Those int[16]'s aren't specified as aligned, but I bet LLVM has an optimisation which opportunistically aligns them given some criteria, and then generates code accordingly. This is the LL code generated using the -output-ll switch of ldc2 (it's a
 kind of nearly universal bytecode for llvm):

  _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
 i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
 i32 16]
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
 i32 1]
  _D5test51cG16i = global [16 x i32] zeroinitializer


 If I add a "align(16)" annotation to a, b, c it adds the align 16
 annotation in the LL code too:
And so it should, you have now made the alignment explicit. It's not an optimisation anymore, it's an explicit request. _D5test51ayG16i = global [16 x i32] [i32 1, i32 2, i32 3, i32 4, i32 5,
 i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
 i32 16], align 16
  _D5test51byG16i = global [16 x i32] [i32 16, i32 15, i32 14, i32 13, i32
 12, i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4, i32 3, i32 2,
 i32 1], align 16
  _D5test51cG16i = global [16 x i32] zeroinitializer, align 16


 Instead of array casts if I cast the pointers the result is similar, but
 there is no call to __d_array_cast_len (this is a raw cast, so I don't
 expect much help from the type system here...):

     auto va = cast(int4*)a.ptr;
     auto vb = cast(int4*)b.ptr;
     auto vc = cast(int4*)c.ptr;


 I'd like to receive an alignment warning in those cast(int4[]), or the
 compiler should not use movdqa in such case.
I suspect that LLVM scheduling a movdqa is permitted only because it has hard evidence during codegen that it is safe to do so. Try the same thing with separate compilation across link units, see if the presumption of alignment goes away? I'm not sure about the warning, it doesn't really sound right to me. Maybe it would be useful... or maybe it would be annoying. A cast is a cast, it is a deliberate act of reinterpretation... the user needs to take some responsibility in this case. Although on the other hand, a cast TO a __vector() (or any aligned type) pointer or array is absolutely useless unless the source pointer is aligned correctly. And in *some* circumstances, the compiler could statically prove this... but in many cases, it probably couldn't. What happens if you do: int[16] a = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]; auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the alignment. (and now it's too short) The compiler doesn't only need to consider the alignment of the source pointer, but also the offset at any given time. This is probably impossible. This means that maybe __d_array_cast_len should keep and transmit the
 alignment of its input pointer to the pointer in the output array. And
 maybe this means the D front-end should keep an alignment information for
 each pointer (or for pointers that will be used in situations where the
 alignment is important), integrating it in its type system (and perhaps
 perform alignment inference like the purity inference done for function
 templates).

 The alignment of pointers is important for the CPU, so maybe the type
 system of a system language should keep track of them, and enforce the
 correctness of the alignments. Maybe this could be done with no further
 annotation burden for the programmer. A potential problem is how to mix
 this alignment inference with the needs of separate compilation. I think
 the align() annotations suffice for that.
Yes, kinda like I was saying above. It might be fine if you are explicit with align() statements, but whenever you are sharing pointers between link units, how can you ever retain the information the compiler needs? I think it's a mistake to make it work in some cases and not others. So for this reason, I maintain that the user just needs to take responsibility for their casts...
Aug 15 2013
parent "bearophile" <bearophileHUGS lycos.com> writes:
Manu:

 I'm not sure about the warning, it doesn't really sound right 
 to me. Maybe
 it would be useful... or maybe it would be annoying.
 A cast is a cast, it is a deliberate act of reinterpretation... 
 the user
 needs to take some responsibility in this case.
OK.
 What happens if you do:
   int[16] a = [ 1,  2,  3,  4,  5,  6,  7,  8, 9, 10, 11, 12, 
 13, 14, 15,
 16];
   auto va = cast(int4[])a[1..$]; // ?? .. the offset breaks the 
 alignment.
 (and now it's too short)
If you use a[16..$] or a[32..$] then the compiler knows statically that the slice has kept the 16 bytes alignment. If you use a[i..$] where i is a run-time value, then the compiler assumes the worst, that a is aligned to 4 bytes. A kind of cast could be used to force the compiler assume a certain pointer/array has a certain alignment (better languages could allow you to write a proof of such alignment, but this is not possible in D, unless you add something like the Microsoft Z3 solver inside the compiler).
 Yes, kinda like I was saying above. It might be fine if you are 
 explicit
 with align() statements, but whenever you are sharing pointers 
 between link
 units, how can you ever retain the information the compiler 
 needs?
The title of this thread is "Pointer alignments in the type system", that means that the static information of the alignment is kept with the type of the pointers/arrays (including dynamic arrays). In D you can use int[3] across link units because the length is part of the static information of the array. To make things back-ward compatible you need a "don't know" alignment, that is like the "void*" type for pointers, or maybe for the minimum alignment you could just use the natural alignment of the type. See some my successive posts for some examples.
 I think it's a mistake to make it work in some cases and not 
 others.
I agree. My proposal is to make them always work (unless you destroy the static information with a cast or a runtime-valued slicing). Runtime functions like "idup" etc are supposed to keep and "transmit" the alignment information.
 So for
 this reason, I maintain that the user just needs to take 
 responsibility for their casts...
In D we have a good static type system, unlike for example in Python2. One of the points of a static type system is to keep some static information, to use it to enforce certain invariants statically, and avoid some bugs and offer performance. SIMD programming works more efficiently with certain alignments. In my opinion it's not too much hard to retain such alignment information and avoid certain bugs statically, and produce fast code. Having a static type system and not using it is a waste. This means this a will has a standard aliment of 16: auto a = new int[128]; static assert(__traits(alignment, a) == 16); this b has a statically knwn alignment of 4: auto b = a[i .. $]; static assert(__traits(alignment, b) == 4); auto b2 = cast(align(16))b; static assert(__traits(alignment, b2) == 16); This c has a statically known alignment of 16: auto c = a.dup; static assert(__traits(alignment, c) == 16); auto d = new align(32) int[128]; // A possible syntax. static assert(__traits(alignment, d) == 32); // Other possible syntax: auto d2 = new int[128] align(32); auto d3 = new int[128]<32>; auto e = d[64..$]; static assert(__traits(alignment, e) == 32); align(32) int[128] f; static assert(__traits(alignment, f) == 32); Bye, bearophile
Aug 16 2013