www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - Android/ARM codegen

reply "Joakim" <dlang joakim.fea.st> writes:
Alright, I've been stepping through some failing tests: one that 
seems to be bad codegen is that many calls to 
std.algorithm.iteration.map seem to fail, for example, when 
running the tests for std.zip.  One of the std.zip tests calls 
std.random.uniform, which then gets its second parameter stomped 
by map from rndGen().  I've tracked it down to this function in 
the llvm IR:

define weak_odr void 
 _D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(%"std.random.rndGen.M
pResult!(__lambda4, Repeat!int).MapResult"* noalias nocapture sret %.sret_arg,
i8* %.nest_arg, %"std.range.Repeat!int.Repeat"* byval nocapture readonly

{

   %.structliteral = alloca 
%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult", 
align 4 ; [#uses = 3 type = 
%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"*]

   %1 = getelementptr inbounds 
%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"* 
%.structliteral, i32 0, i32 0, i32 0 ; [#uses = 1 type = i32*]

   store i32 0, i32* %1, align 4

   %2 = getelementptr %"std.random.rndGen.MapResult!(__lambda4, 
Repeat!int).MapResult"* %.structliteral, i32 0, i32 1 ; [#uses = 
1 type = i8**]

   store i8* %.nest_arg, i8** %2, align 4

   %tmp = call %"std.random.rndGen.MapResult!(__lambda4, 
Repeat!int).MapResult"* 
 _D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(%"std.random.rndGen.M
pResult!(__lambda4, Repeat!int).MapResult"* returned %.structliteral,
%"std.range.Repeat!int.Repeat"* byval %r_arg) ; [#uses = 1 type =
%"std.random.rndGen.MapResult!(__lambda4, Repeat!int).MapResult"*]

   %3 = bitcast %"std.random.rndGen.MapResult!(__lambda4, 
Repeat!int).MapResult"* %tmp to i64* ; [#uses = 1 type = i64*]

   %4 = bitcast %"std.random.rndGen.MapResult!(__lambda4, 
Repeat!int).MapResult"* %.sret_arg to i64* ; [#uses = 1 type = 
i64*]

   %5 = load i64* %3, align 1                      ; [#uses = 1 
type = i64]

   store i64 %5, i64* %4, align 4

   ret void
}

which gets translated to the following ARM assembly:

_D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
         .fnstart
.Leh_func_begin94:


         .save   {r4, lr}
         push    {r4, lr}


         mov     r4, r0


         stmib   sp, {r0, r1}


         bl      
_D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(PLT)
         ldmib   sp, {r0, r1}
         stm     r4, {r0, r1}

         pop     {r4, lr}

         bx      lr

The problem appears to be that even though r4 is saved at the 
beginning of the function, it is overwritten by r1 in the stmib 
instruction afterwards.  Obviously there's no point in pushing r4 
and popping it at the end of the function, if you've lost it by 

the stack pointer by 8, then stmib increments it 4 _before_ 
storing r0 then r1, overwriting the contents of r4 saved at the 
beginning of the function.  Other calls to map also seem to fail 
in other instances of the same template function, but 
interestingly in different ways, ie no uses of stmib there.  I 
haven't tracked down exactly why those other instances fail.

I'm not sure how to reduce this further and if I need to file a 
bug for llvm: any pointers?
Jul 16 2015
next sibling parent reply Dan Olson <gorox comcast.net> writes:
"Joakim" <dlang joakim.fea.st> writes:
 The problem appears to be that even though r4 is saved at the
 beginning of the function, it is overwritten by r1 in the stmib
 instruction afterwards.  Obviously there's no point in pushing r4 and
 popping it at the end of the function, if you've lost it by

 stack pointer by 8, then stmib increments it 4 _before_ storing r0
 then r1, overwriting the contents of r4 saved at the beginning of the
 function.  Other calls to map also seem to fail in other instances of
 the same template function, but interestingly in different ways, ie no
 uses of stmib there.  I haven't tracked down exactly why those other
 instances fail.
That doesn't look right. What is your triple and other options (optimization), llvm version your patched src is based on, and ldc branch (merge-2.067 branch I think)? I'd like to compare the asm snippet with what I get get for thumbv7-apple-ios, because it should be similar, but it passes std.zip. It could be I just haven't stumbled into the combination you are using. -- Dan
Jul 16 2015
parent reply "Joakim" <dlang joakim.fea.st> writes:
On Thursday, 16 July 2015 at 16:40:31 UTC, Dan Olson wrote:
 "Joakim" <dlang joakim.fea.st> writes:
 The problem appears to be that even though r4 is saved at the 
 beginning of the function, it is overwritten by r1 in the 
 stmib instruction afterwards.  Obviously there's no point in 
 pushing r4 and popping it at the end of the function, if 
 you've lost it by overwriting in between.  Specifically, "sub 

 increments it 4 _before_ storing r0 then r1, overwriting the 
 contents of r4 saved at the beginning of the function.  Other 
 calls to map also seem to fail in other instances of the same 
 template function, but interestingly in different ways, ie no 
 uses of stmib there.  I haven't tracked down exactly why those 
 other instances fail.
That doesn't look right. What is your triple and other options (optimization), llvm version your patched src is based on, and ldc branch (merge-2.067 branch I think)? I'd like to compare the asm snippet with what I get get for thumbv7-apple-ios, because it should be similar, but it passes std.zip. It could be I just haven't stumbled into the combination you are using.
llvm triple and other flags used: --output-o -w -d -mtriple=armv7-none-linux-androideabi -relocation-model=pic -O3 -release -unittest ldc was compiled against a locally-compiled llvm 3.6 from the Android NDK repo, which has some modifications: https://android.googlesource.com/toolchain/llvm/+log/release_36 Yes, I'm using the merge-2.067 branch of ldc as of commit 122ea372d from a couple weeks ago, with inlining turned off, as noted in the earlier EH thread. I would be curious to see what llvm IR and ARM assembly is generated for you on iOS and what other flags you're using. The actual function shown is from the std.random module.
Jul 16 2015
parent "Dan Olson" <zans4cans yahoo.com> writes:
On Thursday, 16 July 2015 at 17:07:33 UTC, Joakim wrote:
 On Thursday, 16 July 2015 at 16:40:31 UTC, Dan Olson wrote:
 [...]
llvm triple and other flags used: --output-o -w -d -mtriple=armv7-none-linux-androideabi -relocation-model=pic -O3 -release -unittest ldc was compiled against a locally-compiled llvm 3.6 from the Android NDK repo, which has some modifications: https://android.googlesource.com/toolchain/llvm/+log/release_36 Yes, I'm using the merge-2.067 branch of ldc as of commit 122ea372d from a couple weeks ago, with inlining turned off, as noted in the earlier EH thread. I would be curious to see what llvm IR and ARM assembly is generated for you on iOS and what other flags you're using. The actual function shown is from the std.random module.
And the nice thing about LLVM / LDC is that I can specify your triple and other options with one compiler. I love that about LLVM: all cross compilers in one.
Jul 16 2015
prev sibling parent reply Dan Olson <gorox comcast.net> writes:
"Joakim" <dlang joakim.fea.st> writes:

 Alright, I've been stepping through some failing tests: one that seems
 to be bad codegen is that many calls to std.algorithm.iteration.map
 seem to fail, for example, when running the tests for std.zip.  One of
 the std.zip tests calls std.random.uniform, which then gets its second
 parameter stomped by map from rndGen().
Joakim, I have a hunch. Can you try changing your abi-android.cpp so passByVal() returns false always? That is the only difference in our generated IR and in my experience byval causes incorrect code on ARM. As an experiment, I changed my passByVal() back to return true for Tstruct (the LDC default), and then my IR is identical to yours and my ARM code, though slightly different, also is clobbering a saved reg on the stack. Anyway, I pasted my GOOD assembly below yours for comparison.
 which gets translated to the following ARM assembly:

 _D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult:
         .fnstart
 .Leh_func_begin94:


         .save   {r4, lr}
         push    {r4, lr}


         mov     r4, r0


         stmib   sp, {r0, r1}


         bl
 _D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult(PLT)
         ldmib   sp, {r0, r1}
         stm     r4, {r0, r1}

         pop     {r4, lr}

         bx      lr
Better assembly produced with -output-s -w -d -mtriple=armv7-apple-ios -relocation-model=pic -O3 -release -unittest -disable-inlining (needed to match up with your change). It is similar but uses stm (e.g stmia) instead of stmib. __D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult: push {r4, r7, lr} mov r4, r0 stm sp, {r0, r1} mov r0, sp mov r1, r2 bl __D3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult6__ctorMFNaNbNcNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult ldm sp, {r0, r1} strd r0, r1, [r4] pop {r4, r7, pc} I wanted to use your triple with OS of android, but std/random.d needs Android specific D code from elsewhere, so compile failed. As aside, with inlining enabled, that function shrinks to 3 instructions: __D3std9algorithm9iteration47__T3mapS363std6random6rndGenFNcNdNfZ9__lambda4Z42__T3mapTS3std5range13__T6RepeatTiZ6RepeatZ3mapMFNaNbNiNfS3std5range13__T6RepeatTiZ6RepeatZS3std9algorithm9iteration87__T9MapResultS363std6random6rndGenFNcNdNfZ9__lambda4TS3std5range13__T6RepeatTiZ6RepeatZ9MapResult: mov r3, r1 strd r2, r3, [r0] bx lr Hope it helps
Jul 17 2015
parent reply "Joakim" <dlang joakim.fea.st> writes:
On Friday, 17 July 2015 at 08:58:30 UTC, Dan Olson wrote:
 "Joakim" <dlang joakim.fea.st> writes:

 Alright, I've been stepping through some failing tests: one 
 that seems to be bad codegen is that many calls to 
 std.algorithm.iteration.map seem to fail, for example, when 
 running the tests for std.zip.  One of the std.zip tests calls 
 std.random.uniform, which then gets its second parameter 
 stomped by map from rndGen().
Joakim, I have a hunch. Can you try changing your abi-android.cpp so passByVal() returns false always? That is the only difference in our generated IR and in my experience byval causes incorrect code on ARM. As an experiment, I changed my passByVal() back to return true for Tstruct (the LDC default), and then my IR is identical to yours and my ARM code, though slightly different, also is clobbering a saved reg on the stack.
Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest of the tests and report back, thanks for your help.
Jul 17 2015
parent reply Dan Olson <gorox comcast.net> writes:
"Joakim" <dlang joakim.fea.st> writes:
 Yep, that fixed it, std.zip passes its tests now. :) I'll run the rest
 of the tests and report back, thanks for your help.
:-)
Jul 17 2015
parent reply "Joakim" <dlang joakim.fea.st> writes:
On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
 "Joakim" <dlang joakim.fea.st> writes:
 Yep, that fixed it, std.zip passes its tests now. :) I'll run 
 the rest of the tests and report back, thanks for your help.
:-)
Around 30 more modules from std.phobos now pass their tests with that change, most of which used to segfault somewhere in a test before. core.time from druntime also passes all its tests now. That one change made a big difference, thanks for pointing it out. Now to fix the rest. Many are related to the real type, which I haven't patched for 64-bit yet.
Jul 17 2015
parent reply "Joakim" <dlang joakim.fea.st> writes:
On Friday, 17 July 2015 at 17:57:01 UTC, Joakim wrote:
 On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
 "Joakim" <dlang joakim.fea.st> writes:
 Yep, that fixed it, std.zip passes its tests now. :) I'll run 
 the rest of the tests and report back, thanks for your help.
:-)
Around 30 more modules from std.phobos now pass their tests with that change, most of which used to segfault somewhere in a test before. core.time from druntime also passes all its tests now. That one change made a big difference, thanks for pointing it out. Now to fix the rest. Many are related to the real type, which I haven't patched for 64-bit yet.
Dan, as I said in the main forum, most of the druntime/phobos modules' tests pass on Android/ARM now. However, I had to turn off optimizations for a handful of modules, have you had to do the same? As noted before, one optimization pass was screwing up ldc.eh. I also had to turn off all optimizations, ie -O0, for std.random and std.stream to get their unit tests to pass. For one phobos module, std.regex, turning off all optimizations for druntime's core.memory got the regex tests to pass. Other than those four modules, everything is compiled with -O3 and seems to work, except for the two modules that still segfault, std.net.isemail and std.regex.internal.tests, where compiling those modules with -O0 doesn't make a difference. I haven't spent any time tracking down if other optimized modules might be causing those two to segfault, as was the case with std.regex and core.memory, or exactly which llvm optimizations are causing problems with core.memory, std.random, and std.stream. Are you seeing similar results with your 2.067 branch of ldc with iOS? Since ARM codegen should be similar for the two, I wonder if I'm the only one seeing this.
Aug 03 2015
parent Dan Olson <gorox comcast.net> writes:
"Joakim" <dlang joakim.fea.st> writes:

 On Friday, 17 July 2015 at 17:57:01 UTC, Joakim wrote:
 On Friday, 17 July 2015 at 16:33:07 UTC, Dan Olson wrote:
 "Joakim" <dlang joakim.fea.st> writes:
Dan, as I said in the main forum, most of the druntime/phobos modules' tests pass on Android/ARM now. However, I had to turn off optimizations for a handful of modules, have you had to do the same? As noted before, one optimization pass was screwing up ldc.eh. I also had to turn off all optimizations, ie -O0, for std.random and std.stream to get their unit tests to pass. For one phobos module, std.regex, turning off all optimizations for druntime's core.memory got the regex tests to pass. Other than those four modules, everything is compiled with -O3 and seems to work, except for the two modules that still segfault, std.net.isemail and std.regex.internal.tests, where compiling those modules with -O0 doesn't make a difference. I haven't spent any time tracking down if other optimized modules might be causing those two to segfault, as was the case with std.regex and core.memory, or exactly which llvm optimizations are causing problems with core.memory, std.random, and std.stream. Are you seeing similar results with your 2.067 branch of ldc with iOS? Since ARM codegen should be similar for the two, I wonder if I'm the only one seeing this.
Hi Joakim - I have -O3 optimization on for all modules in the release build and I think I have tested with most of the other -O levels. I did run into an alignment error with neon instructions in std.random unittest with LLVM 3.5.1 and 0.15.1 (2.066) and eventually disabled neon (-mattr=-neon) during optimization as a workaround. I have not tried reenabling neon for merge-2.067 and LLVM 3.6. The problem was a vst1.64 instruction requesting 128-bit alignment when the data was only 64-bit aligned: 0x52197a: vst1.64 {d16, d17}, [r5:128] // r5 addr not properly aligned If it can happen in std.random it could happen elsewhere. You might try -mattr=-neon and see what happens. One other difference to think about is that Android is AAPCS and iOS is a variant of the older APCS. LLVM has some different paths for these. I studied it some a few weeks ago as I made the extern(C) ABI compatible with clang. -- Dan
Aug 03 2015