D.gnu - Improving codegen for ARM Cortex-M
- Mike Franklin (154/154) Jul 20 2018 I've finally succeeded in getting a build of my STM32 ARM
- Mike Franklin (99/99) Jul 20 2018 Actually the assembly output from objdump isn't quite accurate.
- Mike Franklin (7/46) Jul 20 2018 Gah. Sorry folks. I keep screwing up. I can see above that
- Mike Franklin (7/9) Jul 20 2018 Just to follow up, after I enabled `-funroll-loops` for GDC, it
I've finally succeeded in getting a build of my STM32 ARM Cortex-M proof of concept in LDC and GDC, thanks to the recent changes in both compilers. So, I now have a way to compare code generation between the two compilers. The project is extremely simple; it just generates a bunch of random rectangles on it's small LCD screen. This is done by simply writing to memory in a frame buffer. Unfortunately, GDC's code executes quite a bit slower than LDC's code. The difference is quite noticeable, as I can see the rate of the status LED blinking much slower with GDC than with LDC. The code to do this is below (I simplified it for this discussion, but tested to ensure reproduction of the symptoms. I also did away with the random behavior to remove that variable). a block of code in main.d --- uint i = 0; while(true) { lcd.fillRect(x, y, width, height, color); if ((i % 1000) == 0) { statusLED.toggle(); } i++; } in lcd.d --- noinline pragma(inline, false) void fillRect(int x, int y, uint width, uint height, ushort color) { int y2 = y + height; for(int _y = y; _y <= y2; _y++) { ltdc.fillSpan(x, _y, width, color); } } from ltdc.d ----------- void fillSpan(int x, int y, uint spanWidth, ushort color) { int start = y * width + x; for(int i = 0; i < spanWidth; i++) { frameBuffer[start + i] = color; } } LDC disassembly --------------- ldc2 -conf= -disable-simplify-libcalls -c -Os -mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 -Isource/runtime -boundscheck=off <_D5board3lcd8fillRectFiikktZv>: 80000b8: e92d 43f0 stmdb sp!, {r4, r5, r6, r7, r8, r9, lr} 80000bc: eb03 0e01 add.w lr, r3, r1 80000c0: 459e cmp lr, r3 80000c2: bfb8 it lt 80000c4: e8bd 83f0 ldmialt.w sp!, {r4, r5, r6, r7, r8, r9, pc} 80000e0: 1a54 subs r4, r2, r1 80000ec: b1f2 cbz r2, 800012c <_D5board3lcd8fillRectFiikktZv+0x74> 80000f4: d30a bcc.n 800010c <_D5board3lcd8fillRectFiikktZv+0x54> 80000f6: 463e mov r6, r7 8000108: 42ac cmp r4, r5 800010a: d1f5 bne.n 80000f8 <_D5board3lcd8fillRectFiikktZv+0x40> 800010c: b171 cbz r1, 800012c <_D5board3lcd8fillRectFiikktZv+0x74> 8000118: 4435 add r5, r6 800011e: d005 beq.n 800012c <_D5board3lcd8fillRectFiikktZv+0x74> 8000128: bf18 it ne 8000132: 4573 cmp r3, lr 8000134: ddda ble.n 80000ec <_D5board3lcd8fillRectFiikktZv+0x34> 8000136: e8bd 83f0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, pc} GDC disassembly --------------- arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak <_D5board3lcd8fillRectFiikktZv>: 800049c: b470 push {r4, r5, r6} 800049e: 440b add r3, r1 80004a0: 4299 cmp r1, r3 80004a6: dc15 bgt.n 80004d4 <_D5board3lcd8fillRectFiikktZv+0x38> <_D5board3lcd8fillRectFiikktZv+0x3c>) 80004b2: 4410 add r0, r2 80004be: b122 cbz r2, 80004ca <_D5board3lcd8fillRectFiikktZv+0x2e> 80004c0: 19a0 adds r0, r4, r6 80004c6: 42a0 cmp r0, r4 80004c8: d1fb bne.n 80004c2 <_D5board3lcd8fillRectFiikktZv+0x26> 80004cc: 428b cmp r3, r1 80004d2: daf4 bge.n 80004be <_D5board3lcd8fillRectFiikktZv+0x22> 80004d4: bc70 pop {r4, r5, r6} 80004d6: 4770 bx lr 80004d8: 20000000 .word 0x20000000 For both LDC and GDC `fillSpan` gets inlined into `fillRect`. I had to disable inlining for `fillRect` to make it easier to compare the disassembly, otherwise all I get is a huge `main`. I used `O2` for GDC because the `Os` was even slower, and didn't inline `fillSpan`. Although GDC's code is shorter, LDC's code is faster. My guess is that this is due to the `ldm` and `stm` instructions in the LDC disassembly which are SIMD instructions (load multiple, and store multiple), but I'm not sure. I've tried a number of different optimization permutations (too many to list here), but they didn't seem to make any difference. I ask for any insight you might have, should you wish to give this your attention. Regardless, I'll keep investigating. Thanks, Mike
Jul 20 2018
Actually the assembly output from objdump isn't quite accurate. Here's the generated assembly from the compiler. LDC --- ldc2 -conf= -disable-simplify-libcalls -c -Os -mtriple=thumb-none-eabi -float-abi=hard -mcpu=cortex-m4 -Isource/runtime -boundscheck=off _D5board3lcd8fillRectFiikktZv: .fnstart .save {r4, r5, r6, r7, r8, r9, lr} push.w {r4, r5, r6, r7, r8, r9, lr} add.w lr, r3, r1 cmp lr, r3 it lt poplt.w {r4, r5, r6, r7, r8, r9, pc} movw r8, :lower16:_D5board4ltdc11frameBufferG76800t movt r8, :upper16:_D5board4ltdc11frameBufferG76800t subs r4, r2, r1 .LBB1_1: cbz r2, .LBB1_8 blo .LBB1_5 mov r6, r7 .LBB1_4: strh r0, [r6] cmp r4, r5 bne .LBB1_4 .LBB1_5: cbz r1, .LBB1_8 add r5, r6 beq .LBB1_8 it ne .LBB1_8: cmp r3, lr ble .LBB1_1 pop.w {r4, r5, r6, r7, r8, r9, pc} GDC --- arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak _D5board3lcd8fillRectFiikktZv: .fnstart .LFB4: args = 4, pretend = 0, frame = 0 frame_needed = 0, uses_anonymous_args = 0 link register save eliminated. push {r4, r5, r6} add r3, r3, r1 cmp r1, r3 bgt .L47 ldr r4, .L58 add r0, r0, r2 .L51: cbz r2, .L49 adds r0, r4, r6 .L50: cmp r0, r4 bne .L50 .L49: cmp r3, r1 bge .L51 .L47: pop {r4, r5, r6} bx lr Mike
Jul 20 2018
On Friday, 20 July 2018 at 12:49:59 UTC, Mike Franklin wrote:GDC --- arm-none-eabi-gdc -c -O2 -nophoboslib -nostdinc -nodefaultlibs -nostdlib -mthumb -mcpu=cortex-m4 -mtune=cortex-m4 -mfloat-abi=hard -Isource/runtime -fno-bounds-check -ffunction-sections -fdata-sections -fno-weak _D5board3lcd8fillRectFiikktZv: .fnstart .LFB4: args = 4, pretend = 0, frame = 0 frame_needed = 0, uses_anonymous_args = 0 link register save eliminated. push {r4, r5, r6} add r3, r3, r1 cmp r1, r3 bgt .L47 ldr r4, .L58 add r0, r0, r2 .L51: cbz r2, .L49 adds r0, r4, r6 .L50: cmp r0, r4 bne .L50 .L49: cmp r3, r1 bge .L51 .L47: pop {r4, r5, r6} bx lrGah. Sorry folks. I keep screwing up. I can see above that `fillSpan` function is not being inlined. I must be doing something wrong. Please ignore this thread. Sorry, Mike
Jul 20 2018
On Friday, 20 July 2018 at 11:11:12 UTC, Mike Franklin wrote:I ask for any insight you might have, should you wish to give this your attention. Regardless, I'll keep investigating.Just to follow up, after I enabled `-funroll-loops` for GDC, it was almost twice as fast as LDC, though the code size was a little larger. Bottom line is: I just need to learn the compilers better (both of them) and learn how to tune them for the application. Mike
Jul 20 2018