digitalmars.D.learn - dmd simple loop disassembly - redundant instruction?
- Ivan Kazmenko (65/65) Dec 25 2013 Hello,
- Chris Cain (14/22) Dec 25 2013 Did you try something like:
- Ivan Kazmenko (3/15) Dec 25 2013 Thanks, that sounded reasonable. Still, in this particular case,
- bearophile (17/42) Dec 25 2013 ldc2 optimizes the useless loop away:
- Ivan Kazmenko (6/20) Dec 25 2013 Glad to know that! But what about DMD? Anyone?..
- Lionello Lunesu (7/14) Dec 26 2013 You should have said that all instructions are redundant :) Looks like
- Ivan Kazmenko (8/15) Dec 26 2013 I added the "return a[7];" part from bearophile's suggestion, but
Hello, I am studying the difference between x86 generated code of DMD and C/C++ compilers on Windows (simply put: why exactly, and by what margin, DMD-compiled D code is often slower than GCC-compiled C/C++ equivalent). Now, I have this simple D program: ----- immutable int MAX_N = 1_000_000; void main () { int [MAX_N] a; foreach (i; 0..MAX_N) a[i] = i; } ----- (I know there's iota in std.range, and it turns out to be even slower - but that's a high level function, and I'm trying to understand the lower-level details now.) The assembly (dmd -O -release -inline -noboundscheck, then obj2asm) has the following piece corresponding to the cycle: ----- L2C: mov -03D0900h[EDX*4][EBP],EDX mov ECX,EDX inc EDX cmp EDX,0F4240h jb L2C ----- Now, I am not exactly fluent in assembler, but the "mov ECX, EDX" seems unnecessary. The ECX register is explicitly used three times in the whole program, and it looks like this instruction can at least be moved out of the loop, if not removed completely. Is it indeed a bug, or there's some reason here? And if the former, where do I report it - at http://d.puremagic.com/issues/, as with the front-end? I didn't try GDC or LDC since I didn't find a clear instruction for using them under Win32. If there is one, please kindly point me to it. I found a few explanations for GDC, but had a hard time trying to figure out which is the most current one. Note that the C++ version does the same with four instructions instead of five, as D version is expected to be if we remove the instruction in question. Indeed, it goes like (code inside the loop): ----- L3: movl %eax, _a(,%eax,4) addl $1, %eax cmpl $1000000, %eax jne L3 ----- The full assembly listings, and the source codes (D and C++), are here: http://acm.math.spbu.ru/~gassa/dlang/simple_loop/ I've tried a few other versions as well. Changing the loop to an explicit "for (int i = 0; i < MAX_N; i++)" (a2.d) does not affect the generated assembly. Making the array dynamic (a3.d) leads to five instructions, all seemingly important. A __gshared static array (a4.d) gives the same seemingly unneeded instruction but with EAX instead of ECX: ----- L2: mov _D2a41aG1000000i[EDX*4],EDX mov EAX,EDX inc EDX cmp EDX,0F4240h jb L2 ----- Ivan Kazmenko.
Dec 25 2013
On Wednesday, 25 December 2013 at 12:03:08 UTC, Ivan Kazmenko wrote:Now, I am not exactly fluent in assembler, but the "mov ECX, EDX" seems unnecessary. The ECX register is explicitly used three times in the whole program, and it looks like this instruction can at least be moved out of the loop, if not removed completely. Is it indeed a bug, or there's some reason here? And if the former, where do I report it - at http://d.puremagic.com/issues/, as with the front-end?Did you try something like: for(immutable i; 0..MAX_N) a[i] = i; too? One thing to note is that, technically, i is a _copy_ of the iterated number. So things like for(i; 0..5) i++; have no effect (it will loop 5 times regardless). Indeed, in your case, this could be optimized out, but in general the extra instruction is technically correct. I don't know if making i immutable would change things, but it might give the compiler enough of a hint to do the correct optimization here.
Dec 25 2013
On Wednesday, 25 December 2013 at 12:43:05 UTC, Chris Cain wrote:Did you try something like: for(immutable i; 0..MAX_N) a[i] = i; too? One thing to note is that, technically, i is a _copy_ of the iterated number. So things like for(i; 0..5) i++; have no effect (it will loop 5 times regardless). Indeed, in your case, this could be optimized out, but in general the extra instruction is technically correct. I don't know if making i immutable would change things, but it might give the compiler enough of a hint to do the correct optimization here.Thanks, that sounded reasonable. Still, in this particular case, the generated assembly remained the same.
Dec 25 2013
Ivan Kazmenko:I am studying the difference between x86 generated code of DMD and C/C++ compilers on Windows (simply put: why exactly, and by what margin, DMD-compiled D code is often slower than GCC-compiled C/C++ equivalent). Now, I have this simple D program: ----- immutable int MAX_N = 1_000_000; void main () { int [MAX_N] a; foreach (i; 0..MAX_N) a[i] = i; } ----- (I know there's iota in std.range, and it turns out to be even slower - but that's a high level function, and I'm trying to understand the lower-level details now.) The assembly (dmd -O -release -inline -noboundscheck, then obj2asm) has the following piece corresponding to the cycle: ----- L2C: mov -03D0900h[EDX*4][EBP],EDX mov ECX,EDX inc EDX cmp EDX,0F4240h jb L2C -----ldc2 optimizes the useless loop away: __Dmain: xorl %eax, %eax ret If I modify the code returning some value from the int main: return a[7]; ldc2 gives the loop code: LBB0_1: movl %eax, 12(%esp,%eax,4) incl %eax cmpl $1000000, %eax jne LBB0_1 If I use iota ldc2 copiles the loop to exactly the same asm: foreach (i; MAX_N.iota) Bye, bearophile
Dec 25 2013
On Wednesday, 25 December 2013 at 14:51:11 UTC, bearophile wrote:ldc2 optimizes the useless loop away: __Dmain: xorl %eax, %eax ret If I modify the code returning some value from the int main: return a[7]; ldc2 gives the loop code: LBB0_1: movl %eax, 12(%esp,%eax,4) incl %eax cmpl $1000000, %eax jne LBB0_1 If I use iota ldc2 copiles the loop to exactly the same asm: foreach (i; MAX_N.iota)Glad to know that! But what about DMD? Anyone?.. If someone with better knowledge in assembly confirms the instruction is unnecessary, I'll file a bug report (at http://d.puremagic.com/issues/ I presume). Ivan Kazmenko.
Dec 25 2013
On 12/25/13, 20:03, Ivan Kazmenko wrote:----- L2C: mov -03D0900h[EDX*4][EBP],EDX mov ECX,EDX inc EDX cmp EDX,0F4240h jb L2C -----You should have said that all instructions are redundant :) Looks like the array got optimized out, but then the optimizer stopped. The ECX likely refers to the 'i' loop variable. When the array write code got optimized out, the compile could have figured out that 'i' was in turn unused as well and remove it too. And then, the foreach, etc... You can file backend bugs on the same site.
Dec 26 2013
On Thursday, 26 December 2013 at 08:08:09 UTC, Lionello Lunesu wrote:You should have said that all instructions are redundant :) Looks like the array got optimized out, but then the optimizer stopped. The ECX likely refers to the 'i' loop variable. When the array write code got optimized out, the compile could have figured out that 'i' was in turn unused as well and remove it too. And then, the foreach, etc...I added the "return a[7];" part from bearophile's suggestion, but that did not change anything in the loop code. So, my guess is that the array does not get optimized out at all.You can file backend bugs on the same site.Thanks, issue created! https://d.puremagic.com/issues/show_bug.cgi?id=11821 Ivan Kazmenko.
Dec 26 2013