digitalmars.D.learn - dmd simple loop disassembly - redundant instruction?
- Ivan Kazmenko (65/65) Dec 25 2013 Hello,
- Chris Cain (14/22) Dec 25 2013 Did you try something like:
- Ivan Kazmenko (3/15) Dec 25 2013 Thanks, that sounded reasonable. Still, in this particular case,
- bearophile (17/42) Dec 25 2013 ldc2 optimizes the useless loop away:
- Ivan Kazmenko (6/20) Dec 25 2013 Glad to know that! But what about DMD? Anyone?..
- Lionello Lunesu (7/14) Dec 26 2013 You should have said that all instructions are redundant :) Looks like
- Ivan Kazmenko (8/15) Dec 26 2013 I added the "return a[7];" part from bearophile's suggestion, but
Hello,
I am studying the difference between x86 generated code of DMD
and C/C++ compilers on Windows (simply put: why exactly, and by
what margin, DMD-compiled D code is often slower than
GCC-compiled C/C++ equivalent).
Now, I have this simple D program:
-----
immutable int MAX_N = 1_000_000;
void main () {
int [MAX_N] a;
foreach (i; 0..MAX_N)
a[i] = i;
}
-----
(I know there's iota in std.range, and it turns out to be even
slower - but that's a high level function, and I'm trying to
understand the lower-level details now.)
The assembly (dmd -O -release -inline -noboundscheck, then
obj2asm) has the following piece corresponding to the cycle:
-----
L2C: mov -03D0900h[EDX*4][EBP],EDX
mov ECX,EDX
inc EDX
cmp EDX,0F4240h
jb L2C
-----
Now, I am not exactly fluent in assembler, but the "mov ECX, EDX"
seems unnecessary. The ECX register is explicitly used three
times in the whole program, and it looks like this instruction
can at least be moved out of the loop, if not removed completely.
Is it indeed a bug, or there's some reason here? And if the
former, where do I report it - at http://d.puremagic.com/issues/,
as with the front-end?
I didn't try GDC or LDC since I didn't find a clear instruction
for using them under Win32. If there is one, please kindly point
me to it. I found a few explanations for GDC, but had a hard
time trying to figure out which is the most current one.
Note that the C++ version does the same with four instructions
instead of five, as D version is expected to be if we remove the
instruction in question. Indeed, it goes like (code inside the
loop):
-----
L3:
movl %eax, _a(,%eax,4)
addl $1, %eax
cmpl $1000000, %eax
jne L3
-----
The full assembly listings, and the source codes (D and C++), are
here:
http://acm.math.spbu.ru/~gassa/dlang/simple_loop/
I've tried a few other versions as well. Changing the loop to an
explicit "for (int i = 0; i < MAX_N; i++)" (a2.d) does not affect
the generated assembly. Making the array dynamic (a3.d) leads to
five instructions, all seemingly important. A __gshared static
array (a4.d) gives the same seemingly unneeded instruction but
with EAX instead of ECX:
-----
L2: mov _D2a41aG1000000i[EDX*4],EDX
mov EAX,EDX
inc EDX
cmp EDX,0F4240h
jb L2
-----
Ivan Kazmenko.
Dec 25 2013
On Wednesday, 25 December 2013 at 12:03:08 UTC, Ivan Kazmenko wrote:Now, I am not exactly fluent in assembler, but the "mov ECX, EDX" seems unnecessary. The ECX register is explicitly used three times in the whole program, and it looks like this instruction can at least be moved out of the loop, if not removed completely. Is it indeed a bug, or there's some reason here? And if the former, where do I report it - at http://d.puremagic.com/issues/, as with the front-end?Did you try something like: for(immutable i; 0..MAX_N) a[i] = i; too? One thing to note is that, technically, i is a _copy_ of the iterated number. So things like for(i; 0..5) i++; have no effect (it will loop 5 times regardless). Indeed, in your case, this could be optimized out, but in general the extra instruction is technically correct. I don't know if making i immutable would change things, but it might give the compiler enough of a hint to do the correct optimization here.
Dec 25 2013
On Wednesday, 25 December 2013 at 12:43:05 UTC, Chris Cain wrote:
Did you try something like:
for(immutable i; 0..MAX_N)
a[i] = i;
too? One thing to note is that, technically, i is a _copy_ of
the iterated number. So things like
for(i; 0..5)
i++;
have no effect (it will loop 5 times regardless). Indeed, in
your case, this could be optimized out, but in general the
extra instruction is technically correct. I don't know if
making i immutable would change things, but it might give the
compiler enough of a hint to do the correct optimization here.
Thanks, that sounded reasonable. Still, in this particular case,
the generated assembly remained the same.
Dec 25 2013
Ivan Kazmenko:
I am studying the difference between x86 generated code of DMD
and C/C++ compilers on Windows (simply put: why exactly, and by
what margin, DMD-compiled D code is often slower than
GCC-compiled C/C++ equivalent).
Now, I have this simple D program:
-----
immutable int MAX_N = 1_000_000;
void main () {
int [MAX_N] a;
foreach (i; 0..MAX_N)
a[i] = i;
}
-----
(I know there's iota in std.range, and it turns out to be even
slower - but that's a high level function, and I'm trying to
understand the lower-level details now.)
The assembly (dmd -O -release -inline -noboundscheck, then
obj2asm) has the following piece corresponding to the cycle:
-----
L2C: mov -03D0900h[EDX*4][EBP],EDX
mov ECX,EDX
inc EDX
cmp EDX,0F4240h
jb L2C
-----
ldc2 optimizes the useless loop away:
__Dmain:
xorl %eax, %eax
ret
If I modify the code returning some value from the int main:
return a[7];
ldc2 gives the loop code:
LBB0_1:
movl %eax, 12(%esp,%eax,4)
incl %eax
cmpl $1000000, %eax
jne LBB0_1
If I use iota ldc2 copiles the loop to exactly the same asm:
foreach (i; MAX_N.iota)
Bye,
bearophile
Dec 25 2013
On Wednesday, 25 December 2013 at 14:51:11 UTC, bearophile wrote:
ldc2 optimizes the useless loop away:
__Dmain:
xorl %eax, %eax
ret
If I modify the code returning some value from the int main:
return a[7];
ldc2 gives the loop code:
LBB0_1:
movl %eax, 12(%esp,%eax,4)
incl %eax
cmpl $1000000, %eax
jne LBB0_1
If I use iota ldc2 copiles the loop to exactly the same asm:
foreach (i; MAX_N.iota)
Glad to know that! But what about DMD? Anyone?..
If someone with better knowledge in assembly confirms the
instruction is unnecessary, I'll file a bug report (at
http://d.puremagic.com/issues/ I presume).
Ivan Kazmenko.
Dec 25 2013
On 12/25/13, 20:03, Ivan Kazmenko wrote:
-----
L2C: mov -03D0900h[EDX*4][EBP],EDX
mov ECX,EDX
inc EDX
cmp EDX,0F4240h
jb L2C
-----
You should have said that all instructions are redundant :) Looks like
the array got optimized out, but then the optimizer stopped. The ECX
likely refers to the 'i' loop variable. When the array write code got
optimized out, the compile could have figured out that 'i' was in turn
unused as well and remove it too. And then, the foreach, etc...
You can file backend bugs on the same site.
Dec 26 2013
On Thursday, 26 December 2013 at 08:08:09 UTC, Lionello Lunesu wrote:You should have said that all instructions are redundant :) Looks like the array got optimized out, but then the optimizer stopped. The ECX likely refers to the 'i' loop variable. When the array write code got optimized out, the compile could have figured out that 'i' was in turn unused as well and remove it too. And then, the foreach, etc...I added the "return a[7];" part from bearophile's suggestion, but that did not change anything in the loop code. So, my guess is that the array does not get optimized out at all.You can file backend bugs on the same site.Thanks, issue created! https://d.puremagic.com/issues/show_bug.cgi?id=11821 Ivan Kazmenko.
Dec 26 2013









"Ivan Kazmenko" <gassa mail.ru> 