www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - D slower than C++ by a factor of _two_ for simple raytracer (gdc)

reply downs <default_357-line yahoo.de> writes:
My platform is GDC 4.1.2 vs G++ 4.1.1.

I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).

During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!

"Glee" I thought!

Then I applied the same optimization to the C++ source and it dropped to 8s.

I haven't been able to get the D code even close to this new speed level.

The outputs of both programs are identical save for off-by-one differences.

The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv

D version is http://paste.dprogramming.com/dpzal0jd

Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.

Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
Both compilers were built with roughly similar configure flags. The GDC used is
the latest available in SVN, and based on DMD 1.022.

Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?

Ideas appreciated,

 --downs
Feb 14 2008
next sibling parent Daniel Lewis <murpsoft hotmail.com> writes:
downs Wrote:

 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs
Don't have a GC, or statically load all of Phobos just to do simple raytracing?
Feb 14 2008
prev sibling next sibling parent "Unknown W. Brackets" <unknown simplemachines.org> writes:
Well, I'm on Windows, but comparing DMC and DMD on that code, DMD is 
slightly faster.  I know that gdc isn't really optimizing everything yet....

That said, cl (v15) beats dmc and dmd at like 60% the time, but this has 
less to do with the language itself.

I wonder how gcc and dmd compare here...

-[Unknown]


downs wrote:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
 
 I played around with the simple ray tracer code I'd ported to D a while back,
still being dissatisfied with the timings of 21s (D) vs 16s (C++).
 
 During this, I found a nice optimization that brought my D code down to 17s,
within less than a second of C++!
 
 "Glee" I thought!
 
 Then I applied the same optimization to the C++ source and it dropped to 8s.
 
 I haven't been able to get the D code even close to this new speed level.
 
 The outputs of both programs are identical save for off-by-one differences.
 
 The source code for the C++ version is http://paste.dprogramming.com/dpvpm7jv
 
 D version is http://paste.dprogramming.com/dpzal0jd
 
 Before you ask, yes I've tried turning the structs into classes, the classes
into structs and the refs into pointers. That usually made no difference, or
worsened it.
 
 Both programs were built with -O3 -ffast-math, the D version additionally with
-frelease.
 Both compilers were built with roughly similar configure flags. The GDC used
is the latest available in SVN, and based on DMD 1.022.
 
 Does anybody know how to bring the D results in line with, or at least closer
to, the C++ version?
 
 Ideas appreciated,
 
  --downs
Feb 14 2008
prev sibling next sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code. DMD isn't much good at inlining, etc, so probably your methods are all function calls, struct methods too. If you translate your D raytracer to Java6 with HotSpot you will probably find that your D code is probably 20-50% slower than the Java one, despite the Java one being a bit higher level :-) (Thanks to HotSpot and the GC). If you can stand the ugliness, you can probably reduce your running time by 10-15% using my TinyVector structs instead of your Vec struct, you can find them in my d libs: (V.2.70 at the moment, their development is going well, http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes from extensive testing of mine. You probably may require 10-20 minutes of time to adapt your raytracer to using TinyVector, but it's not too much difficult. The result will be ugly... Bye, bearophile
Feb 15 2008
next sibling parent bearophile <bearophileHUGS lycos.com> writes:
bearophile>If you can stand the ugliness, you can probably reduce your running
time by 10-15% using my TinyVector structs instead of your Vec struct,<

Note that I expect such speedup on DMD, where I have developed them. I don't
know what's the outcome on GDC (that you are using).

Bye,
bearophile
Feb 15 2008
prev sibling next sibling parent reply Marius Muja <mariusm cs.ubc.ca> writes:
bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code.
In my experience GDC code is faster than DMD code (in some cases significantly faster).
 DMD isn't much good at inlining, etc, so probably your methods are all
function calls, struct methods too.
 
 If you translate your D raytracer to Java6 with HotSpot you will probably find
that your D code is probably 20-50% slower than the Java one, despite the Java
one being a bit higher level :-) (Thanks to HotSpot and the GC).
 
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...
 
 Bye,
 bearophile
Feb 15 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Marius Muja Wrote:
 In my experience GDC code is faster than DMD code (in some cases significantly
faster).
My experience is similar to the results you can see here, that is about the same on average, better for some things, worse for other ones: http://shootout.alioth.debian.org/sandbox/benchmark.php?test=all&lang=gdc (I was using GDC based on MinGW based on GCC 3.2. You can find a good newer MinGW here: http://nuwen.net/mingw.html but I don't know if it works with GDC). Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the C++ code? They improve the C++ raytracer speed some. Bye, bearophile
Feb 15 2008
parent downs <default_357-line yahoo.de> writes:
bearophile wrote:
 Note for downs: have you tried -fprofile-generate/-fprofile-use flags for the
C++ code? They improve the C++ raytracer speed some.
 
 Bye,
 bearophile
My point is not in making GDC's crushing defeat even crushinger :) But thanks for the advice, anyway. --downs
Feb 15 2008
prev sibling parent reply downs <default_357-line yahoo.de> writes:
bearophile wrote:
 downs:
 My platform is GDC 4.1.2 vs G++ 4.1.1.
DMD doesn't optimize much for speed, and programs compiled with GDC aren't that far from DMD ones, I don't know why. I'd like GDC to emit C++ code (later to be compiled by GCC) so I can see the spots where it emits slow-looking C++ code. DMD isn't much good at inlining, etc, so probably your methods are all function calls, struct methods too. If you translate your D raytracer to Java6 with HotSpot you will probably find that your D code is probably 20-50% slower than the Java one, despite the Java one being a bit higher level :-) (Thanks to HotSpot and the GC). If you can stand the ugliness, you can probably reduce your running time by 10-15% using my TinyVector structs instead of your Vec struct, you can find them in my d libs: (V.2.70 at the moment, their development is going well, http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes from extensive testing of mine. You probably may require 10-20 minutes of time to adapt your raytracer to using TinyVector, but it's not too much difficult. The result will be ugly... Bye, bearophile
The weird thing is: even if I inline the one spot where gdc ignores its opportunity to inline a function, so that I have the _same_ call-counts as G++ (as measured with -g -pg), even then, the D code is slower. So it doesn't depend on missing inlining opportunities. Or am I missing something? --downs PS: for reference, the missing bit is GDC not always inlining Sphere::ray_sphere. If you look, it's only ever called for cases where the final type is obvious.
Feb 15 2008
next sibling parent downs <default_357-line yahoo.de> writes:
downs wrote:
 bearophile wrote:
 If you can stand the ugliness, you can probably reduce your running time by
10-15% using my TinyVector structs instead of your Vec struct, you can find
them in my d libs: (V.2.70 at the moment, their development is going well,
http://www.fantascienza.net/leonardo/so/libs_d.zip ). That TinyVector comes
from extensive testing of mine. You probably may require 10-20 minutes of time
to adapt your raytracer to using TinyVector, but it's not too much difficult.
The result will be ugly...

 Bye,
 bearophile
To clarify: I know I can get the D code to be as fast as the C++ code if I optimize it more, or use custom structs, etc. That's not the point. The point is getting a comparison of C++ and D using equivalent code. But, again, thanks for the advice.
Feb 15 2008
prev sibling next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
downs wrote:
 The weird thing is: even if I inline the one spot where gdc ignores
 its opportunity to inline a function, so that I have the _same_
 call-counts as G++ (as measured with -g -pg), even then, the D code
 is slower. So it doesn't depend on missing inlining opportunities. Or
 am I missing something?
It's often worthwhile to run obj2asm on the output of each, and compare.
Feb 15 2008
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
Another interesting observation.

If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

 --downs
Feb 15 2008
next sibling parent downs <default_357-line yahoo.de> writes:
downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs
Excuse me. 24 bytes.
Feb 15 2008
prev sibling parent reply Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Another interesting observation.
 
 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.
 
  --downs
Yeah, I was about to say the same. See here: http://paste.dprogramming.com/dpolmzhw It's ugly, but no struct returning. On my machine it's about a second slower than g++ (8.9s vs. 7.8s) compiled via: gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions and g++ -O3 -fomit-frame-pointer -fweb -finline-functions There's probably some other optimizations that could be made. But really I think this comes down to the compiler not being as mature. The stuff that I did should all be done by an optimizing compiler. You're basically tricking the compiler into moving less bits around. Tim.
Feb 15 2008
next sibling parent reply downs <default_357-line yahoo.de> writes:
Tim Burrell wrote:
 downs wrote:
 Another interesting observation.

 If I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit. Still not close to the C++ version though. The
weird thing is that all those ops have been inlined (or so says the assembler
dump). Weird.

  --downs
Yeah, I was about to say the same. See here: http://paste.dprogramming.com/dpolmzhw It's ugly, but no struct returning. On my machine it's about a second slower than g++ (8.9s vs. 7.8s) compiled via: gdc -fversion=Posix -fversion=Tango -O3 -fomit-frame-pointer -fweb -frelease -finline-functions and g++ -O3 -fomit-frame-pointer -fweb -finline-functions There's probably some other optimizations that could be made. But really I think this comes down to the compiler not being as mature. The stuff that I did should all be done by an optimizing compiler. You're basically tricking the compiler into moving less bits around. Tim.
But even using your compiler flags, I'm still looking at 12.8s (D) vs 8.1s (C++) .. 11.4 (D) vs 7.8 (C++) using -march=nocona. :ten minutes later: ... Okay, now I'm confused. Your program is three seconds faster than my op*Assign version. Is there a generic problem with operator overloading? I rewrote my version for freestanding functions .. 9.5s :confused: Why do struct members (which are inlined, I checked) take such a speed hit? Ah well. Let's hope LLVMDC does a better job .. someday. --downs
Feb 15 2008
next sibling parent reply "Jarrett Billingsley" <kb3ctd2 yahoo.com> writes:
"downs" <default_357-line yahoo.de> wrote in message 
news:fp4593$1kko$1 digitalmars.com...

 I rewrote my version for freestanding functions .. 9.5s :confused: Why do 
 struct members (which are inlined, I checked) take such a speed hit?
I think other people have come to this bizarre realization as well. It really doesn't make any sense. Have you compared the assembly of calling a struct member function and calling a free function?
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
I ran a comparison of struct vector methods vs freestanding, and the GDC
generated assembler code is precisely identical.

Here's my test source

struct foo {
  double x, y, z;
  void opAddAssign(ref foo bar) {
    x += bar.x; y += bar.y; z += bar.z;
  }
}

void foo_add(ref foo bar, ref foo baz) {
  baz.x += bar.x; baz.y += bar.y; baz.z += bar.z;
}

// prevents overzealous optimization
// really just returns 0, 0, 0
extern(C) foo complex_external_function();

import std.stdio;
void main() {
  foo a = complex_external_function(), b = complex_external_function();
  asm { int 3; }
  a += b;
  asm { int 3; }
  foo c = complex_external_function(), d = complex_external_function();
  asm { int 3; }
  foo_add(d, c);
  asm { int 3; }
  writefln(a, b, c, d);
}

And here are the relevant two bits of assembler.


#APP
	int	$3
#NO_APP
	fldl	-120(%ebp)
	faddl	-96(%ebp)
	fstpl	-120(%ebp)
	fldl	-112(%ebp)
	faddl	-88(%ebp)
	fstpl	-112(%ebp)
	fldl	-104(%ebp)
	faddl	-80(%ebp)
	fstpl	-104(%ebp)



#APP
	int	$3
#NO_APP
	fldl	-72(%ebp)
	faddl	-48(%ebp)
	fstpl	-72(%ebp)
	fldl	-64(%ebp)
	faddl	-40(%ebp)
	fstpl	-64(%ebp)
	fldl	-56(%ebp)
	faddl	-32(%ebp)
	fstpl	-56(%ebp)

No difference. But then why the obvious speed difference? Color me confused ._.

 --downs
Feb 15 2008
parent Walter Bright <newshound1 digitalmars.com> writes:
downs wrote:
 No difference. But then why the obvious speed difference? Color me confused ._.
Test to see if the stack is aligned, i.e. if the doubles start on 16 byte address boundaries.
Feb 15 2008
prev sibling parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?
 
My version had a bug. x__X The correct version takes 11.2s again. --downs
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 downs wrote:
 I rewrote my version for freestanding functions .. 9.5s :confused: Why do
struct members (which are inlined, I checked) take such a speed hit?
My version had a bug. x__X The correct version takes 11.2s again. --downs
If I fix the bug, the 'external function' version is exactly as fast as the opFoo version. Sorry. I think the 8s version posted earlier has a similar bug. Look at the output. :) -- downs
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
I've been playing around with the 8-9s version posted earlier.

The problem seems to lie in ray_sphere.

Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
but Vec v = center - ray.orig; runs in 11.1s.

Still investigating why this happens.

 --downs
Feb 15 2008
next sibling parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 I've been playing around with the 8-9s version posted earlier.
 
 The problem seems to lie in ray_sphere.
 
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
 
 Still investigating why this happens.
 
  --downs
Okay, found the cause, if not the reason, by looking at the assembler output. For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. Here's the disassembly for ray_sphere for both cases: slow (opSub) http://paste.dprogramming.com/dpcds3p3 fast http://paste.dprogramming.com/dpd6pi8n So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch? --downs
Feb 15 2008
next sibling parent reply Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!
Feb 15 2008
parent reply downs <default_357-line yahoo.de> writes:
Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!
I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos. Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed. Myself of course is mostly clueless about both compilers. :/ --downs
Feb 15 2008
parent Tim Burrell <tim timburrell.net> writes:
downs wrote:
 Tim Burrell wrote:
 downs wrote:
 Strangely, Vec v = void; Vec.sub(center, ray.orig, v); runs in 8.8s, producing
a correct output once the printf at the bottom has been fixed,
 but Vec v = center - ray.orig; runs in 11.1s.
For some reason, the bad case, although inlined, stores its values back into memory. The fast case keeps working with them. So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help. Does anybody have an up-to-date version of the 4.2.x patch?
Hey good deal on figuring this out! It's good to know, especially for those of us using D for real-time simulation type stuff. Is there really a GDC that compiles against gcc >= 4.2?!
I'm not sure; I remember somebody saying he'd managed to build it. And there's a post on d.gnu from somebody saying he'd gotten it to work, although he couldn't build phobos. Since GDC seems to be .. inert at the moment, it'd probably up to some volunteer effort to upgrade it to 4.[23]. That, or get llvmdc up to speed. Myself of course is mostly clueless about both compilers. :/
I notice that the Ubuntu team appears to have a working 4.2 based gdc that the changelog also says works with 4.3: http://packages.ubuntu.com/hardy/devel/gdc-4.2 Changelog is here: http://changelogs.ubuntu.com/changelogs/pool/universe/g/gdc-4.2/gdc-4.2_0.25-4.2.3-0ubuntu1/changelog It'd be really nice to see a new gdc release! I wonder if David even knows about these patches!?
Feb 15 2008
prev sibling next sibling parent reply downs <default_357-line yahoo.de> writes:
downs wrote:
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?
 
  --downs
Especially interesting to note (slow case): fstpl -24(%ebp) [...] movl -24(%ebp), %eax movl %eax, -48(%ebp) movl -20(%ebp), %eax movl %eax, -44(%ebp) Translation: Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48]. This indicates a pretty serious problem with optimization, since the whole thing is basically redundant. The "fast" version doesn't have any memory writes at all during the computation. --downs
Feb 15 2008
parent downs <default_357-line yahoo.de> writes:
downs wrote:
 Especially interesting to note (slow case):
 
     fstpl    -24(%ebp)
 [...]
     movl    -24(%ebp), %eax
     movl    %eax, -48(%ebp)
     movl    -20(%ebp), %eax
     movl    %eax, -44(%ebp)
 
 Translation:
 	Store floating-point number to ebp[-24]. No, wait, move it to ebp[-48].
I left something out. fstpl -24(%ebp) [...] movl -24(%ebp), %eax movl %eax, -48(%ebp) movl -20(%ebp), %eax movl %eax, -44(%ebp) [...] fldl -48(%ebp) So, the whole thing comes down to "Store FP number to memory. No wait, move it somewhere else! No wait, read it back!" No wonder it's slow.
Feb 15 2008
prev sibling parent reply Sergey Gromov <snake.scaly gmail.com> writes:
downs <default_357-line yahoo.de> wrote:
 For some reason, the bad case, although inlined, stores its values back into
memory. The fast case keeps working with them.
 
 Here's the disassembly for ray_sphere for both cases:
 
 slow (opSub)
 
 http://paste.dprogramming.com/dpcds3p3
 
 fast
 
 http://paste.dprogramming.com/dpd6pi8n
 
 So it comes down to a GDC FP "bug". I think changing to 4.2 or 4.3 might help.
Does anybody have an up-to-date version of the 4.2.x patch?
I'm trying to investigate this issue, too. I'm comparing the C++ code generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and DMD 1.020. Here's the commented out comparison of unitise() function: http://paste.dprogramming.com/dpl9p4pt As you can see, the code is very close. But the static opCall() which initializes the by-value return struct is not inlined, and therefore not optimized out. So there is an additional call and extra copying of already calculated values. If not that, the code would be nearly identical. -- SnakE
Feb 16 2008
parent reply Sergey Gromov <snake.scaly gmail.com> writes:
Sergey Gromov <snake.scaly gmail.com> wrote:
 I'm trying to investigate this issue, too.  I'm comparing the C++ code 
 generated by Visual C Express 2005, and GDC 0.24 based on GCC 3.4.5 and 
 DMD 1.020.  Here's the commented out comparison of unitise() function:
Continuing investigation. Here are raw results:
make-cpp-gcc.cmd
gcc -c -O3 -fomit-frame-pointer -fweb -finline-functions ray-cpp.cpp gcc ray-cpp.o -o ray-cpp.exe -lstdc++
test-cpp.cmd
ray-cpp 1>ray-cpp.pbm 10968
make-d.cmd
gdc -c -O3 -fomit-frame-pointer -fweb -frelease -finline-functions ray- d.d gdc ray-d.o -o ray-d.exe
test-d.cmd
ray-d 1>ray-d.pbm 10828 The numbers printed by tests are milliseconds. As you can see, the D version is slightly faster. The outputs are identical. C++ and D program is here, respectively: http://paste.dprogramming.com/dpaftqa2 http://paste.dprogramming.com/dptiniar The only change in C++ is the time output at the end of the main(). D program is refactored so that all struct manipulations happen in-place, without passing and returning by value. GDC has troubles inlining static opCalls for some reason. Microsoft's compiler produces FP/math code about 25% shorter than GCC/GDC in average, hence the results:
make-cpp.cmd
cl -nologo -EHsc -Ox ray-cpp.cpp
test-cpp.cmd
ray-cpp 1>ray-cpp.pbm 7656 -- SnakE
Feb 16 2008
parent reply bearophile <bearophileHUGS lycos.com> writes:
Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.
Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have added static opCalls yesterday, but I may have to remove them again).
 Microsoft's compiler produces FP/math code about 25% shorter than 
 GCC/GDC in average
Nice. Thank you for your experiments. Timings of your code (that has a bug, see downs for a fixed version) on Win, Pentium3, best of 3 runs, image 256x256: D DMD v.1.025: bud -clean -O -release -inline rayD.d 15.8 seconds (memory deallocation too) C++ MinGW based on GCC 4.2.1: g++ -O3 -s rayCpp.cpp -o rayCpp0 9.42 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer rayCpp.cpp -o rayCpp1 8.89 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-generate rayCpp.cpp -o rayCpp2 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-use rayCpp.cpp -o rayCpp2 8.72 s (memory deallocation too) I haven't tried GDC yet. Bye, bearophile
Feb 16 2008
parent Sergey Gromov <snake.scaly gmail.com> writes:
bearophile <bearophileHUGS lycos.com> wrote:
 Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.
Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have added static opCalls yesterday, but I may have to remove them again).
One of programmer's joys is to invent a wheel and pretend it's better than the others. ;)
 Timings of your code (that has a bug, see downs for a fixed version) on 
The only bug I can see is printing out characters through text-mode Windows stdout which expands every 0xA into "\r\n". This doesn't have any impact on the benchmark. -- SnakE
Feb 16 2008
prev sibling parent reply K.Wilson <phizzzt yahoo.com> writes:
I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

llvm-g++4.0.1   5.76
ldc-rev736      6.68
g++4.1.2        6.72
gdc0.24         7.45
g++4.3.1        7.66
dmd1.030        14.52

Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)

Thanks,
K.Wilson


bearophile Wrote:

 Sergey Gromov:
 D program is refactored so that all struct manipulations happen in-place, 
 without passing and returning by value.  GDC has troubles inlining 
 static opCalls for some reason.
Yep, you seem to have re-invented a fixed-size version of my TinyVector (I have added static opCalls yesterday, but I may have to remove them again).
 Microsoft's compiler produces FP/math code about 25% shorter than 
 GCC/GDC in average
Nice. Thank you for your experiments. Timings of your code (that has a bug, see downs for a fixed version) on Win, Pentium3, best of 3 runs, image 256x256: D DMD v.1.025: bud -clean -O -release -inline rayD.d 15.8 seconds (memory deallocation too) C++ MinGW based on GCC 4.2.1: g++ -O3 -s rayCpp.cpp -o rayCpp0 9.42 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer rayCpp.cpp -o rayCpp1 8.89 s (memory deallocation too) C++ MinGW (the same): g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-generate rayCpp.cpp -o rayCpp2 g++ -pipe -O3 -s -ffast-math -fomit-frame-pointer -fprofile-use rayCpp.cpp -o rayCpp2 8.72 s (memory deallocation too) I haven't tried GDC yet. Bye, bearophile
Oct 27 2008
next sibling parent reply "Bill Baxter" <wbaxter gmail.com> writes:
On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)
Faaaaan Tastic! So how is Windows support coming along? Does it build smoothly with MinGW now? --bb
Oct 27 2008
parent reply K.Wilson <phizzzt yahoo.com> writes:
Bill Baxter Wrote:

 On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)
Faaaaan Tastic! So how is Windows support coming along? Does it build smoothly with MinGW now? --bb
Hey Bill, I didn't see your question the other day...the windows support is still not up to par for x86-64 (whether trying to build in 32 or 64 bit mode). I couldn't get things built on my machine for Windows (but I don't have VStudio 2008, which is suggested because MinGW64 is not production quallity yet). I think building for MinGW on x86 is somewhat supported and running...exceptions and inline asm still have some issues...hopefully these issues will be resolved quickly. Thanks, K.Wilson
Oct 29 2008
next sibling parent Christian Kamm <kamm-incasoftware removethis.de> writes:
 So how is Windows support coming along?  Does it build smoothly with
 MinGW now?
 
 --bb
I think building for MinGW on x86-32 is somewhat supported and running...exceptions and inline asm still have some issues...hopefully these issues will be resolved quickly. Thanks, K.Wilson
Exactly. Elrood has managed to compile a working LDC on x86-32 Windows using MinGW (he even provided binaries at one point) and since he hasn't reported otherwise, I'd expect that its still working. LLVM-Windows in particular was broken in LLVM trunk sometimes, but now that we're using the 2.4-release branch, it should be more stable. However, in order for Windows LDC to make quick progress, we'd need more Windows people who are willing to test, debug and fix things to try the compiler.
Oct 30 2008
prev sibling parent Robert Fraser <fraserofthenight gmail.com> writes:
K.Wilson wrote:
 Bill Baxter Wrote:
 
 On Tue, Oct 28, 2008 at 2:26 PM, K.Wilson <phizzzt yahoo.com> wrote:
 I just finished initial support for x86-64 output with the ldc compiler (dmdfe
attached to llvm backend) and wanted to do some timings, so I used the ray
tracing code mentioned in this old thread. I compiled things with the same
optimization flags mentioned in the thread and came up with these averages over
6 runs on an AMD x86-64 machine running Fedora Core Linux.

 llvm-g++4.0.1   5.76
 ldc-rev736      6.68
 g++4.1.2        6.72
 gdc0.24         7.45
 g++4.3.1        7.66
 dmd1.030        14.52

 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).

 I just thought I would let people know that ldc is coming along and performs
quite well, at this point. And it has some x86-64 support now ;)
Faaaaan Tastic! So how is Windows support coming along? Does it build smoothly with MinGW now? --bb
Hey Bill, I didn't see your question the other day...the windows support is still not up to par for x86-64 (whether trying to build in 32 or 64 bit mode). I couldn't get things built on my machine for Windows (but I don't have VStudio 2008, which is suggested because MinGW64 is not production quallity yet). I think building for MinGW on x86 is somewhat supported and running...exceptions and inline asm still have some issues...hopefully these issues will be resolved quickly. Thanks, K.Wilson
I was trying to get LDC going on Windows x64, but it's been slow going with school & everything else. I submitted a path that gets it _compiling_ on VS (the patch is out-of-date, but it shouldn't be too hard to bring it up to date), but last time I checked (a couple weeks ago...) it would fail an assertion. It will take a lot of knowledge about LLVM internals to get LDC usable on this architecture+OS. Then there's Windows IA64...
Oct 30 2008
prev sibling parent reply bearophile <bearophileHUGS lycos.com> writes:
K. Wilson:
Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).<
Very nice. On Win in 100% of my programs and benchmarks llvm-gcc 2.3 turns out slower or quite slower than GCC 4.3.1, but the ratio is never bigger than about 2 times slower. So it's curious to see an example of the opposite.
I just thought I would let people know that ldc is coming along and performs
quite well, at this point.<
I have read it compiles all Tango tests, this is a lot, because Tango is large and complex. -------- The LDC docs say:
One thing the D spec isn't clear about at all is how asm blocks mixed with
normal D code (for example code between two asm blocks) interacts.<
Currently 'naked' in D is treated as a compile time error in LDC. Reason for
this is that LLVM does not support directly controlling prologue/epilogue
generation. Also the documentation from the D spec on this topic is extremely
limited and doesn't mention anything about how normal D code in a naked
function works. In particular local (stack) variables are unclear, also
accessing named parameters etc.<
I think Walter has more or less said he's interested in seeing LDC grow, so I presume such things can be asked to him, and he can give some answers that can help LDC a lot. Bye, bearophile
Oct 28 2008
parent reply Don <nospam nospam.com.au> writes:
bearophile wrote:
 K. Wilson:
 Seems like the LLVM backend is doing well (though I have seen other timings
where g++4.x beats llvm-g++4.x, so take from this what you will).<
Very nice. On Win in 100% of my programs and benchmarks llvm-gcc 2.3 turns out slower or quite slower than GCC 4.3.1, but the ratio is never bigger than about 2 times slower. So it's curious to see an example of the opposite.
 I just thought I would let people know that ldc is coming along and performs
quite well, at this point.<
I have read it compiles all Tango tests, this is a lot, because Tango is large and complex. -------- The LDC docs say:
 One thing the D spec isn't clear about at all is how asm blocks mixed with
normal D code (for example code between two asm blocks) interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. Reason for
this is that LLVM does not support directly controlling prologue/epilogue
generation. Also the documentation from the D spec on this topic is extremely
limited and doesn't mention anything about how normal D code in a naked
function works. In particular local (stack) variables are unclear, also
accessing named parameters etc.<
I can answer this. 'naked' in DMD is quite simple: almost nothing works. Stack variables with 'naked' don't work. Parameters don't work, either. Nor do contracts. Here's why: Regardless of whether 'naked' is specified or not, the compiler creates code exactly as if the function began with 'push EBP; mov EBP, ESP; ' and ended with 'pop EBP; ' If 'naked' is specified, it just doesn't put that prologue and epilogue in. So the only way to use 'naked' is to manually keep track of where everything is on the stack, and index it off the stack pointer. Why use 'naked' at all, then? (1) so that you can use the EBP register; (2) because non-naked asm doesn't work properly, either. If you pass an array into an asm function, you can't get the ".ptr" part of it, because the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, since almost all asm functions that I write work on arrays. I don't think that this is how naked asm _should_ work, though. It should allow you to reference parameters without adjusting the offsets assuming you're using EBP. So the following should work: void nakedfunc(uint [] dest, uint [] src) { asm { naked; mov ECX, dest.ptr[ESP]; } } Since the spec says: "If the [EBP] is omitted, it is assumed for local variables. If naked is used, this no longer holds." True, it doesn't put in an [EBP], but it adjusts the offset assuming that a stack frame has been set up. And ".ptr" doesn't work. Curious fact: Whenever I'm naked, my body disappears! void fkk() { asm { naked; } } void main() { fkk(); } --> generates a linker error. Fair enough, really. But kind of interesting.
Oct 28 2008
next sibling parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Don wrote:
 bearophile wrote:
 The LDC docs say:

 One thing the D spec isn't clear about at all is how asm blocks mixed 
 with normal D code (for example code between two asm blocks) interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. 
 Reason for this is that LLVM does not support directly controlling 
 prologue/epilogue generation. Also the documentation from the D spec 
 on this topic is extremely limited and doesn't mention anything about 
 how normal D code in a naked function works. In particular local 
 (stack) variables are unclear, also accessing named parameters etc.<
I can answer this.
Thank you very much. This will help me a lot implementing 'naked' in LDC :)
 
 'naked' in DMD is quite simple: almost nothing works.
:P
 
 Stack variables with 'naked' don't work. Parameters don't work, either. 
 Nor do contracts. Here's why:
 
 Regardless of whether 'naked' is specified or not, the compiler creates 
 code exactly as if the function began with 'push EBP; mov EBP, ESP; ' 
 and ended with 'pop EBP; '
 If 'naked' is specified, it just doesn't put that prologue and epilogue in.
 So the only way to use 'naked' is to manually keep track of where 
 everything is on the stack, and index it off the stack pointer.
 
This actually makes things quite simple, LLVM does frame pointer elimination by default (we force it to emit one when normal inline asm is used though), so this condition should just be changed to 'if (asmIsUsed && !isNaked) doEBP();' There's a few other technical issues, mostly related to LLVM using an SSA form and function arguments being l-values in D, but that's not really important here...
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass an 
 array into an asm function, you can't get the ".ptr" part of it, because 
 the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, 
 since almost all asm functions that I write work on arrays.
 
Could we work around this somehow ? Actually I'm not sure we even handle this kind of thing properly yet (accessing aggregate fields in asm). I should test that :)
 I don't think that this is how naked asm _should_ work, though. It 
 should allow you to reference parameters without adjusting the offsets 
 assuming you're using EBP. So the following should work:
 
 void nakedfunc(uint [] dest, uint [] src)
 {
   asm {
     naked;
     mov ECX, dest.ptr[ESP];
   }
 }
So this should work since you've not modified ESP right? I'm not sure what facilities LLVM currently has to get stack offsets of parameters before the actual native codegen, probably none.. So this might be a bit difficult to implement in LDC, but again, I'm not really aware of all that LLVM's inline asm can really do, since there is basically no documentation, and we've only read so much of their source code :P
 
 Since the spec says:
 "If the [EBP] is omitted, it is assumed for local variables. If naked is 
 used, this no longer holds."
 True, it doesn't put in an [EBP], but it adjusts the offset assuming 
 that a stack frame has been set up. And ".ptr" doesn't work.
 
This .ptr issue certainly seems like something that should be fixed somehow :)
 
 Curious fact:
 Whenever I'm naked, my body disappears!
 
 void fkk()
 {
    asm { naked; }
 }
 
 void main() { fkk(); }
 
 --> generates a linker error. Fair enough, really. But kind of interesting.
This will not happen in LDC, it will just produce what it does without the naked: _D3bar3fooFZv: ret Again thanx for these explanations. Maybe we'll have 'naked' soon in LDC after all :) -Tomas
Oct 28 2008
parent Don <nospam nospam.com.au> writes:
Tomas Lindquist Olsen wrote:
 Don wrote:
 bearophile wrote:
 The LDC docs say:

 One thing the D spec isn't clear about at all is how asm blocks 
 mixed with normal D code (for example code between two asm blocks) 
 interacts.<
 Currently 'naked' in D is treated as a compile time error in LDC. 
 Reason for this is that LLVM does not support directly controlling 
 prologue/epilogue generation. Also the documentation from the D spec 
 on this topic is extremely limited and doesn't mention anything 
 about how normal D code in a naked function works. In particular 
 local (stack) variables are unclear, also accessing named parameters 
 etc.<
I can answer this.
Thank you very much. This will help me a lot implementing 'naked' in LDC :)
 'naked' in DMD is quite simple: almost nothing works.
:P
 Stack variables with 'naked' don't work. Parameters don't work, 
 either. Nor do contracts. Here's why:

 Regardless of whether 'naked' is specified or not, the compiler 
 creates code exactly as if the function began with 'push EBP; mov EBP, 
 ESP; ' and ended with 'pop EBP; '
 If 'naked' is specified, it just doesn't put that prologue and 
 epilogue in.
 So the only way to use 'naked' is to manually keep track of where 
 everything is on the stack, and index it off the stack pointer.
This actually makes things quite simple, LLVM does frame pointer elimination by default (we force it to emit one when normal inline asm is used though), so this condition should just be changed to 'if (asmIsUsed && !isNaked) doEBP();' There's a few other technical issues, mostly related to LLVM using an SSA form and function arguments being l-values in D, but that's not really important here...
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.
Could we work around this somehow ? Actually I'm not sure we even handle this kind of thing properly yet (accessing aggregate fields in asm). I should test that :)
 I don't think that this is how naked asm _should_ work, though. It 
 should allow you to reference parameters without adjusting the offsets 
 assuming you're using EBP. So the following should work:

 void nakedfunc(uint [] dest, uint [] src)
 {
   asm {
     naked;
     mov ECX, dest.ptr[ESP];
     ret 4*4;
   }
 }
So this should work since you've not modified ESP right?
Yes, it's your responsibility to make sure that ESP points to first parameter on the stack. I'm not sure
 what facilities LLVM currently has to get stack offsets of parameters 
 before the actual native codegen, probably none.. So this might be a bit 
 difficult to implement in LDC, but again, I'm not really aware of all 
 that LLVM's inline asm can really do, since there is basically no 
 documentation, and we've only read so much of their source code :P
If it is only mov ECX, [ESP]param, then you can translate it to mov ECX, param. In fact, it would be even better if you could write mov ECX, param. If DMD would keep track of the number of pushes and pops that occured, it could do this too. Here's the kind of nonsense I'm doing at the moment. I create a constant 'LASTPARAM' which is the offset to the last parameter. The compiler should really be helping with this. But at least it works... void foo(uint [] dest, uint[] left, uint [] right) { enum { LASTPARAM = 6*4 } // 4* pushes + local + return address. asm { naked; push ESI; push EDI; push EBX; push EBP; push EAX; // local variable M mov EDI, [ESP + LASTPARAM + 4*5]; // dest.ptr mov EBX, [ESP + LASTPARAM + 4*2]; // left.length mov ESI, [ESP + LASTPARAM + 4*3]; // left.ptr ... mul int ptr [ESP]; // M ... pop EAX; // get rid of M pop EBP; pop EBX; pop EDI; pop ESI; ret 6*4; } }
 
 Since the spec says:
 "If the [EBP] is omitted, it is assumed for local variables. If naked 
 is used, this no longer holds."
 True, it doesn't put in an [EBP], but it adjusts the offset assuming 
 that a stack frame has been set up. And ".ptr" doesn't work.
This .ptr issue certainly seems like something that should be fixed somehow :)
 Curious fact:
 Whenever I'm naked, my body disappears!

 void fkk()
 {
    asm { naked; }
 }

 void main() { fkk(); }

 --> generates a linker error. Fair enough, really. But kind of 
 interesting.
This will not happen in LDC, it will just produce what it does without the naked: _D3bar3fooFZv: ret Again thanx for these explanations. Maybe we'll have 'naked' soon in LDC after all :) -Tomas
Oct 29 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass an 
 array into an asm function, you can't get the ".ptr" part of it, because 
 the "ptr" conflicts with the asm "ptr" keyword. This is a big problem, 
 since almost all asm functions that I write work on arrays.
(3) you want to write the entire function in assembler.
Oct 28 2008
parent reply Don <nospam nospam.com.au> writes:
Walter Bright wrote:
 Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.
(3) you want to write the entire function in assembler.
I think it probably should be illegal to include any non-asm code in a function containing naked asm. It would certainly make the spec simpler! In fact, I don't see how any other approach is really possible.
Oct 29 2008
parent reply Tomas Lindquist Olsen <tomas famolsen.dk> writes:
Don wrote:
 Walter Bright wrote:
 Don wrote:
 Why use 'naked' at all, then?
 (1) so that you can use the EBP register;
 (2) because non-naked asm doesn't work properly, either. If you pass 
 an array into an asm function, you can't get the ".ptr" part of it, 
 because the "ptr" conflicts with the asm "ptr" keyword. This is a big 
 problem, since almost all asm functions that I write work on arrays.
(3) you want to write the entire function in assembler.
I think it probably should be illegal to include any non-asm code in a function containing naked asm. It would certainly make the spec simpler! In fact, I don't see how any other approach is really possible.
Seems sensible to me. Walter, why is normal code even allowed ?
Oct 29 2008
next sibling parent Walter Bright <newshound1 digitalmars.com> writes:
Tomas Lindquist Olsen wrote:
 Don wrote:
 Walter Bright wrote:
 (3) you want to write the entire function in assembler.
I think it probably should be illegal to include any non-asm code in a function containing naked asm. It would certainly make the spec simpler! In fact, I don't see how any other approach is really possible.
Seems sensible to me. Walter, why is normal code even allowed ?
You might want to build your own adjuster thunks.
Oct 30 2008
prev sibling parent reply Walter Bright <newshound1 digitalmars.com> writes:
Tomas Lindquist Olsen wrote:
 Seems sensible to me. Walter, why is normal code even allowed ?
I should add that 'naked' is for people who don't mind getting intimate with how the compiler arranges things. Any code using it should expect it to not be portable, and be past the compiler wagging its finger at them.
Oct 30 2008
parent reply Christian Kamm <kamm-incasoftware removethis.de> writes:
Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.
Here portable means 'portable between different compilers for the same architecture'? I'm wondering because I thought the point of specifying inline assembler was exactly to guarantee that kind of portability. So you would not consider a compiler that does not implement naked, or handles it differently from DMD to be breaking the D specification?
Oct 30 2008
parent reply Walter Bright <newshound1 digitalmars.com> writes:
Christian Kamm wrote:
 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.
Here portable means 'portable between different compilers for the same architecture'? I'm wondering because I thought the point of specifying inline assembler was exactly to guarantee that kind of portability. So you would not consider a compiler that does not implement naked, or handles it differently from DMD to be breaking the D specification?
Since the stack prolog/epilog is not defined in the D ABI, a compiler is free to innovate in this area. The reason to specify the inline assembler is to have a common ground on the assembler syntax.
Oct 30 2008
next sibling parent Christian Kamm <kamm-incasoftware removethis.de> writes:
 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.
Christian Kamm wrote: Here portable means 'portable between different compilers for the same architecture'? I'm wondering because I thought the point of specifying inline assembler was exactly to guarantee that kind of portability. So you would not consider a compiler that does not implement naked, or handles it differently from DMD to be breaking the D specification?
Walter Bright wrote: Since the stack prolog/epilog is not defined in the D ABI, a compiler is free to innovate in this area. The reason to specify the inline assembler is to have a common ground on the assembler syntax.
Thanks for the clarification. We'll not worry about it too much then - though I think that from the conversation with Don, Tomas now has a good idea of what we'd need to do for them to behave similarly.
Oct 31 2008
prev sibling parent Don <nospam nospam.com> writes:
Walter Bright wrote:
 Christian Kamm wrote:
 Walter Bright wrote:
 I should add that 'naked' is for people who don't mind getting intimate
 with how the compiler arranges things. Any code using it should expect
 it to not be portable, and be past the compiler wagging its finger at
 them.
Here portable means 'portable between different compilers for the same architecture'? I'm wondering because I thought the point of specifying inline assembler was exactly to guarantee that kind of portability. So you would not consider a compiler that does not implement naked, or handles it differently from DMD to be breaking the D specification?
Since the stack prolog/epilog is not defined in the D ABI, a compiler is free to innovate in this area.
That's great, and should be in the documentation. However, I don't think there's any reason for the prolog to interfere with an all-assembler naked function. Accessing a stack parameter by name from inside a naked asm function should either be: * defined as implementation-dependent; or * assume nothing about the way the variable is referenced, and should be independent of the stack prolog. (This could also be achieved with another 'magic' constant, similar to LOCALSIZE, giving the size of the stack frame for that function; so that var+FRAMESIZE is constant, regardless of the vendor). For now, you could add a line to the docs stating that "accessing stack params and variables from naked asm functions is not yet standardized".
 The reason to specify the inline assembler is to have a common ground on 
 the assembler syntax.
Nov 03 2008
prev sibling parent bearophile <bearophileHUGS lycos.com> writes:
downs:
f I change all my opFoo's to opFooAssign's, and use those instead, speed goes
up from 16s to 13s; indicating that returning large structs (12 bytes/vector)
causes a significant speed hit.<
Tim Burrell:
 Yeah, I was about to say the same.  See here:
Yep, see my TinyVector ;-) Bye, bearophile
Feb 15 2008
prev sibling parent downs <default_357-line yahoo.de> writes:
Another other observation: GDC's std.math functions still aren't being inlined
properly, forcing me to use the intrinsics manually.

That didn't cause the speed difference though.

Still, it would be nice to see it fixed some time soon, seeing as I filed the
bug in November :)

 --downs
Feb 15 2008
prev sibling parent "Saaa" <empty needmail.com> writes:
With a little bit of commenting, this could be an excellent tutorial. 
Feb 15 2008