www.digitalmars.com         C & C++   DMDScript  

c++ - alignment of automatic variables in DMC

reply "Laurentiu Pancescu" <lpancescu fastmail.fm> writes:
Here's a test case:

/* test.c */
#include <stdio.h>
#include <time.h>
int main( int argc, char *argv[] )
{
  int i;
  double x, y, z;
  clock_t now;
  printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z);
  now = clock();
  z = 0;
  for( i = 1; i < 200000000; i++ ) {
    x = i - 1;
    y = x - 1;
    y = x * y;
    z += y;
  };
  printf("%g\n", z );
  printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
  return 0;
}

If compiled with -o+all, the double variables are aligned at a 4 byte
boundary, while -o+space makes them aligned at 8 byte boundary, leading to a
significantly better performance (just try it!).  A workaround is to declare
the int *after* the doubles, and still compile with -o+all.  This trick
doesn't work with BCC, because it thinks it knows better, and rearranges the
order of variables on the stack, so you can't avoid performance loss for BCC
5.5.1, AFAIK.

GCC seems to align almost anything, including char[] vectors, at 8 or 16
byte boundaries, so it always provides best performance.  If you have gcc,
use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.

I think it would be very nice if DMC would get smarter about this (I use an
AMD Athlon - are other x86 processors less sensitive about this?), but
that's up to Walter, isn't it?

Laurentiu
Feb 05 2002
next sibling parent "Walter" <walter digitalmars.com> writes:
Thanks for tracking this down! I'll definitely look into a fix. -Walter

"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a3pc5n$2n0$2 digitaldaemon.com...
 Here's a test case:

 /* test.c */
 #include <stdio.h>
 #include <time.h>
 int main( int argc, char *argv[] )
 {
   int i;
   double x, y, z;
   clock_t now;
   printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z);
   now = clock();
   z = 0;
   for( i = 1; i < 200000000; i++ ) {
     x = i - 1;
     y = x - 1;
     y = x * y;
     z += y;
   };
   printf("%g\n", z );
   printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
   return 0;
 }

 If compiled with -o+all, the double variables are aligned at a 4 byte
 boundary, while -o+space makes them aligned at 8 byte boundary, leading to
a
 significantly better performance (just try it!).  A workaround is to
declare
 the int *after* the doubles, and still compile with -o+all.  This trick
 doesn't work with BCC, because it thinks it knows better, and rearranges
the
 order of variables on the stack, so you can't avoid performance loss for
BCC
 5.5.1, AFAIK.

 GCC seems to align almost anything, including char[] vectors, at 8 or 16
 byte boundaries, so it always provides best performance.  If you have gcc,
 use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.

 I think it would be very nice if DMC would get smarter about this (I use
an
 AMD Athlon - are other x86 processors less sensitive about this?), but
 that's up to Walter, isn't it?

 Laurentiu
Feb 05 2002
prev sibling parent reply "Walter" <walter digitalmars.com> writes:
Interestingly, this makes  3:1 difference in speed on my machine. The
problem, however, is it's not related to optimization. It's just the lay of
how things wind up on the stack. The calling conventions specify a 4 byte
aligned stack. I don't see at the moment how dynamically adjusting it to 8
bytes within a function is going to work.

-Walter

"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a3pc5n$2n0$2 digitaldaemon.com...
 Here's a test case:

 /* test.c */
 #include <stdio.h>
 #include <time.h>
 int main( int argc, char *argv[] )
 {
   int i;
   double x, y, z;
   clock_t now;
   printf("i %p, x %p, y %p, z %p\n", &i, &x, &y, &z);
   now = clock();
   z = 0;
   for( i = 1; i < 200000000; i++ ) {
     x = i - 1;
     y = x - 1;
     y = x * y;
     z += y;
   };
   printf("%g\n", z );
   printf("elapsed time: %g\n", (double)(clock() - now) / CLOCKS_PER_SEC);
   return 0;
 }

 If compiled with -o+all, the double variables are aligned at a 4 byte
 boundary, while -o+space makes them aligned at 8 byte boundary, leading to
a
 significantly better performance (just try it!).  A workaround is to
declare
 the int *after* the doubles, and still compile with -o+all.  This trick
 doesn't work with BCC, because it thinks it knows better, and rearranges
the
 order of variables on the stack, so you can't avoid performance loss for
BCC
 5.5.1, AFAIK.

 GCC seems to align almost anything, including char[] vectors, at 8 or 16
 byte boundaries, so it always provides best performance.  If you have gcc,
 use "-O9 -funroll-loops -mcpu=pentiumpro", to compare the speed.

 I think it would be very nice if DMC would get smarter about this (I use
an
 AMD Athlon - are other x86 processors less sensitive about this?), but
 that's up to Walter, isn't it?

 Laurentiu
Feb 06 2002
parent reply "Laurentiu Pancescu" <lpancescu fastmail.fm> writes:
The speed increase is about the same factor on my Athlon (exec time 14
seconds, as opposed to 4), and, since I saw -o+space makes auto variables
being aligned at 8 bytes in 2 programs that I used for testing, I assumed it
was no coincidence.

I'm not very sure what you mean by "dynamically adjusting the stack to 8
bytes", so I'm sorry if the following don't match the *real* meaning of your
message.

GCC doesn't seem to do any special handling inside the stack frame code, so
I guess it knows it starts with an aligned stack, and manages to keep that
alignment somehow (maybe it adds unused bytes in every function call, so any
called function also starts with an aligned stack?).  Doing this might break
compatibility with other people's ABI... I don't know exactly, but it
doesn't sound like a good solution for DMC.

What I propose is to dynamically adjust the stack in each function, like in
the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
syntax):

segment test public use32 class=CODE

; int test(int x)
; {
;   int t;
;   double a, b;
;   t = x + x;
;   return t;
; }

global _test
_test:
        push ebp                        ; save EBP, since we use it for
        mov ebp, esp                    ; accessing local parameters
        and esp, 0xFFFFFFF8             ; align the stack at 8 byte boundary
                                        ; (ESP normally decreases, so this
is okay)
        add esp, -24                    ; reserve space for local vars
                                        ; (compiler rearranges vars: doubles
first, then
                                        ; the int, referring to an
hypothetical push order):
                                        ; - a   [ESP + 16]
                                        ; - b   [ESP + 8]
                                        ; - t   [ESP + 0] (4 bytes needed,
just alignment demo)
        mov eax, [ebp + 8]              ; EAX <- local param 'x'
        add eax, eax                    ; calculate value for 'x + x'
        mov [esp], eax                  ; 't' <- EAX
        mov esp, ebp                    ; restore the value that ESP had,
after EBP was
                                        ; pushed, but *before* alignment
        pop ebp                         ; restore EBP (LEAVE also works, but
like this is clearer)
        retn                            ; return value is in EAX, as normal

I hope that your news client won't ruin my nice NASM code formatting... :)

I think this approach is relatively unexpensive, and allows the compiler to
do proper alignment for local variables, since it knows it always starts
with an 8-byte aligned stack (not true for local parameters, if you're
called some a non-DMC code, but oh well!).  Even more, DMC could do normal
stack frame for static functions, since they can only be called from the
same module, and all functions ensure that the stack is 8 byte aligned
before they call any other function.  What do you think?

Laurentiu

"Walter" <walter digitalmars.com> wrote in message
news:a3qup4$26oj$1 digitaldaemon.com...
 Interestingly, this makes  3:1 difference in speed on my machine. The
 problem, however, is it's not related to optimization. It's just the lay
of
 how things wind up on the stack. The calling conventions specify a 4 byte
 aligned stack. I don't see at the moment how dynamically adjusting it to 8
 bytes within a function is going to work.

 -Walter
Feb 06 2002
next sibling parent reply "Walter" <walter digitalmars.com> writes:
The trouble is, if I align ESP, then the function can't access the passed
parameters any more with a fixed ESP offset. What you're doing is accessing
the parameters with EBP, and the locals with ESP. I'd thought of that, too,
but it's a significant recoding of the code generator. -Walter

"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a3rnsp$2i3n$1 digitaldaemon.com...
 The speed increase is about the same factor on my Athlon (exec time 14
 seconds, as opposed to 4), and, since I saw -o+space makes auto variables
 being aligned at 8 bytes in 2 programs that I used for testing, I assumed
it
 was no coincidence.

 I'm not very sure what you mean by "dynamically adjusting the stack to 8
 bytes", so I'm sorry if the following don't match the *real* meaning of
your
 message.

 GCC doesn't seem to do any special handling inside the stack frame code,
so
 I guess it knows it starts with an aligned stack, and manages to keep that
 alignment somehow (maybe it adds unused bytes in every function call, so
any
 called function also starts with an aligned stack?).  Doing this might
break
 compatibility with other people's ABI... I don't know exactly, but it
 doesn't sound like a good solution for DMC.

 What I propose is to dynamically adjust the stack in each function, like
in
 the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
 syntax):

 segment test public use32 class=CODE

 ; int test(int x)
 ; {
 ;   int t;
 ;   double a, b;
 ;   t = x + x;
 ;   return t;
 ; }

 global _test
 _test:
         push ebp                        ; save EBP, since we use it for
         mov ebp, esp                    ; accessing local parameters
         and esp, 0xFFFFFFF8             ; align the stack at 8 byte
boundary
                                         ; (ESP normally decreases, so this
 is okay)
         add esp, -24                    ; reserve space for local vars
                                         ; (compiler rearranges vars:
doubles
 first, then
                                         ; the int, referring to an
 hypothetical push order):
                                         ; - a   [ESP + 16]
                                         ; - b   [ESP + 8]
                                         ; - t   [ESP + 0] (4 bytes needed,
 just alignment demo)
         mov eax, [ebp + 8]              ; EAX <- local param 'x'
         add eax, eax                    ; calculate value for 'x + x'
         mov [esp], eax                  ; 't' <- EAX
         mov esp, ebp                    ; restore the value that ESP had,
 after EBP was
                                         ; pushed, but *before* alignment
         pop ebp                         ; restore EBP (LEAVE also works,
but
 like this is clearer)
         retn                            ; return value is in EAX, as
normal
 I hope that your news client won't ruin my nice NASM code formatting... :)

 I think this approach is relatively unexpensive, and allows the compiler
to
 do proper alignment for local variables, since it knows it always starts
 with an 8-byte aligned stack (not true for local parameters, if you're
 called some a non-DMC code, but oh well!).  Even more, DMC could do normal
 stack frame for static functions, since they can only be called from the
 same module, and all functions ensure that the stack is 8 byte aligned
 before they call any other function.  What do you think?

 Laurentiu
Feb 06 2002
next sibling parent Jan Knepper <jan smartsoft.cc> writes:
 The trouble is, if I align ESP, then the function can't access the passed
 parameters any more with a fixed ESP offset. What you're doing is accessing
 the parameters with EBP, and the locals with ESP. I'd thought of that, too,
 but it's a significant recoding of the code generator. -Walter
Nevertheless sounds like something you would do anyways... Jan
Feb 06 2002
prev sibling parent Heinz Saathoff <hsaat bre.ipnet.de> writes:
Walter schrieb...
 he trouble is, if I align ESP, then the function can't access the passed
 parameters any more with a fixed ESP offset. What you're doing is accessing
 the parameters with EBP, and the locals with ESP. I'd thought of that, too,
 but it's a significant recoding of the code generator. -Walter
Maybe it's not necessary to adjust EPB or ESP when you know that at startup ESP is aligned to 8. The calling function must pass parameters aligned, call the function (now only 4 byte aligned), create the stack frame by saving pushing old EPB (now stack is aligned to 8 again). Now make sure that every auto-var is aligned to 8. That's it! Or have I missed a point? Regards, Heinz
Feb 07 2002
prev sibling next sibling parent reply "Walter" <walter digitalmars.com> writes:
I did some investigating. GCC does some fiddling so that each function
starts out with an aligned stack. This option will be a bit clumsy for DMC,
since I don't have control over the function calling conventions. After
spending several hours not being able to get it out of my mind <g>, I
figured out a way to do it that has almost no impact on generated code. I
can hide nearly all the stack adjustments in code that already
adds/subtracts from ESP so that once the stack is 8 byte aligned, it stays
that way.

Unfortunately, this doesn't work for parameters, i.e. if you call with
(double x, int y, double z) they're not going to be aligned. It also doesn't
work if some foreign code calls you with a misaligned stack. Oh well. I'll
email you the fix so you can try it out (it happens with -o or -o+speed).

"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a3rnsp$2i3n$1 digitaldaemon.com...
 The speed increase is about the same factor on my Athlon (exec time 14
 seconds, as opposed to 4), and, since I saw -o+space makes auto variables
 being aligned at 8 bytes in 2 programs that I used for testing, I assumed
it
 was no coincidence.

 I'm not very sure what you mean by "dynamically adjusting the stack to 8
 bytes", so I'm sorry if the following don't match the *real* meaning of
your
 message.

 GCC doesn't seem to do any special handling inside the stack frame code,
so
 I guess it knows it starts with an aligned stack, and manages to keep that
 alignment somehow (maybe it adds unused bytes in every function call, so
any
 called function also starts with an aligned stack?).  Doing this might
break
 compatibility with other people's ABI... I don't know exactly, but it
 doesn't sound like a good solution for DMC.

 What I propose is to dynamically adjust the stack in each function, like
in
 the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
 syntax):

 segment test public use32 class=CODE

 ; int test(int x)
 ; {
 ;   int t;
 ;   double a, b;
 ;   t = x + x;
 ;   return t;
 ; }

 global _test
 _test:
         push ebp                        ; save EBP, since we use it for
         mov ebp, esp                    ; accessing local parameters
         and esp, 0xFFFFFFF8             ; align the stack at 8 byte
boundary
                                         ; (ESP normally decreases, so this
 is okay)
         add esp, -24                    ; reserve space for local vars
                                         ; (compiler rearranges vars:
doubles
 first, then
                                         ; the int, referring to an
 hypothetical push order):
                                         ; - a   [ESP + 16]
                                         ; - b   [ESP + 8]
                                         ; - t   [ESP + 0] (4 bytes needed,
 just alignment demo)
         mov eax, [ebp + 8]              ; EAX <- local param 'x'
         add eax, eax                    ; calculate value for 'x + x'
         mov [esp], eax                  ; 't' <- EAX
         mov esp, ebp                    ; restore the value that ESP had,
 after EBP was
                                         ; pushed, but *before* alignment
         pop ebp                         ; restore EBP (LEAVE also works,
but
 like this is clearer)
         retn                            ; return value is in EAX, as
normal
 I hope that your news client won't ruin my nice NASM code formatting... :)

 I think this approach is relatively unexpensive, and allows the compiler
to
 do proper alignment for local variables, since it knows it always starts
 with an 8-byte aligned stack (not true for local parameters, if you're
 called some a non-DMC code, but oh well!).  Even more, DMC could do normal
 stack frame for static functions, since they can only be called from the
 same module, and all functions ensure that the stack is 8 byte aligned
 before they call any other function.  What do you think?

 Laurentiu

 "Walter" <walter digitalmars.com> wrote in message
 news:a3qup4$26oj$1 digitaldaemon.com...
 Interestingly, this makes  3:1 difference in speed on my machine. The
 problem, however, is it's not related to optimization. It's just the lay
of
 how things wind up on the stack. The calling conventions specify a 4
byte
 aligned stack. I don't see at the moment how dynamically adjusting it to
8
 bytes within a function is going to work.

 -Walter
Feb 07 2002
next sibling parent Roland <rv ronetech.com> writes:
Walter a écrit :

 Unfortunately, this doesn't work for parameters, i.e. if you call with
 (double x, int y, double z) they're not going to be aligned. It also doesn't
 work if some foreign code calls you with a misaligned stack. Oh well. I'll
 email you the fix so you can try it out (it happens with -o or -o+speed).
can i try too ? (complicate ? i use idde !) Roland
Feb 07 2002
prev sibling parent reply "Laurentiu Pancescu" <lpancescu fastmail.fm> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a3tdb7$i5i$1 digitaldaemon.com...
 I'll email you the fix so you can try it out (it happens with -o
or -o+speed). I must confess that I checked my email every 5 minutes in the last 2 days... :) Will this fix be available in the 8.27 release? I can hardly wait to look at the COD file that the new compiler will generate! Laurentiu
Feb 08 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a412rr$oj7$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a3tdb7$i5i$1 digitaldaemon.com...
 I'll email you the fix so you can try it out (it happens with -o
or -o+speed). I must confess that I checked my email every 5 minutes in the last 2
days...
 :)  Will this fix be available in the 8.27 release?  I can hardly wait to
 look at the COD file that the new compiler will generate!
I emailed it to you, but your email server bounced it saying it didn't like attachments. Got an email address that can handle large attachments?
Feb 08 2002
parent reply "Laurentiu Pancescu" <lpancescu fastmail.fm> writes:
"Walter" <walter digitalmars.com> wrote in message
news:a419th$2knh$1 digitaldaemon.com...
 I emailed it to you, but your email server bounced it saying it didn't
like
 attachments. Got an email address that can handle large attachments?
Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu
Feb 08 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
news:a41du4$2n6m$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a419th$2knh$1 digitaldaemon.com...
 I emailed it to you, but your email server bounced it saying it didn't
like
 attachments. Got an email address that can handle large attachments?
Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu
Here's what I get: ---------------------------------------------------------------------------- ---- This is the Postfix program at host fastmail.fm. I'm sorry to have to inform you that the message returned below could not be delivered to one or more destinations. For further assistance, please send mail to <postmaster> If you do so, please include this problem report. You can delete your own text from the message returned below. The Postfix program <lpancescu fastmail.fm>: host localhost[127.0.0.1] said: 552 Uuencoded attachments not accepted --------------------------------------------------------------------------- -----
Feb 08 2002
parent "Laurentiu Pancescu" <lpancescu fastmail.fm> writes:
Strange... could you please try sending the attachment as MIME?  I always
used MIME when sending something from work to my fastmail account, and it
worked (I think the largest attachment was about 500k).

Laurentiu

"Walter" <walter digitalmars.com> wrote in message
news:a41imf$2pnk$6 digitaldaemon.com...
 "Laurentiu Pancescu" <lpancescu fastmail.fm> wrote in message
 news:a41du4$2n6m$1 digitaldaemon.com...
 "Walter" <walter digitalmars.com> wrote in message
 news:a419th$2knh$1 digitaldaemon.com...
 I emailed it to you, but your email server bounced it saying it didn't
like
 attachments. Got an email address that can handle large attachments?
Fastmail claims there's no limit on the size of files I send, and that I have 100M quota... strange! Is it possible to put it somewhere (http or ftp), and just send me the link? Even scp is fine... :) Thanks, Laurentiu
Here's what I get: --------------------------------------------------------------------------
--
 ----

 This is the Postfix program at host fastmail.fm.

 I'm sorry to have to inform you that the message returned
 below could not be delivered to one or more destinations.

 For further assistance, please send mail to <postmaster>

 If you do so, please include this problem report. You can
 delete your own text from the message returned below.

 The Postfix program

 <lpancescu fastmail.fm>: host localhost[127.0.0.1] said: 552 Uuencoded
     attachments not accepted
  -------------------------------------------------------------------------
--
 -----
Feb 08 2002
prev sibling parent reply Roland <rv ronetech.com> writes:
Laurentiu Pancescu a écrit :

 GCC doesn't seem to do any special handling inside the stack frame code, so
 I guess it knows it starts with an aligned stack, and manages to keep that
 alignment somehow (maybe it adds unused bytes in every function call, so any
 called function also starts with an aligned stack?).  Doing this might break
 compatibility with other people's ABI... I don't know exactly, but it
 doesn't sound like a good solution for DMC.
Why not ? If stack starts aligned, just manage yourself it stays so. Compiler can help: - for parameters: if totals size of parameter is not multiple of 4 (or 8), it can pushs some dummy byte so that stack stays aligned. Unaligned parameters can be slow to acces but at least, stack is aligned at function entry. For Pascal call convention, compiler still have to remove the dummy bytes with add esp - for local data, it is the same. We can imagine all parameters are aligned (push 7 dummy byte and a significat byte for a char parameter) The problem is for compatibility with other modules linked with DMC. Optimizer can do so only for function in the same module as the one currently compiled.
 What I propose is to dynamically adjust the stack in each function, like in
 the following example, written in NASM (sorry, I'm pretty bad at MASM/TASM
 syntax):
seems to me some "plaster in a wood leg" Roland
Feb 07 2002
parent reply "Walter" <walter digitalmars.com> writes:
"Roland" <rv ronetech.com> wrote in message
news:3C624108.E6F7AE4B ronetech.com...
 Why not ?
 If stack starts aligned, just manage yourself it stays so.
 Compiler can help:
 - for parameters: if totals size of parameter is not multiple of 4 (or 8),
it
 can pushs some dummy byte
 so that stack stays aligned.
 Unaligned parameters can be slow to acces but at least, stack is aligned
at
 function entry.
 For Pascal call convention, compiler still have to remove the dummy bytes
with
 add esp
 - for local data, it is the same.
 We can imagine all parameters are aligned (push 7 dummy byte and a
significat
 byte for a char parameter)
 The problem is for compatibility with other modules linked with DMC.
 Optimizer can do so only for function in the same module as the one
currently
 compiled.
What you suggest appears to be what GCC does. What I do is slightly different. The called function, not the caller, figures out how many parameter bytes were pushed. Then, the size of the frame for the automatics is adjusted so the grand total works out to be a multiple of 8. The beauty of this is that most of the time no extra code is generated. There are several special cases and complications with this, but I think I took care of all of them but the case of a varargs function. I've deferred fixing that for the moment. -Walter
Feb 07 2002
parent reply Roland <rv ronetech.com> writes:
Walter a écrit :

 What I do is slightly different. The called function, not the caller,
 figures out how many parameter bytes were pushed. Then, the size of the
 frame for the automatics is adjusted so the grand total works out to be a
 multiple of 8. The beauty of this is that most of the time no extra code is
 generated.
nice !
 There are several special cases and complications with this, but I think I
 took care of all of them but the case of a varargs function. I've deferred
 fixing that for the moment.
for varargs, just a warning in the manual is enough as far as i'm concerned Roland
Feb 07 2002
parent "Walter" <walter digitalmars.com> writes:
"Roland" <rv ronetech.com> wrote in message
news:3C627B5E.1E0D5EFF ronetech.com...
 Walter a écrit :

 What I do is slightly different. The called function, not the caller,
 figures out how many parameter bytes were pushed. Then, the size of the
 frame for the automatics is adjusted so the grand total works out to be
a
 multiple of 8. The beauty of this is that most of the time no extra code
is
 generated.
nice !
 There are several special cases and complications with this, but I think
I
 took care of all of them but the case of a varargs function. I've
deferred
 fixing that for the moment.
for varargs, just a warning in the manual is enough as far as i'm
concerned The only varargs functions that mean anything anyway are printf and scanf, and they don't do heavy loops with doubles. So, while for completeness it should be fixed, as a practical matter it is irrelevant.
Feb 07 2002