digitalmars.D - memcpy vs slice copy

bearophile (24/24) Mar 14 2009 While doing some string processing I've seen some unusual timings compar...

Moritz Warning (12/16) Mar 15 2009 I did a little benchmark:

bearophile (8/10) Mar 15 2009 I have taken the times again. My timings, best of 4:

bearophile (38/38) Mar 15 2009 For reference here's a simple C version:

Sergey Gromov (3/12) Mar 15 2009 Obviously, a memcpy intrinsic is at work here. DMD calls an actual CRT

bearophile (6/7) Mar 15 2009 Yes, gcc is able to recognize some calls to C library functions and repl...

Moritz Warning (4/20) Mar 15 2009 My system is 64bit with dmd 1.041 and ldc from trunk.
Sergey Gromov (31/53) Mar 15 2009 The original benchmark swapped insanely on my 1GB laptop so I've cut the

Don (8/67) Mar 16 2009 Definitely! The difference ought to be bigger than a factor of 3. Which

bearophile (4/5) Mar 16 2009 Time ago I have read an article written by AMD that shows that indeed wi...

Jarrett Billingsley (5/8) Mar 16 2009 I'm actually kind of shocked that given the prevalence of memory block

BCS (2/7) Mar 16 2009 How about memory to memory DMA, Why even make the CPU wait for it to fin...

Don (4/16) Mar 16 2009 That actually happens, to some extent, on some CPUs. If you get

BCS (6/22) Mar 16 2009 I was thinking even more aggressive, for instance: I wonder how much mor...

Jarrett Billingsley (7/13) Mar 16 2009 sh?

BCS (4/20) Mar 16 2009 As long as either the CPU can voluntarily block or the MMU can block acc...

Christopher Wright (2/9) Mar 16 2009 You could probably get good results by seeing how long the array is, tho...

Don (19/36) Mar 17 2009 Not necessary. If the length is long enough to get benefit from that,

Sergey Gromov (3/72) Mar 16 2009 Don't disregard the function call overhead. memcpy is called 50 M

Don (4/76) Mar 16 2009 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's

Walter Bright (2/5) Mar 16 2009 The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 byt...

Don (4/11) Mar 16 2009 That's good.
Sergey Gromov (27/33) Mar 19 2009 It didn't, actually.

Lionello Lunesu (3/3) Mar 17 2009 This has been discussed before, to no avail.

bearophile <bearophileHUGS lycos.com> writes:

While doing some string processing I've seen some unusual timings compared to
the C code, so I have written this to see the situation better.
When USE_MEMCPY is false this little benchmark runs about 3+ times slower:

import std.c.stdlib: malloc;
import std.c.string: memcpy;
import std.stdio: writefln;

const bool USE_MEMCPY = true;

void main() {
    const int N = 100_000_000;

    auto h = "hello\n";
    char* ptr = cast(char*)malloc(N * h.length);
    char* p = ptr;

    for (int i; i < (h.length * N); i += h.length) {
        static if (USE_MEMCPY)
            memcpy(p, h.ptr, h.length);
        else
            p[0 .. h.length] = h;

        p += h.length;
    }

    if (N <= 100) // to see if it works decrease N
        writefln(ptr[0 .. N * h.length]);
}

As usual with benchmarks I may be missing things (especially because now it's
quite late here), so keep eyes open.

Bye,
bearophile

Mar 14 2009

Moritz Warning <moritzwarning web.de> writes:

On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

I did a little benchmark:

ldc -release -O5
true: 0.51
false: 0.63

dmd -release -O
true: 4.47
false: 3.58

I don't see a very big difference between slice copying and memcpy (but 
between compilers).



Btw.: http://www.digitalmars.com/pnews/read.php?
server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933

Mar 15 2009

bearophile <bearophileHUGS lycos.com> writes:

Moritz Warning:
 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).

I have taken the times again. My timings, best of 4:
true: 1.33 s
false:  4.28 s

I have used dmd 1.041, with Phobos, on WinXP 32 bit, 2 GB RAM, CPU Core 2 at 2
GHz.
This may be another little benchnmark for Robert Clipsham :-)

Bye,
bearophile

Mar 15 2009

bearophile <bearophileHUGS lycos.com> writes:

For reference here's a simple C version:

#include "stdlib.h"
#include "string.h"
#include "stdio.h"

#define N 100000000
#define L 6

char h[L] = "hello\n";

int main() {
    char *ptr;
    if (N <= 100)
        ptr = malloc(N * L + 1); // the +1 is for the final printing
    else
        ptr = malloc(N * L);

    char *p = ptr;
    int i;

    for (i = 0; i < N; i++) {
        memcpy(p, h, L);
        p += L;
    }

    // to see if it works decrease N
    if (N <= 100) {
        ptr[N * L] = '\0';
        printf("%s", ptr);
    }
    return 0;
}

It takes 0.59 s to run on the same PC and OS of mine, both using GCC and
LLVM-GCC.
(Note that the loop in the code is a bit different now, the end result is the
same).

The ASM of the inner loop:

L:  movl    _h, %eax
    movl    %eax, (%edx)
    movzwl  _h+4, %eax
    movw    %ax, 4(%edx)
    addl    $6, %edx
    cmpl    %ecx, %edx
    jne L

Bye,
bearophile

Mar 15 2009

Sergey Gromov <snake.scaly gmail.com> writes:

Sun, 15 Mar 2009 10:31:10 -0400, bearophile wrote:

 The ASM of the inner loop:
 
 L:  movl    _h, %eax
     movl    %eax, (%edx)
     movzwl  _h+4, %eax
     movw    %ax, 4(%edx)
     addl    $6, %edx
     cmpl    %ecx, %edx
     jne L

Obviously, a memcpy intrinsic is at work here.  DMD calls an actual CRT
function instead.

Mar 15 2009

bearophile <bearophileHUGS lycos.com> writes:

Sergey Gromov:
 Obviously, a memcpy intrinsic is at work here.<

Yes, gcc is able to recognize some calls to C library functions and replace
them with intrinsics.
I think LDC too uses an intrinsic to copy memory of a slice.
This isn't a too much interesting benchmark, there's nothing much
interesting/new to see here.

Bye,
bearophile

Mar 15 2009

Moritz Warning <moritzwarning web.de> writes:

On Sun, 15 Mar 2009 13:17:50 +0000, Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
 
 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

 
 I did a little benchmark:
 
 ldc -release -O5
 true: 0.51
 false: 0.63
 
 dmd -release -O
 true: 4.47
 false: 3.58

My system is 64bit with dmd 1.041 and ldc from trunk.
Since dmd produces 32bit binaries only this might account
for a factor of two for the timings.

Mar 15 2009

Sergey Gromov <snake.scaly gmail.com> writes:

Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:
 
 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

 
 I did a little benchmark:
 
 ldc -release -O5
 true: 0.51
 false: 0.63
 
 dmd -release -O
 true: 4.47
 false: 3.58
 
 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).
 
 Btw.: http://www.digitalmars.com/pnews/read.php?
 server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933

The original benchmark swapped insanely on my 1GB laptop so I've cut the
number of iterations in half, to 50_000_000.  Compiled with -O -release
-inline.  Results:

slice: 2.31
memcpy: 	0.73

That's 3 times difference.  Disassembly:

slice:
L31:            mov     ECX,EDX
                mov     EAX,6
                lea     ESI,010h[ESP]
                mov     ECX,EAX
                mov     EDI,EDX
                rep
                movsb
                add     EDX,6
                add     EBX,6
                cmp     EBX,011E1A300h
                jb      L31

memcpy:
L35:            push    6
                lea     ECX,014h[ESP]
                push    ECX
                push    EBX
                call    near ptr _memcpy
                add     EBX,6
                add     ESI,6
                add     ESP,0Ch
                cmp     ESI,011E1A300h
                jb      L35

Seems like rep movsb is /way/ sub-optimal for copying data.

Mar 15 2009

Don <nospam nospam.com> writes:

Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
 
 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

 I did a little benchmark:

 ldc -release -O5
 true: 0.51
 false: 0.63

 dmd -release -O
 true: 4.47
 false: 3.58

 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).

 Btw.: http://www.digitalmars.com/pnews/read.php?
 server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933

 
 The original benchmark swapped insanely on my 1GB laptop so I've cut the
 number of iterations in half, to 50_000_000.  Compiled with -O -release
 -inline.  Results:
 
 slice: 2.31
 memcpy: 	0.73
 
 That's 3 times difference.  Disassembly:
 
 slice:
 L31:            mov     ECX,EDX
                 mov     EAX,6
                 lea     ESI,010h[ESP]
                 mov     ECX,EAX
                 mov     EDI,EDX
                 rep
                 movsb
                 add     EDX,6
                 add     EBX,6
                 cmp     EBX,011E1A300h
                 jb      L31
 
 memcpy:
 L35:            push    6
                 lea     ECX,014h[ESP]
                 push    ECX
                 push    EBX
                 call    near ptr _memcpy
                 add     EBX,6
                 add     ESI,6
                 add     ESP,0Ch
                 cmp     ESI,011E1A300h
                 jb      L35
 
 Seems like rep movsb is /way/ sub-optimal for copying data.

Definitely! The difference ought to be bigger than a factor of 3. Which 
means that memcpy probably isn't anywhere near optimal, either.
rep movsd is always 4 times quicker than rep movsb. There's a range of 
lengths for which rep movsd is optimal; outside that range, there's are 
other options which are even faster.

So there's a factor of 4-8 speedup available on most memory copies. 
Low-hanging fruit! <g>

Mar 16 2009

bearophile <bearophileHUGS lycos.com> writes:

Don:
Which means that memcpy probably isn't anywhere near optimal, either.<

Time ago I have read an article written by AMD that shows that indeed with
modern CPUs there are ways to go much faster, using vector asm instructions,
loop unrolling and explicit cache prefetching (but it's useful with longer
arrays only. Profile-driven optimization can tell you if a particular copy
usually copies lot of data, and use such kind of copy, that is
overkill/slow/too much code for the cache for smaller copies. As an alternative
the programmer may add some annotation to choose what copy strategy to use, but
this is not nice).

Bye,
bearophile

Mar 16 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Mon, Mar 16, 2009 at 8:43 AM, bearophile <bearophileHUGS lycos.com> wrote:
 Don:
Which means that memcpy probably isn't anywhere near optimal, either.<

 Time ago I have read an article written by AMD that shows that indeed with
modern CPUs there are ways to go much faster, using vector asm instructions,
loop unrolling and explicit cache prefetching

I'm actually kind of shocked that given the prevalence of memory block
copy operations that more CPUs haven't implemented it as a basic
instruction.  Yes, RISC is nice, but geez, this seems like a
no-brainer.

Mar 16 2009

BCS <none anon.com> writes:

Hello Jarrett,

 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 

How about memory to memory DMA, Why even make the CPU wait for it to finish?

Mar 16 2009

Don <nospam nospam.com> writes:

BCS wrote:
 Hello Jarrett,
 
 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.

 
 How about memory to memory DMA, Why even make the CPU wait for it to 
 finish?
 
 

That actually happens, to some extent, on some CPUs. If you get 
everything right, the data only goes to the outermost cache, and never 
reaches the CPU itself.

Mar 16 2009

BCS <none anon.com> writes:

Hello Don,

 BCS wrote:
 
 Hello Jarrett,
 
 I'm actually kind of shocked that given the prevalence of memory
 block copy operations that more CPUs haven't implemented it as a
 basic instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 

 How about memory to memory DMA, Why even make the CPU wait for it to
 finish?
 

 That actually happens, to some extent, on some CPUs. If you get
 everything right, the data only goes to the outermost cache, and never
 reaches the CPU itself.
 

I was thinking even more aggressive, for instance: I wonder how much more 
silicon it would take to allow copied that happen to be on the same DIMM 
to never even touch the motherboard?

I can think of some cases where this would be very useful (large matrix
transposition 
for one would work well if the memory is set up for arbitrary striding)

Mar 16 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:
 I'm actually kind of shocked that given the prevalence of memory block
 copy operations that more CPUs haven't implemented it as a basic
 instruction. =A0Yes, RISC is nice, but geez, this seems like a
 no-brainer.

 How about memory to memory DMA, Why even make the CPU wait for it to fini=

sh?

Sure, then you have to worry about waiting for the *DMA* to finish.
If it's a copy whose result is largely unimportant to the
immediately-following code (copying a backbuffer into a frontbuffer,
for example), it doesn't matter.  But for copying large value types,
I'd think it's pretty important that the copy is semantically atomic.

Mar 16 2009

BCS <none anon.com> writes:

Hello Jarrett,

 On Mon, Mar 16, 2009 at 3:29 PM, BCS <none anon.com> wrote:
 
 I'm actually kind of shocked that given the prevalence of memory
 block copy operations that more CPUs haven't implemented it as a
 basic instruction.  Yes, RISC is nice, but geez, this seems like a
 no-brainer.
 

 How about memory to memory DMA, Why even make the CPU wait for it to
 finish?
 

 Sure, then you have to worry about waiting for the *DMA* to finish. If
 it's a copy whose result is largely unimportant to the
 immediately-following code (copying a backbuffer into a frontbuffer,
 for example), it doesn't matter.  But for copying large value types,
 I'd think it's pretty important that the copy is semantically atomic.
 

As long as either the CPU can voluntarily block or the MMU can block access 
until the DMA is done then it's a free gain in that the CPU can do /something/ 
that it wouldn't be able to otherwise.

Mar 16 2009

Christopher Wright <dhasenan gmail.com> writes:

bearophile wrote:
 Don:
 Which means that memcpy probably isn't anywhere near optimal, either.<

 
 Time ago I have read an article written by AMD that shows that indeed with
modern CPUs there are ways to go much faster, using vector asm instructions,
loop unrolling and explicit cache prefetching (but it's useful with longer
arrays only. Profile-driven optimization can tell you if a particular copy
usually copies lot of data, and use such kind of copy, that is
overkill/slow/too much code for the cache for smaller copies. As an alternative
the programmer may add some annotation to choose what copy strategy to use, but
this is not nice).
 
 Bye,
 bearophile

You could probably get good results by seeing how long the array is, though.

Mar 16 2009

Don <nospam nospam.com> writes:

Christopher Wright wrote:
 bearophile wrote:
 Don:
 Which means that memcpy probably isn't anywhere near optimal, either.<

 Time ago I have read an article written by AMD that shows that indeed 
 with modern CPUs there are ways to go much faster, using vector asm 
 instructions, loop unrolling and explicit cache prefetching (but it's 
 useful with longer arrays only. Profile-driven optimization can tell 
 you if a particular copy usually copies lot of data, and use such kind 
 of copy, that is overkill/slow/too much code for the cache for smaller 
 copies. As an alternative the programmer may add some annotation to 
 choose what copy strategy to use, but this is not nice).


Not necessary. If the length is long enough to get benefit from that, 
the function call overhead for calling memcpy() is negligible. So you 
just need to include all the cases in memcpy.

 Bye,
 bearophile

 
 You could probably get good results by seeing how long the array is, 
 though.

Yes. If the length is known at compile time, and it's short, inline with 
special-case code for the small sizes. If it's long, just put in a call 
to memcpy.
I looked at the code in the DMD backend, the generation of the memcpy 
intrinsic is in cdmemcpy() in cod2.c.
But as far as I can tell, by the time that code is called, the length 
isn't a constant any more. So I don't think it'd be easy to fix.

On Core2,
rep movsb     transfers 1 byte every 3 clocks -- ie 0.3 bytes per clock (!)
rep movsq/rep movsq transfers 1 dword/qword every 0.63 clocks -- ie best 
case is 13 bytes/clock.
On AMD you get one transfer per clock -- max 8 bytes per clock, 1 byte 
per clock with rep movsb.

So rep movsb is _unbelievably_ slow. A simple D for loop is probably 
quicker.

Mar 17 2009

Sergey Gromov <snake.scaly gmail.com> writes:

Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:

 Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:
 
 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

 I did a little benchmark:

 ldc -release -O5
 true: 0.51
 false: 0.63

 dmd -release -O
 true: 4.47
 false: 3.58

 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).

 Btw.: http://www.digitalmars.com/pnews/read.php?
 server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933

 
 The original benchmark swapped insanely on my 1GB laptop so I've cut the
 number of iterations in half, to 50_000_000.  Compiled with -O -release
 -inline.  Results:
 
 slice: 2.31
 memcpy: 	0.73
 
 That's 3 times difference.  Disassembly:
 
 slice:
 L31:            mov     ECX,EDX
                 mov     EAX,6
                 lea     ESI,010h[ESP]
                 mov     ECX,EAX
                 mov     EDI,EDX
                 rep
                 movsb
                 add     EDX,6
                 add     EBX,6
                 cmp     EBX,011E1A300h
                 jb      L31
 
 memcpy:
 L35:            push    6
                 lea     ECX,014h[ESP]
                 push    ECX
                 push    EBX
                 call    near ptr _memcpy
                 add     EBX,6
                 add     ESI,6
                 add     ESP,0Ch
                 cmp     ESI,011E1A300h
                 jb      L35
 
 Seems like rep movsb is /way/ sub-optimal for copying data.

 
 Definitely! The difference ought to be bigger than a factor of 3. Which 
 means that memcpy probably isn't anywhere near optimal, either.
 rep movsd is always 4 times quicker than rep movsb. There's a range of 
 lengths for which rep movsd is optimal; outside that range, there's are 
 other options which are even faster.
 
 So there's a factor of 4-8 speedup available on most memory copies. 
 Low-hanging fruit! <g>

Don't disregard the function call overhead.  memcpy is called 50 M
times, copying only 6 bytes per call.

Mar 16 2009

Don <nospam nospam.com> writes:

Sergey Gromov wrote:
 Mon, 16 Mar 2009 10:34:33 +0100, Don wrote:
 
 Sergey Gromov wrote:
 Sun, 15 Mar 2009 13:17:50 +0000 (UTC), Moritz Warning wrote:

 On Sat, 14 Mar 2009 23:50:58 -0400, bearophile wrote:

 While doing some string processing I've seen some unusual timings
 compared to the C code, so I have written this to see the situation
 better. When USE_MEMCPY is false this little benchmark runs about 3+
 times slower:

 I did a little benchmark:

 ldc -release -O5
 true: 0.51
 false: 0.63

 dmd -release -O
 true: 4.47
 false: 3.58

 I don't see a very big difference between slice copying and memcpy (but 
 between compilers).

 Btw.: http://www.digitalmars.com/pnews/read.php?
 server=news.digitalmars.com&group=digitalmars.D.bugs&artnum=14933

 The original benchmark swapped insanely on my 1GB laptop so I've cut the
 number of iterations in half, to 50_000_000.  Compiled with -O -release
 -inline.  Results:

 slice: 2.31
 memcpy: 	0.73

 That's 3 times difference.  Disassembly:

 slice:
 L31:            mov     ECX,EDX
                 mov     EAX,6
                 lea     ESI,010h[ESP]
                 mov     ECX,EAX
                 mov     EDI,EDX
                 rep
                 movsb
                 add     EDX,6
                 add     EBX,6
                 cmp     EBX,011E1A300h
                 jb      L31

 memcpy:
 L35:            push    6
                 lea     ECX,014h[ESP]
                 push    ECX
                 push    EBX
                 call    near ptr _memcpy
                 add     EBX,6
                 add     ESI,6
                 add     ESP,0Ch
                 cmp     ESI,011E1A300h
                 jb      L35

 Seems like rep movsb is /way/ sub-optimal for copying data.

 Definitely! The difference ought to be bigger than a factor of 3. Which 
 means that memcpy probably isn't anywhere near optimal, either.
 rep movsd is always 4 times quicker than rep movsb. There's a range of 
 lengths for which rep movsd is optimal; outside that range, there's are 
 other options which are even faster.

 So there's a factor of 4-8 speedup available on most memory copies. 
 Low-hanging fruit! <g>

 
 Don't disregard the function call overhead.  memcpy is called 50 M
 times, copying only 6 bytes per call.

Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
six bytes -- it's in the asm. Blimey. It should just be doing that as a 
direct sequence of loads and stores, for anything up to at least 8 bytes.

Mar 16 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as a 
 direct sequence of loads and stores, for anything up to at least 8 bytes.

The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.

Mar 16 2009

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as 
 a direct sequence of loads and stores, for anything up to at least 8 
 bytes.

 
 The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.

That's good.
In fact, in this case it's a pessimisation. It's quicker in debug mode, 
than release mode!

Mar 16 2009

Sergey Gromov <snake.scaly gmail.com> writes:

Mon, 16 Mar 2009 11:36:50 -0700, Walter Bright wrote:

 Don wrote:
 Oh. I didn't see it was only 6 bytes. And the compiler even KNOWS it's 
 six bytes -- it's in the asm. Blimey. It should just be doing that as a 
 direct sequence of loads and stores, for anything up to at least 8 bytes.

 
 The compiler will replace it with a simple mov if it is 1, 2, 4 or 8 bytes.

It didn't, actually.

I've just filed a patch which should fix this issue:

http://d.puremagic.com/issues/show_bug.cgi?id=2750

For instance, the benchmark in the original post were compiling to

L31:            mov     ECX,EDX
                mov     EAX,6
                lea     ESI,010h[ESP]
                mov     ECX,EAX
                mov     EDI,EDX
                rep
                movsb
                add     EDX,6
                add     EBX,6
                cmp     EBX,011E1A300h
                jb      L31

With my patch it's just

L3A:            lea     ESI,0Ch[ESP]
                mov     EDI,EAX
                movsd
                movsb
                movsb
                add     EAX,6
                add     ECX,6
                cmp     ECX,011E1A300h
                jb      L3A

as it probably was meant to be.

Mar 19 2009

"Lionello Lunesu" <lionello lunesu.remove.com> writes:

This has been discussed before, to no avail.
http://d.puremagic.com/issues/show_bug.cgi?id=2313

L.

Mar 17 2009

D Programming

C/C++ Programming

Other

digitalmars.D - memcpy vs slice copy