digitalmars.D - memset and related things

bearophile (272/272) Sep 19 2009 In a program I've seen that in the inner loop an array cleaning was taki...

bearophile (63/63) Sep 20 2009 I think this version is a bit better:

Jeremie Pelletier (4/11) Sep 20 2009 This is quite interesting, it made me rethink my own uses of memset. By

Don (10/18) Sep 20 2009 Yeah, DMD's memset() and memcpy() are far from optimal. IIRC memcpy() is...
bearophile (6/11) Sep 20 2009 Probably removing bugs is more important still :-)

language_fan (5/7) Sep 20 2009 I have been wondering why people often here complain that Java is slower...

bearophile (10/13) Sep 20 2009 I don't fully understand your question. The main purpose of some of the ...

language_fan (9/12) Sep 21 2009 I have seen people many times mention that Java is in general orders of

downs (5/21) Sep 21 2009 Wow. I actually haven't seen that argument in a relatively long time - i...

bearophile (9/11) Sep 21 2009 In D/C# you can use an array of structs that sometimes allows for better...
Andrei Alexandrescu (3/29) Sep 21 2009 I guess that reinforces some other misinformation :o).

Don (16/31) Sep 21 2009 I think you're creating a straw man. I don't recall anyone saying that.

Walter Bright (2/5) Sep 23 2009 I believe that's because it has reached the point of diminishing returns...

Don (8/14) Sep 23 2009 I presume you're talking about integer optimisation, because that's

Walter Bright (2/9) Sep 23 2009 I'll agree with you on that assessment.

Walter Bright (20/28) Sep 23 2009 If you simply time a heap allocation in D and Java, Java will probably

bearophile (10/20) Sep 23 2009 I have discussed with LDC devs an improvement of that idea (it was discu...

Jeremie Pelletier (9/35) Sep 23 2009 I would love to see such a feature added in D, would make object

Jarrett Billingsley (4/10) Sep 23 2009 You are really hung up on parsing, aren't you ;) Parsing is a tiny,

Jeremie Pelletier (7/20) Sep 23 2009 Everything begins at parsing! You'd need to parse the scope lexeme in

bearophile (15/17) Sep 23 2009 Probably you have not understood what I meant. I surely didn't meant tha...

Don (5/19) Sep 21 2009 Really, memset() _should_ be optimal in all cases. On almost all

Andrei Alexandrescu (11/36) Sep 21 2009 Don, I suggest the following.

bearophile <bearophileHUGS lycos.com> writes:

In a program I've seen that in the inner loop an array cleaning was taking too
much time. To solve the problem I've done many experiments, and I've also
produced the following testing program.

The short summary is, to set array of 4 byte integers to a certain constant the
best was are:
- if len <~ 20, then just use an inlined loop.
- if 20 < len < 200_000 it's better to use a loop unrolled 4 times with the
movaps instruction (8 times unrolled is a little worse).
- if n > 200_000 a loop with the movntps instruction is better.

Generally such solutions are better than the memset() (only when len is about
150_000 memset is a bit better than four movaps).


See also:
http://www.gamedev.net/community/forums/topic.asp?topic_id=532112

http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc197.htm


This is a possible function that can be used if a.length > 20, otherwise an
inlined loop is faster:


void memset4(T)(T[] a, T value) {
    // to be used for len(a) >~ 20
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        if ((a_end - a_ptr) >= 200_000) {
            asm {
                mov ESI, a_ptr;
                mov EDI, n;

                movss XMM0, value; // XMM0 = value,value,value,value
                shufps XMM0, XMM0, 0;

                align 8;
                LOOP1:
                    add ESI, 64;
                    movntps [ESI+ 0-64], XMM0;
                    movntps [ESI+16-64], XMM0;
                    movntps [ESI+32-64], XMM0;
                    movntps [ESI+48-64], XMM0;
                    cmp ESI, EDI;
                jb LOOP1;

                mov a_ptr, ESI;
            }
        } else {
            asm {
                mov ESI, a_ptr;
                mov EDI, n;

                movss XMM0, value; // XMM0 = value,value,value,value
                shufps XMM0, XMM0, 0;

                align 8;
                LOOP2:
                    add ESI, 64;
                    movaps [ESI+ 0-64], XMM0;
                    movaps [ESI+16-64], XMM0;
                    movaps [ESI+32-64], XMM0;
                    movaps [ESI+48-64], XMM0;
                    cmp ESI, EDI;
                jb LOOP2;

                mov a_ptr, ESI;
            }
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}



So the D front end can implement:
int[] a = ...;
a[] = k;


as:

int[] a = ...;
if (a.length < 20)
  for (int i; i < a.length; i++)
    a[i] = k;
else
  memset4(a, k);

Arrays of other types of values need a little different code. Today most CPUs
support SSE, but in the uncommon situations where it's not available it can be
used C memset() instead of memset4 (the inline for loop is useful anyway when
you use C memset).

I am ignorant of SSE asm still, so the code I have written can contain bugs,
naive things, etc.

If such code can be debugged, and my timings are meaningful, then the code can
be put inside the d front-end, so LDC too can use it (another good thing is for
LLVM to improve its memset intrinsic, so LDC front-end doesn't need to perform
such thing).
    
----------------------------------------

// benchmark code

version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.time: clock, CLOCKS_PER_SEC;
    import tango.math.Math: sqrt;
    import tango.stdc.string: memset;
} else {
    import std.c.stdio: printf;
    import std.c.time: clock, CLOCKS_PER_SEC;
    import std.math: sqrt;
    import std.c.string: memset;
}


double myclock() {
    return cast(double)clock() / CLOCKS_PER_SEC;
}


void memset4_movaps(T)(T[] a, T value) {
    // to be used if a.length >~ 20, otherwise an inlined loop is faster
    
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        asm {
            mov ESI, a_ptr;
            mov EDI, n;

            movss XMM0, value; // XMM0 = value,value,value,value
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2:
                add ESI, 64;
                movaps [ESI+ 0-64], XMM0;
                movaps [ESI+16-64], XMM0;
                movaps [ESI+32-64], XMM0;
                movaps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


void memset4_movntps(T)(T[] a, T value) {
    // to be used if a.length >~ 25, otherwise an inlined loop is faster
    
    static assert(T.sizeof == 4);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer
    size_t aux = ((cast(size_t)a_ptr + 15) & ~15);
    auto n = cast(T*)aux;
    while (a_ptr < n)
        *a_ptr++ = value;
    n = cast(T*)((cast(size_t)a_end) & ~15);

    if (a_ptr < n && (a_end - a_ptr) >= 16) {
        // Aligned case
        asm {
            mov ESI, a_ptr;
            mov EDI, n;

            movss XMM0, value; // XMM0 = value,value,value,value
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2:
                add ESI, 64;
                movntps [ESI+ 0-64], XMM0;
                movntps [ESI+16-64], XMM0;
                movntps [ESI+32-64], XMM0;
                movntps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }
    }

    // trailing ones
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


T test(T, int ntest)(int len, int nloops) {
    auto a = new T[len];

    for (int i; i < nloops; i++) {
        static if (ntest == 0)
            for (int j; j < a.length; j++)
                a[j] = T.init;
         else static if (ntest == 1)
            memset(a.ptr, 0, a.length * T.sizeof);
         else static if (ntest == 2)  
             memset4_movaps(a, 0);
         else static if (ntest == 3)
             memset4_movntps(a, 0);          

        a[i % len] = i;
    }

    return a[0];
}


void show(float[] a) {
    printf("[");

    if (a.length > 1)
        foreach (el; a[0 .. $-1])
            printf("%.1f, ", el);
    if (a.length)
        printf("%.1f", a[$-1]);
    printf("]\n");
}


void main() {
    // small test
    auto a = new float[20];
    foreach (i, ref el; a)
        el = i;

    // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
    show(a);

    memset4_movaps(a[2 .. 2], 0.5f);

    // [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
    show(a);

    memset4_movaps(a[2 .. 9], 0.5f);

    // [0, 1, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19]
    show(a);
    printf("---------------------\n\n");

    auto lens = [2, 4, 5, 8, 15, 16, 20, 25, 32, 50, 64, 100, 250, 256, 1000,
2048, 2500,
                  1 << 12, 1 << 15, 1 << 16, 1 << 17, 210_000, 1 << 18, 1 <<
19, 1 << 20];
    alias int T;
    
    printf("len, nloops, time with loop, time with memset, time with movaps,
time with movntps:\n");
    foreach (len; lens) {
        int nloops;
        if (len >= 8000)
            nloops = cast(int)(cast(double)4_000_000 / sqrt(cast(double)len));
        else
            nloops = cast(int)(cast(double)60_000_000 / sqrt(cast(double)len));
            
        auto t0 = myclock();
        //test!(T, 0)(len, nloops);
        auto a2 = new T[len];
        for (int i; i < nloops; i++) {
                for (int j; j < a2.length; j++)
                    a2[j] = T.init;
            a2[i % len] = i;
        }
        auto t1 = myclock();

        auto t2 = myclock();
        test!(T, 1)(len, nloops);
        auto t3 = myclock();

        auto t4 = myclock();
        test!(T, 2)(len, nloops);
        auto t5 = myclock();

        auto t6 = myclock();
        test!(T, 3)(len, nloops);
        auto t7 = myclock();

        printf("%8d  %9d  %.3f  %.3f  %.2f  %.3f\n", len, nloops, t1-t0, t3-t2,
t5-t4, t7-t6);
    }
}

----------------------------------------

Timings taken on a Celeron at 2.13 MHz (other CPUs will give different results,
I'll take the timings on a Core 2 too):

       2   42426406  0.850  1.510  1.020  1.740
       4   30000000  0.570  1.180  0.850  0.860
       5   26832815  0.480  0.940  0.980  1.000
       8   21213203  0.380  0.720  0.860  0.850
      15   15491933  0.340  0.580  0.960  0.940
      16   15000000  0.350  0.560  0.400  4.420
      20   13416407  0.430  0.570  0.370  3.940
      25   12000000  0.480  0.540  0.320  3.550
      32   10606601  0.420  0.530  0.280  3.140
      50    8485281  0.500  0.530  0.360  2.450
      64    7500000  0.560  0.510  0.320  2.220
     100    6000000  0.690  0.490  0.290  2.020
     250    3794733  1.020  0.470  0.280  2.140
     256    3750000  0.990  0.460  0.260  2.080
    1000    1897366  2.070  0.640  0.340  3.460
    2048    1325825  2.890  0.880  0.470  4.750
    2500    1200000  3.240  0.920  0.500  5.300
    4096     937500  4.070  1.150  0.630  6.660
   32768      22097  0.570  0.320  0.250  1.300
   65536      15625  0.850  0.450  0.380  1.760
  131072      11048  1.630  1.150  1.240  2.500
  210000       8728  3.740  2.630  3.150  3.160
  262144       7812  5.480  4.050  5.040  3.550
  524288       5524  11.120  9.200  11.020  5.110
 1048576       3906  15.850  13.190  15.820  7.210


Graph in PNG:
http://i37.tinypic.com/v836e1.png

Bye,
bearophile

Sep 19 2009

bearophile <bearophileHUGS lycos.com> writes:

I think this version is a bit better:

void memset4(T)(T[] a, T value=T.init) {
    static assert (T.sizeof == 4);
    static assert (size_t.sizeof == (T*).sizeof);
    if (!a.length)
        return;
    auto a_ptr = a.ptr;
    auto a_end = a_ptr + a.length;

    // align pointer to 16 bytes, processing leading unaligned items
    size_t a_end_trimmed = (cast(size_t)a_ptr + 15) & (~15);
    while (cast(size_t)a_ptr < a_end_trimmed)
        *a_ptr++ = value;

    // ending pointer minus the last % 64 bytes
    a_end_trimmed = cast(size_t)a_end & (~cast(size_t)63);

    //printf("%d %d %d %u\n", a_ptr, a_end, a_end_trimmed);
    //int counter1, counter2;

    if (a_end_trimmed - cast(size_t)a_ptr > (200_000 * T.sizeof))
        asm {
            mov ESI, a_ptr;
            mov EDI, a_end_trimmed;

            //pxor XMM0, XMM0; // XMMO = value, value, value, value
            // XMM0 = value,value,value,value
            movss XMM0, value;
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP1: // writes 4 * 4 * 4 bytes each loop
                //inc counter1;
                add ESI, 64;
                movntps [ESI+ 0-64], XMM0;
                movntps [ESI+16-64], XMM0;
                movntps [ESI+32-64], XMM0;
                movntps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP1;

            mov a_ptr, ESI;
        }
    else if (a_end_trimmed - cast(size_t)a_ptr > 16)
        asm {
            mov ESI, a_ptr;
            mov EDI, a_end_trimmed;

            //pxor XMM0, XMM0; // XMMO = value, value, value, value
            // XMM0 = value,value,value,value
            movss XMM0, value;
            shufps XMM0, XMM0, 0;

            align 8;
            LOOP2: // writes 4 * 4 * 4 bytes each loop
                //inc counter2;
                add ESI, 64;
                movaps [ESI+ 0-64], XMM0;
                movaps [ESI+16-64], XMM0;
                movaps [ESI+32-64], XMM0;
                movaps [ESI+48-64], XMM0;
                cmp ESI, EDI;
            jb LOOP2;

            mov a_ptr, ESI;
        }

    //printf("counter1, counter2: %d %d\n", counter1, counter2);

    // the last % 16 items
    while (a_ptr < a_end)
        *a_ptr++ = value;
}


Bye,
bearophile

Sep 20 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

bearophile wrote:
 I think this version is a bit better:
 
 [snip]
 
 
 Bye,
 bearophile

This is quite interesting, it made me rethink my own uses of memset. By 
the way you should first verify if SSE is present through the cpuid 
instruction.

Sep 20 2009

Don <nospam nospam.com> writes:

bearophile wrote:
 In a program I've seen that in the inner loop an array cleaning was taking too
much time. To solve the problem I've done many experiments, and I've also
produced the following testing program.
 
 The short summary is, to set array of 4 byte integers to a certain constant
the best was are:
 - if len <~ 20, then just use an inlined loop.
 - if 20 < len < 200_000 it's better to use a loop unrolled 4 times with the
movaps instruction (8 times unrolled is a little worse).
 - if n > 200_000 a loop with the movntps instruction is better.
 
 Generally such solutions are better than the memset() (only when len is about
150_000 memset is a bit better than four movaps).

Yeah, DMD's memset() and memcpy() are far from optimal. IIRC memcpy() is 
even worse. I had done a bit of work on it, as well, but when I posted 
preliminary stuff, there wasn't much interest. The general feedback 
seemed to be that it'd be more useful to fix the compiler ICE bugs. So I 
did that <g>. It'll be interesting to see what the priorities are now -- 
maybe this stuff is of more interest now.

BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
exactly remember) goes into great detail about both memcpy() and 
memset(). Turns out there's about five different cases.

Sep 20 2009

bearophile <bearophileHUGS lycos.com> writes:

Don:

 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.

Probably removing bugs is more important still :-)
For example your work has changed a little how compile-time functions can be
used in D.


 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

In the meantime Deewiant has told me that on 64 bit glibc memset is better and
on more modern CPUs the timings are different (and on 64 bit my first version
may not work, maybe the second one is better. I have not tested it on 64 bit
LDC yet). I'm just a newbie on this stuff, while people that write the memset
of 64bit glibc are expert.

Bye and thank you,
bearophile

Sep 20 2009

language_fan <foo bar.com.invalid> writes:

Sun, 20 Sep 2009 16:09:50 -0400, bearophile thusly wrote:

 I'm just a newbie on this stuff, while
 people that write the memset of 64bit glibc are expert.

I have been wondering why people often here complain that Java is slower 
than D. Every time I see your benchmarks, Java is actually doing just 
fine, sometimes it's even faster than D. Are your benchmarks somehow 
flawed every time Java wins D in performance?

Sep 20 2009

bearophile <bearophileHUGS lycos.com> writes:

language_fan:

 Every time I see your benchmarks, Java is actually doing just 
 fine, sometimes it's even faster than D. Are your benchmarks somehow 
 flawed every time Java wins D in performance?

I don't fully understand your question. The main purpose of some of the
benchmarks I've done in the last few months is to find performance bugs (or
sometimes just bugs) in LDC or LLVM (I have stopped doing something similar for
DMD because people look less interested in improving it performance).

Now often I use code originally written in C++ and Java because I've seen that
in most situations LLVM is able to produce good binaries from very C-like code
(with few exceptions).

My benchmarks aren't chosen randomly, I naturally focus on things that are
slower in D, so sometimes you can see Java to "win". I usually discard the code
where Java results slower :-)

Installing and using a debug build of the JavaVM to see the assembly its JIT
produces is not handy, but I sometimes do it. Once in a while you can find some
surprises there.

Recently I have found a nice single-file C++ program, that heavily uses
templates and partial specialization, but I am having problems in translating
it to good D to test templates in LDC because I don't know enough C++ yet. It
finds the optimal solution to the 15 problems, and it's quite efficient:
http://codepad.org/96ATkuhx
Help/suggestions are welcome :-)

Bye,
bearophile

Sep 20 2009

language_fan <foo bar.com.invalid> writes:

Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

I have seen people many times mention that Java is in general orders of 
magnitude slower than D, no matter what kind of algorithms you run on 
both environments. This is because of the VM - nothing on a VM can run 
faster than native code, they say. If you decide to hide the bad results 
(for D), it will only reinforce the misinformation. I personally use a 
lot of heap memory allocation in my work, and so far Java has not only 
been safer (and provides decent stack traces and no compiler bugs), but 
also faster - each time.

Sep 21 2009

downs <default_357-line yahoo.de> writes:

language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

 
 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments. This is because of the VM - nothing on a VM can run 
 faster than native code, they say. 

Wow. I actually haven't seen that argument in a relatively long time - it seems
to me relatively debunked nowadays.

(Personally, for me it comes down to "Java will always have a certain delay on
startup, and its thoroughly object-oriented design forces heap access that is
often unnecessary for solving the problem; plus the single-paradigm model is
woefully constraining in comparison to more flexible languages. "

 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

Yes, heap allocation is faster in Java. There's so much of it they pretty much
had no choice but to tune it to hell and back :)

 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.

Um, he said he hides the bad results _for Java_.

Sep 21 2009

bearophile <bearophileHUGS lycos.com> writes:

downs:

and its thoroughly object-oriented design forces heap access that is often
unnecessary for solving the problem;<


performance, in Java to avoid the decrease in performance caused by using an
array of references you may need to use several parallel arrays of the single
fields of the struct.

But the cache coherence signature of such parallels arrays is different,
sometimes it's better and sometimes it's worse than the array of structs. If
you have to access several items at once of each struct of the array, then
using the array of structs is generally faster than using the parallel arrays.
If you have to scan only one or few fields then using parallel arrays is faster
because there's less data that comes through the caches.

Time ago someone has created a "parallel array" struct here to allow for such
layout choices at compile-time. A VirtualMachine in theory can dynamically
change such struct/array layout at runtime according to the access patterns to
the data :-) I think similar dynamism in data structures may be done in the
next generation of VM-based languages (you can implement such smart data
structures in D too, but the data structure has to be mostly opaque and you
have to forbid taking addresses to any data, because it can be moved and
shuffled around).


Um, he said he hides the bad results _for Java_.<

Right. I generally don't like to hide things, and I show all data I have
collected.
Sometimes I don't show benchmarks where Java looks slow because they can be
boring and because I am not currently interested in improving the JavaVM.
Focusing on where LDC/LLVM do badly I can try to improve them.

Bye,
bearophile

Sep 21 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

downs wrote:
 language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:

 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments. This is because of the VM - nothing on a VM can run 
 faster than native code, they say. 

 
 Wow. I actually haven't seen that argument in a relatively long time - it
seems to me relatively debunked nowadays.
 
 (Personally, for me it comes down to "Java will always have a certain delay on
startup, and its thoroughly object-oriented design forces heap access that is
often unnecessary for solving the problem; plus the single-paradigm model is
woefully constraining in comparison to more flexible languages. "
 
 I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

 
 Yes, heap allocation is faster in Java. There's so much of it they pretty much
had no choice but to tune it to hell and back :)
 
 If you decide to hide the bad results
 (for D), it will only reinforce the misinformation.

 
 Um, he said he hides the bad results _for Java_.

I guess that reinforces some other misinformation :o).

Andrei

Sep 21 2009

Don <nospam nospam.com> writes:

language_fan wrote:
 Sun, 20 Sep 2009 18:16:37 -0400, bearophile thusly wrote:
 
 My benchmarks aren't chosen randomly, I naturally focus on things that
 are slower in D, so sometimes you can see Java to "win". I usually
 discard the code where Java results slower :-)

 
 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments.

I think you're creating a straw man. I don't recall anyone saying that. 
Certainly not "orders of magnitude" in general.
In general, when D is faster, it'll be because more coding techniques 
are available. For example, because you're not forced to used OOP all 
the time.

 This is because of the VM - nothing on a VM can run 
 faster than native code, they say.  If you decide to hide the bad results
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

On this ng, it seems to be universally acknowledged that Java's GC is 
much better than D's; and that Java's tool chain is orders of magnitude 
more mature than D.
But, in the cases where Java beats D, it will almost always be because 
of the GC. LDC probably only gets beaten on the GC.
Having looked at the DMD optimiser, I'm a bit surprised that it's 
competitive at all (especially in floating point). There is so much 
low-hanging fruit, it's practically an orchard <g>. It's hardly been 
touched since the mid-90's, while Java's had 15 years of massive 
commercial development during that time.

Sep 21 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.

I believe that's because it has reached the point of diminishing returns.

Sep 23 2009

Don <nospam nospam.com> writes:

Walter Bright wrote:
 Don wrote:
 Having looked at the DMD optimiser, I'm a bit surprised that it's 
 competitive at all (especially in floating point). There is so much 
 low-hanging fruit, it's practically an orchard <g>.

 
 I believe that's because it has reached the point of diminishing returns.

I presume you're talking about integer optimisation, because that's 
definitely not the case for DMD's floating point. Eg, there's a comment 
in the code somewhere saying that CSE (complex sub expression) 
optimisations are not done for floating point because the code generator 
can't cope with it. It does loops particularly poorly, it never seems to 
keep a variable on the FP stack, so it keeps loading an saving them. The 
potential speedups are enormous.

Sep 23 2009

Walter Bright <newshound1 digitalmars.com> writes:

Don wrote:
 I presume you're talking about integer optimisation, because that's 
 definitely not the case for DMD's floating point. Eg, there's a comment 
 in the code somewhere saying that CSE (complex sub expression) 
 optimisations are not done for floating point because the code generator 
 can't cope with it. It does loops particularly poorly, it never seems to 
 keep a variable on the FP stack, so it keeps loading an saving them. The 
 potential speedups are enormous.


I'll agree with you on that assessment.

Sep 23 2009

Walter Bright <newshound1 digitalmars.com> writes:

language_fan wrote:
 I have seen people many times mention that Java is in general orders of 
 magnitude slower than D, no matter what kind of algorithms you run on 
 both environments. This is because of the VM - nothing on a VM can run 
 faster than native code, they say. If you decide to hide the bad results 
 (for D), it will only reinforce the misinformation. I personally use a 
 lot of heap memory allocation in my work, and so far Java has not only 
 been safer (and provides decent stack traces and no compiler bugs), but 
 also faster - each time.

If you simply time a heap allocation in D and Java, Java will probably 
be faster not because of anything inherently faster about Java, but 
because Sun has poured billions of dollars into trying to make their 
heap allocator faster.

However, a typical D program does far fewer allocations than the Java 
equivalent, for the simple reason that D supports stack allocated and 
embedded value aggregates while Java requires them to be on the heap. It 
is the much reduced need for the heap to be fast that is an advantage for D.

Java does do some escape analysis to try and allocate heap objects on 
the stack instead, but I don't know how effective this is, and even that 
won't help if you try to embed a value aggregate into a class:

struct S { int x, y, z; }

class C
{
     S v;   // in Java this would require a second heap allocation
}

In other words, one of the most effective techniques for speeding up 
heap allocation costs is to design the data structures so they require 
fewer allocations!

Sep 23 2009

bearophile <bearophileHUGS lycos.com> writes:

Walter Bright:

 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:
 
 struct S { int x, y, z; }
 
 class C
 {
      S v;   // in Java this would require a second heap allocation
 }

I have discussed with LDC devs an improvement of that idea (it was discussed in
this newsgroup too):

class C1 { int x, y, z; }

class C2 {
    scope C1 c;
}

This idea raises few problems, but I think they can be solved.
You can special-case the management of such objects scope-allocated inside
other objects. Or you can just add an invisible field to that C2 class, the
reference to c.

Bye,
bearophile

Sep 23 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

bearophile wrote:
 Walter Bright:
 
 Java does do some escape analysis to try and allocate heap objects on 
 the stack instead, but I don't know how effective this is, and even that 
 won't help if you try to embed a value aggregate into a class:

 struct S { int x, y, z; }

 class C
 {
      S v;   // in Java this would require a second heap allocation
 }

 
 I have discussed with LDC devs an improvement of that idea (it was discussed
in this newsgroup too):
 
 class C1 { int x, y, z; }
 
 class C2 {
     scope C1 c;
 }
 
 This idea raises few problems, but I think they can be solved.
 You can special-case the management of such objects scope-allocated inside
other objects. Or you can just add an invisible field to that C2 class, the
reference to c.
 
 Bye,
 bearophile

I would love to see such a feature added in D, would make object 
composition much more convenient and faster.

You'd just need the compiler to force a call into C1's constructor 
within C2's and generate a call to C1's destructor in the epilog of 
C2's. I can't see how implementing idea that could be any harder, other 
than the parser related support.

No need to store a reference within the object, that would still require 
another heap allocation killing the need for scope in the first place.

Sep 23 2009

Jarrett Billingsley <jarrett.billingsley gmail.com> writes:

On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.

You are really hung up on parsing, aren't you ;) Parsing is a tiny,
tiny, tiny fraction of the complexity of implementing a language, and
most language features do not require any change to it. Really.

Sep 23 2009

Jeremie Pelletier <jeremiep gmail.com> writes:

Jarrett Billingsley wrote:
 On Wed, Sep 23, 2009 at 10:52 AM, Jeremie Pelletier <jeremiep gmail.com> wrote:
 I would love to see such a feature added in D, would make object composition
 much more convenient and faster.

 You'd just need the compiler to force a call into C1's constructor within
 C2's and generate a call to C1's destructor in the epilog of C2's. I can't
 see how implementing idea that could be any harder, other than the parser
 related support.

 
 You are really hung up on parsing, aren't you ;) Parsing is a tiny,
 tiny, tiny fraction of the complexity of implementing a language, and
 most language features do not require any change to it. Really.

Everything begins at parsing! You'd need to parse the scope lexeme in 
that context, then you'd need to adjust the appropriate parse node to 
contain the scope flag, which would require additional semantics 
analysis and finally a few changes in the IR generation. Thats what I 
implied by parser "related" support.

But yeah, I love parsing :)

Sep 23 2009

bearophile <bearophileHUGS lycos.com> writes:

Jeremie Pelletier:

 No need to store a reference within the object, that would still require 
 another heap allocation killing the need for scope in the first place.

Probably you have not understood what I meant. I surely didn't meant that it
needs another heap allocation (the purpose of scope is to avoid that).

Such extra invisible reference field among that object attributes is not
necessary, but it makes things simpler. If you add such reference, then inside
the outer object you have the inner object followed by the reference to the
inner object, both inlined (so no other heap allocations are necessary). So you
can keep all the usual semantics of D objects, avoid most special-casing. So
for example you can assign a different object to such reference:

class C1 { int x, y, z; }

class C2 {
  scope C1 c1;
}

void main() {
  auto c2 = new C2;
  auto c = new C1;
  c2.c1 = c; // <--- see here
}

Now the memory of c1 inside c2 is not reachable anymore, so the destructor of
that inner object can be called by the GC, but of course its actual memory will
not be freed until c2 is deallocated.

Bye,
bearophile

Sep 23 2009

Don <nospam nospam.com> writes:

bearophile wrote:
 Don:
 
 It'll be interesting to see what the priorities are now -- 
 maybe this stuff is of more interest now.

 
 Probably removing bugs is more important still :-)
 For example your work has changed a little how compile-time functions can be
used in D.
 
 
 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

 
 In the meantime Deewiant has told me that on 64 bit glibc memset is better and
on more modern CPUs the timings are different (and on 64 bit my first version
may not work, maybe the second one is better. I have not tested it on 64 bit
LDC yet). I'm just a newbie on this stuff, while people that write the memset
of 64bit glibc are expert.

Really, memset() _should_ be optimal in all cases.  On almost all 
compilers, it's not optimal, and on many (such as DMD) there's a _lot_ 
of room for improvement. So I consider this to a C standard library 
implementation issue, rather than a language weakness.

Sep 21 2009

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Don wrote:
 bearophile wrote:
 Don:

 It'll be interesting to see what the priorities are now -- maybe this 
 stuff is of more interest now.

 Probably removing bugs is more important still :-)
 For example your work has changed a little how compile-time functions 
 can be used in D.


 BTW the AMD manual for K7 (or might be K6 optimisation manual? don't 
 exactly remember) goes into great detail about both memcpy() and 
 memset(). Turns out there's about five different cases.

 In the meantime Deewiant has told me that on 64 bit glibc memset is 
 better and on more modern CPUs the timings are different (and on 64 
 bit my first version may not work, maybe the second one is better. I 
 have not tested it on 64 bit LDC yet). I'm just a newbie on this 
 stuff, while people that write the memset of 64bit glibc are expert.

 
 Really, memset() _should_ be optimal in all cases.  On almost all 
 compilers, it's not optimal, and on many (such as DMD) there's a _lot_ 
 of room for improvement. So I consider this to a C standard library 
 implementation issue, rather than a language weakness.

Don, I suggest the following.

std.algorithm has a routine called fill(range, value) which semantically 
subsumes memset. I suggest you specialize fill() for contiguous memory 
ranges of primitive types (which shouldn't be hard with std.traits), and 
then optimize the heck out of it.

You could do the same with copy(), also in std.algorithm, to implement a 
super-duper memcpy() routine.

If you go this route people can uniformly use high-level algorithms that 
specialize themselves whenever applicable.


Andrei

Sep 21 2009

D Programming

C/C++ Programming

Other

digitalmars.D - memset and related things