digitalmars.D - Vectorized Laziness

bearophile (20/20) Nov 10 2009 I've seen this good Google Talk video from Jan 2008, "MonetDB/X100: a (v...

BCS (2/2) Nov 10 2009 Hello bearophile,
bearophile (114/114) Nov 11 2009 It seems my idea has not interested people and devs...

bearophile (4/4) Nov 30 2009 It seems I was not fully mad, Clojure designers have had the same idea, ...

bearophile <bearophileHUGS lycos.com> writes:

I've seen this good Google Talk video from Jan 2008, "MonetDB/X100: a (very)
fast column-store", about a more efficient column-oriented DBMS:
http://www.youtube.com/watch?v=yrLd-3lnZ58

The efficiency of this DBMS (something like 50 times faster) is produced by few
things:
- It's column-wise, this helps in several other successive optimizations too
(row-oriented DBMS have their purpose still, the column-oriented are good only
if you want to perform statistics on most of your data, certain kinds of data
mining, etc).
- At 13.57 it shows that instead of yielding single tuples (or single items,
it's column-oriented), it yields arrays of about 100 fields. This allows to
create primitives that are much more efficient. And the CPU can process them
better, using SSE instruction too. Such small arrays are designed to fit in the
CPU cache (probably L2). Such vectorized operations are also pipelined in some
way.
- The filtering operations often don't produce new vectors, they just mark the
items as not present any more inside an array. This helps reducing the number
of allocatations of new arrays.
- On disk data is kept compressed, the compression is column-wise, and
decompression is done only just-in-time to reduce the amount of data transfers.
The number compression takes in account that often data is sorted, so it's
delta-compressed, and then such delta is compressed only in a range of the
Gaussian-like residual distribution (outliers are encoded directly). This
compression also allows to keep large indexes in RAM, that speeds up things
more. Such compression is I/O bound, the CPU performs it at max speed.
- They even shown vectorized hashing, but I have not understood how they work.
- Reading from the disks is done in a merged way, to avoid reading the same
data many times for similar queries.

(The good thing is that it's not hard to understand most things shown in this
video. But I'd like to see the C code they use as "reference", that's just 3
times faster than their DBMS. Such C code may be improved).

Inside, DBMS work as the lazy operations that are getting more common in D2
code, that are common in Python3, and even more in Haskell.

So in D2 to reduce the overhead of the lazy operations it may be useful to use
a vectorized strategy (that's meant to be almost transparent for the D
programmer), so items can be produced and moved in groups, inside arrays, like
in that video. (The DBMS has an interpreter overhead, it's made of fast
primitives used by lazy interpreted code). This idea can be applied in other
languages too.

Lazy filtering is a bit less common in D2 code, so I don't know if the idea of
not creating new blocks (and just marking items as absent, for example using a
bit array attached to the items array) can be useful in D2 too, as in that X100
DBMS.

In this discussion you have to think about pieces of code like:
xfilter(xchain(xmap(_, xrange(_)), xfilter(xmap(_))))
or things even more complex, and not something simpler, that LDC may be able to
just fully inline.

(Time ago I have posted related comments on the Python newsgroup, but it was
ignored.)

(Partially unrelated note: years after starting to program in D, I still miss a
lot strict/lazy array comprehensions in D. Maybe Walter will eventually see how
handy they are.)

Bye,
bearophile

Nov 10 2009

BCS <none anon.com> writes:

Hello bearophile,

I seem to recall watching that a while ago. Fun and cool stuff IIRC.

Nov 10 2009

bearophile <bearophileHUGS lycos.com> writes:

It seems my idea has not interested people and devs...
The following is a silly little example, works with Tango and Phobos, and on
both DMD and LDC the third tabler (VectorizedLazyTabler) is faster...


version (Tango) {
    import tango.stdc.stdio: printf;
    import tango.stdc.time: CLOCKS_PER_SEC, clock;
} else {
    import std.c.stdio: printf;
    import std.c.time: CLOCKS_PER_SEC, clock;
}

double myclock() {
    return clock() / cast(float)CLOCKS_PER_SEC;
}

const int N = 40_000_000;


// silly generator of the data
struct EvensGenerator {
    int i;
    int opApply(int delegate(ref int) dg) {
        int result;
        while (true) {
            result = dg(i);
            i += 2;
            if (result)
                break;
        }
        return result;
    }
}


int[] strictTabler(TyGen)(TyGen g) {
    auto arr = new int[N];
    int i;
    foreach (el; g) {
        arr[i] = el;
        i++;
        if (i == N)
            break;
    }
    return arr;
}


struct LazyTabler(TyGen) {
    TyGen g;

    int opApply(int delegate(ref int) dg) {
        int result, i;
        foreach (el; g) {
            result = dg(el);
            if (result)
                break;
            i++;
            if (i == N)
                break;
        }
        return result;
    }
}


struct VectorizedLazyTabler(TyGen) {
    const BLOCK_SIZE = 1_000;
    TyGen g;

    int opApply(int delegate(ref int[]) dg) {
        int result;
        // 'block' can be allocated elsewhere, etc
        auto block = new int[BLOCK_SIZE];
        for (int nblock; nblock < (N / BLOCK_SIZE); nblock++) {
            int i;
            foreach (el; g) {
                block[i] = el;
                i++;
                if (i == BLOCK_SIZE)
                    break;
            }
            result = dg(block);
            if (result)
                return result;
        }

        int rest = N % BLOCK_SIZE;
        if (rest) {
            int i;
            foreach (el; g) {
                block[i] = el;
                i++;
                if (i == rest)
                    break;
            }
            block.length = rest;
            result = dg(block);
        }

        return result;
    }
}


void main() {
    double t0, t1;
    long tot;

    t0 = myclock();
    tot = 0;
    foreach (el; strictTabler(EvensGenerator()))
        tot += el;
    t1 = myclock();
    printf("%lld, %.2f\n", tot, t1 - t0);

    t0 = myclock();
    tot = 0;
    auto g2 = LazyTabler!(EvensGenerator)(EvensGenerator());
    foreach (el; g2)
        tot += el;
    t1 = myclock();
    printf("%lld, %.2f\n", tot, t1 - t0);

    t0 = myclock();
    tot = 0;
    auto g3 = VectorizedLazyTabler!(EvensGenerator)(EvensGenerator());
    foreach (block; g3)
        foreach (el; block)
            tot += el;
    t1 = myclock();
    printf("%lld, %.2f\n", tot, t1 - t0);
}


Bye,
bearophile

Nov 11 2009

bearophile <bearophileHUGS lycos.com> writes:

It seems I was not fully mad, Clojure designers have had the same idea, see
page 29 of this document (Click on it, don't save as):
http://clojure.googlegroups.com/web/ClojureExperience.pdf

Be well,
bearophile

Nov 30 2009

D Programming

C/C++ Programming

Other

digitalmars.D - Vectorized Laziness