digitalmars.D - Vectorized Laziness
- bearophile (20/20) Nov 10 2009 I've seen this good Google Talk video from Jan 2008, "MonetDB/X100: a (v...
- BCS (2/2) Nov 10 2009 Hello bearophile,
- bearophile (114/114) Nov 11 2009 It seems my idea has not interested people and devs...
- bearophile (4/4) Nov 30 2009 It seems I was not fully mad, Clojure designers have had the same idea, ...
I've seen this good Google Talk video from Jan 2008, "MonetDB/X100: a (very) fast column-store", about a more efficient column-oriented DBMS: http://www.youtube.com/watch?v=yrLd-3lnZ58 The efficiency of this DBMS (something like 50 times faster) is produced by few things: - It's column-wise, this helps in several other successive optimizations too (row-oriented DBMS have their purpose still, the column-oriented are good only if you want to perform statistics on most of your data, certain kinds of data mining, etc). - At 13.57 it shows that instead of yielding single tuples (or single items, it's column-oriented), it yields arrays of about 100 fields. This allows to create primitives that are much more efficient. And the CPU can process them better, using SSE instruction too. Such small arrays are designed to fit in the CPU cache (probably L2). Such vectorized operations are also pipelined in some way. - The filtering operations often don't produce new vectors, they just mark the items as not present any more inside an array. This helps reducing the number of allocatations of new arrays. - On disk data is kept compressed, the compression is column-wise, and decompression is done only just-in-time to reduce the amount of data transfers. The number compression takes in account that often data is sorted, so it's delta-compressed, and then such delta is compressed only in a range of the Gaussian-like residual distribution (outliers are encoded directly). This compression also allows to keep large indexes in RAM, that speeds up things more. Such compression is I/O bound, the CPU performs it at max speed. - They even shown vectorized hashing, but I have not understood how they work. - Reading from the disks is done in a merged way, to avoid reading the same data many times for similar queries. (The good thing is that it's not hard to understand most things shown in this video. But I'd like to see the C code they use as "reference", that's just 3 times faster than their DBMS. Such C code may be improved). Inside, DBMS work as the lazy operations that are getting more common in D2 code, that are common in Python3, and even more in Haskell. So in D2 to reduce the overhead of the lazy operations it may be useful to use a vectorized strategy (that's meant to be almost transparent for the D programmer), so items can be produced and moved in groups, inside arrays, like in that video. (The DBMS has an interpreter overhead, it's made of fast primitives used by lazy interpreted code). This idea can be applied in other languages too. Lazy filtering is a bit less common in D2 code, so I don't know if the idea of not creating new blocks (and just marking items as absent, for example using a bit array attached to the items array) can be useful in D2 too, as in that X100 DBMS. In this discussion you have to think about pieces of code like: xfilter(xchain(xmap(_, xrange(_)), xfilter(xmap(_)))) or things even more complex, and not something simpler, that LDC may be able to just fully inline. (Time ago I have posted related comments on the Python newsgroup, but it was ignored.) (Partially unrelated note: years after starting to program in D, I still miss a lot strict/lazy array comprehensions in D. Maybe Walter will eventually see how handy they are.) Bye, bearophile
Nov 10 2009
Hello bearophile, I seem to recall watching that a while ago. Fun and cool stuff IIRC.
Nov 10 2009
It seems my idea has not interested people and devs... The following is a silly little example, works with Tango and Phobos, and on both DMD and LDC the third tabler (VectorizedLazyTabler) is faster... version (Tango) { import tango.stdc.stdio: printf; import tango.stdc.time: CLOCKS_PER_SEC, clock; } else { import std.c.stdio: printf; import std.c.time: CLOCKS_PER_SEC, clock; } double myclock() { return clock() / cast(float)CLOCKS_PER_SEC; } const int N = 40_000_000; // silly generator of the data struct EvensGenerator { int i; int opApply(int delegate(ref int) dg) { int result; while (true) { result = dg(i); i += 2; if (result) break; } return result; } } int[] strictTabler(TyGen)(TyGen g) { auto arr = new int[N]; int i; foreach (el; g) { arr[i] = el; i++; if (i == N) break; } return arr; } struct LazyTabler(TyGen) { TyGen g; int opApply(int delegate(ref int) dg) { int result, i; foreach (el; g) { result = dg(el); if (result) break; i++; if (i == N) break; } return result; } } struct VectorizedLazyTabler(TyGen) { const BLOCK_SIZE = 1_000; TyGen g; int opApply(int delegate(ref int[]) dg) { int result; // 'block' can be allocated elsewhere, etc auto block = new int[BLOCK_SIZE]; for (int nblock; nblock < (N / BLOCK_SIZE); nblock++) { int i; foreach (el; g) { block[i] = el; i++; if (i == BLOCK_SIZE) break; } result = dg(block); if (result) return result; } int rest = N % BLOCK_SIZE; if (rest) { int i; foreach (el; g) { block[i] = el; i++; if (i == rest) break; } block.length = rest; result = dg(block); } return result; } } void main() { double t0, t1; long tot; t0 = myclock(); tot = 0; foreach (el; strictTabler(EvensGenerator())) tot += el; t1 = myclock(); printf("%lld, %.2f\n", tot, t1 - t0); t0 = myclock(); tot = 0; auto g2 = LazyTabler!(EvensGenerator)(EvensGenerator()); foreach (el; g2) tot += el; t1 = myclock(); printf("%lld, %.2f\n", tot, t1 - t0); t0 = myclock(); tot = 0; auto g3 = VectorizedLazyTabler!(EvensGenerator)(EvensGenerator()); foreach (block; g3) foreach (el; block) tot += el; t1 = myclock(); printf("%lld, %.2f\n", tot, t1 - t0); } Bye, bearophile
Nov 11 2009
It seems I was not fully mad, Clojure designers have had the same idea, see page 29 of this document (Click on it, don't save as): http://clojure.googlegroups.com/web/ClojureExperience.pdf Be well, bearophile
Nov 30 2009