digitalmars.D - Array append performance 2
- bearophile (46/46) Aug 30 2008 Until some more basic solution is found (and/or answers from Walter rega...
- bearophile (4/4) Aug 30 2008 IsReferenceType!() isn't fully correct because it sees fixed size arrays...
- dsimcha (15/18) Aug 30 2008 reference types. I have posted a more correct (and cleaned, and single-t...
- Sergey Gromov (21/29) Aug 31 2008 I may be missing something, but the following seems to work:
- Sergey Gromov (28/45) Aug 31 2008 Fixed arrays: dynamic and associative arrays always contain references,
- dsimcha (24/24) Aug 30 2008 Nice work on the reference template. I had thought that the typeid(T).f...
- bearophile (11/17) Aug 30 2008 Right, I have just read the D docs that say it.
- Sean Kelly (3/10) Aug 30 2008 Not with Tango.
- Fawzi Mohamed (9/23) Aug 31 2008 I would have done it quite differently: keep a linked list (if you want
- bearophile (10/17) Aug 31 2008 Yes, my code was experimental, my main purpose was to ask if it's correc...
- Fawzi Mohamed (7/40) Aug 31 2008 I had not thought about an important thing: the content of the array
- bearophile (4/6) Aug 31 2008 Yes, and there are other problems, that make the situation/code a mess, ...
- Denis Koroskin (51/59) Sep 01 2008 Please, test my version, too!
- bearophile (49/52) Sep 01 2008 Sorry, I'm not using D 2.x. I (hopefully) will soon show an improved ver...
- bearophile (5/5) Sep 01 2008 This is the first beta version, it seems to work well enough (despite to...
- bearophile (5/5) Sep 01 2008 Well, a finished version of ArrayBuilder is in my libs, in the 'builders...
- Benji Smith (9/16) Sep 01 2008 Very cool.
- bearophile (5/9) Sep 01 2008 My libs are meant to improve std libs, if/where/when the compiler can't ...
- Dave (8/21) Sep 02 2008 There have been quite a few examples of contributors modifying the D run...
- bearophile (6/11) Sep 03 2008 It seems to work, it speeds up the code where I have used it (I'm using ...
- BCS (8/35) Sep 01 2008 I think that in many cases the information needed for the compiler to pr...
- Denis Koroskin (2/33) Aug 31 2008 Static doesn't apply to a struct, only to inner classes, IIRC.
- Christian Kamm (4/14) Aug 31 2008 Actually, I don't think the D spec requires typeof(T.init) == T. And T i...
- bearophile (23/25) Aug 31 2008 Right, but the following program:
- downs (3/8) Sep 01 2008 This should do what you want:
- Nick B (34/34) Sep 01 2008 Hi
- Lars Ivar Igesund (13/56) Sep 01 2008 That is well and fine, but you shouldn't hijack a thread for it, rather
- Brian Price (24/58) Sep 01 2008 Hate to say it but this is yet another case of reinventing the wheel. T...
- Leandro Lucarella (19/23) Sep 01 2008 Different problems has differents solutions. What you say is really nice
- Brian Price (19/35) Sep 01 2008 In a (near) realtime embedded system with a high wire cost (cellphone, s...
- Mitja Slenc (23/81) Sep 02 2008 Can you explain where you see violation of DRY exactly? Also, can you
Until some more basic solution is found (and/or answers from Walter regarding how to solve this append performance problem and regarding the idea of adding a third capacity field to dynamical arrays), someone has suggested to create an array builder class, like the string builder of Java. So I have created one, the code is in the attach. Note that code is raw still, not commented, and it does the bare minimum (append one item and extract the array, I'd like to add generic extend methods, and tye possibility to change the capacity/length), its tests are messy still, etc, once I know it's correct I'll improve it and I'll add it to my libs (of course, I can offer a version of this code to Phobos/Tango too). It's not easy for me to write such GC-based code, so my questions are: Is ArrayBuilder!() correct in every situation? Why are the performance of the benchmark3 so low (a bit worse than normal array append)? Can it be improved? Extra note: while creating it I have found another bug in DMD I didn't know about: import std.stdio: writefln; void main() { alias int[2] T; writefln(T.init); } This program has to print [0,0] instead of 0. Timings, on a Core Duo 2 GHz, 2 GB RAM: benchmark 1, N=2_000_000: Array append: 0.160 s ArrayBuilder: 0.026 s benchmark 1, N=20_000_000: Array append: 1.837 s ArrayBuilder: 0.325 s benchmark 1, N=50_000_000: Array append: 4.489 s ArrayBuilder: 0.731 s benchmark 1, N=200_000_000: Array append: 32.293 s ArrayBuilder: Out of memory benchmark 2, N=200_000: Array append: 0.099 s ArrayBuilder: 0.004 s benchmark 2, N=2_000_000: Array append: 6.896 s ArrayBuilder: 0.050 s benchmark 3, N=200_000: Array append: 0.090 s ArrayBuilder: 0.076 s benchmark 3, N=2_000_000: Array append: 1.014 s ArrayBuilder: 0.923 s (why is this so slow?) benchmark 3, N=6_000_000: Array append: 5.295 s ArrayBuilder: 4.431 s benchmark 4, N=500_000: Array append: 1.109 s ArrayBuilder: 0.414 s benchmark 4, N=1_000_000: Array append: 3.103 s Bye, bearophile
Aug 30 2008
IsReferenceType!() isn't fully correct because it sees fixed size arrays as reference types. I have posted a more correct (and cleaned, and single-template) version here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=75813 Bye, bearophile
Aug 30 2008
== Quote from bearophile (bearophileHUGS lycos.com)'s articleIsReferenceType!() isn't fully correct because it sees fixed size arrays asreference types. I have posted a more correct (and cleaned, and single-template) version here:http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=75813Bye, bearophileOne small issue: Your IsReference template doesn't work with invariant, and presumably const, types. For example, IsReferenceType!(typeof("Foo"[0])); //Returns true. The easiest way to fix this is to add a Mutable!(): static if (IsType!(Mutable!(Types[0]), bool, byte, ubyte, short, ushort, int, uint, long, ulong, float, double, real, ifloat, idouble, ireal, cfloat, cdouble, creal, char, dchar, wchar) ) Also, although it may be the least bad way to do this (I can't think of a better one), the process of elimination method this relies on seems kind of brittle because it would have to be modified for any new type that is added. If anyone knows of a less kludge-y way, please post.
Aug 30 2008
dsimcha <dsimcha yahoo.com> wrote:static if (IsType!(Mutable!(Types[0]), bool, byte, ubyte, short, ushort, int, uint, long, ulong, float, double, real, ifloat, idouble, ireal, cfloat, cdouble, creal, char, dchar, wchar) ) Also, although it may be the least bad way to do this (I can't think of a better one), the process of elimination method this relies on seems kind of brittle because it would have to be modified for any new type that is added. If anyone knows of a less kludge-y way, please post.I may be missing something, but the following seems to work: template Category(T) { static if ( is(T == interface) || is(T == class) || is(T Base : Base*)) { const Category = "reference"; } else const Category = "value"; } template Category(T : T[U], U) { const Category = Category!(T); } You need to extend this for structs and unions but that seems routine... -- SnakE
Aug 31 2008
Sergey Gromov <snake.scaly gmail.com> wrote:template Category(T) { static if ( is(T == interface) || is(T == class) || is(T Base : Base*)) { const Category = "reference"; } else const Category = "value"; } template Category(T : T[U], U) { const Category = Category!(T); }Fixed arrays: dynamic and associative arrays always contain references, static arrays must be checked recursively: template Category(T) { static if ( is(T == interface) || is(T == class) || is(T Elem : Elem[]) || // dyn array is(T Base : Base*)) { const Category = "reference"; } else const Category = "value"; } template Category(T : T[U], U) { // AAs always contain references const Category = "reference"; } template Category(T : T[U], size_t U) { // Static arrays may or may not contain references const Category = Category!(T); } -- SnakE
Aug 31 2008
Nice work on the reference template. I had thought that the typeid(T).flags() would be inlined, then constant folded, but CTFE doesn't work on it, so I guess not. Probably pretty negligible, but if I have a template laying around to do this at compile time, I may as well use it. Also, to answer your question, you do need to call hasNoPointers() each time you realloc. import std.stdio, std.gc, std.process; void main() { uint[] foo = (cast(uint[]) malloc(uint.sizeof))[0..1]; hasNoPointers(foo.ptr); foo = cast(uint[]) realloc(foo.ptr, 100_000 * uint.sizeof)[0..100_000]; //Commenting next line out causes severe mem leaks. hasNoPointers(foo.ptr); foreach(ref f; foo) { f = cast(uint) test(); //Create false pointers. } system("pause"); //So mem usage can be checked. } uint* test() { return (new uint[1_000]).ptr; } Oh, one more thing: What's a static struct as opposed to a regular struct? I've never seen it used before. I had assumed that this means a struct composed only of static members, but apparently not.
Aug 30 2008
dsimcha:Nice work on the reference template.<It comes from several expansion and compression cycles, has requires few hours to be written ;-)Also, to answer your question, you do need to call hasNoPointers() each time you realloc.<Right, I have just read the D docs that say it. But what's the rationale behind it? If I have a pointer and I want to realloc it, to make its space bigger, I never want to change the fact that the GC scans its pointers or not. Stating it once must suffice.Oh, one more thing: What's a static struct as opposed to a regular struct?<I have taken a look at the documentation, and I have performed some benchmarks, now I think it does nothing, I have removed that attribute. I have flagged that struct fields as private because originally that was a class, but a struct is better and faster, so I have removed those "private" too.One small issue: Your IsReference template doesn't work with invariant, and presumably const, types.<Because I (as most people here, I presume) use D 1. Sorry for not stating it clearly.The easiest way to fix this is to add a Mutable!(): static if (IsType!(Mutable!(Types[0]), bool, byte, [...]Thank you, I have added a note into my libs, even if they are for D 1. Hopefully some one else will tell me if the ArrayBuilder is correct (even if this group isn't D.learn). Bye and thank you, bearophile
Aug 30 2008
dsimcha wrote:Nice work on the reference template. I had thought that the typeid(T).flags() would be inlined, then constant folded, but CTFE doesn't work on it, so I guess not. Probably pretty negligible, but if I have a template laying around to do this at compile time, I may as well use it. Also, to answer your question, you do need to call hasNoPointers() each time you realloc.Not with Tango. Sean
Aug 30 2008
On 2008-08-31 06:01:26 +0200, Sean Kelly <sean invisibleduck.org> said:dsimcha wrote:I would have done it quite differently: keep a linked list (if you want to improve of batches of 10 arrays), with a pointer to the end, append in O(1), and keep the total length. only upon a call to toArray allocate an array of the requested length and fill it. An array (even with tricks) is not a good structure to append to, so it seems a bad idea to me to use it. FawziNice work on the reference template. I had thought that the typeid(T).flags() would be inlined, then constant folded, but CTFE doesn't work on it, so I guess not. Probably pretty negligible, but if I have a template laying around to do this at compile time, I may as well use it. Also, to answer your question, you do need to call hasNoPointers() each time you realloc.Not with Tango. Sean
Aug 31 2008
Fawzi Mohamed:I would have done it quite differently: keep a linked list (if you want to improve of batches of 10 arrays), with a pointer to the end, append in O(1), and keep the total length. only upon a call to toArray allocate an array of the requested length and fill it. An array (even with tricks) is not a good structure to append to, so it seems a bad idea to me to use it.Yes, my code was experimental, my main purpose was to ask if it's correct, in its GC management, etc. Algorithmic and code tuning is something I think about later. I'm now implementing ArrayBuilder with a deque-like, to see how it performs in comparison. There are 3 possible things you may want to join to such ArrayBuilder: - Fixed size arrays - Single items - Dynamic arrays For the dynamic arrays can be copied just a reference, for the other is better to copy the data. The single items can be added to short (64 KB) structs with fixed size arrays inside, organized into a linked list. The static arrays can equally be copied in pieces into such blocks. The dynamic arrays instead require short arrays filled with the struct of the dynamic arrays. This makes things messy. If the user keeps alternating single items to dynamic arrays the data structure has to avoid wasting too much memory. So I don't see an easy solution yet, beside my original one or a similar one (that consists in copying all items of dynamic arrays too). Bye and thank you, bearophile
Aug 31 2008
On 2008-08-31 17:37:13 +0200, bearophile <bearophileHUGS lycos.com> said:Fawzi Mohamed:I had not thought about an important thing: the content of the array can change... so copying is safer. Probably using (as you said) a linked list of fixed arrays is the best approach for many small appends, maybe append might have a bool argument "fixed=false" to know if copying can be avoided. FawziI would have done it quite differently: keep a linked list (if you want to improve of batches of 10 arrays), with a pointer to the end, append in O(1), and keep the total length. only upon a call to toArray allocate an array of the requested length and fill it. An array (even with tricks) is not a good structure to append to, so it seems a bad idea to me to use it.Yes, my code was experimental, my main purpose was to ask if it's correct, in its GC management, etc. Algorithmic and code tuning is something I think about later. I'm now implementing ArrayBuilder with a deque-like, to see how it performs in comparison. There are 3 possible things you may want to join to such ArrayBuilder: - Fixed size arrays - Single items - Dynamic arrays For the dynamic arrays can be copied just a reference, for the other is better to copy the data. The single items can be added to short (64 KB) structs with fixed size arrays inside, organized into a linked list. The static arrays can equally be copied in pieces into such blocks. The dynamic arrays instead require short arrays filled with the struct of the dynamic arrays. This makes things messy. If the user keeps alternating single items to dynamic arrays the data structure has to avoid wasting too much memory. So I don't see an easy solution yet, beside my original one or a similar one (that consists in copying all items of dynamic arrays too). Bye and thank you, bearophile
Aug 31 2008
Fawzi Mohamed:I had not thought about an important thing: the content of the array can change... so copying is safer.Yes, and there are other problems, that make the situation/code a mess, so I copy all items now, keeping code simpler. I'm debugging/benchmarking it now... Later, bearophile
Aug 31 2008
On Mon, 01 Sep 2008 01:11:46 +0400, bearophile <bearophileHUGS lycos.com> wrote:Fawzi Mohamed:Please, test my version, too! While it is slighly slower, it allows appending invariant arrays (i.e. strings) without copying data (although it is currently not implemented). An idea is to store an array (or a list) of invariant elements. Current buffer is mutable until it is completely full. It then casted to invariant and appended to the list and an new mutable buffer (of a higher size) created. struct ArrayBuilder(T) { //private BearophileArrayBuilder!(invariant(T[])) subArrays; private invariant(T[])[] subArrays; private int lengthSoFar; private T[] tmpArray; private int curOffset = 0; void opCatAssign(T x) { int capacity = tmpArray.length; if (curOffset >= capacity) { int old_capacity = capacity; if (capacity == 0) capacity = 4; // starting point not too much small else if (capacity < (1 << 26)) // 67_108_864 capacity *= 2; // for amortized O(1) and less RAM fragmentation else capacity *= 1.3; // slower growing rate to save RAM if (old_capacity != 0) { subArrays ~= cast(invariant(T[]))tmpArray; } tmpArray = new T[capacity]; curOffset = 0; } tmpArray[curOffset] = x; assert(tmpArray[curOffset] == x); ++curOffset; ++lengthSoFar; } /// T[] toarray() { T[] array = new T[lengthSoFar]; int offset = 0; foreach (subArray; subArrays) { int newOffset = offset+subArray.length; array[offset..newOffset] = subArray[]; offset = newOffset; } array[offset..$] = tmpArray[0..curOffset]; return array; } }I had not thought about an important thing: the content of the array can change... so copying is safer.Yes, and there are other problems, that make the situation/code a mess, so I copy all items now, keeping code simpler. I'm debugging/benchmarking it now... Later, bearophile
Sep 01 2008
Denis Koroskin:Please, test my version, too! While it is slighly slower, it allows appending invariant arrays (i.e. strings) without copying data (although it is currently not implemented).Sorry, I'm not using D 2.x. I (hopefully) will soon show an improved version of my code, so you can test your version. Note: I have benchmarked the version with the deque-like data structure, and it's slower, the code is a bit more complex (and I may have hit a bug in the GC (I'd like to know if Tango behaves at the same way)), so I'll probably just use the first version, with the reallocs. ---------------- With Phobos: import std.stdio: writefln; import std.gc: malloc; void main() { int M = 1 << 16; int N = M / int.sizeof; const int N_POINTERS = 18; int*[N_POINTERS] pointers; foreach (ref ptr; pointers) { ptr = cast(int*)malloc(N * int.sizeof); if (ptr is null) { throw new Error("OUT OF MEM"); return 1; } } int last = 0; foreach (i, ptr; pointers) { int int_ptr = cast(int)ptr; int diff = int_ptr - last; writefln(i, " ", int_ptr, " ", diff, " ", M - diff); last = int_ptr; } } Prints: 0 9052160 9052160 -8986624 1 9117696 65536 0 2 9183232 65536 0 3 9248768 65536 0 4 9314304 65536 0 5 9379840 65536 0 6 9445376 65536 0 7 9510912 65536 0 8 9576448 65536 0 9 9641984 65536 0 10 9707520 65536 0 11 9773056 65536 0 12 9838592 65536 0 13 9904128 65536 0 14 9969664 65536 0 15 10092544 122880 -57344 16 10158080 65536 0 17 10223616 65536 0 Is that result at n = 15 normal/correct? Bye, bearophile
Sep 01 2008
This is the first beta version, it seems to work well enough (despite too much easy memory overflows), I'll add ddoc comments, better comments, I'll clean the code, improve unit tests, etc. If you spot problems I'd like to know them. You will probably find its finished version inside my libs in one or two days. I presume Walter isn't interested to put this into Phobos, nor to 'fix' the built-in arrays (or maybe he's just busy). Bye, bearophile
Sep 01 2008
Well, a finished version of ArrayBuilder is in my libs, in the 'builders' module: http://www.fantascienza.net/leonardo/so/libs_d.zip I have already used it to replace the old string appending used in the putr() (printing functions) and now printing is about 2.5 times faster (and using it for that purpose is a way to perform hundreds of unit tests on ArrayBuilder), so I'm quite pleased. Bye, and thank you for all the help and answers (but more bugs can be present still), bearophile
Sep 01 2008
bearophile wrote:Well, a finished version of ArrayBuilder is in my libs, in the 'builders' module: http://www.fantascienza.net/leonardo/so/libs_d.zip I have already used it to replace the old string appending used in the putr() (printing functions) and now printing is about 2.5 times faster (and using it for that purpose is a way to perform hundreds of unit tests on ArrayBuilder), so I'm quite pleased. Bye, and thank you for all the help and answers (but more bugs can be present still), bearophileVery cool. Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this? I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements, but in the end, having to rely on an array-builder class seems very non-D-ish. Thoughts? --benji
Sep 01 2008
Benji Smith:Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this?I agree, but so far I am not able to modify/improve the language. And an ArrayBuilder isn't meant to replace dynamic arrays, it's just a way to build them faster, "patching" what I perceive as one of their problems.I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements,My libs are meant to improve std libs, if/where/when the compiler can't be modified. They are un-standard but they are largish libs anyway, and I'm developing them doing my best :-) Bye, bearophile
Sep 01 2008
"bearophile" <bearophileHUGS lycos.com> wrote in message news:g9i1nb$26pd$1 digitalmars.com...Benji Smith:There have been quite a few examples of contributors modifying the D runtime to improve things like array concatenation. If it generally performs better and is demonstrably correct, I'm pretty sure Walter and the Tango crew would incorporate it. In other words, have a go at improving the arraycat code in internal/gc/gc.d if you have the time. It's simple enough to modify and build.Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this?I agree, but so far I am not able to modify/improve the language. And an ArrayBuilder isn't meant to replace dynamic arrays, it's just a way to build them faster, "patching" what I perceive as one of their problems.I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements,My libs are meant to improve std libs, if/where/when the compiler can't be modified. They are un-standard but they are largish libs anyway, and I'm developing them doing my best :-) Bye, bearophile
Sep 02 2008
Dave:There have been quite a few examples of contributors modifying the D runtime to improve things like array concatenation.This requires three fields in the dynamic array struct, so maybe it requires changes in several points of the front-end.If it generally performs better and is demonstrably correct, I'm pretty sure Walter and the Tango crew would incorporate it.<It seems to work, it speeds up the code where I have used it (I'm using it more and more in my libs, it has speed up the select() two times, it contains an array append because in general you don't know how many items you filter away from the input iterable) but I am not able to demostrate it correct. Its main fault so far is that it causes MemoryOverflow if you try to add something like 200 million size_t to it.In other words, have a go at improving the arraycat code in internal/gc/gc.d if you have the time. It's simple enough to modify and build.I'll take a look, but I presume it's not easy as you say :-) Bye, bearophile
Sep 03 2008
Reply to Benji,bearophile wrote:I think that in many cases the information needed for the compiler to produce the needed optimizations is to hard to encode in D like languages. D can handle the individual cats but when to do the rest is a problem that can span to much code to expect the compiler to solve it. A human with knowledge of what is supposed to happen can use libs to improve the situation. OTOH the best solution would be to describe programs in a way that lets a computer do that work. We can dream.Well, a finished version of ArrayBuilder is in my libs, in the 'builders' module: http://www.fantascienza.net/leonardo/so/libs_d.zip I have already used it to replace the old string appending used in the putr() (printing functions) and now printing is about 2.5 times faster (and using it for that purpose is a way to perform hundreds of unit tests on ArrayBuilder), so I'm quite pleased. Bye, and thank you for all the help and answers (but more bugs can be present still), bearophileVery cool. Philosophically, though, isn't the whole purpose of having dynamic arrays in D to avoid having to create library implementations like this? I can definitely appreciate how something like this provides proof-of-concept for compiler & stdlib improvements, but in the end, having to rely on an array-builder class seems very non-D-ish. Thoughts? --benji
Sep 01 2008
On Sun, 31 Aug 2008 04:58:10 +0400, dsimcha <dsimcha yahoo.com> wrote:Nice work on the reference template. I had thought that the typeid(T).flags() would be inlined, then constant folded, but CTFE doesn't work on it, so I guess not. Probably pretty negligible, but if I have a template laying around to do this at compile time, I may as well use it. Also, to answer your question, you do need to call hasNoPointers() each time you realloc. import std.stdio, std.gc, std.process; void main() { uint[] foo = (cast(uint[]) malloc(uint.sizeof))[0..1]; hasNoPointers(foo.ptr); foo = cast(uint[]) realloc(foo.ptr, 100_000 * uint.sizeof)[0..100_000]; //Commenting next line out causes severe mem leaks. hasNoPointers(foo.ptr); foreach(ref f; foo) { f = cast(uint) test(); //Create false pointers. } system("pause"); //So mem usage can be checked. } uint* test() { return (new uint[1_000]).ptr; } Oh, one more thing: What's a static struct as opposed to a regular struct? I've never seen it used before. I had assumed that this means a struct composed only of static members, but apparently not.Static doesn't apply to a struct, only to inner classes, IIRC.
Aug 31 2008
Extra note: while creating it I have found another bug in DMD I didn't know about: import std.stdio: writefln; void main() { alias int[2] T; writefln(T.init); } This program has to print [0,0] instead of 0.Actually, I don't think the D spec requires typeof(T.init) == T. And T is a valid initializer for T[N]: int[3] arr = 5; // sets all elements to 5 Christian
Aug 31 2008
Christian Kamm:Actually, I don't think the D spec requires typeof(T.init) == T.Maybe the D specs have to be fixed then.And T is a valid initializer for T[N]:Right, but the following program: import std.stdio: writefln; void main() { alias int[2] T; writefln(T.init); int[2] T_init; writefln(T_init); } prints: 0 [0,0] So if I have a dynamic array of static arrays (where T is the static array) I can't do this: array[0 .. $][] = T.init; I have to do: T T_init; array[0 .. $][] = T_init; While if T is a dynamic array I can do: array[0 .. $] = T.init; So even if they aren't bugs I don't like all such differences much... Bye, bearophile
Aug 31 2008
bearophile wrote:Christian Kamm:This should do what you want: template Init(T) { const T Init; }Actually, I don't think the D spec requires typeof(T.init) == T.Maybe the D specs have to be fixed then.
Sep 01 2008
Hi I came across this the other day, and no one has mentioned this, on this news group before, I thought I would bring the subject to the communities attention, so to speak. Google has very recently, open sourced "Protocol Buffers". What is it you ask ? In a couple of lines it is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Think XML, but smaller, faster, and simpler. Why not just use XML ? Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: * are simpler * are 3 to 10 times smaller * are 20 to 100 times faster * are less ambiguous * generate data access classes that are easier to use programmatically. PB supports Java, Python,and C++ currently. A more detailed overview can be found here: http://code.google.com/apis/protocolbuffers/docs/overview.html and the FAQ here: http://code.google.com/apis/protocolbuffers/docs/faq.html See the answer to the question "Can I add support for a new language to protocol buffers?" inside the FAQ. Some Tips and comments can be found here: http://zunger.livejournal.com/164024.html My questions. Does the D community see this of interest ? Is this something they might use ? Do they see value in it being added to the respective Tango or Phobos frameworks ? any other comments ? cheers Nick B.
Sep 01 2008
Nick B wrote:Hi I came across this the other day, and no one has mentioned this, on this news group before, I thought I would bring the subject to the communities attention, so to speak.That is well and fine, but you shouldn't hijack a thread for it, rather start a new one.Google has very recently, open sourced "Protocol Buffers". What is it you ask ? In a couple of lines it is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Think XML, but smaller, faster, and simpler. Why not just use XML ? Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: * are simpler * are 3 to 10 times smaller * are 20 to 100 times fasterNot necessarily a valid point if you use the right parser, say, one from Tango.* are less ambiguous * generate data access classes that are easier to use programmatically. PB supports Java, Python,and C++ currently. A more detailed overview can be found here: http://code.google.com/apis/protocolbuffers/docs/overview.html and the FAQ here: http://code.google.com/apis/protocolbuffers/docs/faq.html See the answer to the question "Can I add support for a new language to protocol buffers?" inside the FAQ. Some Tips and comments can be found here: http://zunger.livejournal.com/164024.html My questions. Does the D community see this of interest ? Is this something they might use ? Do they see value in it being added to the respective Tango or Phobos frameworks ?If the protocol is widely used, then it would certainly be beneficial to support it, and I would assume that it is fairly easy to implement on top of Tango. -- Lars Ivar Igesund blog at http://larsivi.net DSource, #d.tango & #D: larsivi Dancing the Tango
Sep 01 2008
== Quote from Nick B (nick.barbalich gmail.com)'s articleHi I came across this the other day, and no one has mentioned this, on this news group before, I thought I would bring the subject to the communities attention, so to speak. Google has very recently, open sourced "Protocol Buffers". What is it you ask ? In a couple of lines it is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Think XML, but smaller, faster, and simpler. Why not just use XML ? Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: * are simpler * are 3 to 10 times smaller * are 20 to 100 times faster * are less ambiguous * generate data access classes that are easier to use programmatically. PB supports Java, Python,and C++ currently. A more detailed overview can be found here: http://code.google.com/apis/protocolbuffers/docs/overview.html and the FAQ here: http://code.google.com/apis/protocolbuffers/docs/faq.html See the answer to the question "Can I add support for a new language to protocol buffers?" inside the FAQ. Some Tips and comments can be found here: http://zunger.livejournal.com/164024.html My questions. Does the D community see this of interest ? Is this something they might use ? Do they see value in it being added to the respective Tango or Phobos frameworks ? any other comments ? cheers Nick B.Hate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning. Binary solutions such as this, while more efficient byte-wise, run into portability problems especially with floating point values. They also lose the human readability of the data (sans the use of special tools). Often the use of a binary solution is a case of premature optimization and indicates bad design at a higher level. The marshalling strategy should be 'pluggable' so that one can use an easier to debug, human readable, data format during development. Versioning problems can be (and have been) addressed in a number of ways over the years, the simplest and imo often the best, is to sidestep the problem by serializing a map of the data rather than just the raw data. That is, each value has associated metadata indicating its field name (a key-value mapping iow). If XML is too heavy a hammer, JSON (with or without embedded metadata) is a good alternative with quite a bit of industry support. In fact, the use of JSON encoding with embedded metadata gives you a lightweight solution that works with nearly every language in present use. Even better, any language with good reflection support (such as Java) allows an implementation that does not violate DRY. This "Protocol Buffers" solution is a dinosaur, a throwback, yet another wheel when we need ground effects or anti-gravity. My 2c, Brian
Sep 01 2008
Brian Price, el 1 de septiembre a las 14:31 me escribiste:Hate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning.Different problems has differents solutions. What you say is really nice and have sense when *latency* and other performance issues are not a problem to you (it's obviously a problem for Google and for a lot of other people, like me, I was doing a little "framework", similar to Google's PB at work, just before they release it =). When you deal with (almost) realtime systems, you can't pay the price for human readability, extra parsing time, extra abstraction and extra bytes thrown throgh the wire. You just can't. You don't even care about portability (this kind of things usually run in a very controlled environment). I just wanted to clarify this. -- Leandro Lucarella (luca) | Blog colectivo: http://www.mazziblog.com.ar/blog/ ---------------------------------------------------------------------------- GPG Key: 5F5A8D05 (F8CD F9A7 BF00 5431 4145 104C 949E BFB6 5F5A 8D05) ---------------------------------------------------------------------------- PADRES DENUNCIAN QUE SU HIJA SE ESCAPO CON UN PARAGUAYITO -- Crónica TV
Sep 01 2008
== Quote from Leandro Lucarella (llucax gmail.com)'s articleBrian Price, el 1 de septiembre a las 14:31 me escribiste:In a (near) realtime embedded system with a high wire cost (cellphone, smart meters, etc) I'd have to agree with you. However, my initial impression was that this was being proposed/offered as a general purpose object streaming system. Multicore/hyperthreaded processors with multiple levels of caching have a huge latency (comparitively) between core and wire. Even in a high demand system there's generally plenty of cpu cycles left over to do quite a bit of extra parsing and data massaging (compression). A compressed text message (XML or JSON format) is very competitive in size with binary format (depending of course on the actual makeup of the data payload). I expect that in a few years the embedded systems will be sporting processors with similar capabilities to today's desktops and servers. At that time, it may be worthwhile re-evaluating the decision to go with a 'brittle' binary format in new designs. On a personal note, some of my friends are laughing at me over this - ten or twelve years ago I was vehemently against the idea of using human readable data formats for transmission - for the same reasons you give above. Of course my cellphone today probably has nearly the power of my desktop back then. Sorry for any misunderstanding, BrianHate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning.Different problems has differents solutions. What you say is really nice and have sense when *latency* and other performance issues are not a problem to you (it's obviously a problem for Google and for a lot of other people, like me, I was doing a little "framework", similar to Google's PB at work, just before they release it =). When you deal with (almost) realtime systems, you can't pay the price for human readability, extra parsing time, extra abstraction and extra bytes thrown throgh the wire. You just can't. You don't even care about portability (this kind of things usually run in a very controlled environment). I just wanted to clarify this.
Sep 01 2008
Brian Price wrote:== Quote from Nick B (nick.barbalich gmail.com)'s articleCan you explain where you see violation of DRY exactly? Also, can you explain the big difference between serialization and marshalling? I find it hard to draw any meaningful line between the two..Hi I came across this the other day, and no one has mentioned this, on this news group before, I thought I would bring the subject to the communities attention, so to speak. Google has very recently, open sourced "Protocol Buffers". What is it you ask ? In a couple of lines it is a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more. Think XML, but smaller, faster, and simpler. Why not just use XML ? Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers: * are simpler * are 3 to 10 times smaller * are 20 to 100 times faster * are less ambiguous * generate data access classes that are easier to use programmatically. PB supports Java, Python,and C++ currently. A more detailed overview can be found here: http://code.google.com/apis/protocolbuffers/docs/overview.html and the FAQ here: http://code.google.com/apis/protocolbuffers/docs/faq.html See the answer to the question "Can I add support for a new language to protocol buffers?" inside the FAQ. Some Tips and comments can be found here: http://zunger.livejournal.com/164024.html My questions. Does the D community see this of interest ? Is this something they might use ? Do they see value in it being added to the respective Tango or Phobos frameworks ? any other comments ? cheers Nick B.Hate to say it but this is yet another case of reinventing the wheel. The worst thing about this throwback to the early 90s is its inherent violation of DRY. This package intermingles and confuses three separate issues, treating them as a monolithic whole, namely: serialization, marshalling, and versioning.Binary solutions such as this, while more efficient byte-wise, run into portability problems especially with floating point values.So, what sort of problems do you see with this specific solution? To me, it looks like a text based solution would have more problems with floating point values, because they typically get converted to/from decimal..They also lose the human readability of the data (sans the use of special tools). Often the use of a binary solution is a case of premature optimization and indicates bad design at a higher level. The marshalling strategy should be 'pluggable' so that one can use an easier to debug, human readable, data format during development.Well, in my experience, most data tends to live somewhere not readily accessible and/or human readable, so you need a tool one way or the other. Making generic tools for PBs shouldn't be too hard, as far as I can tell they do have support for generic message handling, 'reflection', there's a DebugString() method on each of them, etc..Versioning problems can be (and have been) addressed in a number of ways over the years, the simplest and imo often the best, is to sidestep the problem by serializing a map of the data rather than just the raw data. That is, each value has associated metadata indicating its field name (a key-value mapping iow).And the PB encoding differs from this how exactly? Each value has a key preceding it, which identifies which field it is..If XML is too heavy a hammer, JSON (with or without embedded metadata) is a good alternative with quite a bit of industry support. In fact, the use of JSON encoding with embedded metadata gives you a lightweight solution that works with nearly every language in present use. Even better, any language with good reflection support (such as Java) allows an implementation that does not violate DRY.I haven't tried it, but it looks like generic mapping to/from JSON would be very easily done on top of PBs, for when you need it (like when publishing a JavaScript API). But why would you want to use JSON in any other case? Compared to PBs, it's harder/slower to parse, definitely takes more space/bandwidth, has fewer types, and you don't get the nice free API protocol buffers provide you for each message type.This "Protocol Buffers" solution is a dinosaur, a throwback, yet another wheel when we need ground effects or anti-gravity.Well, you haven't convinced me yet :) Mitja
Sep 02 2008