digitalmars.D - std.variant benchmark
- Gor Gyolchanyan (33/33) Jul 29 2012 std.variant is so incredibly slow! It's practically unusable for anythin...
- Andrei Alexandrescu (8/10) Jul 29 2012 You do realize you actually benchmark against a function that does
- Gor Gyolchanyan (10/21) Jul 29 2012 I do compare it with nothing, just to see how many times does it exceed ...
- Tobias Pankrath (2/10) Jul 29 2012 Isn't that a paradox? What kind of type safety do you expect from
- Gor Gyolchanyan (6/15) Jul 29 2012 The kind, which doesn't forget about postblits and destructors. This isn...
- Dmitry Olshansky (24/34) Jul 29 2012 This should be more relevant then:
- Gor Gyolchanyan (5/46) Jul 29 2012 Thank you for demonstrating my point. :-)
- Andrei Alexandrescu (20/47) Jul 29 2012 I don't think he demonstrated your point (even leaving aside that the
- Gor Gyolchanyan (10/69) Jul 30 2012 You're right. I'm trying to make a subject-oriented system, where there ...
- jerro (32/73) Jul 29 2012 I thought this results are a bit strange, so I converted the
- Andrei Alexandrescu (3/6) Jul 29 2012 Very nice analysis!
- Andrei Alexandrescu (12/14) Jul 29 2012 I guess you just volunteered! When I looked at it this morning I noticed...
- Dmitry Olshansky (73/86) Jul 29 2012 Well obviously integers do not take lightly when they are copied around
- Dmitry Olshansky (15/31) Jul 29 2012 ...
- Marco Leise (5/19) Jul 29 2012 Very nice finding. Now it is only 100x slower. :D It seems well worth ch...
- jerro (9/44) Aug 02 2012 I profiled it and found out much of the time is spent inside
- Gor Gyolchanyan (6/54) Aug 03 2012 Wow! Now that's an impressive improvement. If TypeInfo instances are uni...
- jerro (4/22) Aug 03 2012 I've tried Dmitry's fork now (he optimized opArithmetic to not do
-
Andrei Alexandrescu
(39/82)
Jul 29 2012
Yah, when I replaced memcpy with simple assignment for Variant
I - Dmitry Olshansky (21/69) Jul 30 2012 I'd say array ops could and should do this (since compiler has all the
- Andrei Alexandrescu (10/27) Jul 30 2012 memcpy is implemented as an intrinsic on many platforms. I'm not sure
- Don Clugston (8/28) Jul 30 2012 It is an intrinsic on DMD, but it isn't done optimally. Mostly it just
- Peter Alexander (16/57) Jul 29 2012 You underestimate DMD's optimisations :-)
- Dmitry Olshansky (11/70) Jul 29 2012 Yeah, guilty as charged. Was in a harry, yet I did test it without
- Jonathan M Davis (8/9) Jul 29 2012 I believe that Robert Jaques fixed up std.variant as part of his work on...
- Mirko Pilger (2/2) Jul 29 2012 https://jshare.johnshopkins.edu/rjacque2/public_html/variant.mht
std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance. Benchmark done under -noboundscheck -inline -release -O: import std.stdio; import std.variant; import std.datetime; void on() { auto var = Variant(5); int i = var.get!int; } void off() { auto var = 5; int i = var; } void main() { writeln(benchmark!(on, off)(100000)); } The result is: [TickDuration(25094), TickDuration(98)] There are tons of cases, where a simple typeless data storage is necessary. No type information, no type checking - just a low-level storage, upon which Variant and other dynamic-type constructs can be built. I want to ask the community: what's the best way to store any variable in a typeless storage, so that one could store any variable in that storage and get a reference to that variable given its static type with no type checking and minimal overhead compared to a statically typed storage and with type-safe storage (postblits, garbage collection...)? -- Bye, Gor Gyolchanyan.
Jul 29 2012
On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
On Sun, Jul 29, 2012 at 6:17 PM, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:I do compare it with nothing, just to see how many times does it exceed the performance of static typed storage. The point is that Variant is extremely slow. All I want is to find out how to implement a very fast typeless storage with maximum performance and type safety. -- Bye, Gor Gyolchanyan.std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
I do compare it with nothing, just to see how many times does it exceed the performance of static typed storage. The point is that Variant is extremely slow. All I want is to find out how to implement a very fast typeless storage with maximum performance and type safety.Isn't that a paradox? What kind of type safety do you expect from a typeless storage?
Jul 29 2012
On Sun, Jul 29, 2012 at 6:41 PM, Tobias Pankrath <tobias pankrath.net>wrote:I do compare it with nothing, just to see how many times does it exceed theThe kind, which doesn't forget about postblits and destructors. This isn't a paradox; -- Bye, Gor Gyolchanyan.performance of static typed storage. The point is that Variant is extremely slow. All I want is to find out how to implement a very fast typeless storage with maximum performance and type safety.Isn't that a paradox? What kind of type safety do you expect from a typeless storage?
Jul 29 2012
On 29-Jul-12 18:17, Andrei Alexandrescu wrote:On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshanskystd.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:On 29-Jul-12 18:17, Andrei Alexandrescu wrote:Thank you for demonstrating my point. :-) -- Bye, Gor Gyolchanyan.On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshanskystd.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com <mailto:dmitry.olsh gmail.com>> wrote: This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshansky Thank you for demonstrating my point. :-)I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable. You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction. Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome. Andrei
Jul 29 2012
On Mon, Jul 30, 2012 at 6:07 AM, Andrei Alexandrescu < SeeWebsiteForEmail erdani.org> wrote:On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:You're right. I'm trying to make a subject-oriented system, where there are nodes and accessors and each accessor has access to a specific part of the node. I must have the data stored in typeless manner and cast to the appropriate type (depending on the accessor). This is all necessary for my very flexible graph data structure. -- Bye, Gor Gyolchanyan.On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com <mailto:dmitry.olsh gmail.com>**> wrote: This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshansky Thank you for demonstrating my point. :-)I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable. You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction. Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome. Andrei
Jul 30 2012
On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:On 29-Jul-12 18:17, Andrei Alexandrescu wrote:I thought this results are a bit strange, so I converted the result to seconds. This gave me: [3.73e-06, 3.721e-06, 2.97281] One million inner loop iterations in under 4 microseconds? My processor's frequency isn't measured in THz, so something strange must be going on here. In order to find out what it was, I changed the code to this: writeln(benchmark!(fib!int, fib!long)(1000_000_000)[] .map!"a.nsecs() * 1.0e-9"); and used a profiler on it. The relevant part of the output is: 0.00 : 445969: test %r12d,%r12d 0.00 : 44596c: je 445975 <_D3std8date 46.67 : 44596e: inc %ebx 0.00 : 445970: cmp %r12d,%ebx 0.00 : 445973: jb 44596e <_D3std8date 0.00 : 445975: lea -0x18(%rbp),%rdi 0.00 : 445979: callq 45a048 <_D3std8date 0.00 : 44597e: mov %rax,0x0(%r13) 0.00 : 445982: lea -0x18(%rbp),%rdi 0.00 : 445986: callq 459fb4 <_D3std8date 0.00 : 44598b: xor %ebx,%ebx 0.00 : 44598d: test %r12d,%r12d 0.00 : 445990: je 445999 <_D3std8date 53.33 : 445992: inc %ebx 0.00 : 445994: cmp %r12d,%ebx 0.00 : 445997: jb 445992 <_D3std8date As you can see, most of the time is spent in two loops with empty body, so your code is benchmarking Variant against nothing, too. Adding asm{ nop; } to fib changes the output to this: [0.00437154, 0.00444938, 3.03917] Whih is still a huge difference.On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
On 7/29/12 2:57 PM, jerro wrote:I thought this results are a bit strange, so I converted the result to seconds. This gave me: [3.73e-06, 3.721e-06, 2.97281]Very nice analysis! Andrei
Jul 29 2012
On 7/29/12 10:43 AM, Dmitry Olshansky wrote:I'm horrified. Who was working on std.variant enhancements? Please chime in.I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.) Variant has a simple design with (in case of int) an int and a pointer to a function. Many of its operations incur an indirect call through that pointer. This makes operations slower than the time-honored design of using an integral tag and switching on it, but offers in return the ability to hold any type without needing to enumerate all types explicitly. We can use the pointer to function as a tag for improving performance on primitive types. Andrei
Jul 29 2012
On 29-Jul-12 22:59, Andrei Alexandrescu wrote:On 7/29/12 10:43 AM, Dmitry Olshansky wrote:Strange, but numbers stay the same for me.I'm horrified. Who was working on std.variant enhancements? Please chime in.I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.)Variant has a simple design with (in case of int) an int and a pointer to a function. Many of its operations incur an indirect call through that pointer. This makes operations slower than the time-honored design of using an integral tag and switching on it, but offers in return the ability to hold any type without needing to enumerate all types explicitly.Well obviously integers do not take lightly when they are copied around with memcpy like some pompous arrays that alone incurs *some* penalty. In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int). I've found something far more evil: property bool convertsTo(T)() const { TypeInfo info = typeid(T); return fptr(OpID.testConversion, null, &info) == 0; } Okay... now let me pull off another piece of rag: private VariantN opArithmetic(T, string op)(T other) { VariantN result; static if (is(T == VariantN)) { if (convertsTo!(uint) && other.convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); else if (convertsTo!(int) && other.convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other.get!(int)"); else if (convertsTo!(ulong) && other.convertsTo!(ulong)) result = mixin("get!(ulong) " ~ op ~ " other.get!(ulong)"); else if (convertsTo!(long) && other.convertsTo!(long)) result = mixin("get!(long) " ~ op ~ " other.get!(long)"); else if (convertsTo!(double) && other.convertsTo!(double)) result = mixin("get!(double) " ~ op ~ " other.get!(double)"); else result = mixin("get!(real) " ~ op ~ " other.get!(real)"); } else { if (is(typeof(T.max) : uint) && T.min == 0 && convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other"); else if (is(typeof(T.max) : int) && T.min < 0 && convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other"); else if (is(typeof(T.max) : ulong) && T.min == 0 && convertsTo!(ulong)) result = mixin("get!(ulong) " ~ op ~ " other"); else if (is(typeof(T.max) : long) && T.min < 0 && convertsTo!(long)) result = mixin("get!(long) " ~ op ~ " other"); else if (is(T : double) && convertsTo!(double)) result = mixin("get!(double) " ~ op ~ " other"); else result = mixin("get!(real) " ~ op ~ " other"); } return result; } So to boot in order to get to int + int _detected_ it needs 4 calls to fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each time. Not counting 2 gets to obtain result. It's just madness. Plain hardcoded Type to integer switch seems so much better now.We can use the pointer to function as a tag for improving performance on primitive types.Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) ... else if(ptr == handler!uint) ... else ... Another way to go is make 2-d function table for all built-ins and all operations on them. It's in fact a bunch of tables: Type1 x Type2 for all of important basic operations. Tables are static so it won't be much of a problem. At least int + int then would be 1 memory read, 1 call, 2 (casted as needed) reads & 1 store. -- Dmitry Olshansky
Jul 29 2012
On 30-Jul-12 00:11, Dmitry Olshansky wrote:I've found something far more evil: property bool convertsTo(T)() const { TypeInfo info = typeid(T); return fptr(OpID.testConversion, null, &info) == 0; } Okay... now let me pull off another piece of rag: private VariantN opArithmetic(T, string op)(T other) { VariantN result; static if (is(T == VariantN)) { /*if (convertsTo!(uint) && other.convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); else*/ if (convertsTo!(int) && other.convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other.get!(int)");... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o) -- Dmitry Olshansky
Jul 29 2012
Am Mon, 30 Jul 2012 00:24:40 +0400 schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o)Very nice finding. Now it is only 100x slower. :D It seems well worth changing the style Variant is written in. -- Marco
Jul 29 2012
On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:On 30-Jul-12 00:11, Dmitry Olshansky wrote:I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.I've found something far more evil: property bool convertsTo(T)() const { TypeInfo info = typeid(T); return fptr(OpID.testConversion, null, &info) == 0; } Okay... now let me pull off another piece of rag: private VariantN opArithmetic(T, string op)(T other) { VariantN result; static if (is(T == VariantN)) { /*if (convertsTo!(uint) && other.convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); else*/ if (convertsTo!(int) && other.convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other.get!(int)");... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o)
Aug 02 2012
On Fri, Aug 3, 2012 at 8:28 AM, jerro <a a.com> wrote:On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:Wow! Now that's an impressive improvement. If TypeInfo instances are unique for every type, then we should be good to go! -- Bye, Gor Gyolchanyan.On 30-Jul-12 00:11, Dmitry Olshansky wrote:I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.I've found something far more evil: property bool convertsTo(T)() const { TypeInfo info = typeid(T); return fptr(OpID.testConversion, null, &info) == 0; } Okay... now let me pull off another piece of rag: private VariantN opArithmetic(T, string op)(T other) { VariantN result; static if (is(T == VariantN)) { /*if (convertsTo!(uint) && other.convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); else*/ if (convertsTo!(int) && other.convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other.get!(int)");... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o)
Aug 03 2012
I've tried Dmitry's fork now (he optimized opArithmetic to not do as many calls to fptr) and it's even a little faster (1.01 s for 100000 iterations). Replacing != with !is in Dmitry's fork speeds it up a little bit more (0.935 s for 100000 iterations).I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.Wow! Now that's an impressive improvement. If TypeInfo instances are unique for every type, then we should be good to go!
Aug 03 2012
On 7/29/12 4:11 PM, Dmitry Olshansky wrote:On 29-Jul-12 22:59, Andrei Alexandrescu wrote:Depends on what you measure.On 7/29/12 10:43 AM, Dmitry Olshansky wrote:Strange, but numbers stay the same for me.I'm horrified. Who was working on std.variant enhancements? Please chime in.I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.)Yah, when I replaced memcpy with simple assignment for Variant<int> I saw some improvements.Variant has a simple design with (in case of int) an int and a pointer to a function. Many of its operations incur an indirect call through that pointer. This makes operations slower than the time-honored design of using an integral tag and switching on it, but offers in return the ability to hold any type without needing to enumerate all types explicitly.Well obviously integers do not take lightly when they are copied around with memcpy like some pompous arrays that alone incurs *some* penalty.In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int).In fact memcpy should be smart enough to do all that, but apparently it doesn't.I've found something far more evil:[snip]So to boot in order to get to int + int _detected_ it needs 4 calls to fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each time. Not counting 2 gets to obtain result. It's just madness.Heh, this all gives me a new, humbling perspective on my earlier self running my mouth about other designs :o). As always, history and context makes one much more empathic. Variant has a very old design and implementation. It is, in fact, the very first design I've ever carried in D; I'd decided, if I can't do that, D isn't powerful enough. The compiler was so buggy back then, implementation needed to take many contortions, and it was a minor miracle the whole house of cards stood together. The approach was to design for usability first, performance could come later. So design only had to not prevent good performance from being improved transparently later, which I think it has achieved. The basic approach was to use a pointer to function (instead of an integral) as a discriminator. The pointer to function was taken from the instantiation of a template function. That means Variant could work with any type, even types not yet defined. I'd made it a clear goal that I must not enumerate all types in one place, or "register" types for use with Variant - that would have been failure. Using a pointer to template function naturally associated a distinct tag with each type, automatically. I think it was a good idea then, and it's a good idea now. On this data layout functionality can be built in two ways: (a) Implement it inside the dispatcher function template, or (b) Test against it "outside", in the template Variant code. The first approach is more general, the second is faster. Right now Variant uses (a) almost exclusively. It's time to start doing (b) for more types.Plain hardcoded Type to integer switch seems so much better now.But we can do that now.Yes, and what we gain is speed for select known types. As long as the number of those is relatively small, using simple test and branch should serve us well. And we get to keep the generality, too.We can use the pointer to function as a tag for improving performance on primitive types.Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) .... else if(ptr == handler!uint) .... else ....Another way to go is make 2-d function table for all built-ins and all operations on them. It's in fact a bunch of tables: Type1 x Type2 for all of important basic operations. Tables are static so it won't be much of a problem. At least int + int then would be 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.I think that's not worth it, but hey, worth a try. Thanks, Andrei
Jul 29 2012
On 30-Jul-12 06:01, Andrei Alexandrescu wrote:I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int).In fact memcpy should be smart enough to do all that, but apparently it doesn't.These were indeed very rough days. I think it's a minor miracle I haven't gave up on D after seeing tons of bugs back in 2010.I've found something far more evil:[snip]So to boot in order to get to int + int _detected_ it needs 4 calls to fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each time. Not counting 2 gets to obtain result. It's just madness.Heh, this all gives me a new, humbling perspective on my earlier self running my mouth about other designs :o). As always, history and context makes one much more empathic. Variant has a very old design and implementation. It is, in fact, the very first design I've ever carried in D; I'd decided, if I can't do that, D isn't powerful enough. The compiler was so buggy back then, implementation needed to take many contortions, and it was a minor miracle the whole house of cards stood together.The basic approach was to use a pointer to function (instead of an integral) as a discriminator. The pointer to function was taken from the instantiation of a template function. That means Variant could work with any type, even types not yet defined. I'd made it a clear goal that I must not enumerate all types in one place, or "register" types for use with Variant - that would have been failure. Using a pointer to template function naturally associated a distinct tag with each type, automatically. I think it was a good idea then, and it's a good idea now.It's indeed nice trick and design sounds great (no register function, yay!). The only piece missing is that e.g. arithmetic operations are binary thus requiring double-dispatch or something like arithmeticHandler!(T1,T2) to archive optimal performance. The virtual handcrafted C competition would do it via nested switches. [snip]OK I'll try it for opArithmetic that should help a lot.Yes, and what we gain is speed for select known types. As long as the number of those is relatively small, using simple test and branch should serve us well. And we get to keep the generality, too.We can use the pointer to function as a tag for improving performance on primitive types.Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) .... else if(ptr == handler!uint) .... else ....I'm thinking more of it and common types are just uint, int, long, ulong, double, string. In fact most of e.g. int x string and such are "throw me an exception" entries, they could be merged. Then we could make 2-stage table to save some space, it sounds like trie, hm.... -- Dmitry OlshanskyAnother way to go is make 2-d function table for all built-ins and all operations on them. It's in fact a bunch of tables: Type1 x Type2 for all of important basic operations. Tables are static so it won't be much of a problem. At least int + int then would be 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.I think that's not worth it, but hey, worth a try.
Jul 30 2012
On 7/30/12 4:34 AM, Dmitry Olshansky wrote:On 30-Jul-12 06:01, Andrei Alexandrescu wrote:memcpy is implemented as an intrinsic on many platforms. I'm not sure whether it is on dmd, but it is on dmc (http://www.digitalmars.com/ctg/sc.html), icc, and gcc (http://software.intel.com/en-us/articles/memcpy-performance/). But then clearly using simple word assignments wherever possible makes for a more robust performance profile.I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int).In fact memcpy should be smart enough to do all that, but apparently it doesn't.I'm thinking more of it and common types are just uint, int, long, ulong, double, string. In fact most of e.g. int x string and such are "throw me an exception" entries, they could be merged. Then we could make 2-stage table to save some space, it sounds like trie, hm....You can implement it via visitation too (two indirect calls, no search). "Modern C++ Design" discusses possible approaches at length. Andrei
Jul 30 2012
On 30/07/12 13:24, Andrei Alexandrescu wrote:On 7/30/12 4:34 AM, Dmitry Olshansky wrote:It is an intrinsic on DMD, but it isn't done optimally. Mostly it just compiles to a couple of loads + the single instruction rep movsd; / rep movsq; which is perfect for medium-sized lengths when everything is aligned, but once it is longer than a few hundred bytes, it should be done as a function call. (The optimal method involves cache blocking). Also for very short lengths it should be done as a couple of loads.On 30-Jul-12 06:01, Andrei Alexandrescu wrote:memcpy is implemented as an intrinsic on many platforms. I'm not sure whether it is on dmd, but it is on dmc (http://www.digitalmars.com/ctg/sc.html), icc, and gcc (http://software.intel.com/en-us/articles/memcpy-performance/). But then clearly using simple word assignments wherever possible makes for a more robust performance profile.I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int).In fact memcpy should be smart enough to do all that, but apparently it doesn't.
Jul 30 2012
On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:On 29-Jul-12 18:17, Andrei Alexandrescu wrote:You underestimate DMD's optimisations :-) For int and long, DMD does the whole loop at compile time, so again you are benchmarking against an empty function. It's easy to see that this is the case by changing the size of the loop and noting that the int and long versions take the same amount of time. Here's my results with optimisations turned *off*: int: 2,304,868 long: 1,559,679 Variant: 2,667,320,252 Yes, it's not good to test without optimisations, but I think this gives a clearer picture of the relative differences between the two. Still not great at ~1000x difference, but much better than a million :-)On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
On 29-Jul-12 23:20, Peter Alexander wrote:On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:Yeah, guilty as charged. Was in a harry, yet I did test it without optimization before posting and results were not good either.On 29-Jul-12 18:17, Andrei Alexandrescu wrote:You underestimate DMD's optimisations :-)On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.std.variant is so incredibly slow! It's practically unusable for anything, which requires even a tiny bit of performance.You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) AndreiFor int and long, DMD does the whole loop at compile time, so again you are benchmarking against an empty function. It's easy to see that this is the case by changing the size of the loop and noting that the int and long versions take the same amount of time. Here's my results with optimisations turned *off*: int: 2,304,868 long: 1,559,679Guess you are on 64 bit system or what? Long takes somewhat longer for me.Variant: 2,667,320,252Mine with -release -inline: 2307us 2577us 3170315us 3 orders of magnitude. Sill too bad.Yes, it's not good to test without optimisations, but I think this gives a clearer picture of the relative differences between the two. Still not great at ~1000x difference, but much better than a million :-)-- Dmitry Olshansky
Jul 29 2012
On Sunday, July 29, 2012 18:43:07 Dmitry Olshansky wrote:I'm horrified. Who was working on std.variant enhancements? Please chime in.I believe that Robert Jaques fixed up std.variant as part of his work on fixing up std.json and that he intended to get the std.variant changes reviewed and merged in before doing the same with his std.json changes, but he's never asked for a review of either, and I don't know if his code is available online or not. I think that it's another one of those cases where someone did a fair bit of work but became too busy to finish the job. - Jonathan M Davis
Jul 29 2012