www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - std.variant benchmark

reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
std.variant is so incredibly slow! It's practically unusable for anything,
which requires even a tiny bit of performance.
Benchmark done under -noboundscheck -inline -release -O:

import std.stdio;
import std.variant;
import std.datetime;

void on()
{
auto var = Variant(5);
int i = var.get!int;
}

void off()
{
auto var = 5;
int i = var;
}

void main()
{
writeln(benchmark!(on, off)(100000));
}

The result is:

[TickDuration(25094), TickDuration(98)]

There are tons of cases, where a simple typeless data storage is necessary.
No type information, no type checking - just a low-level storage, upon
which Variant and other dynamic-type constructs can be built.

I want to ask the community: what's the best way to store any variable in a
typeless storage, so that one could store any variable in that storage and
get a reference to that variable given its static type with no type
checking and minimal overhead compared to a statically typed storage and
with type-safe storage (postblits, garbage collection...)?

-- 
Bye,
Gor Gyolchanyan.
Jul 29 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
Jul 29 2012
next sibling parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Sun, Jul 29, 2012 at 6:17 PM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:

 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
I do compare it with nothing, just to see how many times does it exceed the performance of static typed storage. The point is that Variant is extremely slow. All I want is to find out how to implement a very fast typeless storage with maximum performance and type safety. -- Bye, Gor Gyolchanyan.
Jul 29 2012
parent reply "Tobias Pankrath" <tobias pankrath.net> writes:
 I do compare it with nothing, just to see how many times does 
 it exceed the
 performance of static typed storage. The point is that Variant 
 is extremely
 slow.
 All I want is to find out how to implement a very fast typeless 
 storage
 with maximum performance and type safety.
Isn't that a paradox? What kind of type safety do you expect from a typeless storage?
Jul 29 2012
parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Sun, Jul 29, 2012 at 6:41 PM, Tobias Pankrath <tobias pankrath.net>wrote:

 I do compare it with nothing, just to see how many times does it exceed the
 performance of static typed storage. The point is that Variant is
 extremely
 slow.
 All I want is to find out how to implement a very fast typeless storage
 with maximum performance and type safety.
Isn't that a paradox? What kind of type safety do you expect from a typeless storage?
The kind, which doesn't forget about postblits and destructors. This isn't a paradox; -- Bye, Gor Gyolchanyan.
Jul 29 2012
prev sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshansky
Jul 29 2012
next sibling parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:

 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:

 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:

 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in. -- Dmitry Olshansky
Thank you for demonstrating my point. :-) -- Bye, Gor Gyolchanyan.
Jul 29 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:
 On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com
 <mailto:dmitry.olsh gmail.com>> wrote:
     This should be more relevant then:

     //fib.d
     import std.datetime, std.stdio, std.variant;

     auto fib(Int)()
     {
              Int a = 1, b = 1;
              for(size_t i=0; i<100; i++){
                      Int c = a + b;
                      a = b;
                      b = c;
              }
              return a;
     }

     void main()
     {
              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
     }


     dmd -O -inline -release fib.d

     Output:

     [TickDuration(197), TickDuration(276), TickDuration(93370107)]

     I'm horrified. Who was working on std.variant enhancements? Please
     chime in.

     --
     Dmitry Olshansky


 Thank you for demonstrating my point. :-)
I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable. You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction. Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome. Andrei
Jul 29 2012
parent Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Mon, Jul 30, 2012 at 6:07 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:

 On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com
 <mailto:dmitry.olsh gmail.com>**> wrote:
     This should be more relevant then:

     //fib.d
     import std.datetime, std.stdio, std.variant;

     auto fib(Int)()
     {
              Int a = 1, b = 1;
              for(size_t i=0; i<100; i++){
                      Int c = a + b;
                      a = b;
                      b = c;
              }
              return a;
     }

     void main()
     {
              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
     }


     dmd -O -inline -release fib.d

     Output:

     [TickDuration(197), TickDuration(276), TickDuration(93370107)]

     I'm horrified. Who was working on std.variant enhancements? Please
     chime in.

     --
     Dmitry Olshansky


 Thank you for demonstrating my point. :-)
I don't think he demonstrated your point (even leaving aside that the benchmark above is also flawed). I don't think there was a point, unless it was you wanted to vent and looked for a pretext - which is, by the way, entirely reasonable. You mentioned you need "a very fast typeless storage with maximum performance and type safety." Well if it's typeless then it means you're using it for storage, not for computationally-intensive operations, as the code above does. So what you'd presumably want to test is the speed of data storage and retrieval. A loop that does a bunch of adds on Variant does not go in that direction. Benchmarks are notoriously hard to get right. You need to think of reasonable baselines - if you want to use Variant for typeless storage, what is your vanilla implementation, the "obvious contender"? What are the primitives of which speed is important? Then you need to make sure you subtract noise from all your comparisons. Then you need to figure whether the issue is with the design or with the implementation of Variant. In the former case, maybe Variant isn't for you. In the latter, bug reports are always welcome. Andrei
You're right. I'm trying to make a subject-oriented system, where there are nodes and accessors and each accessor has access to a specific part of the node. I must have the data stored in typeless manner and cast to the appropriate type (depending on the accessor). This is all necessary for my very flexible graph data structure. -- Bye, Gor Gyolchanyan.
Jul 30 2012
prev sibling next sibling parent reply "jerro" <a a.com> writes:
On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable 
 for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.
I thought this results are a bit strange, so I converted the result to seconds. This gave me: [3.73e-06, 3.721e-06, 2.97281] One million inner loop iterations in under 4 microseconds? My processor's frequency isn't measured in THz, so something strange must be going on here. In order to find out what it was, I changed the code to this: writeln(benchmark!(fib!int, fib!long)(1000_000_000)[] .map!"a.nsecs() * 1.0e-9"); and used a profiler on it. The relevant part of the output is: 0.00 : 445969: test %r12d,%r12d 0.00 : 44596c: je 445975 <_D3std8date 46.67 : 44596e: inc %ebx 0.00 : 445970: cmp %r12d,%ebx 0.00 : 445973: jb 44596e <_D3std8date 0.00 : 445975: lea -0x18(%rbp),%rdi 0.00 : 445979: callq 45a048 <_D3std8date 0.00 : 44597e: mov %rax,0x0(%r13) 0.00 : 445982: lea -0x18(%rbp),%rdi 0.00 : 445986: callq 459fb4 <_D3std8date 0.00 : 44598b: xor %ebx,%ebx 0.00 : 44598d: test %r12d,%r12d 0.00 : 445990: je 445999 <_D3std8date 53.33 : 445992: inc %ebx 0.00 : 445994: cmp %r12d,%ebx 0.00 : 445997: jb 445992 <_D3std8date As you can see, most of the time is spent in two loops with empty body, so your code is benchmarking Variant against nothing, too. Adding asm{ nop; } to fib changes the output to this: [0.00437154, 0.00444938, 3.03917] Whih is still a huge difference.
Jul 29 2012
parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/29/12 2:57 PM, jerro wrote:
 I thought this results are a bit strange, so I converted the result to
 seconds. This gave me:

 [3.73e-06, 3.721e-06, 2.97281]
Very nice analysis! Andrei
Jul 29 2012
prev sibling next sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.
I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.) Variant has a simple design with (in case of int) an int and a pointer to a function. Many of its operations incur an indirect call through that pointer. This makes operations slower than the time-honored design of using an integral tag and switching on it, but offers in return the ability to hold any type without needing to enumerate all types explicitly. We can use the pointer to function as a tag for improving performance on primitive types. Andrei
Jul 29 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
 On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.
I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.)
Strange, but numbers stay the same for me.
 Variant has a simple design with (in case of int) an int and a pointer
 to a function. Many of its operations incur an indirect call through
 that pointer. This makes operations slower than the time-honored design
 of using an integral tag and switching on it, but offers in return the
 ability to hold any type without needing to enumerate all types explicitly.
Well obviously integers do not take lightly when they are copied around with memcpy like some pompous arrays that alone incurs *some* penalty. In fact memcpy could and should be replaced with word by word copy for almost all of struct sizes up to ~32 bytes (as size is known in advance for this particular function pointer i.e. handler!int). I've found something far more evil: property bool convertsTo(T)() const { TypeInfo info = typeid(T); return fptr(OpID.testConversion, null, &info) == 0; } Okay... now let me pull off another piece of rag: private VariantN opArithmetic(T, string op)(T other) { VariantN result; static if (is(T == VariantN)) { if (convertsTo!(uint) && other.convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other.get!(uint)"); else if (convertsTo!(int) && other.convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other.get!(int)"); else if (convertsTo!(ulong) && other.convertsTo!(ulong)) result = mixin("get!(ulong) " ~ op ~ " other.get!(ulong)"); else if (convertsTo!(long) && other.convertsTo!(long)) result = mixin("get!(long) " ~ op ~ " other.get!(long)"); else if (convertsTo!(double) && other.convertsTo!(double)) result = mixin("get!(double) " ~ op ~ " other.get!(double)"); else result = mixin("get!(real) " ~ op ~ " other.get!(real)"); } else { if (is(typeof(T.max) : uint) && T.min == 0 && convertsTo!(uint)) result = mixin("get!(uint) " ~ op ~ " other"); else if (is(typeof(T.max) : int) && T.min < 0 && convertsTo!(int)) result = mixin("get!(int) " ~ op ~ " other"); else if (is(typeof(T.max) : ulong) && T.min == 0 && convertsTo!(ulong)) result = mixin("get!(ulong) " ~ op ~ " other"); else if (is(typeof(T.max) : long) && T.min < 0 && convertsTo!(long)) result = mixin("get!(long) " ~ op ~ " other"); else if (is(T : double) && convertsTo!(double)) result = mixin("get!(double) " ~ op ~ " other"); else result = mixin("get!(real) " ~ op ~ " other"); } return result; } So to boot in order to get to int + int _detected_ it needs 4 calls to fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each time. Not counting 2 gets to obtain result. It's just madness. Plain hardcoded Type to integer switch seems so much better now.
 We can use the pointer to function as a tag for improving performance on
 primitive types.
Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) ... else if(ptr == handler!uint) ... else ... Another way to go is make 2-d function table for all built-ins and all operations on them. It's in fact a bunch of tables: Type1 x Type2 for all of important basic operations. Tables are static so it won't be much of a problem. At least int + int then would be 1 memory read, 1 call, 2 (casted as needed) reads & 1 store. -- Dmitry Olshansky
Jul 29 2012
next sibling parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 30-Jul-12 00:11, Dmitry Olshansky wrote:
 I've found something far more evil:

  property bool convertsTo(T)() const
      {
          TypeInfo info = typeid(T);
          return fptr(OpID.testConversion, null, &info) == 0;
      }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
      {
          VariantN result;
          static if (is(T == VariantN))
          {
             /*if (convertsTo!(uint) && other.convertsTo!(uint))
                  result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
              else*/ if (convertsTo!(int) && other.convertsTo!(int))
                  result = mixin("get!(int) " ~ op ~ " other.get!(int)");
... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o) -- Dmitry Olshansky
Jul 29 2012
next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Mon, 30 Jul 2012 00:24:40 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Just commenting one extra branch of this horror movie
 gives interesting change:
 
 2779us
 2667us
 3153762us
 
 After:
 
 2319us
 2523us
 288581us
 
 Aye, 10x :o)
Very nice finding. Now it is only 100x slower. :D It seems well worth changing the style Variant is written in. -- Marco
Jul 29 2012
prev sibling parent reply "jerro" <a a.com> writes:
On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:
 On 30-Jul-12 00:11, Dmitry Olshansky wrote:
 I've found something far more evil:

  property bool convertsTo(T)() const
     {
         TypeInfo info = typeid(T);
         return fptr(OpID.testConversion, null, &info) == 0;
     }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
     {
         VariantN result;
         static if (is(T == VariantN))
         {
            /*if (convertsTo!(uint) && other.convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " 
 other.get!(uint)");
             else*/ if (convertsTo!(int) && 
 other.convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " 
 other.get!(int)");
... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o)
I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.
Aug 02 2012
parent reply Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:
On Fri, Aug 3, 2012 at 8:28 AM, jerro <a a.com> wrote:

 On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:

 On 30-Jul-12 00:11, Dmitry Olshansky wrote:

 I've found something far more evil:

  property bool convertsTo(T)() const
     {
         TypeInfo info = typeid(T);
         return fptr(OpID.testConversion, null, &info) == 0;
     }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
     {
         VariantN result;
         static if (is(T == VariantN))
         {
            /*if (convertsTo!(uint) && other.convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
             else*/ if (convertsTo!(int) && other.convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " other.get!(int)");
... Apparently I'm spot on. Just commenting one extra branch of this horror movie gives interesting change: 2779us 2667us 3153762us After: 2319us 2523us 288581us Aye, 10x :o)
I profiled it and found out much of the time is spent inside TypeInfo.opEquals being called from tryPutting. So I tried replacing "!= typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s down to 0.12 on my machine. I don't know if that is the proper solution, since I don't know if typeid can ever return two TypeInfo objects that aren't the same but are equal (I haven't used typeid and TypeInfo much before). The fib function here does return correct result after doing that, though.
Wow! Now that's an impressive improvement. If TypeInfo instances are unique for every type, then we should be good to go! -- Bye, Gor Gyolchanyan.
Aug 03 2012
parent "jerro" <a a.com> writes:
 I profiled it and found out much of the time is spent inside
 TypeInfo.opEquals being called from tryPutting. So I tried 
 replacing "!=
 typeid" in tryPutting with "!is typeid". That brought the time 
 from 2.8 s
 down to 0.12 on my machine. I don't know if that is the proper 
 solution,
 since I don't know if typeid can ever return two TypeInfo 
 objects that
 aren't the same but are equal (I haven't used typeid and 
 TypeInfo much
 before). The fib function here does return correct result 
 after doing that,
 though.
Wow! Now that's an impressive improvement. If TypeInfo instances are unique for every type, then we should be good to go!
I've tried Dmitry's fork now (he optimized opArithmetic to not do as many calls to fptr) and it's even a little faster (1.01 s for 100000 iterations). Replacing != with !is in Dmitry's fork speeds it up a little bit more (0.935 s for 100000 iterations).
Aug 03 2012
prev sibling parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/29/12 4:11 PM, Dmitry Olshansky wrote:
 On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
 On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.
I guess you just volunteered! When I looked at it this morning I noticed a few signs of bit rot, e.g. opAssign returns by value and such. (Only planting a "ref" there improves performance a good amount.)
Strange, but numbers stay the same for me.
Depends on what you measure.
 Variant has a simple design with (in case of int) an int and a pointer
 to a function. Many of its operations incur an indirect call through
 that pointer. This makes operations slower than the time-honored design
 of using an integral tag and switching on it, but offers in return the
 ability to hold any type without needing to enumerate all types
 explicitly.
Well obviously integers do not take lightly when they are copied around with memcpy like some pompous arrays that alone incurs *some* penalty.
Yah, when I replaced memcpy with simple assignment for Variant<int> I saw some improvements.
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).
In fact memcpy should be smart enough to do all that, but apparently it doesn't.
 I've found something far more evil:
[snip]
 So to boot in order to get to int + int _detected_ it needs 4 calls to
 fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
 time. Not counting 2 gets to obtain result. It's just madness.
Heh, this all gives me a new, humbling perspective on my earlier self running my mouth about other designs :o). As always, history and context makes one much more empathic. Variant has a very old design and implementation. It is, in fact, the very first design I've ever carried in D; I'd decided, if I can't do that, D isn't powerful enough. The compiler was so buggy back then, implementation needed to take many contortions, and it was a minor miracle the whole house of cards stood together. The approach was to design for usability first, performance could come later. So design only had to not prevent good performance from being improved transparently later, which I think it has achieved. The basic approach was to use a pointer to function (instead of an integral) as a discriminator. The pointer to function was taken from the instantiation of a template function. That means Variant could work with any type, even types not yet defined. I'd made it a clear goal that I must not enumerate all types in one place, or "register" types for use with Variant - that would have been failure. Using a pointer to template function naturally associated a distinct tag with each type, automatically. I think it was a good idea then, and it's a good idea now. On this data layout functionality can be built in two ways: (a) Implement it inside the dispatcher function template, or (b) Test against it "outside", in the template Variant code. The first approach is more general, the second is faster. Right now Variant uses (a) almost exclusively. It's time to start doing (b) for more types.
 Plain hardcoded Type to integer switch seems so much better now.
But we can do that now.
 We can use the pointer to function as a tag for improving performance on
 primitive types.
Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) .... else if(ptr == handler!uint) .... else ....
Yes, and what we gain is speed for select known types. As long as the number of those is relatively small, using simple test and branch should serve us well. And we get to keep the generality, too.
 Another way to go is make 2-d function table for all built-ins and all
 operations on them. It's in fact a bunch of tables:
 Type1 x Type2 for all of important basic operations. Tables are static
 so it won't be much of a problem. At least int + int then would be
 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.
I think that's not worth it, but hey, worth a try. Thanks, Andrei
Jul 29 2012
parent reply Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).
In fact memcpy should be smart enough to do all that, but apparently it doesn't.
I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).
 I've found something far more evil:
[snip]
 So to boot in order to get to int + int _detected_ it needs 4 calls to
 fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
 time. Not counting 2 gets to obtain result. It's just madness.
Heh, this all gives me a new, humbling perspective on my earlier self running my mouth about other designs :o). As always, history and context makes one much more empathic. Variant has a very old design and implementation. It is, in fact, the very first design I've ever carried in D; I'd decided, if I can't do that, D isn't powerful enough. The compiler was so buggy back then, implementation needed to take many contortions, and it was a minor miracle the whole house of cards stood together.
These were indeed very rough days. I think it's a minor miracle I haven't gave up on D after seeing tons of bugs back in 2010.
 The basic approach was to use a pointer to function (instead of an
 integral) as a discriminator. The pointer to function was taken from the
 instantiation of a template function. That means Variant could work with
 any type, even types not yet defined. I'd made it a clear goal that I
 must not enumerate all types in one place, or "register" types for use
 with Variant - that would have been failure. Using a pointer to template
 function naturally associated a distinct tag with each type,
 automatically. I think it was a good idea then, and it's a good idea now.
It's indeed nice trick and design sounds great (no register function, yay!). The only piece missing is that e.g. arithmetic operations are binary thus requiring double-dispatch or something like arithmeticHandler!(T1,T2) to archive optimal performance. The virtual handcrafted C competition would do it via nested switches. [snip]
 We can use the pointer to function as a tag for improving performance on
 primitive types.
Indeed we can though it won't be as fast as simple tag - pointers have quite big and sparse addresses giving us the same as: if(ptr == handler!int) .... else if(ptr == handler!uint) .... else ....
Yes, and what we gain is speed for select known types. As long as the number of those is relatively small, using simple test and branch should serve us well. And we get to keep the generality, too.
OK I'll try it for opArithmetic that should help a lot.
 Another way to go is make 2-d function table for all built-ins and all
 operations on them. It's in fact a bunch of tables:
 Type1 x Type2 for all of important basic operations. Tables are static
 so it won't be much of a problem. At least int + int then would be
 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.
I think that's not worth it, but hey, worth a try.
I'm thinking more of it and common types are just uint, int, long, ulong, double, string. In fact most of e.g. int x string and such are "throw me an exception" entries, they could be merged. Then we could make 2-stage table to save some space, it sounds like trie, hm.... -- Dmitry Olshansky
Jul 30 2012
parent reply Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 7/30/12 4:34 AM, Dmitry Olshansky wrote:
 On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).
In fact memcpy should be smart enough to do all that, but apparently it doesn't.
I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).
memcpy is implemented as an intrinsic on many platforms. I'm not sure whether it is on dmd, but it is on dmc (http://www.digitalmars.com/ctg/sc.html), icc, and gcc (http://software.intel.com/en-us/articles/memcpy-performance/). But then clearly using simple word assignments wherever possible makes for a more robust performance profile.
 I'm thinking more of it and common types are just uint, int, long,
 ulong, double, string. In fact most of e.g. int x string and such are
 "throw me an exception" entries, they could be merged.
 Then we could make 2-stage table to save some space, it sounds like
 trie, hm....
You can implement it via visitation too (two indirect calls, no search). "Modern C++ Design" discusses possible approaches at length. Andrei
Jul 30 2012
parent Don Clugston <dac nospam.com> writes:
On 30/07/12 13:24, Andrei Alexandrescu wrote:
 On 7/30/12 4:34 AM, Dmitry Olshansky wrote:
 On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).
In fact memcpy should be smart enough to do all that, but apparently it doesn't.
I'd say array ops could and should do this (since compiler has all the info at compile-time). On the other hand memcpy is just one tired C function aimed at fast blitting of any memory chunks. (Even just call/ret pair is too much of overhead in case of int).
memcpy is implemented as an intrinsic on many platforms. I'm not sure whether it is on dmd, but it is on dmc (http://www.digitalmars.com/ctg/sc.html), icc, and gcc (http://software.intel.com/en-us/articles/memcpy-performance/). But then clearly using simple word assignments wherever possible makes for a more robust performance profile.
It is an intrinsic on DMD, but it isn't done optimally. Mostly it just compiles to a couple of loads + the single instruction rep movsd; / rep movsq; which is perfect for medium-sized lengths when everything is aligned, but once it is longer than a few hundred bytes, it should be done as a function call. (The optimal method involves cache blocking). Also for very short lengths it should be done as a couple of loads.
Jul 30 2012
prev sibling next sibling parent reply "Peter Alexander" <peter.alexander.au gmail.com> writes:
On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable 
 for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.
You underestimate DMD's optimisations :-) For int and long, DMD does the whole loop at compile time, so again you are benchmarking against an empty function. It's easy to see that this is the case by changing the size of the loop and noting that the int and long versions take the same amount of time. Here's my results with optimisations turned *off*: int: 2,304,868 long: 1,559,679 Variant: 2,667,320,252 Yes, it's not good to test without optimisations, but I think this gives a clearer picture of the relative differences between the two. Still not great at ~1000x difference, but much better than a million :-)
Jul 29 2012
parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On 29-Jul-12 23:20, Peter Alexander wrote:
 On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.
You do realize you actually benchmark against a function that does nothing, right? Clearly there are ways in which we can improve std.variant to the point initialization costs assignment of two words, but this benchmark doesn't help. (Incidentally I just prepared a class at C++ and Beyond on benchmarking, and this benchmark makes a lot of the mistakes described therein...) Andrei
This should be more relevant then: //fib.d import std.datetime, std.stdio, std.variant; auto fib(Int)() { Int a = 1, b = 1; for(size_t i=0; i<100; i++){ Int c = a + b; a = b; b = c; } return a; } void main() { writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000)); } dmd -O -inline -release fib.d Output: [TickDuration(197), TickDuration(276), TickDuration(93370107)] I'm horrified. Who was working on std.variant enhancements? Please chime in.
You underestimate DMD's optimisations :-)
Yeah, guilty as charged. Was in a harry, yet I did test it without optimization before posting and results were not good either.
 For int and long, DMD does the whole loop at compile time, so again you
 are benchmarking against an empty function. It's easy to see that this
 is the case by changing the size of the loop and noting that the int and
 long versions take the same amount of time.

 Here's my results with optimisations turned *off*:

 int: 2,304,868
 long: 1,559,679
Guess you are on 64 bit system or what? Long takes somewhat longer for me.
 Variant: 2,667,320,252
Mine with -release -inline: 2307us 2577us 3170315us 3 orders of magnitude. Sill too bad.
 Yes, it's not good to test without optimisations, but I think this gives
 a clearer picture of the relative differences between the two.

 Still not great at ~1000x difference, but much better than a million :-)
-- Dmitry Olshansky
Jul 29 2012
prev sibling parent reply Jonathan M Davis <jmdavisProg gmx.com> writes:
On Sunday, July 29, 2012 18:43:07 Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime in.
I believe that Robert Jaques fixed up std.variant as part of his work on fixing up std.json and that he intended to get the std.variant changes reviewed and merged in before doing the same with his std.json changes, but he's never asked for a review of either, and I don't know if his code is available online or not. I think that it's another one of those cases where someone did a fair bit of work but became too busy to finish the job. - Jonathan M Davis
Jul 29 2012
parent Mirko Pilger <pilger cymotec.de> writes:
https://jshare.johnshopkins.edu/rjacque2/public_html/variant.mht
https://jshare.johnshopkins.edu/rjacque2/public_html/variant.d
Jul 29 2012