digitalmars.D - std.variant benchmark

Gor Gyolchanyan (33/33) Jul 29 2012 std.variant is so incredibly slow! It's practically unusable for anythin...

Andrei Alexandrescu (8/10) Jul 29 2012 You do realize you actually benchmark against a function that does

Gor Gyolchanyan (10/21) Jul 29 2012 I do compare it with nothing, just to see how many times does it exceed ...

Tobias Pankrath (2/10) Jul 29 2012 Isn't that a paradox? What kind of type safety do you expect from

Gor Gyolchanyan (6/15) Jul 29 2012 The kind, which doesn't forget about postblits and destructors. This isn...

Dmitry Olshansky (24/34) Jul 29 2012 This should be more relevant then:

Gor Gyolchanyan (5/46) Jul 29 2012 Thank you for demonstrating my point. :-)

Andrei Alexandrescu (20/47) Jul 29 2012 I don't think he demonstrated your point (even leaving aside that the

Gor Gyolchanyan (10/69) Jul 30 2012 You're right. I'm trying to make a subject-oriented system, where there ...

jerro (32/73) Jul 29 2012 I thought this results are a bit strange, so I converted the

Andrei Alexandrescu (3/6) Jul 29 2012 Very nice analysis!

Andrei Alexandrescu (12/14) Jul 29 2012 I guess you just volunteered! When I looked at it this morning I noticed...

Dmitry Olshansky (73/86) Jul 29 2012 Well obviously integers do not take lightly when they are copied around

Dmitry Olshansky (15/31) Jul 29 2012 ...

Marco Leise (5/19) Jul 29 2012 Very nice finding. Now it is only 100x slower. :D It seems well worth ch...
jerro (9/44) Aug 02 2012 I profiled it and found out much of the time is spent inside

Gor Gyolchanyan (6/54) Aug 03 2012 Wow! Now that's an impressive improvement. If TypeInfo instances are uni...

jerro (4/22) Aug 03 2012 I've tried Dmitry's fork now (he optimized opArithmetic to not do

Andrei Alexandrescu (39/82) Jul 29 2012 Yah, when I replaced memcpy with simple assignment for Variant I

Dmitry Olshansky (21/69) Jul 30 2012 I'd say array ops could and should do this (since compiler has all the

Andrei Alexandrescu (10/27) Jul 30 2012 memcpy is implemented as an intrinsic on many platforms. I'm not sure

Don Clugston (8/28) Jul 30 2012 It is an intrinsic on DMD, but it isn't done optimally. Mostly it just

Peter Alexander (16/57) Jul 29 2012 You underestimate DMD's optimisations :-)

Dmitry Olshansky (11/70) Jul 29 2012 Yeah, guilty as charged. Was in a harry, yet I did test it without

Jonathan M Davis (8/9) Jul 29 2012 I believe that Robert Jaques fixed up std.variant as part of his work on...

Mirko Pilger (2/2) Jul 29 2012 https://jshare.johnshopkins.edu/rjacque2/public_html/variant.mht

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

std.variant is so incredibly slow! It's practically unusable for anything,
which requires even a tiny bit of performance.
Benchmark done under -noboundscheck -inline -release -O:

import std.stdio;
import std.variant;
import std.datetime;

void on()
{
auto var = Variant(5);
int i = var.get!int;
}

void off()
{
auto var = 5;
int i = var;
}

void main()
{
writeln(benchmark!(on, off)(100000));
}

The result is:

[TickDuration(25094), TickDuration(98)]

There are tons of cases, where a simple typeless data storage is necessary.
No type information, no type checking - just a low-level storage, upon
which Variant and other dynamic-type constructs can be built.

I want to ask the community: what's the best way to store any variable in a
typeless storage, so that one could store any variable in that storage and
get a reference to that variable given its static type with no type
checking and minimal overhead compared to a statically typed storage and
with type-safe storage (postblits, garbage collection...)?

-- 
Bye,
Gor Gyolchanyan.

Jul 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.

You do realize you actually benchmark against a function that does 
nothing, right? Clearly there are ways in which we can improve 
std.variant to the point initialization costs assignment of two words, 
but this benchmark doesn't help. (Incidentally I just prepared a class 
at C++ and Beyond on benchmarking, and this benchmark makes a lot of the 
mistakes described therein...)


Andrei

Jul 29 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Sun, Jul 29, 2012 at 6:17 PM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:

 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that does
 nothing, right? Clearly there are ways in which we can improve std.variant
 to the point initialization costs assignment of two words, but this
 benchmark doesn't help. (Incidentally I just prepared a class at C++ and
 Beyond on benchmarking, and this benchmark makes a lot of the mistakes
 described therein...)


 Andrei

I do compare it with nothing, just to see how many times does it exceed the
performance of static typed storage. The point is that Variant is extremely
slow.
All I want is to find out how to implement a very fast typeless storage
with maximum performance and type safety.

-- 
Bye,
Gor Gyolchanyan.

Jul 29 2012

"Tobias Pankrath" <tobias pankrath.net> writes:

 I do compare it with nothing, just to see how many times does 
 it exceed the
 performance of static typed storage. The point is that Variant 
 is extremely
 slow.
 All I want is to find out how to implement a very fast typeless 
 storage
 with maximum performance and type safety.

Isn't that a paradox? What kind of type safety do you expect from 
a typeless storage?

Jul 29 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Sun, Jul 29, 2012 at 6:41 PM, Tobias Pankrath <tobias pankrath.net>wrote:

 I do compare it with nothing, just to see how many times does it exceed the
 performance of static typed storage. The point is that Variant is
 extremely
 slow.
 All I want is to find out how to implement a very fast typeless storage
 with maximum performance and type safety.

 Isn't that a paradox? What kind of type safety do you expect from a
 typeless storage?

The kind, which doesn't forget about postblits and destructors. This isn't
a paradox;

-- 
Bye,
Gor Gyolchanyan.

Jul 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that does
 nothing, right? Clearly there are ways in which we can improve
 std.variant to the point initialization costs assignment of two words,
 but this benchmark doesn't help. (Incidentally I just prepared a class
 at C++ and Beyond on benchmarking, and this benchmark makes a lot of the
 mistakes described therein...)


 Andrei


This should be more relevant then:

//fib.d
import std.datetime, std.stdio, std.variant;

auto fib(Int)()
{
	Int a = 1, b = 1;
	for(size_t i=0; i<100; i++){
		Int c = a + b;
		a = b;
		b = c;
	}
	return a;	
}

void main()
{
	writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
}


dmd -O -inline -release fib.d

Output:

[TickDuration(197), TickDuration(276), TickDuration(93370107)]

I'm horrified. Who was working on std.variant enhancements? Please chime in.

-- 
Dmitry Olshansky

Jul 29 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com>wrote:

 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:

 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:

 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that does
 nothing, right? Clearly there are ways in which we can improve
 std.variant to the point initialization costs assignment of two words,
 but this benchmark doesn't help. (Incidentally I just prepared a class
 at C++ and Beyond on benchmarking, and this benchmark makes a lot of the
 mistakes described therein...)


 Andrei


 This should be more relevant then:

 //fib.d
 import std.datetime, std.stdio, std.variant;

 auto fib(Int)()
 {
         Int a = 1, b = 1;
         for(size_t i=0; i<100; i++){
                 Int c = a + b;
                 a = b;
                 b = c;
         }
         return a;
 }

 void main()
 {
         writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
 }


 dmd -O -inline -release fib.d

 Output:

 [TickDuration(197), TickDuration(276), TickDuration(93370107)]

 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.

 --
 Dmitry Olshansky

Thank you for demonstrating my point. :-)

-- 
Bye,
Gor Gyolchanyan.

Jul 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:
 On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com
 <mailto:dmitry.olsh gmail.com>> wrote:
     This should be more relevant then:

     //fib.d
     import std.datetime, std.stdio, std.variant;

     auto fib(Int)()
     {
              Int a = 1, b = 1;
              for(size_t i=0; i<100; i++){
                      Int c = a + b;
                      a = b;
                      b = c;
              }
              return a;
     }

     void main()
     {
              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
     }


     dmd -O -inline -release fib.d

     Output:

     [TickDuration(197), TickDuration(276), TickDuration(93370107)]

     I'm horrified. Who was working on std.variant enhancements? Please
     chime in.

     --
     Dmitry Olshansky


 Thank you for demonstrating my point. :-)

I don't think he demonstrated your point (even leaving aside that the 
benchmark above is also flawed). I don't think there was a point, unless 
it was you wanted to vent and looked for a pretext - which is, by the 
way, entirely reasonable.

You mentioned you need "a very fast typeless storage with maximum 
performance and type safety." Well if it's typeless then it means you're 
using it for storage, not for computationally-intensive operations, as 
the code above does. So what you'd presumably want to test is the speed 
of data storage and retrieval. A loop that does a bunch of adds on 
Variant does not go in that direction.

Benchmarks are notoriously hard to get right. You need to think of 
reasonable baselines - if you want to use Variant for typeless storage, 
what is your vanilla implementation, the "obvious contender"? What are 
the primitives of which speed is important? Then you need to make sure 
you subtract noise from all your comparisons. Then you need to figure 
whether the issue is with the design or with the implementation of 
Variant. In the former case, maybe Variant isn't for you. In the latter, 
bug reports are always welcome.


Andrei

Jul 29 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Mon, Jul 30, 2012 at 6:07 AM, Andrei Alexandrescu <
SeeWebsiteForEmail erdani.org> wrote:

 On 7/29/12 11:20 AM, Gor Gyolchanyan wrote:

 On Sun, Jul 29, 2012 at 6:43 PM, Dmitry Olshansky <dmitry.olsh gmail.com
 <mailto:dmitry.olsh gmail.com>**> wrote:
     This should be more relevant then:

     //fib.d
     import std.datetime, std.stdio, std.variant;

     auto fib(Int)()
     {
              Int a = 1, b = 1;
              for(size_t i=0; i<100; i++){
                      Int c = a + b;
                      a = b;
                      b = c;
              }
              return a;
     }

     void main()
     {
              writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
     }


     dmd -O -inline -release fib.d

     Output:

     [TickDuration(197), TickDuration(276), TickDuration(93370107)]

     I'm horrified. Who was working on std.variant enhancements? Please
     chime in.

     --
     Dmitry Olshansky


 Thank you for demonstrating my point. :-)

 I don't think he demonstrated your point (even leaving aside that the
 benchmark above is also flawed). I don't think there was a point, unless it
 was you wanted to vent and looked for a pretext - which is, by the way,
 entirely reasonable.

 You mentioned you need "a very fast typeless storage with maximum
 performance and type safety." Well if it's typeless then it means you're
 using it for storage, not for computationally-intensive operations, as the
 code above does. So what you'd presumably want to test is the speed of data
 storage and retrieval. A loop that does a bunch of adds on Variant does not
 go in that direction.

 Benchmarks are notoriously hard to get right. You need to think of
 reasonable baselines - if you want to use Variant for typeless storage,
 what is your vanilla implementation, the "obvious contender"? What are the
 primitives of which speed is important? Then you need to make sure you
 subtract noise from all your comparisons. Then you need to figure whether
 the issue is with the design or with the implementation of Variant. In the
 former case, maybe Variant isn't for you. In the latter, bug reports are
 always welcome.


 Andrei

You're right. I'm trying to make a subject-oriented system, where there are
nodes and accessors and each accessor has access to a specific part of the
node. I must have the data stored in typeless manner and cast to the
appropriate type (depending on the accessor). This is all necessary for my
very flexible graph data structure.

-- 
Bye,
Gor Gyolchanyan.

Jul 30 2012

"jerro" <a a.com> writes:

On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable 
 for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that 
 does
 nothing, right? Clearly there are ways in which we can improve
 std.variant to the point initialization costs assignment of 
 two words,
 but this benchmark doesn't help. (Incidentally I just prepared 
 a class
 at C++ and Beyond on benchmarking, and this benchmark makes a 
 lot of the
 mistakes described therein...)


 Andrei


 This should be more relevant then:

 //fib.d
 import std.datetime, std.stdio, std.variant;

 auto fib(Int)()
 {
 	Int a = 1, b = 1;
 	for(size_t i=0; i<100; i++){
 		Int c = a + b;
 		a = b;
 		b = c;
 	}
 	return a;	
 }

 void main()
 {
 	writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
 }


 dmd -O -inline -release fib.d

 Output:

 [TickDuration(197), TickDuration(276), TickDuration(93370107)]

 I'm horrified. Who was working on std.variant enhancements? 
 Please chime in.

I thought this results are a bit strange, so I converted the 
result to seconds. This gave me:

[3.73e-06, 3.721e-06, 2.97281]

One million inner loop iterations in under 4 microseconds? My 
processor's frequency isn't measured in THz, so something strange 
must be going on here. In order to find out what it was, I 
changed the code to this:

     writeln(benchmark!(fib!int, fib!long)(1000_000_000)[]
         .map!"a.nsecs() * 1.0e-9");

and used a profiler on it. The relevant part of the output is:

     0.00 :	  445969:       test   %r12d,%r12d
     0.00 :	  44596c:       je     445975 <_D3std8date
    46.67 :	  44596e:       inc    %ebx
     0.00 :	  445970:       cmp    %r12d,%ebx
     0.00 :	  445973:       jb     44596e <_D3std8date
     0.00 :	  445975:       lea    -0x18(%rbp),%rdi
     0.00 :	  445979:       callq  45a048 <_D3std8date
     0.00 :	  44597e:       mov    %rax,0x0(%r13)
     0.00 :	  445982:       lea    -0x18(%rbp),%rdi
     0.00 :	  445986:       callq  459fb4 <_D3std8date
     0.00 :	  44598b:       xor    %ebx,%ebx
     0.00 :	  44598d:       test   %r12d,%r12d
     0.00 :	  445990:       je     445999 <_D3std8date
    53.33 :	  445992:       inc    %ebx
     0.00 :	  445994:       cmp    %r12d,%ebx
     0.00 :	  445997:       jb     445992 <_D3std8date


As you can see, most of the time is spent in two loops with empty 
body, so your code is benchmarking Variant against nothing, too. 
Adding asm{ nop; } to fib changes the output to this:

[0.00437154, 0.00444938, 3.03917]

Whih is still a huge difference.

Jul 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/29/12 2:57 PM, jerro wrote:
 I thought this results are a bit strange, so I converted the result to
 seconds. This gave me:

 [3.73e-06, 3.721e-06, 2.97281]

Very nice analysis!

Andrei

Jul 29 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.

I guess you just volunteered! When I looked at it this morning I noticed 
a few signs of bit rot, e.g. opAssign returns by value and such. (Only 
planting a "ref" there improves performance a good amount.)

Variant has a simple design with (in case of int) an int and a pointer 
to a function. Many of its operations incur an indirect call through 
that pointer. This makes operations slower than the time-honored design 
of using an integral tag and switching on it, but offers in return the 
ability to hold any type without needing to enumerate all types explicitly.

We can use the pointer to function as a tag for improving performance on 
primitive types.


Andrei

Jul 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
 On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.

 I guess you just volunteered! When I looked at it this morning I noticed
 a few signs of bit rot, e.g. opAssign returns by value and such. (Only
 planting a "ref" there improves performance a good amount.)

Strange, but numbers stay the same for me.

 Variant has a simple design with (in case of int) an int and a pointer
 to a function. Many of its operations incur an indirect call through
 that pointer. This makes operations slower than the time-honored design
 of using an integral tag and switching on it, but offers in return the
 ability to hold any type without needing to enumerate all types explicitly.

Well obviously integers do not take lightly when they are copied around 
with memcpy like some pompous arrays that alone incurs *some* penalty.

In fact memcpy could and should be replaced with word by word copy for 
almost all of struct sizes up to ~32 bytes (as size is known in advance 
for this particular function pointer i.e. handler!int).

I've found something far more evil:

 property bool convertsTo(T)() const
     {
         TypeInfo info = typeid(T);
         return fptr(OpID.testConversion, null, &info) == 0;
     }

Okay... now let me pull off another piece of rag:

private VariantN opArithmetic(T, string op)(T other)
     {
         VariantN result;
         static if (is(T == VariantN))
         {
             if (convertsTo!(uint) && other.convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
             else if (convertsTo!(int) && other.convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " other.get!(int)");
             else if (convertsTo!(ulong) && other.convertsTo!(ulong))
                 result = mixin("get!(ulong) " ~ op ~ " other.get!(ulong)");
             else if (convertsTo!(long) && other.convertsTo!(long))
                 result = mixin("get!(long) " ~ op ~ " other.get!(long)");
             else if (convertsTo!(double) && other.convertsTo!(double))
                 result = mixin("get!(double) " ~ op ~ " 
other.get!(double)");
             else
                 result = mixin("get!(real) " ~ op ~ " other.get!(real)");
         }
         else
         {
             if (is(typeof(T.max) : uint) && T.min == 0 && 
convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " other");
             else if (is(typeof(T.max) : int) && T.min < 0 && 
convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " other");
             else if (is(typeof(T.max) : ulong) && T.min == 0
                      && convertsTo!(ulong))
                 result = mixin("get!(ulong) " ~ op ~ " other");
             else if (is(typeof(T.max) : long) && T.min < 0 && 
convertsTo!(long))
                 result = mixin("get!(long) " ~ op ~ " other");
             else if (is(T : double) && convertsTo!(double))
                 result = mixin("get!(double) " ~ op ~ " other");
             else
                 result = mixin("get!(real) " ~ op ~ " other");
         }
         return result;
     }


So to boot in order to get to int + int _detected_ it needs 4 calls to
  fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each 
time. Not counting 2 gets to obtain result. It's just madness.

Plain hardcoded Type to integer switch seems so much better now.

 We can use the pointer to function as a tag for improving performance on
 primitive types.

Indeed we can though it won't be as fast as simple tag - pointers
have quite big and sparse addresses giving us the same as:

if(ptr == handler!int)
...
else if(ptr == handler!uint)
...
else
...

Another way to go is make 2-d function table for all built-ins and all 
operations on them. It's in fact a bunch of tables:
Type1 x Type2 for all of important basic operations. Tables are static 
so it won't be much of a problem. At least int + int then would be
1 memory read, 1 call, 2 (casted as needed) reads & 1 store.

-- 
Dmitry Olshansky

Jul 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 30-Jul-12 00:11, Dmitry Olshansky wrote:
 I've found something far more evil:

  property bool convertsTo(T)() const
      {
          TypeInfo info = typeid(T);
          return fptr(OpID.testConversion, null, &info) == 0;
      }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
      {
          VariantN result;
          static if (is(T == VariantN))
          {
             /*if (convertsTo!(uint) && other.convertsTo!(uint))
                  result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
              else*/ if (convertsTo!(int) && other.convertsTo!(int))
                  result = mixin("get!(int) " ~ op ~ " other.get!(int)");

...
Apparently I'm spot on.
Just commenting one extra branch of this horror movie
gives interesting change:

2779us
2667us
3153762us

After:

2319us
2523us
288581us

Aye, 10x :o)

-- 
Dmitry Olshansky

Jul 29 2012

Marco Leise <Marco.Leise gmx.de> writes:

Am Mon, 30 Jul 2012 00:24:40 +0400
schrieb Dmitry Olshansky <dmitry.olsh gmail.com>:

 Just commenting one extra branch of this horror movie
 gives interesting change:
 
 2779us
 2667us
 3153762us
 
 After:
 
 2319us
 2523us
 288581us
 
 Aye, 10x :o)

Very nice finding. Now it is only 100x slower. :D It seems well worth changing
the style Variant is written in.

-- 
Marco

Jul 29 2012

"jerro" <a a.com> writes:

On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:
 On 30-Jul-12 00:11, Dmitry Olshansky wrote:
 I've found something far more evil:

  property bool convertsTo(T)() const
     {
         TypeInfo info = typeid(T);
         return fptr(OpID.testConversion, null, &info) == 0;
     }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
     {
         VariantN result;
         static if (is(T == VariantN))
         {
            /*if (convertsTo!(uint) && other.convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " 
 other.get!(uint)");
             else*/ if (convertsTo!(int) && 
 other.convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " 
 other.get!(int)");

 ...
 Apparently I'm spot on.
 Just commenting one extra branch of this horror movie
 gives interesting change:

 2779us
 2667us
 3153762us

 After:

 2319us
 2523us
 288581us

 Aye, 10x :o)

I profiled it and found out much of the time is spent inside 
TypeInfo.opEquals being called from tryPutting. So I tried 
replacing "!= typeid" in tryPutting with "!is typeid". That 
brought the time from 2.8 s down to 0.12 on my machine. I don't 
know if that is the proper solution, since I don't know if typeid 
can ever return two TypeInfo objects that aren't the same but are 
equal (I haven't used typeid and TypeInfo much before). The fib 
function here does return correct result after doing that, though.

Aug 02 2012

Gor Gyolchanyan <gor.f.gyolchanyan gmail.com> writes:

On Fri, Aug 3, 2012 at 8:28 AM, jerro <a a.com> wrote:

 On Sunday, 29 July 2012 at 20:24:42 UTC, Dmitry Olshansky wrote:

 On 30-Jul-12 00:11, Dmitry Olshansky wrote:

 I've found something far more evil:

  property bool convertsTo(T)() const
     {
         TypeInfo info = typeid(T);
         return fptr(OpID.testConversion, null, &info) == 0;
     }

 Okay... now let me pull off another piece of rag:

 private VariantN opArithmetic(T, string op)(T other)
     {
         VariantN result;
         static if (is(T == VariantN))
         {
            /*if (convertsTo!(uint) && other.convertsTo!(uint))
                 result = mixin("get!(uint) " ~ op ~ " other.get!(uint)");
             else*/ if (convertsTo!(int) && other.convertsTo!(int))
                 result = mixin("get!(int) " ~ op ~ " other.get!(int)");

 ...
 Apparently I'm spot on.
 Just commenting one extra branch of this horror movie
 gives interesting change:

 2779us
 2667us
 3153762us

 After:

 2319us
 2523us
 288581us

 Aye, 10x :o)

 I profiled it and found out much of the time is spent inside
 TypeInfo.opEquals being called from tryPutting. So I tried replacing "!=
 typeid" in tryPutting with "!is typeid". That brought the time from 2.8 s
 down to 0.12 on my machine. I don't know if that is the proper solution,
 since I don't know if typeid can ever return two TypeInfo objects that
 aren't the same but are equal (I haven't used typeid and TypeInfo much
 before). The fib function here does return correct result after doing that,
 though.

Wow! Now that's an impressive improvement. If TypeInfo instances are unique
for every type, then we should be good to go!

-- 
Bye,
Gor Gyolchanyan.

Aug 03 2012

"jerro" <a a.com> writes:

 I profiled it and found out much of the time is spent inside
 TypeInfo.opEquals being called from tryPutting. So I tried 
 replacing "!=
 typeid" in tryPutting with "!is typeid". That brought the time 
 from 2.8 s
 down to 0.12 on my machine. I don't know if that is the proper 
 solution,
 since I don't know if typeid can ever return two TypeInfo 
 objects that
 aren't the same but are equal (I haven't used typeid and 
 TypeInfo much
 before). The fib function here does return correct result 
 after doing that,
 though.

 Wow! Now that's an impressive improvement. If TypeInfo 
 instances are unique
 for every type, then we should be good to go!

I've tried Dmitry's fork now (he optimized opArithmetic to not do 
as many calls to fptr) and it's even a little faster (1.01 s for 
100000 iterations). Replacing != with !is in Dmitry's fork speeds 
it up a little bit more (0.935 s for 100000 iterations).

Aug 03 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/29/12 4:11 PM, Dmitry Olshansky wrote:
 On 29-Jul-12 22:59, Andrei Alexandrescu wrote:
 On 7/29/12 10:43 AM, Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime
 in.

 I guess you just volunteered! When I looked at it this morning I noticed
 a few signs of bit rot, e.g. opAssign returns by value and such. (Only
 planting a "ref" there improves performance a good amount.)

 Strange, but numbers stay the same for me.

Depends on what you measure.

 Variant has a simple design with (in case of int) an int and a pointer
 to a function. Many of its operations incur an indirect call through
 that pointer. This makes operations slower than the time-honored design
 of using an integral tag and switching on it, but offers in return the
 ability to hold any type without needing to enumerate all types
 explicitly.

 Well obviously integers do not take lightly when they are copied around
 with memcpy like some pompous arrays that alone incurs *some* penalty.

Yah, when I replaced memcpy with simple assignment for Variant<int> I 
saw some improvements.

 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).

In fact memcpy should be smart enough to do all that, but apparently it 
doesn't.

 I've found something far more evil:

[snip]
 So to boot in order to get to int + int _detected_ it needs 4 calls to
 fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
 time. Not counting 2 gets to obtain result. It's just madness.

Heh, this all gives me a new, humbling perspective on my earlier self 
running my mouth about other designs :o). As always, history and context 
makes one much more empathic.

Variant has a very old design and implementation. It is, in fact, the 
very first design I've ever carried in D; I'd decided, if I can't do 
that, D isn't powerful enough. The compiler was so buggy back then, 
implementation needed to take many contortions, and it was a minor 
miracle the whole house of cards stood together.

The approach was to design for usability first, performance could come 
later. So design only had to not prevent good performance from being 
improved transparently later, which I think it has achieved.

The basic approach was to use a pointer to function (instead of an 
integral) as a discriminator. The pointer to function was taken from the 
instantiation of a template function. That means Variant could work with 
any type, even types not yet defined. I'd made it a clear goal that I 
must not enumerate all types in one place, or "register" types for use 
with Variant - that would have been failure. Using a pointer to template 
function naturally associated a distinct tag with each type, 
automatically. I think it was a good idea then, and it's a good idea now.

On this data layout functionality can be built in two ways:

(a) Implement it inside the dispatcher function template, or
(b) Test against it "outside", in the template Variant code.

The first approach is more general, the second is faster. Right now 
Variant uses (a) almost exclusively. It's time to start doing (b) for 
more types.

 Plain hardcoded Type to integer switch seems so much better now.

But we can do that now.

 We can use the pointer to function as a tag for improving performance on
 primitive types.

 Indeed we can though it won't be as fast as simple tag - pointers
 have quite big and sparse addresses giving us the same as:

 if(ptr == handler!int)
 ....
 else if(ptr == handler!uint)
 ....
 else
 ....

Yes, and what we gain is speed for select known types. As long as the 
number of those is relatively small, using simple test and branch should 
serve us well. And we get to keep the generality, too.

 Another way to go is make 2-d function table for all built-ins and all
 operations on them. It's in fact a bunch of tables:
 Type1 x Type2 for all of important basic operations. Tables are static
 so it won't be much of a problem. At least int + int then would be
 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.

I think that's not worth it, but hey, worth a try.


Thanks,

Andrei

Jul 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).

 In fact memcpy should be smart enough to do all that, but apparently it
 doesn't.

I'd say array ops could and should do this (since compiler has all the 
info at compile-time). On the other hand memcpy is just one tired C 
function aimed at fast blitting of any memory chunks.
(Even just   call/ret pair is too much of overhead in case of int).

 I've found something far more evil:

 [snip]
 So to boot in order to get to int + int _detected_ it needs 4 calls to
 fptr(OpID.testConversion, null, &info) == 0 and acquire TypeInfo each
 time. Not counting 2 gets to obtain result. It's just madness.

 Heh, this all gives me a new, humbling perspective on my earlier self
 running my mouth about other designs :o). As always, history and context
 makes one much more empathic.

 Variant has a very old design and implementation. It is, in fact, the
 very first design I've ever carried in D; I'd decided, if I can't do
 that, D isn't powerful enough. The compiler was so buggy back then,
 implementation needed to take many contortions, and it was a minor
 miracle the whole house of cards stood together.

These were indeed very rough days. I think it's a minor miracle I 
haven't gave up on D after seeing tons of bugs back in 2010.

 The basic approach was to use a pointer to function (instead of an
 integral) as a discriminator. The pointer to function was taken from the
 instantiation of a template function. That means Variant could work with
 any type, even types not yet defined. I'd made it a clear goal that I
 must not enumerate all types in one place, or "register" types for use
 with Variant - that would have been failure. Using a pointer to template
 function naturally associated a distinct tag with each type,
 automatically. I think it was a good idea then, and it's a good idea now.

It's indeed nice trick and design sounds great (no register function, yay!).
The only piece missing is that e.g. arithmetic operations are binary 
thus requiring double-dispatch or something like 
arithmeticHandler!(T1,T2) to archive optimal performance.

The virtual handcrafted C competition would do it via nested switches.

[snip]
 We can use the pointer to function as a tag for improving performance on
 primitive types.

 Indeed we can though it won't be as fast as simple tag - pointers
 have quite big and sparse addresses giving us the same as:

 if(ptr == handler!int)
 ....
 else if(ptr == handler!uint)
 ....
 else
 ....

 Yes, and what we gain is speed for select known types. As long as the
 number of those is relatively small, using simple test and branch should
 serve us well. And we get to keep the generality, too.

OK I'll try it for opArithmetic that should help a lot.

 Another way to go is make 2-d function table for all built-ins and all
 operations on them. It's in fact a bunch of tables:
 Type1 x Type2 for all of important basic operations. Tables are static
 so it won't be much of a problem. At least int + int then would be
 1 memory read, 1 call, 2 (casted as needed) reads & 1 store.

 I think that's not worth it, but hey, worth a try.

I'm thinking more of it and common types are just uint, int, long, 
ulong, double, string. In fact most of e.g. int x string  and such are 
"throw me an exception" entries, they could be merged.
Then we could make 2-stage table to save some space, it sounds like 
trie, hm....


-- 
Dmitry Olshansky

Jul 30 2012

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 7/30/12 4:34 AM, Dmitry Olshansky wrote:
 On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).

 In fact memcpy should be smart enough to do all that, but apparently it
 doesn't.

 I'd say array ops could and should do this (since compiler has all the
 info at compile-time). On the other hand memcpy is just one tired C
 function aimed at fast blitting of any memory chunks.
 (Even just call/ret pair is too much of overhead in case of int).

memcpy is implemented as an intrinsic on many platforms. I'm not sure 
whether it is on dmd, but it is on dmc 
(http://www.digitalmars.com/ctg/sc.html), icc, and gcc 
(http://software.intel.com/en-us/articles/memcpy-performance/). But then 
clearly using simple word assignments wherever possible makes for a more 
robust performance profile.

 I'm thinking more of it and common types are just uint, int, long,
 ulong, double, string. In fact most of e.g. int x string and such are
 "throw me an exception" entries, they could be merged.
 Then we could make 2-stage table to save some space, it sounds like
 trie, hm....

You can implement it via visitation too (two indirect calls, no search). 
"Modern C++ Design" discusses possible approaches at length.


Andrei

Jul 30 2012

Don Clugston <dac nospam.com> writes:

On 30/07/12 13:24, Andrei Alexandrescu wrote:
 On 7/30/12 4:34 AM, Dmitry Olshansky wrote:
 On 30-Jul-12 06:01, Andrei Alexandrescu wrote:
 In fact memcpy could and should be replaced with word by word copy for
 almost all of struct sizes up to ~32 bytes (as size is known in advance
 for this particular function pointer i.e. handler!int).

 In fact memcpy should be smart enough to do all that, but apparently it
 doesn't.

 I'd say array ops could and should do this (since compiler has all the
 info at compile-time). On the other hand memcpy is just one tired C
 function aimed at fast blitting of any memory chunks.
 (Even just call/ret pair is too much of overhead in case of int).

 memcpy is implemented as an intrinsic on many platforms. I'm not sure
 whether it is on dmd, but it is on dmc
 (http://www.digitalmars.com/ctg/sc.html), icc, and gcc
 (http://software.intel.com/en-us/articles/memcpy-performance/). But then
 clearly using simple word assignments wherever possible makes for a more
 robust performance profile.

It is an intrinsic on DMD, but it isn't done optimally. Mostly it just 
compiles to a couple of loads + the single instruction
rep movsd; / rep movsq;
which is perfect for medium-sized lengths when everything is aligned, 
but once it is longer than a few hundred bytes, it should be done as a 
function call. (The optimal method involves cache blocking).
Also for very short lengths it should be done as a couple of loads.

Jul 30 2012

"Peter Alexander" <peter.alexander.au gmail.com> writes:

On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable 
 for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that 
 does
 nothing, right? Clearly there are ways in which we can improve
 std.variant to the point initialization costs assignment of 
 two words,
 but this benchmark doesn't help. (Incidentally I just prepared 
 a class
 at C++ and Beyond on benchmarking, and this benchmark makes a 
 lot of the
 mistakes described therein...)


 Andrei


 This should be more relevant then:

 //fib.d
 import std.datetime, std.stdio, std.variant;

 auto fib(Int)()
 {
 	Int a = 1, b = 1;
 	for(size_t i=0; i<100; i++){
 		Int c = a + b;
 		a = b;
 		b = c;
 	}
 	return a;	
 }

 void main()
 {
 	writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
 }


 dmd -O -inline -release fib.d

 Output:

 [TickDuration(197), TickDuration(276), TickDuration(93370107)]

 I'm horrified. Who was working on std.variant enhancements? 
 Please chime in.

You underestimate DMD's optimisations :-)

For int and long, DMD does the whole loop at compile time, so 
again you are benchmarking against an empty function. It's easy 
to see that this is the case by changing the size of the loop and 
noting that the int and long versions take the same amount of 
time.

Here's my results with optimisations turned *off*:

int: 2,304,868
long: 1,559,679
Variant: 2,667,320,252

Yes, it's not good to test without optimisations, but I think 
this gives a clearer picture of the relative differences between 
the two.

Still not great at ~1000x difference, but much better than a 
million :-)

Jul 29 2012

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On 29-Jul-12 23:20, Peter Alexander wrote:
 On Sunday, 29 July 2012 at 14:43:09 UTC, Dmitry Olshansky wrote:
 On 29-Jul-12 18:17, Andrei Alexandrescu wrote:
 On 7/29/12 8:17 AM, Gor Gyolchanyan wrote:
 std.variant is so incredibly slow! It's practically unusable for
 anything, which requires even a tiny bit of performance.

 You do realize you actually benchmark against a function that does
 nothing, right? Clearly there are ways in which we can improve
 std.variant to the point initialization costs assignment of two words,
 but this benchmark doesn't help. (Incidentally I just prepared a class
 at C++ and Beyond on benchmarking, and this benchmark makes a lot of the
 mistakes described therein...)


 Andrei


 This should be more relevant then:

 //fib.d
 import std.datetime, std.stdio, std.variant;

 auto fib(Int)()
 {
     Int a = 1, b = 1;
     for(size_t i=0; i<100; i++){
         Int c = a + b;
         a = b;
         b = c;
     }
     return a;
 }

 void main()
 {
     writeln(benchmark!(fib!int, fib!long, fib!Variant)(10_000));
 }


 dmd -O -inline -release fib.d

 Output:

 [TickDuration(197), TickDuration(276), TickDuration(93370107)]

 I'm horrified. Who was working on std.variant enhancements? Please
 chime in.

 You underestimate DMD's optimisations :-)

Yeah, guilty as charged. Was in a harry, yet I did test it without 
optimization before posting and results were not good either.

 For int and long, DMD does the whole loop at compile time, so again you
 are benchmarking against an empty function. It's easy to see that this
 is the case by changing the size of the loop and noting that the int and
 long versions take the same amount of time.

 Here's my results with optimisations turned *off*:

 int: 2,304,868
 long: 1,559,679

Guess you are on 64 bit system or what? Long takes somewhat longer for me.

 Variant: 2,667,320,252

Mine with -release -inline:
2307us
2577us
3170315us

3 orders of magnitude. Sill too bad.

 Yes, it's not good to test without optimisations, but I think this gives
 a clearer picture of the relative differences between the two.

 Still not great at ~1000x difference, but much better than a million :-)



-- 
Dmitry Olshansky

Jul 29 2012

Jonathan M Davis <jmdavisProg gmx.com> writes:

On Sunday, July 29, 2012 18:43:07 Dmitry Olshansky wrote:
 I'm horrified. Who was working on std.variant enhancements? Please chime in.

I believe that Robert Jaques fixed up std.variant as part of his work on fixing 
up std.json and that he intended to get the std.variant changes reviewed and 
merged in before doing the same with his std.json changes, but he's never 
asked for a review of either, and I don't know if his code is available online 
or not. I think that it's another one of those cases where someone did a fair 
bit of work but became too busy to finish the job.

- Jonathan M Davis

Jul 29 2012

Mirko Pilger <pilger cymotec.de> writes:

https://jshare.johnshopkins.edu/rjacque2/public_html/variant.mht
https://jshare.johnshopkins.edu/rjacque2/public_html/variant.d

Jul 29 2012

D Programming

C/C++ Programming

Other

digitalmars.D - std.variant benchmark