www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.ldc - LLVM and TLS

reply "Jonathan Marler" <johnnymarler gmail.com> writes:
I've noticed that on my windows 7 development machine, switching 
between TLS and non-TLS storage has a minimal impact on 
performance (when using DMD).  I haven't tried LDC yet, however, 
on a macbook pro, which uses clang (LLVM) for the linker, using 
TLS has a huge performance impact (much much slower).  Does 
anyone know if this is because of the way LLVM handles TLS 
storage?  I'll have to try using LDC on my windows machine but 
maybe one of you know off hand whether or not LLVM has some 
performance problems with TLS storage. Thanks!
Feb 16 2015
next sibling parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
"Jonathan Marler" <johnnymarler gmail.com> writes:

 I've noticed that on my windows 7 development machine, switching
 between TLS and non-TLS storage has a minimal impact on performance
 (when using DMD).  I haven't tried LDC yet, however, on a macbook pro,
 which uses clang (LLVM) for the linker, using TLS has a huge
 performance impact (much much slower).  Does anyone know if this is
 because of the way LLVM handles TLS storage?  I'll have to try using
 LDC on my windows machine but maybe one of you know off hand whether
 or not LLVM has some performance problems with TLS storage. Thanks!
Last time I checked, DMD still did not use OS X native TLS support, but has its own solution. Try LDC and see if the performance improves because LDC uses OS X native TLS. -- Dan
Feb 16 2015
parent reply "Martin Nowak" <code dawg.eu> writes:
On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
 Try LDC and see if the performance improves because LDC uses OS 
 X native TLS.
Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.
Feb 16 2015
parent reply Jacob Carlborg <doob me.com> writes:
On 2015-02-17 07:47, Martin Nowak wrote:

 Is there more information available abput OSX' TLS support and how this
 is implemented in LDX? What version of OSX is required? I'd very much
 like to use that for DMD/druntime too, so that we can go on with the
 shared library support.
I've created an issue for this, there is some information about the implementation in the issue [1]. OS X 10.7 or later is required. But I'm pretty sure we can back port it to 10.6 if we really want/need to. [1] https://issues.dlang.org/show_bug.cgi?id=9476#c2 -- /Jacob Carlborg
Feb 17 2015
parent reply "Jonathan Marler" <johnnymarler gmail.com> writes:
I've created a simple program to demonstrate the issue.  The 
performance cost of TLS vs __gshared is over one and a half 
orders of magnitude!

import std.stdio;
import std.datetime;

size_t tlsGlobal;
__gshared size_t sharedGlobal;

void main(string[] args)
{
   runTest(3, 10_000_000);
}

void runTest(size_t runCount, size_t loopCount)
{
   writeln("--------------------------------------------------");
   StopWatch sw;
   for(auto runIndex = 0; runIndex < runCount; runIndex++) {

     writefln("run %s (loopcount %s)", runIndex + 1, loopCount);

     sw.reset();
     sw.start();
     for(size_t i = 0; i < loopCount; i++) {
       tlsGlobal = i;
     }
     sw.stop();
     writefln("  TLS   : %s milliseconds", sw.peek.msecs);

     sw.reset();
     sw.start();
     for(size_t i = 0; i < loopCount; i++) {
       sharedGlobal = i;
     }
     sw.stop();
     writefln("  Shared: %s milliseconds", sw.peek.msecs);
   }
}

--------------------------------------------------
Output:
--------------------------------------------------
run 1 (loopcount 10000000)
   TLS   : 104 milliseconds
   Shared: 3 milliseconds
run 2 (loopcount 10000000)
   TLS   : 97 milliseconds
   Shared: 4 milliseconds
run 3 (loopcount 10000000)
   TLS   : 99 milliseconds
   Shared: 3 milliseconds
Feb 17 2015
next sibling parent Jacob Carlborg <doob me.com> writes:
On 2015-02-18 02:41, Jonathan Marler wrote:
 I've created a simple program to demonstrate the issue.  The performance
 cost of TLS vs __gshared is over one and a half orders of magnitude!
It would be nice to have a comparison in C as well, which do use the native TLS implementation. -- /Jacob Carlborg
Feb 17 2015
prev sibling parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
"Jonathan Marler" <johnnymarler gmail.com> writes:

 I've created a simple program to demonstrate the issue.  The
 performance cost of TLS vs __gshared is over one and a half orders of
 magnitude!
--snip-- I ran on my MacBook to compare DMD and LDC 2.066.1 versions. With LDC, I had to put in an emty asm instruction in the for loops otherwise the optimizer removed all but the last write and timing looked really good (0 milliseconds)! LDC __gshared versus TLS time is a bit better than DMD. $ dmd -O timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 93 milliseconds Shared: 6 milliseconds run 2 (loopcount 10000000) TLS : 91 milliseconds Shared: 6 milliseconds run 3 (loopcount 10000000) TLS : 92 milliseconds Shared: 4 milliseconds $ ldmd2 -O3 timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 21 milliseconds Shared: 3 milliseconds run 2 (loopcount 10000000) TLS : 22 milliseconds Shared: 5 milliseconds run 3 (loopcount 10000000) TLS : 20 milliseconds Shared: 3 milliseconds
Feb 18 2015
parent reply "Jonathan Marler" <johnnymarler gmail.com> writes:
On Wednesday, 18 February 2015 at 17:03:38 UTC, Dan Olson wrote:
 LDC  __gshared versus TLS time is a bit better than DMD.

 $ dmd -O timetls.d
 $ ./timetls
 --------------------------------------------------
 run 1 (loopcount 10000000)
   TLS   : 93 milliseconds
   Shared: 6 milliseconds
 run 2 (loopcount 10000000)
   TLS   : 91 milliseconds
   Shared: 6 milliseconds
 run 3 (loopcount 10000000)
   TLS   : 92 milliseconds
   Shared: 4 milliseconds

 $ ldmd2 -O3 timetls.d
 $ ./timetls
 --------------------------------------------------
 run 1 (loopcount 10000000)
   TLS   : 21 milliseconds
   Shared: 3 milliseconds
 run 2 (loopcount 10000000)
   TLS   : 22 milliseconds
   Shared: 5 milliseconds
 run 3 (loopcount 10000000)
   TLS   : 20 milliseconds
   Shared: 3 milliseconds
That's quite a bit better. If I run this using DMD on windows I get almost the same performance: dmd test.d -------------------------------------------------- run 1 (loopcount 10000000) TLS : 28 milliseconds Shared: 25 milliseconds run 2 (loopcount 10000000) TLS : 28 milliseconds Shared: 25 milliseconds run 3 (loopcount 10000000) TLS : 27 milliseconds Shared: 25 milliseconds If I turn on optimization they both take 7 milliseconds.
Feb 18 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler 
wrote:
 If I turn on optimization they both take 7 milliseconds.
You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.
Feb 19 2015
parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
"Ola Fosheim "Grøstad\"" <ola.fosheim.grostad+dlang gmail.com> writes:

 On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:
 If I turn on optimization they both take 7 milliseconds.
You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.
Hmm, you got me thinking. A mfence should not be needed for TLS so in a MT program, expensive TLS lookup could still win. If cache is blown, wouldn't time to reload cache begin to dominate? I know all of this is very architecture dependent, but I have been wary of the number of instructions to do TLS lookup compared to shared. Perhaps I should not. Am I thinking correctly? -- Dan
Feb 21 2015
parent reply "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Sunday, 22 February 2015 at 04:33:58 UTC, Dan Olson wrote:
 Hmm, you got me thinking.  A mfence should not be needed for 
 TLS so in a
 MT program, expensive TLS lookup could still win.  If cache is 
 blown,
 wouldn't time to reload cache begin to dominate?  I know all of 
 this is
 very architecture dependent, but I have been wary of the number 
 of
 instructions to do TLS lookup compared to shared.  Perhaps I 
 should not.
 Am I thinking correctly?
The problem is really in synthetic benchmarks that is comparing apples/oranges. The "problem" may disappear once TLS tables are loaded into the cache or if the compiler has moved the "problem" outside of the loop and retaining it in a register (which also has a hidden cost). A x86 cache miss is perhaps 100-200 cycles and a 3rd level cache load/full barrier is 30-40 cycles, but a pure read or write barrier is only a few cycles... What is the hidden cost of D TLS versus the optimal codegen for a program? I guess you have to compare C vs D on a set of complex programs to figure it all out.
Feb 22 2015
parent reply "Jonathan Marler" <johnnymarler gmail.com> writes:
On Sunday, 22 February 2015 at 17:36:49 UTC, Ola Fosheim Grøstad 
wrote:
 The problem is really in synthetic benchmarks that is comparing 
 apples/oranges. The "problem" may disappear once TLS tables are 
 loaded into the cache or if the compiler has moved the 
 "problem" outside of the loop and retaining it in a register 
 (which also has a hidden cost). A x86 cache miss is  perhaps 
 100-200 cycles and a 3rd level cache load/full barrier is 30-40 
 cycles, but a pure read or write barrier is only a few 
 cycles... What is the hidden cost of D TLS versus the optimal 
 codegen for a program? I guess you have to compare C vs D on a 
 set of complex programs to figure it all out.
Yes I agree that you can't determine the general performance of TLS from such a simple program. Here's what happened: I was writing a program that could optionally use TLS memory. When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler. No matter how I used TLS, it was much much slower when using LLVM. The simple program is just a simple way to demonstrate that TLS is very slow in one specific type of program. It would be great to see another program that could demonstrate that TLS is actually faster in some use cases. However, since it it sooo much slower, I think you'll have a hard time finding such an example. The simple program demonstrates that TLS is almost 2 orders of magnitude slower...it may not be that much slower in other types of programs...but with numbers like that it seem obvious that something is wrong.
Feb 22 2015
next sibling parent "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:
On Monday, 23 February 2015 at 04:10:29 UTC, Jonathan Marler 
wrote:
 Here's what happened: I was writing a program that could 
 optionally use TLS memory.  When I turned on TLS memory it 
 slowed down considerably, but only when using an LLVM compiler.
  No matter how I used TLS, it was much much slower when using 
 LLVM.  The simple program is just a simple way to demonstrate 
 that TLS is very slow in one specific type of program.
Yeah, demonstrating that it slow is reasonable. I was more thinking about the other direction, that either globals or TLS is fast is hard to show without a multi-threaded best-of-breed baseline to compare against. (i.e. that TLS is faster than globals or the other way around does not say much since they both can be too slow if the code gen is lacking...)
 It would be great to see another program that could demonstrate 
 that TLS is actually faster in some use cases.  However, since 
 it it sooo much slower, I think you'll have a hard time finding 
 such an example.  The simple program demonstrates that TLS is 
 almost 2 orders of magnitude slower...it may not be that much 
 slower in other types of programs...but with numbers like that 
 it seem obvious that something is wrong.
Some other wrongs with naive TLS is that every thread gets the same dataset, that you pollute 3rd level cache compared to globals, and that globals can be fetched without a register (absolute addressing or relative to program counter). I'd be vary of using TLS for larger datstructures, but putting a pointer there instead gives you YET another indirection-> more cache misses...
Feb 23 2015
prev sibling parent reply Dan Olson <zans.is.for.cans yahoo.com> writes:
"Jonathan Marler" <johnnymarler gmail.com> writes:

 Here's what happened: I was writing a program that could optionally
 use TLS memory.  When I turned on TLS memory it slowed down
 considerably, but only when using an LLVM compiler.  No matter how I
 used TLS, it was much much slower when using LLVM.
Hi Jonathan. The reason for slowness on OS X is here in the source code on Apple's website. A TLS has the extra cost of address lookup by a call to _tlv_get_addr: _tlv_get_addr: movq 8(%rdi),%rax // get key from descriptor movq %gs:0x0(,%rax,8),%rax // get thread value testq %rax,%rax // if NULL, lazily allocate je LlazyAllocate addq 16(%rdi),%rax // add offset from descriptor ret LlazyAllocate: ... http://www.opensource.apple.com/source/dyld/dyld-210.2.3/src/threadLocalHelpers.s -- Dan
Feb 23 2015
parent Jacob Carlborg <doob me.com> writes:
On 2015-02-23 17:18, Dan Olson wrote:

 Hi Jonathan.

 The reason for slowness on OS X is here in the source code on Apple's
 website. A TLS has the extra cost of address lookup by a call to
 _tlv_get_addr:
Other platforms also use an extra functoin for some models, i.e. when support for dynamic libraries are required. -- /Jacob Carlborg
Feb 23 2015
prev sibling next sibling parent "Joakim" <dlang joakim.fea.st> writes:
On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler 
wrote:
 I've noticed that on my windows 7 development machine, 
 switching between TLS and non-TLS storage has a minimal impact 
 on performance (when using DMD).  I haven't tried LDC yet, 
 however, on a macbook pro, which uses clang (LLVM) for the 
 linker, using TLS has a huge performance impact (much much 
 slower).  Does anyone know if this is because of the way LLVM 
 handles TLS storage?  I'll have to try using LDC on my windows 
 machine but maybe one of you know off hand whether or not LLVM 
 has some performance problems with TLS storage. Thanks!
It has little to do with the linker or llvm. dmd doesn't use the native TLS APIs on OS X, as Dan says, because OS X didn't have native TLS back then: http://www.drdobbs.com/architecture-and-design/implementing-thread-local-storage-on-os/228701185 Druntime has since been updated to call pthread_setspecific and pthread_getspecific, but maybe that's still slower than non-TLS on OS X: https://github.com/D-Programming-Language/druntime/blob/master/src/rt/sections_osx.d#L151 As Dan noted, David got ldc working with the since-added undocumented TLS, ie TLV, functions on OS X: https://github.com/ldc-developers/druntime/blob/ldc/src/ldc/osx_tls.c On Tuesday, 17 February 2015 at 06:47:12 UTC, Martin Nowak wrote:
 On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
 Try LDC and see if the performance improves because LDC uses 
 OS X native TLS.
Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.
The functions David used were added in 10.7.
Feb 17 2015
prev sibling parent reply "Kai Nacke" <kai redstar.de> writes:
Hi Jonathan!

On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler 
wrote:
 I've noticed that on my windows 7 development machine, 
 switching between TLS and non-TLS storage has a minimal impact 
 on performance (when using DMD).  I haven't tried LDC yet, 
 however, on a macbook pro, which uses clang (LLVM) for the 
 linker, using TLS has a huge performance impact (much much 
 slower).  Does anyone know if this is because of the way LLVM 
 handles TLS storage?  I'll have to try using LDC on my windows 
 machine but maybe one of you know off hand whether or not LLVM 
 has some performance problems with TLS storage. Thanks!
On Windows, LLVM uses the segment registers for TLS storage (gs: for 32bit and fs: for 64bit). There is no other impact. Regards, Kai
Feb 17 2015
parent reply "deadalnix" <deadalnix gmail.com> writes:
On Tuesday, 17 February 2015 at 22:50:29 UTC, Kai Nacke wrote:
 On Windows, LLVM uses the segment registers for TLS storage 
 (gs: for 32bit and fs: for 64bit). There is no other impact.
Hijacking this as I'm investigating how TLS plays with shared object. So, my understanding is that the segment register is used as a base for TLS, and TLS globals are indexed using this register as a base. This sounds like it can work until you have to cross shared object boundaries. What is done in this case ?
Feb 19 2015
parent Jacob Carlborg <doob me.com> writes:
On 2015-02-19 23:27, deadalnix wrote:

 Hijacking this as I'm investigating how TLS plays with shared object.
I guess you can read the documentation at [1] and the source code in druntime for Linux and FreeBSD.
 So, my understanding is that the segment register is used as a base for
 TLS, and TLS globals are indexed using this register as a base.

 This sounds like it can work until you have to cross shared object
 boundaries. What is done in this case ?
I think that is when a runtime helper functions is used, i.e. __tls_get_addr. Multiple models of TLS exist and they very between platforms and depending of what features are needed, i.e. shared objects. There's some documentation [1]. [1] http://www.akkadia.org/drepper/tls.pdf -- /Jacob Carlborg
Feb 20 2015