digitalmars.D.ldc - LLVM and TLS

Jonathan Marler (9/9) Feb 16 2015 I've noticed that on my windows 7 development machine, switching

Dan Olson (6/14) Feb 16 2015 Last time I checked, DMD still did not use OS X native TLS support, but

Martin Nowak (5/7) Feb 16 2015 Is there more information available abput OSX' TLS support and

Jacob Carlborg (8/12) Feb 17 2015 I've created an issue for this, there is some information about the

Jonathan Marler (45/45) Feb 17 2015 I've created a simple program to demonstrate the issue. The

Jacob Carlborg (5/7) Feb 17 2015 It would be nice to have a comparison in C as well, which do use the
Dan Olson (31/34) Feb 18 2015 --snip--

Jonathan Marler (15/40) Feb 18 2015 That's quite a bit better. If I run this using DMD on windows I

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/6) Feb 19 2015 You cannot benchmark it like this. To make it more realistic you

Dan Olson (9/15) Feb 21 2015 Hmm, you got me thinking. A mfence should not be needed for TLS so in a

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (11/22) Feb 22 2015 The problem is really in synthetic benchmarks that is comparing

Jonathan Marler (17/27) Feb 22 2015 Yes I agree that you can't determine the general performance of

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (15/28) Feb 23 2015 Yeah, demonstrating that it slow is reasonable. I was more
Dan Olson (17/21) Feb 23 2015 Hi Jonathan.

Jacob Carlborg (5/9) Feb 23 2015 Other platforms also use an extra functoin for some models, i.e. when

Joakim (15/31) Feb 17 2015 It has little to do with the linker or llvm. dmd doesn't use the
Kai Nacke (7/16) Feb 17 2015 Hi Jonathan!

deadalnix (8/10) Feb 19 2015 Hijacking this as I'm investigating how TLS plays with shared

Jacob Carlborg (10/15) Feb 20 2015 I guess you can read the documentation at [1] and the source code in

"Jonathan Marler" <johnnymarler gmail.com> writes:

I've noticed that on my windows 7 development machine, switching 
between TLS and non-TLS storage has a minimal impact on 
performance (when using DMD).  I haven't tried LDC yet, however, 
on a macbook pro, which uses clang (LLVM) for the linker, using 
TLS has a huge performance impact (much much slower).  Does 
anyone know if this is because of the way LLVM handles TLS 
storage?  I'll have to try using LDC on my windows machine but 
maybe one of you know off hand whether or not LLVM has some 
performance problems with TLS storage. Thanks!

Feb 16 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

"Jonathan Marler" <johnnymarler gmail.com> writes:

 I've noticed that on my windows 7 development machine, switching
 between TLS and non-TLS storage has a minimal impact on performance
 (when using DMD).  I haven't tried LDC yet, however, on a macbook pro,
 which uses clang (LLVM) for the linker, using TLS has a huge
 performance impact (much much slower).  Does anyone know if this is
 because of the way LLVM handles TLS storage?  I'll have to try using
 LDC on my windows machine but maybe one of you know off hand whether
 or not LLVM has some performance problems with TLS storage. Thanks!

Last time I checked, DMD still did not use OS X native TLS support, but
has its own solution.  Try LDC and see if the performance improves
because LDC uses OS X native TLS.
--
Dan

Feb 16 2015

"Martin Nowak" <code dawg.eu> writes:

On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
 Try LDC and see if the performance improves because LDC uses OS 
 X native TLS.

Is there more information available abput OSX' TLS support and 
how this is implemented in LDX? What version of OSX is required? 
I'd very much like to use that for DMD/druntime too, so that we 
can go on with the shared library support.

Feb 16 2015

Jacob Carlborg <doob me.com> writes:

On 2015-02-17 07:47, Martin Nowak wrote:

 Is there more information available abput OSX' TLS support and how this
 is implemented in LDX? What version of OSX is required? I'd very much
 like to use that for DMD/druntime too, so that we can go on with the
 shared library support.

I've created an issue for this, there is some information about the 
implementation in the issue [1].

OS X 10.7 or later is required. But I'm pretty sure we can back port it 
to 10.6 if we really want/need to.

[1] https://issues.dlang.org/show_bug.cgi?id=9476#c2

-- 
/Jacob Carlborg

Feb 17 2015

"Jonathan Marler" <johnnymarler gmail.com> writes:

I've created a simple program to demonstrate the issue.  The 
performance cost of TLS vs __gshared is over one and a half 
orders of magnitude!

import std.stdio;
import std.datetime;

size_t tlsGlobal;
__gshared size_t sharedGlobal;

void main(string[] args)
{
   runTest(3, 10_000_000);
}

void runTest(size_t runCount, size_t loopCount)
{
   writeln("--------------------------------------------------");
   StopWatch sw;
   for(auto runIndex = 0; runIndex < runCount; runIndex++) {

     writefln("run %s (loopcount %s)", runIndex + 1, loopCount);

     sw.reset();
     sw.start();
     for(size_t i = 0; i < loopCount; i++) {
       tlsGlobal = i;
     }
     sw.stop();
     writefln("  TLS   : %s milliseconds", sw.peek.msecs);

     sw.reset();
     sw.start();
     for(size_t i = 0; i < loopCount; i++) {
       sharedGlobal = i;
     }
     sw.stop();
     writefln("  Shared: %s milliseconds", sw.peek.msecs);
   }
}

--------------------------------------------------
Output:
--------------------------------------------------
run 1 (loopcount 10000000)
   TLS   : 104 milliseconds
   Shared: 3 milliseconds
run 2 (loopcount 10000000)
   TLS   : 97 milliseconds
   Shared: 4 milliseconds
run 3 (loopcount 10000000)
   TLS   : 99 milliseconds
   Shared: 3 milliseconds

Feb 17 2015

Jacob Carlborg <doob me.com> writes:

On 2015-02-18 02:41, Jonathan Marler wrote:
 I've created a simple program to demonstrate the issue.  The performance
 cost of TLS vs __gshared is over one and a half orders of magnitude!

It would be nice to have a comparison in C as well, which do use the 
native TLS implementation.

-- 
/Jacob Carlborg

Feb 17 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

"Jonathan Marler" <johnnymarler gmail.com> writes:

 I've created a simple program to demonstrate the issue.  The
 performance cost of TLS vs __gshared is over one and a half orders of
 magnitude!

--snip--

I ran on my MacBook to compare DMD and LDC 2.066.1 versions.  With LDC,
I had to put in an emty asm instruction in the for loops otherwise the
optimizer removed all but the last write and timing looked really good
(0 milliseconds)!

LDC  __gshared versus TLS time is a bit better than DMD.

$ dmd -O timetls.d 
$ ./timetls 
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 93 milliseconds
  Shared: 6 milliseconds
run 2 (loopcount 10000000)
  TLS   : 91 milliseconds
  Shared: 6 milliseconds
run 3 (loopcount 10000000)
  TLS   : 92 milliseconds
  Shared: 4 milliseconds

$ ldmd2 -O3 timetls.d 
$ ./timetls 
--------------------------------------------------
run 1 (loopcount 10000000)
  TLS   : 21 milliseconds
  Shared: 3 milliseconds
run 2 (loopcount 10000000)
  TLS   : 22 milliseconds
  Shared: 5 milliseconds
run 3 (loopcount 10000000)
  TLS   : 20 milliseconds
  Shared: 3 milliseconds

Feb 18 2015

"Jonathan Marler" <johnnymarler gmail.com> writes:

On Wednesday, 18 February 2015 at 17:03:38 UTC, Dan Olson wrote:
 LDC  __gshared versus TLS time is a bit better than DMD.

 $ dmd -O timetls.d
 $ ./timetls
 --------------------------------------------------
 run 1 (loopcount 10000000)
   TLS   : 93 milliseconds
   Shared: 6 milliseconds
 run 2 (loopcount 10000000)
   TLS   : 91 milliseconds
   Shared: 6 milliseconds
 run 3 (loopcount 10000000)
   TLS   : 92 milliseconds
   Shared: 4 milliseconds

 $ ldmd2 -O3 timetls.d
 $ ./timetls
 --------------------------------------------------
 run 1 (loopcount 10000000)
   TLS   : 21 milliseconds
   Shared: 3 milliseconds
 run 2 (loopcount 10000000)
   TLS   : 22 milliseconds
   Shared: 5 milliseconds
 run 3 (loopcount 10000000)
   TLS   : 20 milliseconds
   Shared: 3 milliseconds

That's quite a bit better.  If I run this using DMD on windows I 
get almost the same performance:

dmd test.d
--------------------------------------------------
run 1 (loopcount 10000000)
   TLS   : 28 milliseconds
   Shared: 25 milliseconds
run 2 (loopcount 10000000)
   TLS   : 28 milliseconds
   Shared: 25 milliseconds
run 3 (loopcount 10000000)
   TLS   : 27 milliseconds
   Shared: 25 milliseconds

If I turn on optimization they both take 7 milliseconds.

Feb 18 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler 
wrote:
 If I turn on optimization they both take 7 milliseconds.

You cannot benchmark it like this. To make it more realistic you 
should use multiple compilation units, add fences and cache 
invalidation.

Feb 19 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

"Ola Fosheim "Grøstad\"" <ola.fosheim.grostad+dlang gmail.com> writes:

 On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:
 If I turn on optimization they both take 7 milliseconds.

 You cannot benchmark it like this. To make it more realistic you
 should use multiple compilation units, add fences and cache
 invalidation.

Hmm, you got me thinking.  A mfence should not be needed for TLS so in a
MT program, expensive TLS lookup could still win.  If cache is blown,
wouldn't time to reload cache begin to dominate?  I know all of this is
very architecture dependent, but I have been wary of the number of
instructions to do TLS lookup compared to shared.  Perhaps I should not.
Am I thinking correctly?
--
Dan

Feb 21 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Sunday, 22 February 2015 at 04:33:58 UTC, Dan Olson wrote:
 Hmm, you got me thinking.  A mfence should not be needed for 
 TLS so in a
 MT program, expensive TLS lookup could still win.  If cache is 
 blown,
 wouldn't time to reload cache begin to dominate?  I know all of 
 this is
 very architecture dependent, but I have been wary of the number 
 of
 instructions to do TLS lookup compared to shared.  Perhaps I 
 should not.
 Am I thinking correctly?

The problem is really in synthetic benchmarks that is comparing 
apples/oranges. The "problem" may disappear once TLS tables are 
loaded into the cache or if the compiler has moved the "problem" 
outside of the loop and retaining it in a register (which also 
has a hidden cost). A x86 cache miss is  perhaps 100-200 cycles 
and a 3rd level cache load/full barrier is 30-40 cycles, but a 
pure read or write barrier is only a few cycles... What is the 
hidden cost of D TLS versus the optimal codegen for a program? I 
guess you have to compare C vs D on a set of complex programs to 
figure it all out.

Feb 22 2015

"Jonathan Marler" <johnnymarler gmail.com> writes:

On Sunday, 22 February 2015 at 17:36:49 UTC, Ola Fosheim Grøstad 
wrote:
 The problem is really in synthetic benchmarks that is comparing 
 apples/oranges. The "problem" may disappear once TLS tables are 
 loaded into the cache or if the compiler has moved the 
 "problem" outside of the loop and retaining it in a register 
 (which also has a hidden cost). A x86 cache miss is  perhaps 
 100-200 cycles and a 3rd level cache load/full barrier is 30-40 
 cycles, but a pure read or write barrier is only a few 
 cycles... What is the hidden cost of D TLS versus the optimal 
 codegen for a program? I guess you have to compare C vs D on a 
 set of complex programs to figure it all out.

Yes I agree that you can't determine the general performance of 
TLS from such a simple program.

Here's what happened: I was writing a program that could 
optionally use TLS memory.  When I turned on TLS memory it slowed 
down considerably, but only when using an LLVM compiler.  No 
matter how I used TLS, it was much much slower when using LLVM.  
The simple program is just a simple way to demonstrate that TLS 
is very slow in one specific type of program.  It would be great 
to see another program that could demonstrate that TLS is 
actually faster in some use cases.  However, since it it sooo 
much slower, I think you'll have a hard time finding such an 
example.  The simple program demonstrates that TLS is almost 2 
orders of magnitude slower...it may not be that much slower in 
other types of programs...but with numbers like that it seem 
obvious that something is wrong.

Feb 22 2015

"Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= writes:

On Monday, 23 February 2015 at 04:10:29 UTC, Jonathan Marler 
wrote:
 Here's what happened: I was writing a program that could 
 optionally use TLS memory.  When I turned on TLS memory it 
 slowed down considerably, but only when using an LLVM compiler.
  No matter how I used TLS, it was much much slower when using 
 LLVM.  The simple program is just a simple way to demonstrate 
 that TLS is very slow in one specific type of program.

Yeah, demonstrating that it slow is reasonable. I was more 
thinking about the other direction, that either globals or TLS is 
fast is hard to show without a multi-threaded best-of-breed 
baseline to compare against. (i.e. that TLS is faster than 
globals or the other way around does not say much since they both 
can be too slow if the code gen is lacking...)

 It would be great to see another program that could demonstrate 
 that TLS is actually faster in some use cases.  However, since 
 it it sooo much slower, I think you'll have a hard time finding 
 such an example.  The simple program demonstrates that TLS is 
 almost 2 orders of magnitude slower...it may not be that much 
 slower in other types of programs...but with numbers like that 
 it seem obvious that something is wrong.

Some other wrongs with naive TLS is that every thread gets the 
same dataset, that you pollute 3rd level cache compared to 
globals, and that globals can be fetched without a register 
(absolute addressing or relative to program counter). I'd be vary 
of using TLS for larger datstructures, but putting a pointer 
there instead gives you YET another indirection-> more cache 
misses...

Feb 23 2015

Dan Olson <zans.is.for.cans yahoo.com> writes:

"Jonathan Marler" <johnnymarler gmail.com> writes:

 Here's what happened: I was writing a program that could optionally
 use TLS memory.  When I turned on TLS memory it slowed down
 considerably, but only when using an LLVM compiler.  No matter how I
 used TLS, it was much much slower when using LLVM.

Hi Jonathan.

The reason for slowness on OS X is here in the source code on Apple's
website. A TLS has the extra cost of address lookup by a call to
_tlv_get_addr:

_tlv_get_addr:
	movq	8(%rdi),%rax			// get key from descriptor
	movq	%gs:0x0(,%rax,8),%rax	// get thread value
	testq	%rax,%rax				// if NULL, lazily allocate
	je		LlazyAllocate
	addq	16(%rdi),%rax			// add offset from descriptor
	ret
LlazyAllocate:
        ...

http://www.opensource.apple.com/source/dyld/dyld-210.2.3/src/threadLocalHelpers.s

--
Dan

Feb 23 2015

Jacob Carlborg <doob me.com> writes:

On 2015-02-23 17:18, Dan Olson wrote:

 Hi Jonathan.

 The reason for slowness on OS X is here in the source code on Apple's
 website. A TLS has the extra cost of address lookup by a call to
 _tlv_get_addr:

Other platforms also use an extra functoin for some models, i.e. when 
support for dynamic libraries are required.

-- 
/Jacob Carlborg

Feb 23 2015

"Joakim" <dlang joakim.fea.st> writes:

On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler
wrote:
I've noticed that on my windows 7 development machine,
switching between TLS and non-TLS storage has a minimal impact
on performance (when using DMD). I haven't tried LDC yet,
however, on a macbook pro, which uses clang (LLVM) for the
linker, using TLS has a huge performance impact (much much
slower). Does anyone know if this is because of the way LLVM
handles TLS storage? I'll have to try using LDC on my windows
machine but maybe one of you know off hand whether or not LLVM
has some performance problems with TLS storage. Thanks!

It has little to do with the linker or llvm. dmd doesn't use the
native TLS APIs on OS X, as Dan says, because OS X didn't have
native TLS back then:

http://www.drdobbs.com/architecture-and-design/implementing-thread-local-storage-on-os/228701185

Druntime has since been updated to call pthread_setspecific and
pthread_getspecific, but maybe that's still slower than non-TLS
on OS X:

https://github.com/D-Programming-Language/druntime/blob/master/src/rt/sections_osx.d#L151

As Dan noted, David got ldc working with the since-added
undocumented TLS, ie TLV, functions on OS X:

https://github.com/ldc-developers/druntime/blob/ldc/src/ldc/osx_tls.c

On Tuesday, 17 February 2015 at 06:47:12 UTC, Martin Nowak wrote:
On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:
Try LDC and see if the performance improves because LDC uses
OS X native TLS.

Is there more information available abput OSX' TLS support and
how this is implemented in LDX? What version of OSX is
required? I'd very much like to use that for DMD/druntime too,
so that we can go on with the shared library support.

The functions David used were added in 10.7.

Feb 17 2015

"Kai Nacke" <kai redstar.de> writes:

Hi Jonathan!

On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler 
wrote:
 I've noticed that on my windows 7 development machine, 
 switching between TLS and non-TLS storage has a minimal impact 
 on performance (when using DMD).  I haven't tried LDC yet, 
 however, on a macbook pro, which uses clang (LLVM) for the 
 linker, using TLS has a huge performance impact (much much 
 slower).  Does anyone know if this is because of the way LLVM 
 handles TLS storage?  I'll have to try using LDC on my windows 
 machine but maybe one of you know off hand whether or not LLVM 
 has some performance problems with TLS storage. Thanks!

On Windows, LLVM uses the segment registers for TLS storage (gs: 
for 32bit and fs: for 64bit). There is no other impact.

Regards,
Kai

Feb 17 2015

"deadalnix" <deadalnix gmail.com> writes:

On Tuesday, 17 February 2015 at 22:50:29 UTC, Kai Nacke wrote:
 On Windows, LLVM uses the segment registers for TLS storage 
 (gs: for 32bit and fs: for 64bit). There is no other impact.

Hijacking this as I'm investigating how TLS plays with shared 
object.

So, my understanding is that the segment register is used as a 
base for TLS, and TLS globals are indexed using this register as 
a base.

This sounds like it can work until you have to cross shared 
object boundaries. What is done in this case ?

Feb 19 2015

Jacob Carlborg <doob me.com> writes:

On 2015-02-19 23:27, deadalnix wrote:

 Hijacking this as I'm investigating how TLS plays with shared object.

I guess you can read the documentation at [1] and the source code in 
druntime for Linux and FreeBSD.

 So, my understanding is that the segment register is used as a base for
 TLS, and TLS globals are indexed using this register as a base.

 This sounds like it can work until you have to cross shared object
 boundaries. What is done in this case ?

I think that is when a runtime helper functions is used, i.e. 
__tls_get_addr. Multiple models of TLS exist and they very between 
platforms and depending of what features are needed, i.e. shared 
objects. There's some documentation [1].

[1] http://www.akkadia.org/drepper/tls.pdf

-- 
/Jacob Carlborg

Feb 20 2015

D Programming

C/C++ Programming

Other

digitalmars.D.ldc - LLVM and TLS