digitalmars.D.ldc - LLVM and TLS
- Jonathan Marler (9/9) Feb 16 2015 I've noticed that on my windows 7 development machine, switching
- Dan Olson (6/14) Feb 16 2015 Last time I checked, DMD still did not use OS X native TLS support, but
- Martin Nowak (5/7) Feb 16 2015 Is there more information available abput OSX' TLS support and
- Jacob Carlborg (8/12) Feb 17 2015 I've created an issue for this, there is some information about the
- Jonathan Marler (45/45) Feb 17 2015 I've created a simple program to demonstrate the issue. The
- Jacob Carlborg (5/7) Feb 17 2015 It would be nice to have a comparison in C as well, which do use the
- Dan Olson (31/34) Feb 18 2015 --snip--
- Jonathan Marler (15/40) Feb 18 2015 That's quite a bit better. If I run this using DMD on windows I
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (5/6) Feb 19 2015 You cannot benchmark it like this. To make it more realistic you
- Dan Olson (9/15) Feb 21 2015 Hmm, you got me thinking. A mfence should not be needed for TLS so in a
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (11/22) Feb 22 2015 The problem is really in synthetic benchmarks that is comparing
- Jonathan Marler (17/27) Feb 22 2015 Yes I agree that you can't determine the general performance of
- "Ola Fosheim =?UTF-8?B?R3LDuHN0YWQi?= (15/28) Feb 23 2015 Yeah, demonstrating that it slow is reasonable. I was more
- Dan Olson (17/21) Feb 23 2015 Hi Jonathan.
- Jacob Carlborg (5/9) Feb 23 2015 Other platforms also use an extra functoin for some models, i.e. when
- Joakim (15/31) Feb 17 2015 It has little to do with the linker or llvm. dmd doesn't use the
- Kai Nacke (7/16) Feb 17 2015 Hi Jonathan!
- deadalnix (8/10) Feb 19 2015 Hijacking this as I'm investigating how TLS plays with shared
- Jacob Carlborg (10/15) Feb 20 2015 I guess you can read the documentation at [1] and the source code in
I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD). I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower). Does anyone know if this is because of the way LLVM handles TLS storage? I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!
Feb 16 2015
"Jonathan Marler" <johnnymarler gmail.com> writes:I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD). I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower). Does anyone know if this is because of the way LLVM handles TLS storage? I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!Last time I checked, DMD still did not use OS X native TLS support, but has its own solution. Try LDC and see if the performance improves because LDC uses OS X native TLS. -- Dan
Feb 16 2015
On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:Try LDC and see if the performance improves because LDC uses OS X native TLS.Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.
Feb 16 2015
On 2015-02-17 07:47, Martin Nowak wrote:Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.I've created an issue for this, there is some information about the implementation in the issue [1]. OS X 10.7 or later is required. But I'm pretty sure we can back port it to 10.6 if we really want/need to. [1] https://issues.dlang.org/show_bug.cgi?id=9476#c2 -- /Jacob Carlborg
Feb 17 2015
I've created a simple program to demonstrate the issue. The performance cost of TLS vs __gshared is over one and a half orders of magnitude! import std.stdio; import std.datetime; size_t tlsGlobal; __gshared size_t sharedGlobal; void main(string[] args) { runTest(3, 10_000_000); } void runTest(size_t runCount, size_t loopCount) { writeln("--------------------------------------------------"); StopWatch sw; for(auto runIndex = 0; runIndex < runCount; runIndex++) { writefln("run %s (loopcount %s)", runIndex + 1, loopCount); sw.reset(); sw.start(); for(size_t i = 0; i < loopCount; i++) { tlsGlobal = i; } sw.stop(); writefln(" TLS : %s milliseconds", sw.peek.msecs); sw.reset(); sw.start(); for(size_t i = 0; i < loopCount; i++) { sharedGlobal = i; } sw.stop(); writefln(" Shared: %s milliseconds", sw.peek.msecs); } } -------------------------------------------------- Output: -------------------------------------------------- run 1 (loopcount 10000000) TLS : 104 milliseconds Shared: 3 milliseconds run 2 (loopcount 10000000) TLS : 97 milliseconds Shared: 4 milliseconds run 3 (loopcount 10000000) TLS : 99 milliseconds Shared: 3 milliseconds
Feb 17 2015
On 2015-02-18 02:41, Jonathan Marler wrote:I've created a simple program to demonstrate the issue. The performance cost of TLS vs __gshared is over one and a half orders of magnitude!It would be nice to have a comparison in C as well, which do use the native TLS implementation. -- /Jacob Carlborg
Feb 17 2015
"Jonathan Marler" <johnnymarler gmail.com> writes:I've created a simple program to demonstrate the issue. The performance cost of TLS vs __gshared is over one and a half orders of magnitude!--snip-- I ran on my MacBook to compare DMD and LDC 2.066.1 versions. With LDC, I had to put in an emty asm instruction in the for loops otherwise the optimizer removed all but the last write and timing looked really good (0 milliseconds)! LDC __gshared versus TLS time is a bit better than DMD. $ dmd -O timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 93 milliseconds Shared: 6 milliseconds run 2 (loopcount 10000000) TLS : 91 milliseconds Shared: 6 milliseconds run 3 (loopcount 10000000) TLS : 92 milliseconds Shared: 4 milliseconds $ ldmd2 -O3 timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 21 milliseconds Shared: 3 milliseconds run 2 (loopcount 10000000) TLS : 22 milliseconds Shared: 5 milliseconds run 3 (loopcount 10000000) TLS : 20 milliseconds Shared: 3 milliseconds
Feb 18 2015
On Wednesday, 18 February 2015 at 17:03:38 UTC, Dan Olson wrote:LDC __gshared versus TLS time is a bit better than DMD. $ dmd -O timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 93 milliseconds Shared: 6 milliseconds run 2 (loopcount 10000000) TLS : 91 milliseconds Shared: 6 milliseconds run 3 (loopcount 10000000) TLS : 92 milliseconds Shared: 4 milliseconds $ ldmd2 -O3 timetls.d $ ./timetls -------------------------------------------------- run 1 (loopcount 10000000) TLS : 21 milliseconds Shared: 3 milliseconds run 2 (loopcount 10000000) TLS : 22 milliseconds Shared: 5 milliseconds run 3 (loopcount 10000000) TLS : 20 milliseconds Shared: 3 millisecondsThat's quite a bit better. If I run this using DMD on windows I get almost the same performance: dmd test.d -------------------------------------------------- run 1 (loopcount 10000000) TLS : 28 milliseconds Shared: 25 milliseconds run 2 (loopcount 10000000) TLS : 28 milliseconds Shared: 25 milliseconds run 3 (loopcount 10000000) TLS : 27 milliseconds Shared: 25 milliseconds If I turn on optimization they both take 7 milliseconds.
Feb 18 2015
On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:If I turn on optimization they both take 7 milliseconds.You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.
Feb 19 2015
"Ola Fosheim "Grøstad\"" <ola.fosheim.grostad+dlang gmail.com> writes:On Wednesday, 18 February 2015 at 20:05:58 UTC, Jonathan Marler wrote:Hmm, you got me thinking. A mfence should not be needed for TLS so in a MT program, expensive TLS lookup could still win. If cache is blown, wouldn't time to reload cache begin to dominate? I know all of this is very architecture dependent, but I have been wary of the number of instructions to do TLS lookup compared to shared. Perhaps I should not. Am I thinking correctly? -- DanIf I turn on optimization they both take 7 milliseconds.You cannot benchmark it like this. To make it more realistic you should use multiple compilation units, add fences and cache invalidation.
Feb 21 2015
On Sunday, 22 February 2015 at 04:33:58 UTC, Dan Olson wrote:Hmm, you got me thinking. A mfence should not be needed for TLS so in a MT program, expensive TLS lookup could still win. If cache is blown, wouldn't time to reload cache begin to dominate? I know all of this is very architecture dependent, but I have been wary of the number of instructions to do TLS lookup compared to shared. Perhaps I should not. Am I thinking correctly?The problem is really in synthetic benchmarks that is comparing apples/oranges. The "problem" may disappear once TLS tables are loaded into the cache or if the compiler has moved the "problem" outside of the loop and retaining it in a register (which also has a hidden cost). A x86 cache miss is perhaps 100-200 cycles and a 3rd level cache load/full barrier is 30-40 cycles, but a pure read or write barrier is only a few cycles... What is the hidden cost of D TLS versus the optimal codegen for a program? I guess you have to compare C vs D on a set of complex programs to figure it all out.
Feb 22 2015
On Sunday, 22 February 2015 at 17:36:49 UTC, Ola Fosheim Grøstad wrote:The problem is really in synthetic benchmarks that is comparing apples/oranges. The "problem" may disappear once TLS tables are loaded into the cache or if the compiler has moved the "problem" outside of the loop and retaining it in a register (which also has a hidden cost). A x86 cache miss is perhaps 100-200 cycles and a 3rd level cache load/full barrier is 30-40 cycles, but a pure read or write barrier is only a few cycles... What is the hidden cost of D TLS versus the optimal codegen for a program? I guess you have to compare C vs D on a set of complex programs to figure it all out.Yes I agree that you can't determine the general performance of TLS from such a simple program. Here's what happened: I was writing a program that could optionally use TLS memory. When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler. No matter how I used TLS, it was much much slower when using LLVM. The simple program is just a simple way to demonstrate that TLS is very slow in one specific type of program. It would be great to see another program that could demonstrate that TLS is actually faster in some use cases. However, since it it sooo much slower, I think you'll have a hard time finding such an example. The simple program demonstrates that TLS is almost 2 orders of magnitude slower...it may not be that much slower in other types of programs...but with numbers like that it seem obvious that something is wrong.
Feb 22 2015
On Monday, 23 February 2015 at 04:10:29 UTC, Jonathan Marler wrote:Here's what happened: I was writing a program that could optionally use TLS memory. When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler. No matter how I used TLS, it was much much slower when using LLVM. The simple program is just a simple way to demonstrate that TLS is very slow in one specific type of program.Yeah, demonstrating that it slow is reasonable. I was more thinking about the other direction, that either globals or TLS is fast is hard to show without a multi-threaded best-of-breed baseline to compare against. (i.e. that TLS is faster than globals or the other way around does not say much since they both can be too slow if the code gen is lacking...)It would be great to see another program that could demonstrate that TLS is actually faster in some use cases. However, since it it sooo much slower, I think you'll have a hard time finding such an example. The simple program demonstrates that TLS is almost 2 orders of magnitude slower...it may not be that much slower in other types of programs...but with numbers like that it seem obvious that something is wrong.Some other wrongs with naive TLS is that every thread gets the same dataset, that you pollute 3rd level cache compared to globals, and that globals can be fetched without a register (absolute addressing or relative to program counter). I'd be vary of using TLS for larger datstructures, but putting a pointer there instead gives you YET another indirection-> more cache misses...
Feb 23 2015
"Jonathan Marler" <johnnymarler gmail.com> writes:Here's what happened: I was writing a program that could optionally use TLS memory. When I turned on TLS memory it slowed down considerably, but only when using an LLVM compiler. No matter how I used TLS, it was much much slower when using LLVM.Hi Jonathan. The reason for slowness on OS X is here in the source code on Apple's website. A TLS has the extra cost of address lookup by a call to _tlv_get_addr: _tlv_get_addr: movq 8(%rdi),%rax // get key from descriptor movq %gs:0x0(,%rax,8),%rax // get thread value testq %rax,%rax // if NULL, lazily allocate je LlazyAllocate addq 16(%rdi),%rax // add offset from descriptor ret LlazyAllocate: ... http://www.opensource.apple.com/source/dyld/dyld-210.2.3/src/threadLocalHelpers.s -- Dan
Feb 23 2015
On 2015-02-23 17:18, Dan Olson wrote:Hi Jonathan. The reason for slowness on OS X is here in the source code on Apple's website. A TLS has the extra cost of address lookup by a call to _tlv_get_addr:Other platforms also use an extra functoin for some models, i.e. when support for dynamic libraries are required. -- /Jacob Carlborg
Feb 23 2015
On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler wrote:I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD). I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower). Does anyone know if this is because of the way LLVM handles TLS storage? I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!It has little to do with the linker or llvm. dmd doesn't use the native TLS APIs on OS X, as Dan says, because OS X didn't have native TLS back then: http://www.drdobbs.com/architecture-and-design/implementing-thread-local-storage-on-os/228701185 Druntime has since been updated to call pthread_setspecific and pthread_getspecific, but maybe that's still slower than non-TLS on OS X: https://github.com/D-Programming-Language/druntime/blob/master/src/rt/sections_osx.d#L151 As Dan noted, David got ldc working with the since-added undocumented TLS, ie TLV, functions on OS X: https://github.com/ldc-developers/druntime/blob/ldc/src/ldc/osx_tls.c On Tuesday, 17 February 2015 at 06:47:12 UTC, Martin Nowak wrote:On Tuesday, 17 February 2015 at 06:16:04 UTC, Dan Olson wrote:The functions David used were added in 10.7.Try LDC and see if the performance improves because LDC uses OS X native TLS.Is there more information available abput OSX' TLS support and how this is implemented in LDX? What version of OSX is required? I'd very much like to use that for DMD/druntime too, so that we can go on with the shared library support.
Feb 17 2015
Hi Jonathan! On Tuesday, 17 February 2015 at 02:41:12 UTC, Jonathan Marler wrote:I've noticed that on my windows 7 development machine, switching between TLS and non-TLS storage has a minimal impact on performance (when using DMD). I haven't tried LDC yet, however, on a macbook pro, which uses clang (LLVM) for the linker, using TLS has a huge performance impact (much much slower). Does anyone know if this is because of the way LLVM handles TLS storage? I'll have to try using LDC on my windows machine but maybe one of you know off hand whether or not LLVM has some performance problems with TLS storage. Thanks!On Windows, LLVM uses the segment registers for TLS storage (gs: for 32bit and fs: for 64bit). There is no other impact. Regards, Kai
Feb 17 2015
On Tuesday, 17 February 2015 at 22:50:29 UTC, Kai Nacke wrote:On Windows, LLVM uses the segment registers for TLS storage (gs: for 32bit and fs: for 64bit). There is no other impact.Hijacking this as I'm investigating how TLS plays with shared object. So, my understanding is that the segment register is used as a base for TLS, and TLS globals are indexed using this register as a base. This sounds like it can work until you have to cross shared object boundaries. What is done in this case ?
Feb 19 2015
On 2015-02-19 23:27, deadalnix wrote:Hijacking this as I'm investigating how TLS plays with shared object.I guess you can read the documentation at [1] and the source code in druntime for Linux and FreeBSD.So, my understanding is that the segment register is used as a base for TLS, and TLS globals are indexed using this register as a base. This sounds like it can work until you have to cross shared object boundaries. What is done in this case ?I think that is when a runtime helper functions is used, i.e. __tls_get_addr. Multiple models of TLS exist and they very between platforms and depending of what features are needed, i.e. shared objects. There's some documentation [1]. [1] http://www.akkadia.org/drepper/tls.pdf -- /Jacob Carlborg
Feb 20 2015