digitalmars.D - Some GC and emulated TLS questions (GDC related)

Johannes Pfau (103/103) Jul 14 2017 As you might know, GDC currently doesn't properly hook up the GC to the

Kagamin (2/2) Jul 14 2017 Just allocate emutls array in managed heap and pin it somewhere,

Johannes Pfau (9/11) Jul 16 2017 This is basically the option of replicating GCC-style emutls in

Joakim (16/50) Jul 15 2017 I believe that's what's done with the TLS ranges now, they're

Johannes Pfau (24/85) Jul 16 2017 Indeed. We used to use GC.addRange for this and this was said to be

Iain Buclaw via Digitalmars-d (5/57) Jul 16 2017 I sense a revert coming on...

Johannes Pfau (15/21) Jul 16 2017 Correct, though more in a metaphorical sense ;-)

Joakim (12/45) Jul 23 2017 It might be worth doing anyway, considering the rise of GC

Johannes Pfau <nospam example.com> writes:

As you might know, GDC currently doesn't properly hook up the GC to the
GCC emulated TLS support in libgcc. Because of that, TLS memory is not
scanned on target systems with emulated TLS. For GCC this includes
MinGW, Android (although Google switched to LLVM anyway) and some more
architectures. Proper integration likely needs some modifications in
the libgcc emutls code so I need some more information about the GC to
really propose a good solution.


The main problem is that GCC emutls does not use contiguous memory
blocks. So instead of scanning one range containing N variables we'll
have one range for every single TLS variable per thread.
So assuming we could iterate over all these variables (this would be
an extension required in libgcc), would scanTLSRanges in rt.sections
produce acceptable performance in these cases? Depending on the
number of TLS variables and threads there may be thousands of ranges
to scan.

Another solution could be to enhance libgcc emutls to allow custom
allocators, then have a special allocation function in druntime for all
D emutls variables. As far as I know there is no GC heap that is
scanned, but not automatically collected? I'd need a way to completely
manually manage GC.malloc/GC.free memory without the GC collecting this
memory, but still scanning this memory for pointers. Does something
like this exist?

Another option is simply using the DMD-style emutls. But as far as I can
see the DMD implementation never supported dynamic loading of shared
libraries? This is something the GCC emutls support is quite good at:
It doesn't have any platform dependencies (apart from mutexes and some
way to store one thread specific pointer+destructor) and should work
with all kinds of shared library combinations. DMD style emutls also
does not allow sharing TLS variables between D and other languages.

So I was thinking, if DMD style emutls really doesn't support shared
libraries, maybe we should just clone a GCC-style, compiler and OS
agnostic emutls implementation into druntime? A D implementation could
simply allocate all internal arrays using the GC. This should be just
as efficient as the C implementation for variable access and interfacing
to the GC is trivial. It gets somewhat more complicated if we want to
use this in betterC though. We also lose C/C++ compatibility though by
using such a custom implementation.




The rest of this post is a description of the GCC emutls code. Someone
can use this specification to implement a clean-room design D emutls
clone.
Source code can be found here, but beware of the GPL license:
https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

Unlike DMD TLS, the GCC TLS code does not put all initialization memory
into one section. In fact, the code is completely runtime and
compile time linker agnostic so it can't use section start/stop
markers. Instead, every TLS variable is handled individually. For every
variable, an instance of __emutls_object is created in the (writeable)
data segment. __emutls_object is defined as:

struct __emutls_object
{
    word size;
    word align;
    union {pointer offset; void* ptr};
    void* templ;
}

The void* ptr is only used as an optimization for single threaded
programs, so I'll ignore this for now in the further description.

Whenever such a variable is accessed, the compiler calls
__emutls_get_address(&(__emutls_object in data segment)). This function
first does an atomic load of the __emutls_object.offset variable. If it
is zero, this particular TLS variable has not been accessed in any
thread before.

If this is the case, first check if the global emutls
initialization function (emutls_init) has been run already, if not run
it (__gthread_once). The initialization function initializes the mutex
variable and creates a thread local variable using __gthread_key_create
with the destructor function set to emutls_destroy.

Back to __emutls_get_address: If offset was zero and we ran the
emutls_init if required, we now lock the mutex. We have a global
variable emutls_size to count the number of total variables. We now
increase the emutls_size counter and atomically set
__emutls_object.offset = emutls_size.

We now have an __emutls_object.offset index assigned. Either using the
procedure described above or maybe we're called at a later stage again
and offset was already != zero. Now we get a per-thread pointer using
__gthread_getspecific. This is a pointer to an __emutls_array which is
simply a size value, followed by size void*. If
__gthread_getspecific returns null this is the first time we access a
TLS variable in this thread. Then allocate a new __emutls_array (size =
emutls_size + 32 + 1(for the size field)) and save using
__gthread_setspecific. If we already had an array for this thread,
check if __emutls_object.offset index is larger than the array. Then
reallocate the array (double the size, if still to small add +32, then
either way add +1). Update using __gthread_setspecific.

Now we have enough space in the thread-specific array in either case, so
look at array[offset-1]. If this is null, allocate a new object
(emutls_alloc) and set the array value. Return the array value at index
offset-1.

The emutls_alloc function is simple: Allocate __emutls_object.size
bytes with __emutls_object.align alignment. In order to ensure
alignment, the libgcc implementation uses malloc, then manually adjusts
the pointer. As the original pointer is needed for free, the
implementation allocates void*.sizeof more bytes and stores the
original malloc pointer at the start of the allocated data block. The
returned value is offset by void*.sizeof into the data block. Finally
copy __emutls_object.templ into the newly allocated data block
(initialization).

The last missing function is emutls_destroy: Called by __gthread once a
thread key gets destroyed it receives a void* argument pointing to the
per-thread array. The code now simply iterates over the array, gets the
original pointers (offset -1 in the allocated blocks) and frees the
data.

-- Johannes

Jul 14 2017

Kagamin <spam here.lot> writes:

Just allocate emutls array in managed heap and pin it somewhere, 
then everything referenced by it will be preserved.

Jul 14 2017

Johannes Pfau <nospam example.com> writes:

Am Fri, 14 Jul 2017 12:47:55 +0000
schrieb Kagamin <spam here.lot>:

 Just allocate emutls array in managed heap and pin it somewhere, 
 then everything referenced by it will be preserved.

This is basically the option of replicating GCC-style emutls in
druntime. This is quite simple to implement and you don't even need
special pinning, as the Thread instance object in core.thread can refer
to the TLS array.

This solution can't be implemented in libgcc though, as obviously the
GC is not always available to allocate the arrays in pure C programs ;-)


-- Johannes

Jul 16 2017

Joakim <dlang joakim.fea.st> writes:

On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
 Another solution could be to enhance libgcc emutls to allow 
 custom allocators, then have a special allocation function in 
 druntime for all D emutls variables. As far as I know there is 
 no GC heap that is scanned, but not automatically collected?

I believe that's what's done with the TLS ranges now, they're 
scanned but not collected, though they're not part of the GC heap.

 I'd need a way to completely manually manage GC.malloc/GC.free 
 memory without the GC collecting this memory, but still 
 scanning this memory for pointers. Does something like this 
 exist?

It doesn't have to be GC.malloc/GC.free, right?  The current 
DMD-style emutls simply mallocs and frees the TLS data itself and 
only expects the GC to scan it.

 Another option is simply using the DMD-style emutls. But as far 
 as I can see the DMD implementation never supported dynamic 
 loading of shared libraries? This is something the GCC emutls 
 support is quite good at: It doesn't have any platform 
 dependencies (apart from mutexes and some way to store one 
 thread specific pointer+destructor) and should work with all 
 kinds of shared library combinations. DMD style emutls also 
 does not allow sharing TLS variables between D and other 
 languages.

Yes, DMD's emutls was never made to work with loading multiple 
shared libraries.  As for sharing with other languages without 
copying the TLS data over, that seems a rare scenario.

 So I was thinking, if DMD style emutls really doesn't support 
 shared libraries, maybe we should just clone a GCC-style, 
 compiler and OS agnostic emutls implementation into druntime? A 
 D implementation could simply allocate all internal arrays 
 using the GC. This should be just as efficient as the C 
 implementation for variable access and interfacing to the GC is 
 trivial. It gets somewhat more complicated if we want to use 
 this in betterC though. We also lose C/C++ compatibility though 
 by using such a custom implementation.

It would be a good alternative to have, and you're not going to 
care in betterC mode, since there's no druntime or GC.  You'd 
have to be careful how you called TLS data from C/C++, but it 
could still be done.

 The rest of this post is a description of the GCC emutls code. 
 Someone
 can use this specification to implement a clean-room design D 
 emutls
 clone.
 Source code can be found here, but beware of the GPL license:
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

 [...]

There is also this llvm implementation, available under 
permissive licenses and actually documented somewhat:

https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c

Jul 15 2017

Johannes Pfau <nospam example.com> writes:

Am Sat, 15 Jul 2017 10:49:39 +0000
schrieb Joakim <dlang joakim.fea.st>:

 On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
 Another solution could be to enhance libgcc emutls to allow 
 custom allocators, then have a special allocation function in 
 druntime for all D emutls variables. As far as I know there is 
 no GC heap that is scanned, but not automatically collected?  

 
 I believe that's what's done with the TLS ranges now, they're 
 scanned but not collected, though they're not part of the GC heap.

Indeed. We used to use GC.addRange for this and this was said to be
slow when using many ranges. So I'm basically asking whether the scan
delegate has got the same problem or whether it can cope with thousands
of small ranges.
A scanned but not collected heap is slightly different, as the GC can
internally treat the allocator memory as one huge memory range. When
allocating using C malloc, every single allocation needs to be scanned
individually. A scan+/do not collect allocator can probably be built
using the std.experimental.allocator primitives but that code is not in
druntime.

 
 I'd need a way to completely manually manage GC.malloc/GC.free 
 memory without the GC collecting this memory, but still 
 scanning this memory for pointers. Does something like this 
 exist?  

 
 It doesn't have to be GC.malloc/GC.free, right?  The current 
 DMD-style emutls simply mallocs and frees the TLS data itself and 
 only expects the GC to scan it.

The problem here again is whether this scales properly when using
thousands of non contiguous memory ranges. DMD style TLS can allocate
one memory block per thread for all variables. GCC style will allocate
one block per thread and variable.

 
 Another option is simply using the DMD-style emutls. But as far 
 as I can see the DMD implementation never supported dynamic 
 loading of shared libraries? This is something the GCC emutls 
 support is quite good at: It doesn't have any platform 
 dependencies (apart from mutexes and some way to store one 
 thread specific pointer+destructor) and should work with all 
 kinds of shared library combinations. DMD style emutls also 
 does not allow sharing TLS variables between D and other 
 languages.  

 
 Yes, DMD's emutls was never made to work with loading multiple 
 shared libraries.  As for sharing with other languages without 
 copying the TLS data over, that seems a rare scenario.

Yes, probably the best solution for now is to reimplement GCC style
emutls with shared library support in druntime for all compilers and
forget about C/C++ TLS compatibility. Even if we could get patches into
libgcc it'd take years till all relevant systems have been updated to
new libgcc versions.

 
 So I was thinking, if DMD style emutls really doesn't support 
 shared libraries, maybe we should just clone a GCC-style, 
 compiler and OS agnostic emutls implementation into druntime? A 
 D implementation could simply allocate all internal arrays 
 using the GC. This should be just as efficient as the C 
 implementation for variable access and interfacing to the GC is 
 trivial. It gets somewhat more complicated if we want to use 
 this in betterC though. We also lose C/C++ compatibility though 
 by using such a custom implementation.  

 
 It would be a good alternative to have, and you're not going to 
 care in betterC mode, since there's no druntime or GC.  You'd 
 have to be careful how you called TLS data from C/C++, but it 
 could still be done.
 
 The rest of this post is a description of the GCC emutls code. 
 Someone
 can use this specification to implement a clean-room design D 
 emutls
 clone.
 Source code can be found here, but beware of the GPL license:
 https://github.com/gcc-mirror/gcc/blob/master/libgcc/emutls.c

 [...]  

 
 There is also this llvm implementation, available under 
 permissive licenses and actually documented somewhat:
 
 https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c

Unfortunately also not boost compatible, so we can't simply port that
code either, as far as I can see?


-- Johannes

Jul 16 2017

Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com> writes:

On 16 July 2017 at 14:37, Johannes Pfau via Digitalmars-d
<digitalmars-d puremagic.com> wrote:
 Am Sat, 15 Jul 2017 10:49:39 +0000
 schrieb Joakim <dlang joakim.fea.st>:

 On Friday, 14 July 2017 at 09:13:26 UTC, Johannes Pfau wrote:
 Another solution could be to enhance libgcc emutls to allow
 custom allocators, then have a special allocation function in
 druntime for all D emutls variables. As far as I know there is
 no GC heap that is scanned, but not automatically collected?

 I believe that's what's done with the TLS ranges now, they're
 scanned but not collected, though they're not part of the GC heap.

 Indeed. We used to use GC.addRange for this and this was said to be
 slow when using many ranges. So I'm basically asking whether the scan
 delegate has got the same problem or whether it can cope with thousands
 of small ranges.
 A scanned but not collected heap is slightly different, as the GC can
 internally treat the allocator memory as one huge memory range. When
 allocating using C malloc, every single allocation needs to be scanned
 individually. A scan+/do not collect allocator can probably be built
 using the std.experimental.allocator primitives but that code is not in
 druntime.

 I'd need a way to completely manually manage GC.malloc/GC.free
 memory without the GC collecting this memory, but still
 scanning this memory for pointers. Does something like this
 exist?

 It doesn't have to be GC.malloc/GC.free, right?  The current
 DMD-style emutls simply mallocs and frees the TLS data itself and
 only expects the GC to scan it.

 The problem here again is whether this scales properly when using
 thousands of non contiguous memory ranges. DMD style TLS can allocate
 one memory block per thread for all variables. GCC style will allocate
 one block per thread and variable.

 Another option is simply using the DMD-style emutls. But as far
 as I can see the DMD implementation never supported dynamic
 loading of shared libraries? This is something the GCC emutls
 support is quite good at: It doesn't have any platform
 dependencies (apart from mutexes and some way to store one
 thread specific pointer+destructor) and should work with all
 kinds of shared library combinations. DMD style emutls also
 does not allow sharing TLS variables between D and other
 languages.

 Yes, DMD's emutls was never made to work with loading multiple
 shared libraries.  As for sharing with other languages without
 copying the TLS data over, that seems a rare scenario.

 Yes, probably the best solution for now is to reimplement GCC style
 emutls with shared library support in druntime for all compilers and
 forget about C/C++ TLS compatibility. Even if we could get patches into
 libgcc it'd take years till all relevant systems have been updated to
 new libgcc versions.

I sense a revert coming on...

https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c

Iain.

Jul 16 2017

Johannes Pfau <nospam example.com> writes:

Am Sun, 16 Jul 2017 14:48:04 +0200
schrieb Iain Buclaw via Digitalmars-d <digitalmars-d puremagic.com>:

 
 I sense a revert coming on...
 
 https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c
 
 Iain.

Correct, though more in a metaphorical sense ;-)

Ideally, I'd want a boost licensed, high level D implementation in
core.thread. Instead of using __gthread get/setspecific, we simply add a
GC managed (i.e. plain stupid) void[][] _tlsVars array to
core.thread.Thread, use core.sync for locking and core.atomic to manage
array indices. With all the high-level stuff we can reuse from druntime
(resizing/reserving arrays) such an implementation is probably < 100
LOC. Most importantly, as we can't overwrite the functions in libgcc
we'd also use custom function names (__d_emutls_get_address).

The one thing stopping me though is that I don't think I can implement
this and boost-license it now that I almost know the libgcc
implementation by heart...

-- Johannes

Jul 16 2017

Joakim <dlang joakim.fea.st> writes:

On Sunday, 16 July 2017 at 12:37:26 UTC, Johannes Pfau wrote:
 Yes, probably the best solution for now is to reimplement GCC 
 style emutls with shared library support in druntime for all 
 compilers and forget about C/C++ TLS compatibility. Even if we 
 could get patches into libgcc it'd take years till all relevant 
 systems have been updated to new libgcc versions.

It might be worth doing anyway, considering the rise of GC 
languages like D and Go.

 There is also this llvm implementation, available under 
 permissive licenses and actually documented somewhat:
 
 https://github.com/llvm-mirror/compiler-rt/blob/master/lib/builtins/emutls.c

 Unfortunately also not boost compatible, so we can't simply 
 port that code either, as far as I can see?

Yes, it can't simply be relicensed as Boost, even though the 
UIUC/MIT dual license it's under is very permissive, but each 
license has advertising and license text inclusion clauses that 
are not compatible with the Boost license.

On Sunday, 16 July 2017 at 14:10:45 UTC, Johannes Pfau wrote:
 Am Sun, 16 Jul 2017 14:48:04 +0200
 schrieb Iain Buclaw via Digitalmars-d 
 <digitalmars-d puremagic.com>:

 
 I sense a revert coming on...
 
 https://github.com/D-Programming-GDC/GDC/commit/cf5e9e323b26d21a652bc2933dd886faba90281c
 
 Iain.

 Correct, though more in a metaphorical sense ;-)

 Ideally, I'd want a boost licensed, high level D implementation 
 in core.thread. Instead of using __gthread get/setspecific, we 
 simply add a GC managed (i.e. plain stupid) void[][] _tlsVars 
 array to core.thread.Thread, use core.sync for locking and 
 core.atomic to manage array indices. With all the high-level 
 stuff we can reuse from druntime (resizing/reserving arrays) 
 such an implementation is probably < 100 LOC. Most importantly, 
 as we can't overwrite the functions in libgcc we'd also use 
 custom function names (__d_emutls_get_address).

 The one thing stopping me though is that I don't think I can 
 implement this and boost-license it now that I almost know the 
 libgcc implementation by heart...

Sounds like a worthwhile effort.  If it requires someone who's 
never looked at the libgcc implementation, you could try asking 
in the LDC forum or someone who's contributed to the GC.  Maybe 
Dmitry could whip this up for us?

Jul 23 2017

D Programming

C/C++ Programming

Other

digitalmars.D - Some GC and emulated TLS questions (GDC related)