digitalmars.D - Testing some singleton implementations
- Andrej Mitrovic (114/114) Jan 31 2014 There was a nice blog-post about implementing low-lock singletons in D, ...
- Stanislav Blinov (12/12) Jan 31 2014 You forgot to make the flag static for AtomicSingleton. I'd also
- Andrej Mitrovic (9/15) Jan 31 2014 Here's mine:
- Stanislav Blinov (7/19) Jan 31 2014 It is :)
- Benjamin Thaut (16/28) Jan 31 2014 For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic
- Andrej Mitrovic (4/6) Jan 31 2014 Hmm, I guess we could use a version(X86) block to pick this. When you
- Benjamin Thaut (11/17) Jan 31 2014 It depends on the processor architecture. Usually if you have a "normal"...
- Benjamin Thaut (4/4) Jan 31 2014 If you need the details, read:
- Jonathan Bettencourt (2/2) Jan 31 2014 Is it just me or does the implementation of atomic.d look grossly
- Benjamin Thaut (6/8) Jan 31 2014 I can't really judge that, as I don't have much experience in lock free
- Andrej Mitrovic (3/5) Jan 31 2014 Aye it's been on my todo list forever, even though I've read the first
- Benjamin Thaut (5/11) Jan 31 2014 You should really take the time to read it. Its one of the best articles...
- Stanislav Blinov (7/9) Jan 31 2014 Uhm... atomicLoad() itself guarantees that the read is atomic.
- Stanislav Blinov (17/17) Jan 31 2014 In fact #2, I think it's even safe to pull that store out of the
- Dmitry Olshansky (7/23) Jan 31 2014 //(4)
- Stanislav Blinov (8/35) Jan 31 2014 Nope. The only way the thread is going to end up past the null
- Stanislav Blinov (8/38) Feb 01 2014 To clarify: only one thread will ever get to position (3). All
- Dmitry Olshansky (5/41) Feb 01 2014 Yes, I see there could be many writes to _instantiated field but not
- Stanislav Blinov (10/10) Feb 07 2014 There's a lot more to these singletons than meets the eye.
- Jonathan Bettencourt (4/9) Feb 07 2014 I agree that acq/rel is the correct way to go, but it will cause
- Cecil Ward (19/29) Feb 27 2014 Hi Martin, Sean, Stanislav et al
- Stanislav Blinov (22/44) Mar 03 2014 When I said "review" I meant this specific issue, e.g.
- Dmitry Olshansky (6/15) Jan 31 2014 And it was a big thing because of that. Also keep in mind that atomic
- Andrej Mitrovic (3/6) Jan 31 2014 Hmm yeah, but I was expecting better numbers. Even after the 'static'
- Andrej Mitrovic (18/20) Jan 31 2014 Actually, I think I understand why this happens. Logically, the atomic
- Stanislav Blinov (15/21) Jan 31 2014 Easy enough to test. But inconclusive. I just ran some tests with
- Andrej Mitrovic (3/7) Jan 31 2014 Hmm.. Well I know we've had some issues with threads on FreeBSD. It's
- Stanislav Blinov (29/37) Jan 31 2014 I'm not comfortable with that atomicOp in the thread function.
- Andrej Mitrovic (14/17) Feb 04 2014 I've finally managed to build LDC2 on Windows (MinGW version), here
- Stanislav Blinov (7/22) Feb 04 2014 :)
- Andrej Mitrovic (7/11) Feb 04 2014 I haven't figured out exactly what you're trying to swap there. Do you
- Stanislav Blinov (7/21) Feb 04 2014 Both atomicLoad and atomicStore use raw MemoryOrder, and also the
- Andrej Mitrovic (5/8) Feb 05 2014 No difference, but maybe the timing precision isn't proper. It always
- Stanislav Blinov (7/12) Feb 05 2014 Hmm... It should be as proper as it gets, judging from
- Jonathan Bettencourt (5/17) Feb 05 2014 The atomics implementation in druntime is very inefficient, it
- Andrej Mitrovic (2/4) Feb 04 2014 s/:/?
- Jerry (25/42) Feb 04 2014 Here's the best and worst times I get on my linux laptop. These are
- Stanislav Blinov (5/19) Feb 05 2014 Whoah, those times for AtomicSingleton are way high. What kind of
- Jerry (4/28) Feb 05 2014 Core 2 Due T9400. The gdc times were much better for AtomicSingleton -
- Stanislav Blinov (6/7) Feb 05 2014 Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004
- Jerry (23/29) Feb 06 2014 Yup, that helps out the AtomicSingleton a lot. Here's best and worst
- Sean Kelly (4/4) Feb 07 2014 Weird. atomicLoad(raw) should be the same as atomicLoad(acq),
- Stanislav Blinov (9/13) Feb 07 2014 huh?
- Sean Kelly (7/20) Feb 07 2014 Oops. I thought that since Intel has officially defined loads as
- Stanislav Blinov (71/76) Feb 07 2014 Offhand - no. But who forbids empirical tests? :)
- Sean Kelly (3/4) Feb 07 2014 Awesome. Then I think we can go back to the old logic.
- Stanislav Blinov (12/17) Feb 07 2014 Cool. Also, from
- Marco Leise (8/30) Feb 07 2014 Strong-ordering does not work on x86/amd64 in two cases:
- Martin Nowak (2/11) Feb 08 2014 So, who is going to fix core.atomic?
- Stanislav Blinov (2/3) Feb 09 2014 I was under impression that Sean was onto it.
- Martin Nowak (2/5) Feb 09 2014 Can you please submit a bug report, so we don't loose track of this.
- Stanislav Blinov (3/5) Feb 09 2014 Sure:
- Iain Buclaw (10/30) Feb 07 2014 atomicStore(raw) should be the same as atomicStore(rel). At least on x8...
- Marco Leise (12/46) Feb 07 2014 You send shared variables as "volatile" to the backend and
- Iain Buclaw (10/54) Feb 09 2014 having
- Stanislav Blinov (2/2) Feb 09 2014 Isn't it great how a simple benchmark thread can reveal such
- Marco Leise (7/67) Feb 17 2014 ut
- Stanislav Blinov (3/30) Feb 07 2014 Nice.
- Marco Leise (25/59) Feb 07 2014 I just tested with DMD 2.064.2 and my numbers for the
- Dejan Lekic (37/37) Jan 31 2014 I was thinking about implementing a typical Java singleton in D,
- Dejan Lekic (6/6) Jan 31 2014 I should have mentioned two things in my previous post.
- Stanislav Blinov (4/10) Jan 31 2014 What use would the const version have? You'd still need some way
- Dejan Lekic (1/3) Jan 31 2014 I believe it should have been "final" instead of "const".
- Stanislav Blinov (3/4) Jan 31 2014 But D doesn't have "final" :) In any event, that article by Mike
- Andrej Mitrovic (4/8) Jan 31 2014 AFAIK D1's final was equivalent to D2's immutable. But I maybe
- Jacob Carlborg (8/12) Jan 31 2014 In D2 if if a variable is immutable or const you can not call non-const
- Andrej Mitrovic (2/5) Jan 31 2014 So in D1 const is non-transitive?
- Dicebot (9/17) Jan 31 2014 It is completely different in D1. I think it is not even a
- Dejan Lekic (3/5) Jan 31 2014 Well, "final" still works. Until it does not we will agree that D
- Namespace (3/40) Jan 31 2014 Why is someone interested in implementing such an Ani Pattern
- Stanislav Blinov (6/9) Jan 31 2014 Any sort of shared (as in, between threads) resource is often a
- Namespace (6/10) Jan 31 2014 I know so many people and have read so many books where
- Dejan Lekic (4/4) Jan 31 2014 Here is an updated Andrej's code:
- Andrej Mitrovic (5/6) Jan 31 2014 Well yeah, but that's not really the only thing what a singleton is
- Dejan Lekic (3/12) Jan 31 2014 Absolutely, that is why I would use bothe alternatives, depending
- Andrei Alexandrescu (3/9) Jan 31 2014 Well yah Singleton should be created on first access.
- Dejan Lekic (4/17) Jan 31 2014 If that is what people want, then David's version is definitely
- Stanislav Blinov (2/4) Jan 31 2014 Dejan, your singletons are thread-local :)
- Dejan Lekic (3/7) Jan 31 2014 YAY, that is correct! :'(
- Andrej Mitrovic (2/3) Jan 31 2014 SingletonLazy isn't thread-safe. :)
- Dejan Lekic (1/2) Jan 31 2014 EEK!
- Dejan Lekic (4/7) Jan 31 2014 I made it thread-safe, and guess what - I ended up with
- TC (23/38) Feb 07 2014 Should't be the LockSingleton implemented like this instead?
- Iain Buclaw (26/69) Feb 07 2014 We don't want double-checked locking. :)
- Stanislav Blinov (6/12) Feb 07 2014 (_instance is null) will most likely not be an atomic operation.
- Daniel Murphy (3/5) Feb 07 2014 References are one word.
- Stanislav Blinov (3/8) Feb 07 2014 Heh, indeed. Need to go have my brain scanned :\ I have no idea
- Stanislav Blinov (3/16) Feb 07 2014 Scratch that.
- luka8088 (26/46) Feb 09 2014 What about swapping function pointer so the check is done only once per
- Stanislav Blinov (5/9) Feb 09 2014 That is an interesting idea indeed, though it seems to be faster
- luka8088 (5/14) Feb 09 2014 I got it while writing code for dynamic languages (especially
- Martin Nowak (2/17) Feb 09 2014
- Stanislav Blinov (3/11) Feb 09 2014 I don't follow. get should be TLS, as a replacement for
- luka8088 (5/17) Feb 09 2014 It is tls and it needs to be tls because one thread could be replacing
- Andrej Mitrovic (3/7) Feb 10 2014 This confused me for a second since @property is meaningless for variabl...
- luka8088 (2/13) Feb 10 2014 Yeah. My mistake. It should be removed.
- Andrej Mitrovic (6/7) Feb 10 2014 Also, "static __gshared" is really meaningless here, it's either
- luka8088 (26/35) Feb 10 2014 "static" does not meat it must be tls, as "static shared" is valid.
- luka8088 (2/51) Feb 10 2014 Um actually this makes no sense. But anyway I mark it static.
- Andrej Mitrovic (3/4) Feb 10 2014 Yes you're right. I'm beginning to really dislike the 20 different
- Daniel Murphy (3/5) Feb 10 2014 Don't forget that __gshared static and static __gshared do different thi...
- Andrej Mitrovic (2/3) Feb 10 2014 wat.
- Dicebot (3/7) Feb 10 2014 To be more specific: "WATWATWAT"
- Dejan Lekic (2/9) Feb 10 2014 Care to elaborate?
- Daniel Murphy (2/5) Feb 10 2014 https://d.puremagic.com/issues/show_bug.cgi?id=4419
- Andrej Mitrovic (2/9) Feb 11 2014 Ah, that thing. Yeah this whole issue is rather messy IMO.
- Jerry (5/14) Feb 11 2014 Looking at the bug, I see the compiler doesn't implement what the spec
- Daniel Murphy (3/7) Feb 12 2014 It's just messy in the sense that it doesn't behave in a logical or usef...
- Andrej Mitrovic (13/18) Feb 10 2014 C:\dev\code\d_code>test_dmd
- luka8088 (2/25) Feb 10 2014 Could it be that TLS is slower in LLVM?
There was a nice blog-post about implementing low-lock singletons in D, here: http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/ One suggestion on Reddit was by dawgfoto (I think this is Martin Nowak?), to use atomic primitives instead: http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07 I wanted to benchmark these different approaches. I was expecting Martin's implementation to be the fastest one, but on my machine (Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns out to be the fastest one. I'm wondering whether my test case is flawed in some way. Btw, I think we should put an implementation of this into Phobos. The timings on my machine: Test time for LockSingleton: 542 msecs. Test time for SyncSingleton: 20 msecs. Test time for AtomicSingleton: 755 msecs. Here's the code: http://codepad.org/TMb0xxYw And pasted below for convenience: ----- module singleton; import std.concurrency; import core.atomic; import core.thread; class LockSingleton { static LockSingleton get() { __gshared LockSingleton _instance; synchronized { if (_instance is null) _instance = new LockSingleton; } return _instance; } private: this() { } } class SyncSingleton { static SyncSingleton get() { static bool _instantiated; // tls __gshared SyncSingleton _instance; if (!_instantiated) { synchronized { if (_instance is null) _instance = new SyncSingleton; _instantiated = true; } } return _instance; } private: this() { } } class AtomicSingleton { static AtomicSingleton get() { shared bool _instantiated; __gshared AtomicSingleton _instance; // only enter synchronized block if not instantiated if (!atomicLoad!(MemoryOrder.acq)(_instantiated)) { synchronized { if (_instance is null) _instance = new AtomicSingleton; atomicStore!(MemoryOrder.rel)(_instantiated, true); } } return _instance; } } version (unittest) { ulong _thread_call_count; // TLS } unittest { import std.datetime; import std.stdio; import std.string; import std.typetuple; foreach (TestClass; TypeTuple!(LockSingleton, SyncSingleton, AtomicSingleton)) { // mixin to avoid multiple definition errors mixin(q{ static void test_%1$s() { foreach (i; 0 .. 1024_000) { // just trying to avoid the compiler from doing dead-code optimization _thread_call_count += (TestClass.get() !is null); } } auto sw = StopWatch(AutoStart.yes); enum threadCount = 4; foreach (i; 0 .. threadCount) spawn(&test_%1$s); thread_joinAll(); }.format(TestClass.stringof)); sw.stop(); writefln("Test time for %s: %s msecs.", TestClass.stringof, sw.peek.msecs); } } void main() { } -----
Jan 31 2014
You forgot to make the flag static for AtomicSingleton. I'd also move the timing into the threads themselves, for fairness :) http://codepad.org/gvm3A88k Timings on my machine: ldc2 -unittest -release -O3: Test time for LockSingleton: 537 msecs. Test time for SyncSingleton: 2 msecs. Test time for AtomicSingleton: 2.25 msecs. dmd -unittest -release -O -inline: Test time for LockSingleton: 451.5 msecs. Test time for SyncSingleton: 7.75 msecs. Test time for AtomicSingleton: 99.75 msecs.
Jan 31 2014
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:You forgot to make the flag static for AtomicSingleton.Ah. It was copied verbatim from reddit, I guess we both missed it.Timings on my machine: ldc2 -unittest -release -O3: Test time for LockSingleton: 537 msecs. Test time for SyncSingleton: 2 msecs. Test time for AtomicSingleton: 2.25 msecs.Here's mine: $ dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 577.5 msecs. Test time for SyncSingleton: 9.25 msecs. Test time for AtomicSingleton: 159.75 msecs. Maybe ldc's optimizer is just much better at this? In either case how come the atomic version is slower?
Jan 31 2014
On Friday, 31 January 2014 at 10:39:19 UTC, Andrej Mitrovic wrote:On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:Yeah, with D's verbosity in this cases it's easy to miss.You forgot to make the flag static for AtomicSingleton.Ah. It was copied verbatim from reddit, I guess we both missed it.Here's mine: $ dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 577.5 msecs. Test time for SyncSingleton: 9.25 msecs. Test time for AtomicSingleton: 159.75 msecs. Maybe ldc's optimizer is just much better at this?It is :) http://forum.dlang.org/thread/lqmqsnucadaqlkxkoffc forum.dlang.orgIn either case how come the atomic version is slower?It may not be universally true, as Dmitry mentioned. On some platforms, TLS could be slow but atomics fast. I'm suspecting that on Windows TLS could be slower, actually.
Jan 31 2014
Am 31.01.2014 10:18, schrieb Stanislav Blinov:You forgot to make the flag static for AtomicSingleton. I'd also move the timing into the threads themselves, for fairness :) http://codepad.org/gvm3A88k Timings on my machine: ldc2 -unittest -release -O3: Test time for LockSingleton: 537 msecs. Test time for SyncSingleton: 2 msecs. Test time for AtomicSingleton: 2.25 msecs. dmd -unittest -release -O -inline: Test time for LockSingleton: 451.5 msecs. Test time for SyncSingleton: 7.75 msecs. Test time for AtomicSingleton: 99.75 msecs.For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic by default. So I replaced that with MemoryOrder.raw and named it AtomicSingletonRaw On Windows 7: dmd -unittest -release -O -inline -noboundscheck Test time for LockSingleton: 299 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 304 msecs. Test time for AtomicSingletonRaw: 280 msecs. ldc2 -release -unittest -O3 Test time for LockSingleton: 320 msecs. Test time for SyncSingleton: 2 msecs. Test time for AtomicSingleton: 271 msecs. Test time for AtomicSingletonRaw: 209 msecs. It seems that the SyncSingleton is supperior in all cases.
Jan 31 2014
On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic by default.Hmm, I guess we could use a version(X86) block to pick this. When you say x86, do you also imply X86_64? Where can I read about the memory reads being atomic by default?
Jan 31 2014
Am 31.01.2014 12:44, schrieb Andrej Mitrovic:On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:It depends on the processor architecture. Usually if you have a "normal" CPU architecture it garantuees a consitent view to memory. Meaning all reads and writes are atomic. (But not read modify write, or even read write). Usually only numa architectures don't garantuee a consitent view of memory, resulting in reads and writes not beeing atomic. For example the Intel Itanium architecture does not garantuee this. But usually all single processor architectures garantuee a consitent view of memory. I did not come arcross one yet, that didn't do so. (so ARM, PPC and X86, X86_64 all have atomic read/writes) Also see: http://en.wikipedia.org/wiki/Cache_coherenceFor x86 CPUs you don't really need MemoryOrder.acq as reads are atomic by default.Hmm, I guess we could use a version(X86) block to pick this. When you say x86, do you also imply X86_64? Where can I read about the memory reads being atomic by default?
Jan 31 2014
If you need the details, read: http://lwn.net/Articles/250967/ Kind Regards Benjamin Thaut
Jan 31 2014
Is it just me or does the implementation of atomic.d look grossly inefficient and badly in need of a rewrite?
Jan 31 2014
Am 31.01.2014 15:27, schrieb Jonathan Bettencourt:Is it just me or does the implementation of atomic.d look grossly inefficient and badly in need of a rewrite?I can't really judge that, as I don't have much experience in lock free programming. But if someone is to rewrite this module, then it should be someone with quite some experience in lock free programming. Taking a look at the memory model of C++11 and copy from there, might not hurt either.
Jan 31 2014
On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:If you need the details, read: http://lwn.net/Articles/250967/Aye it's been on my todo list forever, even though I've read the first part when it was a single blost post, afair.
Jan 31 2014
Am 31.01.2014 15:30, schrieb Andrej Mitrovic:On 1/31/14, Benjamin Thaut <code benjamin-thaut.de> wrote:You should really take the time to read it. Its one of the best articles on the internet I ever read, and it has tons of relevant information for programmers. You can skip the first chaper, as it mostly talks about the hardware details of how memory works, and why it is hard to make it faster.If you need the details, read: http://lwn.net/Articles/250967/Aye it's been on my todo list forever, even though I've read the first part when it was a single blost post, afair.
Jan 31 2014
On Friday, 31 January 2014 at 11:31:53 UTC, Benjamin Thaut wrote:For x86 CPUs you don't really need MemoryOrder.acq as reads are atomic by default.Uhm... atomicLoad() itself guarantees that the read is atomic. It's not about atomicity of operation, it's about sequential consistency. Using raw in this case is safe because the further synchronized block guarantees that this read will not be reordered to follow write. In fact, the presence of that synchronized block allows for making both load and store raw.
Jan 31 2014
synchronized block: // (2) if (!atomicLoad!(MemoryOrder.raw)(_instantiated)) { // (1) synchronized { // <- this is 'acquire' if (_instance is null) { _instance = new AtomicSingleton; } } // <- this is 'release' // This store cannot be moved to positions (1) or (2) because // of 'synchronized' above atomicStore!(MemoryOrder.raw)(_instantiated, true); }
Jan 31 2014
31-Jan-2014 17:26, Stanislav Blinov пишет:synchronized block: // (2) if (!atomicLoad!(MemoryOrder.raw)(_instantiated)) { // (1) synchronized { // <- this is 'acquire' if (_instance is null) {//(3)_instance = new AtomicSingleton; } } // <- this is 'release'//(4)// This store cannot be moved to positions (1) or (2) because // of 'synchronized' above atomicStore!(MemoryOrder.raw)(_instantiated, true); }No it's not - the second thread may get to (3) while some other thread is at (4). -- Dmitry Olshansky
Jan 31 2014
On Friday, 31 January 2014 at 15:18:43 UTC, Dmitry Olshansky wrote:31-Jan-2014 17:26, Stanislav Blinov пишет:Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.the synchronized block: // (2) if (!atomicLoad!(MemoryOrder.raw)(_instantiated)) { // (1) synchronized { // <- this is 'acquire' if (_instance is null) {//(3)_instance = new AtomicSingleton; } } // <- this is 'release'//(4)// This store cannot be moved to positions (1) or (2) because // of 'synchronized' above atomicStore!(MemoryOrder.raw)(_instantiated, true); }No it's not - the second thread may get to (3) while some other thread is at (4).
Jan 31 2014
On Friday, 31 January 2014 at 23:35:25 UTC, Stanislav Blinov wrote:To clarify: only one thread will ever get to position (3). All others that follow it will see that _instance is not null, thus will just leave the synchronized section. Of course, this means that some N threads (that arrived to the synchronized section before the singleton was created) will all write 'true' into the flag. No big deal :)Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.// (2) if (!atomicLoad!(MemoryOrder.raw)(_instantiated)) { // (1) synchronized { // <- this is 'acquire' if (_instance is null) {//(3)_instance = new AtomicSingleton; } } // <- this is 'release'//(4)// This store cannot be moved to positions (1) or (2) because // of 'synchronized' above atomicStore!(MemoryOrder.raw)(_instantiated, true); }No it's not - the second thread may get to (3) while some other thread is at (4).
Feb 01 2014
01-Feb-2014 18:23, Stanislav Blinov пишет:On Friday, 31 January 2014 at 23:35:25 UTC, Stanislav Blinov wrote:Yes, I see there could be many writes to _instantiated field but not _instance. -- Dmitry OlshanskyTo clarify: only one thread will ever get to position (3). All others that follow it will see that _instance is not null, thus will just leave the synchronized section. Of course, this means that some N threads (that arrived to the synchronized section before the singleton was created) will all write 'true' into the flag. No big deal :)Nope. The only way the thread is going to end up past the null check is if it's instantiating the singleton. It's inside the locked region. As long as the bool is false one of the threads will get inside. the synchronized block, all others will lock. Once that "first" thread is done, the others will see a non null reference. No thread can get to 4 until the singleton is created.// (2) if (!atomicLoad!(MemoryOrder.raw)(_instantiated)) { // (1) synchronized { // <- this is 'acquire' if (_instance is null) {//(3)_instance = new AtomicSingleton; } } // <- this is 'release'//(4)// This store cannot be moved to positions (1) or (2) because // of 'synchronized' above atomicStore!(MemoryOrder.raw)(_instantiated, true); }No it's not - the second thread may get to (3) while some other thread is at (4).
Feb 01 2014
There's a lot more to these singletons than meets the eye. - It would seem that such usage of raw MemoryOrder in AtomicSingleton would be wrong (e.g. return to acq/rel is in order, which should not pose any performance issues on X86, as Sean mentioned). - The instance references should be qualified shared. This needs more serious review, even if only for academic purposes. I'll see what I can come up with :) In the meantime, if anyone has anything to add to the list, please chime in!
Feb 07 2014
On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav Blinov wrote:There's a lot more to these singletons than meets the eye. - It would seem that such usage of raw MemoryOrder in AtomicSingleton would be wrong (e.g. return to acq/rel is in order, which should not pose any performance issues on X86, as Sean mentioned).I agree that acq/rel is the correct way to go, but it will cause performance issues with the current implementation of AtomicLoad.
Feb 07 2014
On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav Blinov wrote:There's a lot more to these singletons than meets the eye. - It would seem that such usage of raw MemoryOrder in AtomicSingleton would be wrong (e.g. return to acq/rel is in order, which should not pose any performance issues on X86, as Sean mentioned). - The instance references should be qualified shared. This needs more serious review, even if only for academic purposes. I'll see what I can come up with :) In the meantime, if anyone has anything to add to the list, please chime in!Hi Martin, Sean, Stanislav et al I would quite like to code-review atomics.d and maybe think about improving the documentation and adding a few comments, especially for the purposes of knowledge capture in this sticky field. Would that be ok, in principle? There are a few rough edges here and there _in my very unworthy opinion_, and the odd bit that doesn't look quite right somehow especially in the x64 branch. If I could even find the odd bug then that would be good. Or rather bad. A big amount of work has clearly gone into this module. So, many beers to Sean and others who put their time into it. Research can be quite a pig too on a project of this kind, I would imagine. There is quite a list of things that I'm currently unclear about when I read through the D, and this might mean me whimpering for help occasionally..? Best, Cecil.
Feb 27 2014
On Friday, 28 February 2014 at 00:29:49 UTC, Cecil Ward wrote:On Friday, 7 February 2014 at 20:09:29 UTC, Stanislav BlinovWhen I said "review" I meant this specific issue, e.g. singletons. Since then I got a bit carried away into general issues with 'shared' qualifier, so for me the quirks of singletons are on hold for now. But if you find other bugs (in atomic.d or anywhere else), inconsistencies, documentation omissions, etc., please post them. This thread clearly shows the value of more thorough testing. Who knows how long it would've taken to notice that atomicLoad() issue if Andrej hadn't created this thread.This needs more serious review, even if only for academic purposes. I'll see what I can come up with :) In the meantime, if anyone has anything to add to the list, please chime in!Hi Martin, Sean, Stanislav et al I would quite like to code-review atomics.dand maybe think about improving the documentation and adding a few comments, especially for the purposes of knowledge capture in this sticky field. Would that be ok, in principle?IMO submitting issues, enhacnements, documentation updates is always a good idea. Though don't be surprised if your submissions hang in the air for a while, it's pretty common esp. when people responsible for the original code are busy with other things.There are a few rough edges here and there _in my very unworthy opinion_, and the odd bit that doesn't look quite right somehow especially in the x64 branch. If I could even find the odd bug then that would be good. Or rather bad. A big amount of work has clearly gone into this module. So, many beers to Sean and others who put their time into it. Research can be quite a pig too on a project of this kind, I would imagine.Use bugzilla (https://d.puremagic.com/issues/) to submit issues/enhancement requests; or submit ready pull requests on github so that they can be reviewed, improved, and if all is good, eventually accepted. It's best done that way since it presents clear history and more focused discussion, and because threads in this NG sink rather quickly.There is quite a list of things that I'm currently unclear about when I read through the D, and this might mean me whimpering for help occasionally..?I don't see a big red banner saying "don't post your questions here" anywhere ;)
Mar 03 2014
31-Jan-2014 12:25, Andrej Mitrovic пишет:There was a nice blog-post about implementing low-lock singletons in D, here: http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/ One suggestion on Reddit was by dawgfoto (I think this is Martin Nowak?), to use atomic primitives instead: http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07 I wanted to benchmark these different approaches. I was expecting Martin's implementation to be the fastest one, but on my machine (Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns out to be the fastest one.And it was a big thing because of that. Also keep in mind that atomic ops are _relatively_ cheap on x86 the stuff should get even better on say ARM. -- Dmitry Olshansky
Jan 31 2014
On 1/31/14, Dmitry Olshansky <dmitry.olsh gmail.com> wrote:Also keep in mind that atomic ops are _relatively_ cheap on x86 the stuff should get even better on say ARM.Hmm yeah, but I was expecting better numbers. Even after the 'static' fix in the bug as noted by Stanislav the atomic version is slower.
Jan 31 2014
On 1/31/14, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:Hmm yeah, but I was expecting better numbers. Even after the 'static' fix in the bug as noted by Stanislav the atomic version is slower.Actually, I think I understand why this happens. Logically, the atomic version will do an atomic read for *every* access, whereas the TLS implementation only checks a thread-local boolean flag. Even though the TLS implementation forces each new thread to enter the synchronized block *on the first read for that thread*, on subsequent reads that thread will not enter the synchronized block anymore. After the very first call of every thread, the cost of the read operation for the TLS version is a TLS read, whereas for the atomic version it is an atomic read. I guess TLS read operations simply beat atomic read operations. The atomic implementation probably beats the TLS version when a lot of new threads are being spawned at once and they only retrieve the singleton which has already been initialized. E.g., say a 1000 threads are spawned. In the atomic version, the 1000 threads will all do an atomic read and not enter the synchronized block, whereas in the TLS version the 1000 threads will all need to enter a synchronized block on the very first read.
Jan 31 2014
On Friday, 31 January 2014 at 10:57:53 UTC, Andrej Mitrovic wrote:The atomic implementation probably beats the TLS version when a lot of new threads are being spawned at once and they only retrieve the singleton which has already been initialized. E.g., say a 1000 threads are spawned.Easy enough to test. But inconclusive. I just ran some tests with 1024 threads :) First, subsequent runs on my machine show interleaving results: Test time for SyncSingleton: 61.2334 msecs. Test time for AtomicSingleton: 15.9795 msecs. Test time for SyncSingleton: 11.209 msecs. Test time for AtomicSingleton: 25.4395 msecs. Test time for SyncSingleton: 22.8105 msecs. Test time for AtomicSingleton: 35.1865 msecs. I guess I'd need a different CPU (and probably one that's not doing anything else at the time) to get conclusive results. It also seems that either there *is* a race in there somewhere, or maybe a bug?.. Some runs just flat freeze (even on small thread counts) :\
Jan 31 2014
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:First, subsequent runs on my machine show interleaving results. It also seems that either there *is* a race in there somewhere, or maybe a bug?.. Some runs just flat freeze (even on small thread counts) :\Hmm.. Well I know we've had some issues with threads on FreeBSD. It's hard to just guess what's wrong though. :)
Jan 31 2014
On Friday, 31 January 2014 at 11:18:03 UTC, Andrej Mitrovic wrote:On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:I'm not comfortable with that atomicOp in the thread function. I've reworked the unittest a little, to accomodate for multiple runs: http://codepad.org/ghZdjvUE And here are ldc's results (you may want to lower the thread count for dmd, I've killed program after the very first test took 27 second :o): Test 0 time for SyncSingleton: 35.4775 msecs. Test 0 time for AtomicSingleton: 58.5859 msecs. Test 1 time for SyncSingleton: 64.9863 msecs. Test 1 time for AtomicSingleton: 12.5479 msecs. Test 2 time for SyncSingleton: 44.2617 msecs. Test 2 time for AtomicSingleton: 26.2842 msecs. Test 3 time for SyncSingleton: 24.8008 msecs. Test 3 time for AtomicSingleton: 34.416 msecs. Test 4 time for SyncSingleton: 5.63477 msecs. Test 4 time for AtomicSingleton: 28.458 msecs. Test 5 time for SyncSingleton: 18.1123 msecs. Test 5 time for AtomicSingleton: 29.6738 msecs. Test 6 time for SyncSingleton: 12.0234 msecs. Test 6 time for AtomicSingleton: 53.2061 msecs. Test 7 time for SyncSingleton: 70.6982 msecs. Test 7 time for AtomicSingleton: 13.2285 msecs. Test 8 time for SyncSingleton: 12.3447 msecs. Test 8 time for AtomicSingleton: 8.06348 msecs. Test 9 time for SyncSingleton: 20.3145 msecs. Test 9 time for AtomicSingleton: 14.334 msecs. Again, inconclusive :)First, subsequent runs on my machine show interleaving results. It also seems that either there *is* a race in there somewhere, or maybe a bug?.. Some runs just flat freeze (even on small thread counts) :\Hmm.. Well I know we've had some issues with threads on FreeBSD. It's hard to just guess what's wrong though. :)
Jan 31 2014
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:I've reworked the unittest a little, to accomodate for multiple runs: http://codepad.org/ghZdjvUEI've finally managed to build LDC2 on Windows (MinGW version), here are the timings between DMD and LDC2: $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 606.5 msecs. Test time for SyncSingleton: 7 msecs. Test time for AtomicSingleton: 138 msecs. $ ldmd2 -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 536.25 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Freaking awesome!
Feb 04 2014
On Tuesday, 4 February 2014 at 09:44:04 UTC, Andrej Mitrovic wrote:I've finally managed to build LDC2 on Windows (MinGW version), here are the timings between DMD and LDC2: $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 606.5 msecs. Test time for SyncSingleton: 7 msecs. Test time for AtomicSingleton: 138 msecs. $ ldmd2 -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 536.25 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Freaking awesome!:) Have you also included fixes from http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org ? How do the test results look in multiple runs? Is AtomicSingleton always faster than SyncSingleton on Windows?
Feb 04 2014
On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:Have you also included fixes from http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org ?I haven't figured out exactly what you're trying to swap there. Do you have a full example:How do the test results look in multiple runs? Is AtomicSingleton always faster than SyncSingleton on Windows?Pretty much. I'm getting reliable results. But I'm not a statistics pro (and yeah I've read http://zedshaw.com/essays/programmer_stats.html - still doesn't make me a pro).
Feb 04 2014
On Tuesday, 4 February 2014 at 14:23:51 UTC, Andrej Mitrovic wrote:On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:Both atomicLoad and atomicStore use raw MemoryOrder, and also the atomicStore is out of the synchronized {} section: http://dpaste.dzfl.pl/291abc51bb0eHave you also included fixes from http://forum.dlang.org/post/khidcgetalmguhassvqm forum.dlang.org ?I haven't figured out exactly what you're trying to swap there. Do you have a full example:Interesting. As you've seen, for me on Linux it's 50/50.How do the test results look in multiple runs? Is AtomicSingleton always faster than SyncSingleton on Windows?Pretty much. I'm getting reliable results.But I'm not a statistics pro (and yeah I've read http://zedshaw.com/essays/programmer_stats.html - still doesn't make me a pro).Same here :)
Feb 04 2014
On 2/4/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:Both atomicLoad and atomicStore use raw MemoryOrder, and also the atomicStore is out of the synchronized {} section: http://dpaste.dzfl.pl/291abc51bb0eNo difference, but maybe the timing precision isn't proper. It always displays one of 3/3.25/4 msecs. Anywho what's important is that Atomic is really speedy and Sync is almost as fast. Except with DMD which is bad at optimizing this specific code.
Feb 05 2014
On Wednesday, 5 February 2014 at 08:39:08 UTC, Andrej Mitrovic wrote:No difference, but maybe the timing precision isn't proper. It always displays one of 3/3.25/4 msecs.Hmm... It should be as proper as it gets, judging from StopWatch's docs.Anywho what's important is that Atomic is really speedy and Sync is almost as fast. Except with DMD which is bad at optimizing this specific code.Yup, at least we have two fast low-lock implementations to choose from depending on platform's capabilities regarding TLS and atomics.
Feb 05 2014
On Wednesday, 5 February 2014 at 09:30:51 UTC, Stanislav Blinov wrote:On Wednesday, 5 February 2014 at 08:39:08 UTC, Andrej Mitrovic wrote:The atomics implementation in druntime is very inefficient, it uses compare-and-swap for nearly everything. I'm working on a rewrite.No difference, but maybe the timing precision isn't proper. It always displays one of 3/3.25/4 msecs.Hmm... It should be as proper as it gets, judging from StopWatch's docs.Anywho what's important is that Atomic is really speedy and Sync is almost as fast. Except with DMD which is bad at optimizing this specific code.Yup, at least we have two fast low-lock implementations to choose from depending on platform's capabilities regarding TLS and atomics.
Feb 05 2014
On 2/4/14, Andrej Mitrovic <andrej.mitrovich gmail.com> wrote:I haven't figured out exactly what you're trying to swap there. Do you have a full example:s/:/?
Feb 04 2014
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:On Tuesday, 4 February 2014 at 09:44:04 UTC, Andrej Mitrovic wrote:Here's the best and worst times I get on my linux laptop. These are with 2.064.2 dmd and gdc 4.9 with 2.064.2 On Ubuntu x86_64: ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest singleton.d Test 2 time for SyncSingleton: 753.547 msecs. Test 2 time for AtomicSingleton: 22290.3 msecs. Test 3 time for SyncSingleton: 254.968 msecs. Test 3 time for AtomicSingleton: 22903.3 msecs. Test 6 time for SyncSingleton: 510.118 msecs. Test 6 time for AtomicSingleton: 23970.9 msecs. Test 8 time for SyncSingleton: 480.175 msecs. Test 8 time for AtomicSingleton: 12827.9 msecs. ../bin/gdc -frelease -funittest -O3 singleton.d Test 0 time for SyncSingleton: 458.605 msecs. Test 0 time for AtomicSingleton: 1985.87 msecs. Test 1 time for SyncSingleton: 334.097 msecs. Test 1 time for AtomicSingleton: 2030.29 msecs. Test 5 time for SyncSingleton: 355.765 msecs. Test 5 time for AtomicSingleton: 1040.87 msecs. Test 9 time for SyncSingleton: 295.145 msecs. Test 9 time for AtomicSingleton: 1272.22 msecs. It seems like gdc and dmd are similar for SyncSingleton. AtomicSingleton is significantly faster for gdc, but not as fast as SyncSingleton.I've finally managed to build LDC2 on Windows (MinGW version), here are the timings between DMD and LDC2: $ dmd -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 606.5 msecs. Test time for SyncSingleton: 7 msecs. Test time for AtomicSingleton: 138 msecs. $ ldmd2 -release -inline -O -noboundscheck -unittest singleton_2.d -oftest.exe && test.exe Test time for LockSingleton: 536.25 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Freaking awesome!
Feb 04 2014
On Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:Here's the best and worst times I get on my linux laptop. These are with 2.064.2 dmd and gdc 4.9 with 2.064.2 On Ubuntu x86_64: ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest singleton.d Test 2 time for SyncSingleton: 753.547 msecs. Test 2 time for AtomicSingleton: 22290.3 msecs. Test 3 time for SyncSingleton: 254.968 msecs. Test 3 time for AtomicSingleton: 22903.3 msecs. Test 6 time for SyncSingleton: 510.118 msecs. Test 6 time for AtomicSingleton: 23970.9 msecs. Test 8 time for SyncSingleton: 480.175 msecs. Test 8 time for AtomicSingleton: 12827.9 msecs.Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop? Perhaps we need to repost the test with the latest implementation of AtomicSingleton.
Feb 05 2014
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:On Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:Core 2 Due T9400. The gdc times were much better for AtomicSingleton - about 4x slower than SyncSingleton.Here's the best and worst times I get on my linux laptop. These are with 2.064.2 dmd and gdc 4.9 with 2.064.2 On Ubuntu x86_64: ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest singleton.d Test 2 time for SyncSingleton: 753.547 msecs. Test 2 time for AtomicSingleton: 22290.3 msecs. Test 3 time for SyncSingleton: 254.968 msecs. Test 3 time for AtomicSingleton: 22903.3 msecs. Test 6 time for SyncSingleton: 510.118 msecs. Test 6 time for AtomicSingleton: 23970.9 msecs. Test 8 time for SyncSingleton: 480.175 msecs. Test 8 time for AtomicSingleton: 12827.9 msecs.Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop?Perhaps we need to repost the test with the latest implementation of AtomicSingleton.I downloaded the test program yesterday.
Feb 05 2014
On Wednesday, 5 February 2014 at 21:47:40 UTC, Jerry wrote:I downloaded the test program yesterday.Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004 Andrej, I hope you don't mind me fiddling with that code? I've put that atomic fix in there, also switched timing to use hnsecs (converted back to msecs for output), which seems to give more accurate readings.
Feb 05 2014
"Stanislav Blinov" <stanislav.blinov gmail.com> writes:On Wednesday, 5 February 2014 at 21:47:40 UTC, Jerry wrote:Yup, that helps out the AtomicSingleton a lot. Here's best and worst times for each for dmd and gdc: jlquinn wyvern:~/d/tests$ ~/dmd2/linux/bin64/dmd -O -release -inline -unittest singleton2.d jlquinn wyvern:~/d/tests$ ./singleton2 *Test 2 time for SyncSingleton: 585.992 msecs. Test 2 time for AtomicSingleton: 1189.03 msecs. Test 5 time for SyncSingleton: 796.834 msecs. *Test 5 time for AtomicSingleton: 1069.08 msecs. *Test 7 time for SyncSingleton: 811.711 msecs. Test 7 time for AtomicSingleton: 1263.36 msecs. Test 9 time for SyncSingleton: 605.729 msecs. *Test 9 time for AtomicSingleton: 2173.74 msecs. jlquinn wyvern:~/d/tests$ ../bin/gdc -O3 -finline -frelease -fno-bounds-check -funittest singleton2.d jlquinn wyvern:~/d/tests$ ./a.out Test 0 time for SyncSingleton: 542.797 msecs. *Test 0 time for AtomicSingleton: 257.805 msecs. *Test 5 time for SyncSingleton: 620.052 msecs. Test 5 time for AtomicSingleton: 248.951 msecs. Test 7 time for SyncSingleton: 437.124 msecs. *Test 7 time for AtomicSingleton: 605.781 msecs. *Test 8 time for SyncSingleton: 252.643 msecs. Test 8 time for AtomicSingleton: 279.854 msecs.I downloaded the test program yesterday.Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004 Andrej, I hope you don't mind me fiddling with that code? I've put that atomic fix in there, also switched timing to use hnsecs (converted back to msecs for output), which seems to give more accurate readings.
Feb 06 2014
Weird. atomicLoad(raw) should be the same as atomicLoad(acq), and atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.
Feb 07 2014
On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Weird. atomicLoad(raw) should be the same as atomicLoad(acq), and atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier = ms != MemoryOrder.raw; } -->8-- Didn't you write this? :)
Feb 07 2014
On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Oops. I thought that since Intel has officially defined loads as having acquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86?Weird. atomicLoad(raw) should be the same as atomicLoad(acq), and atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier = ms != MemoryOrder.raw; } -->8-- Didn't you write this? :)
Feb 07 2014
On Friday, 7 February 2014 at 15:42:06 UTC, Sean Kelly wrote:Oops. I thought that since Intel has officially defined loads as having acquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86?Offhand - no. But who forbids empirical tests? :) --8<-- main.cpp #include <atomic> #include <cstdint> #include <iostream> int test32() { std::atomic<int> ai(0xfacefeed); return ai.load(std::memory_order_acquire); } int64_t test64() { std::atomic<int64_t> ai(0xbadface00badface); return ai.load(std::memory_order_acquire); } int main(int argc, char** argv) { auto i1 = test32(); auto i2 = test64(); // Prevent dead code optimization std::cout << i1 << " " << i2 << std::endl; } -->8-- I've pulled the atomic ops into separate functions to try and prevent the compiler from being too clever. I'm using --std=c++11 but --std=c++0x would work as well. $ g++ -Ofast -m32 --std=c++11 main.cpp $ objdump -d -w -r -C --no-show-raw-insn --disassembler-options=intel a.out | less -S --8<-- 08048830 <test32()>: 8048830: sub esp,0x10 8048833: mov DWORD PTR [esp+0xc],0xfacefeed 804883b: mov eax,DWORD PTR [esp+0xc] 804883f: add esp,0x10 8048842: ret 8048843: lea esi,[esi+0x0] 8048849: lea edi,[edi+eiz*1+0x0] 08048850 <test64()>: 8048850: sub esp,0x1c 8048853: mov DWORD PTR [esp+0x10],0xbadface 804885b: mov DWORD PTR [esp+0x14],0xbadface0 8048863: fild QWORD PTR [esp+0x10] 8048867: fistp QWORD PTR [esp] 804886a: mov eax,DWORD PTR [esp] 804886d: mov edx,DWORD PTR [esp+0x4] 8048871: add esp,0x1c 8048874: ret 8048875: xchg ax,ax 8048877: xchg ax,ax 8048879: xchg ax,ax 804887b: xchg ax,ax 804887d: xchg ax,ax 804887f: nop -->8-- $ g++ -Ofast -m64 --std=c++11 main.cpp $ objdump -d -w -r -C --no-show-raw-insn --disassembler-options=intel a.out | less -S --8<-- 0000000000400950 <test32()>: 400950: mov DWORD PTR [rsp-0x18],0xfacefeed 400958: mov eax,DWORD PTR [rsp-0x18] 40095c: ret 40095d: nop DWORD PTR [rax] 0000000000400960 <test64()>: 400960: movabs rax,0xbadface00badface 40096a: mov QWORD PTR [rsp-0x18],rax 40096f: mov rax,QWORD PTR [rsp-0x18] 400974: ret 400975: nop WORD PTR cs:[rax+rax*1+0x0] 40097f: nop -->8-- No barriers in sight.
Feb 07 2014
On Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov wrote:No barriers in sight.Awesome. Then I think we can go back to the old logic.
Feb 07 2014
On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote:On Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov wrote:Cool. Also, from http://en.cppreference.com/w/cpp/atomic/memory_order: --8<-- On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire) -->8--No barriers in sight.Awesome. Then I think we can go back to the old logic.
Feb 07 2014
Am Fri, 07 Feb 2014 17:10:06 +0000 schrieb "Stanislav Blinov" <stanislav.blinov gmail.com>:On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote:Strong-ordering does not work on x86/amd64 in two cases: http://preshing.com/20120913/acquire-and-release-semantics/#IDComment721195739 Just thought I should throw that in. Only the official CPU docs will give certainty :) -- MarcoOn Friday, 7 February 2014 at 16:36:03 UTC, Stanislav Blinov wrote:Cool. Also, from http://en.cppreference.com/w/cpp/atomic/memory_order: --8<-- On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire) -->8--No barriers in sight.Awesome. Then I think we can go back to the old logic.
Feb 07 2014
On 02/07/2014 06:10 PM, Stanislav Blinov wrote:On Friday, 7 February 2014 at 16:57:50 UTC, Sean Kelly wrote: --8<-- On strongly-ordered systems (x86, SPARC, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire) -->8--So, who is going to fix core.atomic?
Feb 08 2014
On Sunday, 9 February 2014 at 01:40:51 UTC, Martin Nowak wrote:So, who is going to fix core.atomic?I was under impression that Sean was onto it.
Feb 09 2014
On 02/09/2014 03:07 PM, Stanislav Blinov wrote:On Sunday, 9 February 2014 at 01:40:51 UTC, Martin Nowak wrote:Can you please submit a bug report, so we don't loose track of this.So, who is going to fix core.atomic?I was under impression that Sean was onto it.
Feb 09 2014
On Sunday, 9 February 2014 at 18:07:50 UTC, Martin Nowak wrote:Can you please submit a bug report, so we don't loose track of this.Sure: https://d.puremagic.com/issues/show_bug.cgi?id=12121
Feb 09 2014
On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Weird. atomicLoad(raw) should be the same as atomicLoad(acq), andacquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86? Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to what core.atomic is doing, and should see better performance compared to the __sync intrinsics. :)huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier = ms != MemoryOrder.raw; } -->8-- Didn't you write this? :)Oops. I thought that since Intel has officially defined loads as having
Feb 07 2014
Am Fri, 7 Feb 2014 18:42:29 +0000 schrieb Iain Buclaw <ibuclaw gdcproject.org>:On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD? --=20 MarcoOn Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:atomicStore(raw) should be the same as atomicStore(rel). At least on x86. I don't know why that change made a difference in performance.On Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Weird. atomicLoad(raw) should be the same as atomicLoad(acq), andacquire semantics, I had eliminated the barrier requirement there. But I guess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86? =20 Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to what core.atomic is doing, and should see better performance compared to the __sync intrinsics. :)huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw; } -->8-- Didn't you write this? :)Oops. I thought that since Intel has officially defined loads as having
Feb 07 2014
On 8 Feb 2014 01:20, "Marco Leise" <Marco.Leise gmx.de> wrote:Am Fri, 7 Feb 2014 18:42:29 +0000 schrieb Iain Buclaw <ibuclaw gdcproject.org>:x86.On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:atomicStore(raw) should be the same as atomicStore(rel). At least onOn Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Weird. atomicLoad(raw) should be the same as atomicLoad(acq), andhavingI don't know why that change made a difference in performance.huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw; } -->8-- Didn't you write this? :)Oops. I thought that since Intel has officially defined loads asIacquire semantics, I had eliminated the barrier requirement there. Butcore.atomicguess not. I suppose it's an issue worth discussing. Does anyone know offhand what C++0x implementations do for load acquires on x86? Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to whatNo. The fact that I decided shared data be marked volatile was *not* because of a strong ordering. Remember, we follow C semantics here, which is quite specific in not guaranteeing this. The reason it is set as volatile, is that it (instead) guarantees the compiler will not generate code that explicitly cache the shared data.is doing, and should see better performance compared to the __sync intrinsics. :)You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD?
Feb 09 2014
Isn't it great how a simple benchmark thread can reveal such valuable insights and important problems?
Feb 09 2014
Am Sun, 9 Feb 2014 20:47:07 +0000 schrieb Iain Buclaw <ibuclaw gdcproject.org>:On 8 Feb 2014 01:20, "Marco Leise" <Marco.Leise gmx.de> wrote:utAm Fri, 7 Feb 2014 18:42:29 +0000 schrieb Iain Buclaw <ibuclaw gdcproject.org>:x86.On 7 Feb 2014 15:45, "Sean Kelly" <sean invisibleduck.org> wrote:On Friday, 7 February 2014 at 11:17:49 UTC, Stanislav Blinov wrote:atomicStore(raw) should be the same as atomicStore(rel). At least onOn Friday, 7 February 2014 at 08:10:58 UTC, Sean Kelly wrote:Weird. atomicLoad(raw) should be the same as atomicLoad(acq), andhavingI don't know why that change made a difference in performance.huh? --8<-- core/atomic.d template needsLoadBarrier( MemoryOrder ms ) { enum bool needsLoadBarrier =3D ms !=3D MemoryOrder.raw; } -->8-- Didn't you write this? :)Oops. I thought that since Intel has officially defined loads asacquire semantics, I had eliminated the barrier requirement there. B=Iowguess not. I suppose it's an issue worth discussing. Does anyone kn=Ah, alright then. --=20 Marcocore.atomicoffhand what C++0x implementations do for load acquires on x86? Speaking of which, I need to add 'Update gcc.atomics to use new C++0x intrinsics' to the GDCProjects page - they map closely to what=20 No. The fact that I decided shared data be marked volatile was *not* because of a strong ordering. Remember, we follow C semantics here, which is quite specific in not guaranteeing this. =20 The reason it is set as volatile, is that it (instead) guarantees the compiler will not generate code that explicitly cache the shared data.is doing, and should see better performance compared to the __sync intrinsics. :)You send shared variables as "volatile" to the backend and that is correct. I wonder since that should create strong ordering of memory operations (correct?), if DMD has something similar, or if D's "shared" isn't really shared at al=C4=BA and relies entirely on the correct use of atomicLoad/atomicStore and atomicFence. In that case, would the GCC backend be able to optimize more around shared variables (by not considering them volatile) and still be no worse off than DMD?
Feb 17 2014
On Friday, 7 February 2014 at 04:06:40 UTC, Jerry wrote:"Stanislav Blinov" <stanislav.blinov gmail.com> writes:Cool, I almost started to research that CPU of yours :)Here's my latest revision: http://dpaste.dzfl.pl/5b54df1c7004Yup, that helps out the AtomicSingleton a lot. Here's best and worst times for each for dmd and gdc:jlquinn wyvern:~/d/tests$ ~/dmd2/linux/bin64/dmd -O -release -inline -unittest singleton2.d jlquinn wyvern:~/d/tests$ ./singleton2 *Test 2 time for SyncSingleton: 585.992 msecs. Test 2 time for AtomicSingleton: 1189.03 msecs. Test 5 time for SyncSingleton: 796.834 msecs. *Test 5 time for AtomicSingleton: 1069.08 msecs. *Test 7 time for SyncSingleton: 811.711 msecs. Test 7 time for AtomicSingleton: 1263.36 msecs. Test 9 time for SyncSingleton: 605.729 msecs. *Test 9 time for AtomicSingleton: 2173.74 msecs. jlquinn wyvern:~/d/tests$ ../bin/gdc -O3 -finline -frelease -fno-bounds-check -funittest singleton2.d jlquinn wyvern:~/d/tests$ ./a.out Test 0 time for SyncSingleton: 542.797 msecs. *Test 0 time for AtomicSingleton: 257.805 msecs. *Test 5 time for SyncSingleton: 620.052 msecs. Test 5 time for AtomicSingleton: 248.951 msecs. Test 7 time for SyncSingleton: 437.124 msecs. *Test 7 time for AtomicSingleton: 605.781 msecs. *Test 8 time for SyncSingleton: 252.643 msecs. Test 8 time for AtomicSingleton: 279.854 msecs.Nice.
Feb 07 2014
Am Wed, 05 Feb 2014 16:47:40 -0500 schrieb Jerry <jlquinn optonline.net>:"Stanislav Blinov" <stanislav.blinov gmail.com> writes:I just tested with DMD 2.064.2 and my numbers for the AtomicSingleton are not as high. This is on a Core 2 Duo T7250 / 2.0 Ghz. Test 0 time for SyncSingleton: 1068.83 msecs. Test 0 time for AtomicSingleton: 2102.32 msecs. Test 1 time for SyncSingleton: 901.215 msecs. Test 1 time for AtomicSingleton: 2479.6 msecs. Test 2 time for SyncSingleton: 1091.91 msecs. Test 2 time for AtomicSingleton: 2269.45 msecs. Test 3 time for SyncSingleton: 1156.74 msecs. Test 3 time for AtomicSingleton: 2498.25 msecs. Also for GDC my numbers are like this: Test 0 time for SyncSingleton: 657.928 msecs. Test 0 time for AtomicSingleton: 851.795 msecs. Test 1 time for SyncSingleton: 655.204 msecs. Test 1 time for AtomicSingleton: 893.51 msecs. Test 2 time for SyncSingleton: 613.881 msecs. Test 2 time for AtomicSingleton: 843.635 msecs. Test 3 time for SyncSingleton: 657.87 msecs. Test 3 time for AtomicSingleton: 709.823 msecs. Which is far from the difference you see. -- MarcoOn Wednesday, 5 February 2014 at 00:11:58 UTC, Jerry wrote:Core 2 Due T9400. The gdc times were much better for AtomicSingleton - about 4x slower than SyncSingleton.Here's the best and worst times I get on my linux laptop. These are with 2.064.2 dmd and gdc 4.9 with 2.064.2 On Ubuntu x86_64: ~/dmd2/linux/bin64/dmd -O -release -inline -noboundscheck -unittest singleton.d Test 2 time for SyncSingleton: 753.547 msecs. Test 2 time for AtomicSingleton: 22290.3 msecs. Test 3 time for SyncSingleton: 254.968 msecs. Test 3 time for AtomicSingleton: 22903.3 msecs. Test 6 time for SyncSingleton: 510.118 msecs. Test 6 time for AtomicSingleton: 23970.9 msecs. Test 8 time for SyncSingleton: 480.175 msecs. Test 8 time for AtomicSingleton: 12827.9 msecs.Whoah, those times for AtomicSingleton are way high. What kind of machine is your laptop?Perhaps we need to repost the test with the latest implementation of AtomicSingleton.I downloaded the test program yesterday.
Feb 07 2014
I was thinking about implementing a typical Java singleton in D, and then decided to first check whether someone already did that, and guess what - yes, someone did. Chech this URL: http://dblog.aldacron.net/2007/03/03/singletons-in-d/ Something like this (taken from the article above) in the case you do not want lazy initialisation: class Singleton2(T) { public: static const T instance; private: this() {} static this() { instance = new T; } } class TMySingleton2 : Singleton!(TMySingleton2) { } Something like this (taken from the article above) in the case you want lazy initialisation: class Singleton(T) { public: static T instance() { if(_instance is null) _instance = new T; return _instance; } private: this() {} static T _instance; } class TMySingleton : Singleton!(TMySingleton) { } If there are some Java programmers around who are curious how is Java version done: http://www.javaworld.com/article/2073352/core-java/simply-singleton.html
Jan 31 2014
I should have mentioned two things in my previous post. 1) There are no locks involved. No need, because the solution relies on the fact that static member variables are guaranteed to be created the first time they are accessed. 2) Note that we have constructor disabled. This is important not to forget. ;)
Jan 31 2014
On Friday, 31 January 2014 at 10:26:50 UTC, Dejan Lekic wrote:I should have mentioned two things in my previous post. 1) There are no locks involved. No need, because the solution relies on the fact that static member variables are guaranteed to be created the first time they are accessed.And they are thread-local :)2) Note that we have constructor disabled. This is important not to forget. ;)What use would the const version have? You'd still need some way to access the instance, right? Cast away const?
Jan 31 2014
What use would the const version have? You'd still need some way to access the instance, right? Cast away const?I believe it should have been "final" instead of "const".
Jan 31 2014
On Friday, 31 January 2014 at 11:08:42 UTC, Dejan Lekic wrote:I believe it should have been "final" instead of "const".But D doesn't have "final" :) In any event, that article by Mike Parker is about D1.
Jan 31 2014
On 1/31/14, Stanislav Blinov <stanislav.blinov gmail.com> wrote:On Friday, 31 January 2014 at 11:08:42 UTC, Dejan Lekic wrote:AFAIK D1's final was equivalent to D2's immutable. But I maybe remembering that wrong. Or maybe D2 initially used final before settling for the new keyword immutable, to avoid confusion by users.I believe it should have been "final" instead of "const".But D doesn't have "final" :) In any event, that article by Mike Parker is about D1.
Jan 31 2014
On 2014-01-31 12:27, Andrej Mitrovic wrote:AFAIK D1's final was equivalent to D2's immutable. But I maybe remembering that wrong.In D2 if if a variable is immutable or const you can not call non-const non-immutable methods via that variable. D1 didn't have any concept of this. "const" and "final" in D1 as more, you cannot change this variable.Or maybe D2 initially used final before settling for the new keyword immutable, to avoid confusion by users.D2 used "invariant" before it used "immutable". It also changed the meaning of "const" compared to D1. -- /Jacob Carlborg
Jan 31 2014
On 1/31/14, Jacob Carlborg <doob me.com> wrote:In D2 if if a variable is immutable or const you can not call non-const non-immutable methods via that variable. D1 didn't have any concept of this. "const" and "final" in D1 as more, you cannot change this variable.So in D1 const is non-transitive?
Jan 31 2014
On Friday, 31 January 2014 at 12:09:49 UTC, Andrej Mitrovic wrote:On 1/31/14, Jacob Carlborg <doob me.com> wrote:It is completely different in D1. I think it is not even a qualifier there but a storage class - you can't have const function arguments, it is not printed in typeof and, yes, it is non-transitive. It basically just says "you can't modify this memory block". Also const variables with initializer act as D2 enums. This is one of reasons why porting Sociomantic code will be quite painful :)In D2 if if a variable is immutable or const you can not call non-const non-immutable methods via that variable. D1 didn't have any concept of this. "const" and "final" in D1 as more, you cannot change this variable.So in D1 const is non-transitive?
Jan 31 2014
But D doesn't have "final" :) In any event, that article by Mike Parker is about D1.Well, "final" still works. Until it does not we will agree that D does not have it. ;) That article applies to D2 as well, without any problems.
Jan 31 2014
On Friday, 31 January 2014 at 10:20:45 UTC, Dejan Lekic wrote:I was thinking about implementing a typical Java singleton in D, and then decided to first check whether someone already did that, and guess what - yes, someone did. Chech this URL: http://dblog.aldacron.net/2007/03/03/singletons-in-d/ Something like this (taken from the article above) in the case you do not want lazy initialisation: class Singleton2(T) { public: static const T instance; private: this() {} static this() { instance = new T; } } class TMySingleton2 : Singleton!(TMySingleton2) { } Something like this (taken from the article above) in the case you want lazy initialisation: class Singleton(T) { public: static T instance() { if(_instance is null) _instance = new T; return _instance; } private: this() {} static T _instance; } class TMySingleton : Singleton!(TMySingleton) { } If there are some Java programmers around who are curious how is Java version done: http://www.javaworld.com/article/2073352/core-java/simply-singleton.htmlWhy is someone interested in implementing such an Ani Pattern like Singletons? In most of all cases Singletons are misused.
Jan 31 2014
On Friday, 31 January 2014 at 10:27:28 UTC, Namespace wrote:Why is someone interested in implementing such an Ani Pattern like Singletons?Why is someone overquoting without reason? ;)In most of all cases Singletons are misused.Any sort of shared (as in, between threads) resource is often a singleton. A queue for message passing, concurrent GC, a pipe... Even it doesn't have SINGLETON (yes, in all capitals to irritate reviewers) in its name.
Jan 31 2014
On Friday, 31 January 2014 at 10:50:57 UTC, Stanislav Blinov wrote:On Friday, 31 January 2014 at 10:27:28 UTC, Namespace wrote:I know so many people and have read so many books where Singletons are misused, that I react a bit allergic on it. In most cases, a singleton is absolutely unnecessary and hidden a global variable. Sorry if it may have sounded too harsh. ;)Why is someone interested in implementing such an Ani Pattern like Singletons?Why is someone overquoting without reason? ;)
Jan 31 2014
Here is an updated Andrej's code: http://dpaste.dzfl.pl/c85f487c7f70 SingletonSimple is a winner, followed by the SyncSingleton and SingletonLazy.
Jan 31 2014
On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:SingletonSimple is a winnerWell yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Jan 31 2014
On Friday, 31 January 2014 at 11:42:29 UTC, Andrej Mitrovic wrote:On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:Absolutely, that is why I would use bothe alternatives, depending on the use-case.SingletonSimple is a winnerWell yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Jan 31 2014
On 1/31/14, 3:42 AM, Andrej Mitrovic wrote:On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:Well yah Singleton should be created on first access. AndreiSingletonSimple is a winnerWell yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Jan 31 2014
On Friday, 31 January 2014 at 17:10:08 UTC, Andrei Alexandrescu wrote:On 1/31/14, 3:42 AM, Andrej Mitrovic wrote:If that is what people want, then David's version is definitely the best one.On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:Well yah Singleton should be created on first access. AndreiSingletonSimple is a winnerWell yeah, but that's not really the only thing what a singleton is about. It's also about being able to initialize the singleton at an arbitrary time, rather than in a module constructor before main() is called.
Jan 31 2014
On Friday, 31 January 2014 at 11:34:13 UTC, Dejan Lekic wrote:SingletonSimple is a winner, followed by the SyncSingleton and SingletonLazy.Dejan, your singletons are thread-local :)
Jan 31 2014
On Friday, 31 January 2014 at 11:44:10 UTC, Stanislav Blinov wrote:On Friday, 31 January 2014 at 11:34:13 UTC, Dejan Lekic wrote:YAY, that is correct! :'(SingletonSimple is a winner, followed by the SyncSingleton and SingletonLazy.Dejan, your singletons are thread-local :)
Jan 31 2014
On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:SingletonLazy.SingletonLazy isn't thread-safe. :)
Jan 31 2014
On Friday, 31 January 2014 at 11:45:56 UTC, Andrej Mitrovic wrote:On 1/31/14, Dejan Lekic <dejan.lekic gmail.com> wrote:I made it thread-safe, and guess what - I ended up with SyncSingleton-like solution! So SyncSingleton is a clean winner if you want to make it lazy.SingletonLazy.SingletonLazy isn't thread-safe. :)
Jan 31 2014
On Friday, 31 January 2014 at 08:25:16 UTC, Andrej Mitrovic wrote:class LockSingleton { static LockSingleton get() { __gshared LockSingleton _instance; synchronized { if (_instance is null) _instance = new LockSingleton; } return _instance; } private: this() { } }Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null) { synchronized { if (_instance is null) _instance = new LockSingleton; } } return _instance; } private: this() { } __gshared LockSingleton _instance; } At least this is the way singleton is suggested to implement in instantiation and not allways.
Feb 07 2014
On 7 February 2014 10:25, TC <chalucha gmail.com> wrote:On Friday, 31 January 2014 at 08:25:16 UTC, Andrej Mitrovic wrote:We don't want double-checked locking. :) This was discussed at dconf, the D way is to leverage native thread local storage. I seem to recall that when David tested this, GDC had pretty much near identical speeds to unsafe gets(). You'll have to consult the slides, but I think it was something like: class LockSingleton { static auto get() { if (!_instantiated) { synchronized (LockSingleton.classinfo) { if (_instance is null) _instance = new LockSingleton; _instantiated = true; } } return _instance; } private: this() { } static bool _instantiated; __gshared LockSingleton _instance; }class LockSingleton { static LockSingleton get() { __gshared LockSingleton _instance; synchronized { if (_instance is null) _instance = new LockSingleton; } return _instance; } private: this() { } }Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null) { synchronized { if (_instance is null) _instance = new LockSingleton; } } return _instance; } private: this() { } __gshared LockSingleton _instance; } synchronization is then needed only for initial instantiation and not allways.
Feb 07 2014
On Friday, 7 February 2014 at 10:25:52 UTC, TC wrote:Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null)(_instance is null) will most likely not be an atomic operation. References are two words. Imagine that one thread writes half a reference inside synchronized {}, then goes to sleep. What would the thread that gets to that 'if' return? I'd say it'll return "ouch".
Feb 07 2014
"Stanislav Blinov" wrote in message news:idrxthgkumydmiszdtcx forum.dlang.org...(_instance is null) will most likely not be an atomic operation. References are two words.References are one word.
Feb 07 2014
On Friday, 7 February 2014 at 11:36:23 UTC, Daniel Murphy wrote:"Stanislav Blinov" wrote in message news:idrxthgkumydmiszdtcx forum.dlang.org...Heh, indeed. Need to go have my brain scanned :\ I have no idea why I thought that.(_instance is null) will most likely not be an atomic operation. References are two words.References are one word.
Feb 07 2014
On Friday, 7 February 2014 at 11:31:14 UTC, Stanislav Blinov wrote:On Friday, 7 February 2014 at 10:25:52 UTC, TC wrote:Scratch that.Should't be the LockSingleton implemented like this instead? class LockSingleton { static auto get() { if (_instance is null)(_instance is null) will most likely not be an atomic operation. References are two words. Imagine that one thread writes half a reference inside synchronized {}, then goes to sleep. What would the thread that gets to that 'if' return? I'd say it'll return "ouch".
Feb 07 2014
On 31.1.2014. 9:25, Andrej Mitrovic wrote:There was a nice blog-post about implementing low-lock singletons in D, here: http://davesdprogramming.wordpress.com/2013/05/06/low-lock-singletons/ One suggestion on Reddit was by dawgfoto (I think this is Martin Nowak?), to use atomic primitives instead: http://www.reddit.com/r/programming/comments/1droaa/lowlock_singletons_in_d_the_singleton_pattern/c9tmz07 I wanted to benchmark these different approaches. I was expecting Martin's implementation to be the fastest one, but on my machine (Athlon II X4 620 - 2.61GHz) the implementation in the blog post turns out to be the fastest one. I'm wondering whether my test case is flawed in some way. Btw, I think we should put an implementation of this into Phobos. The timings on my machine: Test time for LockSingleton: 542 msecs. Test time for SyncSingleton: 20 msecs. Test time for AtomicSingleton: 755 msecs.What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this) -------------------------------------------------- class FunctionPointerSingleton { private static __gshared typeof(this) instance_; // tls property static typeof(this) function () get; static this () { get = { synchronized { if (instance_ is null) instance_ = new typeof(this)(); get = { return instance_; }; return instance_; } }; } } -------------------------------------------------- dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 901 msecs. Test time for SyncSingleton: 20.75 msecs. Test time for AtomicSingleton: 169 msecs. Test time for FunctionPointerSingleton: 7.5 msecs. I don't have such a muscular machine xD
Feb 09 2014
On Sunday, 9 February 2014 at 12:20:54 UTC, luka8088 wrote:What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this)That is an interesting idea indeed, though it seems to be faster only for dmd. I haven't studied the assembly yet, but with LDC I don't see any noticeable difference between SyncSingleton and FunctionPointerSingleton.
Feb 09 2014
On 9.2.2014. 15:09, Stanislav Blinov wrote:On Sunday, 9 February 2014 at 12:20:54 UTC, luka8088 wrote:I got it while writing code for dynamic languages (especially javascript). Thought came that instead of checking for something that you know will always have the same result just remove that piece of code and voila :)What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this)That is an interesting idea indeed, though it seems to be faster only for dmd. I haven't studied the assembly yet, but with LDC I don't see any noticeable difference between SyncSingleton and FunctionPointerSingleton.
Feb 09 2014
On 02/09/2014 01:20 PM, luka8088 wrote:class FunctionPointerSingleton { private static __gshared typeof(this) instance_; // tls property static typeof(this) function () get;You don't even need to make this TLS, right?static this () { get = { synchronized { if (instance_ is null) instance_ = new typeof(this)(); get = { return instance_; }; return instance_; } }; } }
Feb 09 2014
On Sunday, 9 February 2014 at 18:06:46 UTC, Martin Nowak wrote:On 02/09/2014 01:20 PM, luka8088 wrote:I don't follow. get should be TLS, as a replacement for SyncSingleton's _instantiated TLS bool.class FunctionPointerSingleton { private static __gshared typeof(this) instance_; // tls property static typeof(this) function () get;You don't even need to make this TLS, right?
Feb 09 2014
On 9.2.2014. 19:51, Stanislav Blinov wrote:On Sunday, 9 February 2014 at 18:06:46 UTC, Martin Nowak wrote:It is tls and it needs to be tls because one thread could be replacing where get points to while another is trying to access it. It's either tls or putting some synchronization above it which would break the whole idea of executing synchronized block only once per thread.On 02/09/2014 01:20 PM, luka8088 wrote:I don't follow. get should be TLS, as a replacement for SyncSingleton's _instantiated TLS bool.class FunctionPointerSingleton { private static __gshared typeof(this) instance_; // tls property static typeof(this) function () get;You don't even need to make this TLS, right?
Feb 09 2014
On 2/9/14, luka8088 <luka8088 owave.net> wrote:What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this)Interesting solution for sure.// tls property static typeof(this) function () get;This confused me for a second since property is meaningless for variables. :>
Feb 10 2014
On 10.2.2014. 10:52, Andrej Mitrovic wrote:On 2/9/14, luka8088 <luka8088 owave.net> wrote:Yeah. My mistake. It should be removed.What about swapping function pointer so the check is done only once per thread? (Thread is tldr so I am sorry if someone already suggested this)Interesting solution for sure.// tls property static typeof(this) function () get;This confused me for a second since property is meaningless for variables. :>
Feb 10 2014
On 2/9/14, luka8088 <luka8088 owave.net> wrote:private static __gshared typeof(this) instance_;Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
Feb 10 2014
On 10.2.2014. 10:54, Andrej Mitrovic wrote:On 2/9/14, luka8088 <luka8088 owave.net> wrote:"static" does not meat it must be tls, as "static shared" is valid. I just like to write that it is static and not shared. I know that __gshared does imply static but this implication is not intuitive to me so I write it explicitly. For example, I think that the following code should output 5 and 6 (as it would it __gshared did not imply static): module program; import std.stdio; import core.thread; class A { __gshared int i; } void main () { auto a1 = new A(); auto a2 = new A(); (new Thread({ a1.i = 5; a2.i = 6; (new Thread({ writeln(a1.i); writeln(a2.i); })).start(); })).start(); } But in any case, this variable is just __gshared.private static __gshared typeof(this) instance_;Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
Feb 10 2014
On 10.2.2014. 13:44, luka8088 wrote:On 10.2.2014. 10:54, Andrej Mitrovic wrote:Um actually this makes no sense. But anyway I mark it static.On 2/9/14, luka8088 <luka8088 owave.net> wrote:"static" does not meat it must be tls, as "static shared" is valid. I just like to write that it is static and not shared. I know that __gshared does imply static but this implication is not intuitive to me so I write it explicitly. For example, I think that the following code should output 5 and 6 (as it would it __gshared did not imply static): module program; import std.stdio; import core.thread; class A { __gshared int i; } void main () { auto a1 = new A(); auto a2 = new A(); (new Thread({ a1.i = 5; a2.i = 6; (new Thread({ writeln(a1.i); writeln(a2.i); })).start(); })).start(); } But in any case, this variable is just __gshared.private static __gshared typeof(this) instance_;Also, "static __gshared" is really meaningless here, it's either static (TLS), or globally shared, either way it's not a class instance, so you can type __gshared alone here. Otherwise I'm not sure what the semantics of a per-class-instance __gshared field would be, if that can exist.
Feb 10 2014
On 2/10/14, luka8088 <luka8088 owave.net> wrote:"static" does not mean it must be tls, as "static shared" is valid.Yes you're right. I'm beginning to really dislike the 20 different meanings of "static". :)
Feb 10 2014
"Andrej Mitrovic" wrote in message news:mailman.111.1392039607.21734.digitalmars-d puremagic.com...Yes you're right. I'm beginning to really dislike the 20 different meanings of "static". :)Don't forget that __gshared static and static __gshared do different things!
Feb 10 2014
On 2/10/14, Daniel Murphy <yebbliesnospam gmail.com> wrote:Don't forget that __gshared static and static __gshared do different things!wat.
Feb 10 2014
On Monday, 10 February 2014 at 16:53:35 UTC, Andrej Mitrovic wrote:On 2/10/14, Daniel Murphy <yebbliesnospam gmail.com> wrote:To be more specific: "WATWATWAT"Don't forget that __gshared static and static __gshared do different things!wat.
Feb 10 2014
On Monday, 10 February 2014 at 14:15:58 UTC, Daniel Murphy wrote:"Andrej Mitrovic" wrote in message news:mailman.111.1392039607.21734.digitalmars-d puremagic.com...Care to elaborate?Yes you're right. I'm beginning to really dislike the 20 different meanings of "static". :)Don't forget that __gshared static and static __gshared do different things!
Feb 10 2014
"Dejan Lekic" wrote in message news:nvakemdpugwupoqctrtd forum.dlang.org...https://d.puremagic.com/issues/show_bug.cgi?id=4419Don't forget that __gshared static and static __gshared do different things!Care to elaborate?
Feb 10 2014
On Tuesday, 11 February 2014 at 03:43:35 UTC, Daniel Murphy wrote:"Dejan Lekic" wrote in message news:nvakemdpugwupoqctrtd forum.dlang.org...Ah, that thing. Yeah this whole issue is rather messy IMO.https://d.puremagic.com/issues/show_bug.cgi?id=4419Don't forget that __gshared static and static __gshared do different things!Care to elaborate?
Feb 11 2014
"Andrej Mitrovic" <andrej.mitrovich gmail.com> writes:On Tuesday, 11 February 2014 at 03:43:35 UTC, Daniel Murphy wrote:Looking at the bug, I see the compiler doesn't implement what the spec says. The spec says __gshared implies static. Is the messiness fixing the implementation to match the spec, or refining the spec to better define what should happen?"Dejan Lekic" wrote in message news:nvakemdpugwupoqctrtd forum.dlang.org...Ah, that thing. Yeah this whole issue is rather messy IMO.https://d.puremagic.com/issues/show_bug.cgi?id=4419Don't forget that __gshared static and static __gshared do > differentthings! Care to elaborate?
Feb 11 2014
"Jerry" wrote in message news:87sirpbjdf.fsf optonline.net...Looking at the bug, I see the compiler doesn't implement what the spec says. The spec says __gshared implies static. Is the messiness fixing the implementation to match the spec, or refining the spec to better define what should happen?It's just messy in the sense that it doesn't behave in a logical or useful way.
Feb 12 2014
On 2/9/14, luka8088 <luka8088 owave.net> wrote:dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 901 msecs. Test time for SyncSingleton: 20.75 msecs. Test time for AtomicSingleton: 169 msecs. Test time for FunctionPointerSingleton: 7.5 msecs.C:\dev\code\d_code>test_dmd Test time for LockSingleton: 438 msecs. Test time for SyncSingleton: 6.25 msecs. Test time for AtomicSingleton: 8 msecs. Test time for FunctionPointerSingleton: 5 msecs. C:\dev\code\d_code>test_ldc Test time for LockSingleton: 575.5 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Test time for FunctionPointerSingleton: 5.25 msecs. It seems it makes a tiny bit of difference for DMD, but LDC still generates better codegen for the atomic version.
Feb 10 2014
On 10.2.2014. 10:59, Andrej Mitrovic wrote:On 2/9/14, luka8088 <luka8088 owave.net> wrote:Could it be that TLS is slower in LLVM?dmd -release -inline -O -noboundscheck -unittest -run singleton.d Test time for LockSingleton: 901 msecs. Test time for SyncSingleton: 20.75 msecs. Test time for AtomicSingleton: 169 msecs. Test time for FunctionPointerSingleton: 7.5 msecs.C:\dev\code\d_code>test_dmd Test time for LockSingleton: 438 msecs. Test time for SyncSingleton: 6.25 msecs. Test time for AtomicSingleton: 8 msecs. Test time for FunctionPointerSingleton: 5 msecs. C:\dev\code\d_code>test_ldc Test time for LockSingleton: 575.5 msecs. Test time for SyncSingleton: 5 msecs. Test time for AtomicSingleton: 3 msecs. Test time for FunctionPointerSingleton: 5.25 msecs. It seems it makes a tiny bit of difference for DMD, but LDC still generates better codegen for the atomic version.
Feb 10 2014