www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.learn - Threadpools, difference between DMD and LDC

reply "Philippe Sigaud" <philippe.sigaud gmail.com> writes:
I'm trying to grok message passing. That's my very first foray
into this, so I'm probably making every mistake in the book :-)

I wrote a small threadpool test, it's there:

http://dpaste.dzfl.pl/3d3a65a00425

I'm playing with the number of threads and the number of tasks,
and getting a feel about how message passing works. I must say I
quite like it: it's a bit like suddenly being able to safely
return different types from a function.

What I don't get is the difference between DMD (I'm using 2.065)
and LDC (0.14-alpha1).

For DMD, I compile with -O -inline -noboundscheck
For LDC, I use -03 -inline

LDC gives me smaller executables than DMD (also, 3 to 5 times
smaller than 0.13, good job!) but above all else incredibly,
astoundingly faster. I'm used to LDC producing 20-30% faster
programs, but here it's 1000 times faster!

8 threads, 1000 tasks: DMD:  4000 ms, LDC: 3 ms (!)

So my current hypothesis is a) I'm doing something wrong or b)
the tasks are optimized away or something.

Can someone confirm the results and tell me what I'm doing wrong?
Aug 03 2014
parent reply "safety0ff" <safety0ff.dev gmail.com> writes:
On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:
 Can someone confirm the results and tell me what I'm doing 
 wrong?
LDC is likely optimizing the summation: int sum = 0; foreach(i; 0..task.goal) sum += i; To something like: int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);
Aug 03 2014
parent reply "David Nadlinger" <code klickverbot.at> writes:
On Sunday, 3 August 2014 at 22:24:22 UTC, safety0ff wrote:
 On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:
 Can someone confirm the results and tell me what I'm doing 
 wrong?
LDC is likely optimizing the summation: int sum = 0; foreach(i; 0..task.goal) sum += i; To something like: int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);
This is correct – the LLVM optimizer indeed gets rid of the loop completely. Although I'd be more than happy to be able to claim a thousandfold speedup over DMD on real-world applications. ;) Cheers, David
Aug 03 2014
parent reply Philippe Sigaud via Digitalmars-d-learn writes:
 This is correct – the LLVM optimizer indeed gets rid of the loop completely.
OK,that's clever. But I get this even when put a writeln("some msg") inside the task. I thought a write couldn't be optimized away that way and that it's a slow operation? Anyway, I discovered Thread.wait() in core in the meantime, I'll use that. I just wanted to have tasks taking a different amount of time each time. I have another question: it seems I can spawn hundreds of threads (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there: is there a limit to the number of threads? I tried a threadpool because in my application I feared having to spawn ~100-200 threads but if that's not the case, I can drastically simplify my code. Is spawning a thread a slow operation in general?
Aug 03 2014
next sibling parent reply "Kapps" <opantm2+spam gmail.com> writes:
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of 
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200 
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?
Without going into much detail: Threads are heavy, and creating a thread is an expensive operation (which is partially why virtually every standard library includes a ThreadPool). Along with the overhead of creating the thread, you also get the overhead of additional context switches for each thread you have actively running. Context switches are expensive and a significant waste of time where your CPU gets to sit there doing effectively nothing while the OS manages scheduling which thread will go and restoring its context to run again. If you have 10,000 threads even if you won't run into limits of how many threads you can have, this will provide very significant overhead. I haven't looked into detail your code, but consider using the TaskPool if you just want to schedule some tasks to run amongst a few threads, or potentially using Fibers (which are fairly light-weight) instead of Threads.
Aug 03 2014
parent reply Philippe Sigaud via Digitalmars-d-learn writes:
 Without going into much detail: Threads are heavy, and creating a thread is
 an expensive operation (which is partially why virtually every standard
 library includes a ThreadPool).
 I haven't looked into detail your code, but consider using the TaskPool if
 you just want to schedule some tasks to run amongst a few threads, or
 potentially using Fibers (which are fairly light-weight) instead of Threads.
OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in core, right? IIRC, there are fibers somewhere in core, I'll have a look. I also heard the vibe.d has them.
Aug 04 2014
next sibling parent reply "Chris Cain" <zshazz gmail.com> writes:
On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 OK, I get it. Just to be sure, there is no ThreadPool in Phobos 
 or in
 core, right?
 IIRC, there are fibers somewhere in core, I'll have a look. I 
 also
 heard the vibe.d has them.
There is. It's called taskPool, though:
Aug 04 2014
parent Philippe Sigaud via Digitalmars-d-learn writes:
On Mon, Aug 4, 2014 at 2:13 PM, Chris Cain via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in
 core, right?
 There is. It's called taskPool, though:


Ah, std.parallelism. I stoopidly searched in std.concurrency and core.* Thanks!
Aug 04 2014
prev sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 IIRC, there are fibers somewhere in core, I'll have a look. I 
 also
 heard the vibe.d has them.
vibe.d adds some own abstraction on top, for example "Task" concept and notion of Isolated types for message passing but basic are from Phobos.
Aug 04 2014
prev sibling next sibling parent "David Nadlinger" <code klickverbot.at> writes:
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 This is correct – the LLVM optimizer indeed gets rid of the 
 loop completely.
OK,that's clever. But I get this even when put a writeln("some msg") inside the task. I thought a write couldn't be optimized away that way and that it's a slow operation?
You need the _result_ of the computation for the writeln. LLVM's optimizer recognizes what the loop tries to compute, though, and replaces it with an equivalent expression for the sum of the series, as Trass3r alluded to. Cheers, David
Aug 04 2014
prev sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of 
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200 
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?
Most likely those threads either do nothing or are short living so you don't get actually 10 000 threads running simultaneously. In general you should expect your operating system to start stalling at few thousands of concurrent threads competing for context switches and system resources. Creating new thread is rather costly operation though you may not spot it in synthetic snippets, only under actual load. Modern default approach is to have amount of "worker" threads identical or close to amount of CPU cores and handle internal scheduling manually via fibers or some similar solution. If you are totally new to the topic of concurrent services, getting familiar with http://en.wikipedia.org/wiki/C10k_problem may be useful :)
Aug 04 2014
next sibling parent reply Philippe Sigaud via Digitalmars-d-learn writes:
On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 Most likely those threads either do nothing or are short living so you don't
 get actually 10 000 threads running simultaneously. In general you should
 expect your operating system to start stalling at few thousands of
 concurrent threads competing for context switches and system resources.
 Creating new thread is rather costly operation though you may not spot it in
 synthetic snippets, only under actual load.

 Modern default approach is to have amount of "worker" threads identical or
 close to amount of CPU cores and handle internal scheduling manually via
 fibers or some similar solution.
That's what I guessed. It's juste that I have task that will generate other (linked) tasks, in a DAG. I can use a thread pool of 2-8 threads, but that means storing tasks and their relationships (which is waiting on which, etc). I rather liked the idea of spawning new threads when I needed them ;)
 If you are totally new to the topic of concurrent services, getting familiar
 with http://en.wikipedia.org/wiki/C10k_problem may be useful :)
I'll have a look. I'm quite new, my only knowledge comes from reading the concurrency threads here, std.concurrency, std.parallelism and TDPL :)
Aug 04 2014
next sibling parent "Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:
On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:
 Modern default approach is to have amount of "worker" threads 
 identical or
 close to amount of CPU cores and handle internal scheduling 
 manually via
 fibers or some similar solution.
That's what I guessed. It's juste that I have task that will generate other (linked) tasks, in a DAG. I can use a thread pool of 2-8 threads, but that means storing tasks and their relationships (which is waiting on which, etc). I rather liked the idea of spawning new threads when I needed them ;)
If you can live with the fact that your tasks might not be truly parallel (i.e. don't use busy waiting or other things that assume that other tasks make progress while a specific task is running), and you only use them for computing (no synchronous I/O), you can still use the fibers in core.thread:
Aug 04 2014
prev sibling parent reply "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:

 Most likely those threads either do nothing or are short 
 living so you don't
 get actually 10 000 threads running simultaneously. In general 
 you should
 expect your operating system to start stalling at few 
 thousands of
 concurrent threads competing for context switches and system 
 resources.
 Creating new thread is rather costly operation though you may 
 not spot it in
 synthetic snippets, only under actual load.

 Modern default approach is to have amount of "worker" threads 
 identical or
 close to amount of CPU cores and handle internal scheduling 
 manually via
 fibers or some similar solution.
That's what I guessed. It's juste that I have task that will generate other (linked) tasks, in a DAG. I can use a thread pool of 2-8 threads, but that means storing tasks and their relationships (which is waiting on which, etc). I rather liked the idea of spawning new threads when I needed them ;)
vibe.d additions may help here: http://vibed.org/api/vibe.core.core/runTask http://vibed.org/api/vibe.core.core/runWorkerTask http://vibed.org/api/vibe.core.core/workerThreadCount "task" abstraction allows exactly that - spawning new execution context and have it scheduled automatically via underlying fiber/thread pool. However, I am not aware of any good tutorials about using those so jump in at your own risk.
 If you are totally new to the topic of concurrent services, 
 getting familiar
 with http://en.wikipedia.org/wiki/C10k_problem may be useful :)
I'll have a look. I'm quite new, my only knowledge comes from reading the concurrency threads here, std.concurrency, std.parallelism and TDPL :)
Have fun :P It is rapidly changing topic though, best practices may be out of date by the time you have read them :)
Aug 04 2014
parent reply Philippe Sigaud via Digitalmars-d-learn writes:
On Mon, Aug 4, 2014 at 6:21 PM, Dicebot via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 vibe.d additions may help here:

 http://vibed.org/api/vibe.core.core/runTask
 http://vibed.org/api/vibe.core.core/runWorkerTask
 http://vibed.org/api/vibe.core.core/workerThreadCount

 "task" abstraction allows exactly that - spawning new execution context and
 have it scheduled automatically via underlying fiber/thread pool. However, I
 am not aware of any good tutorials about using those so jump in at your own
 risk.
Has anyone used (the fiber/taks of) vibe.d for something other than powering websites?
Aug 04 2014
next sibling parent "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 Has anyone used (the fiber/taks of) vibe.d for something other 
 than
 powering websites?
Atila has implemented MQRR broker with it : https://github.com/atilaneves/mqtt It it still networking application though - I don't know of any pure offline usage.
Aug 04 2014
prev sibling parent reply "Sean Kelly" <sean invisibleduck.org> writes:
On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via
Digitalmars-d-learn wrote:
 Has anyone used (the fiber/taks of) vibe.d for something other 
 than powering websites?
https://github.com/D-Programming-Language/phobos/pull/1910
Aug 04 2014
parent Philippe Sigaud via Digitalmars-d-learn writes:
 https://github.com/D-Programming-Language/phobos/pull/1910
Very interesting discussion, thanks. I'm impressed by the amount of work you guys do on github.
Aug 04 2014
prev sibling next sibling parent reply Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
Sorry, I missed this thread (!) till now.

On Mon, 2014-08-04 at 13:36 +0000, Dicebot via Digitalmars-d-learn
wrote:
 On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via=20
 Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of=20
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is=20
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200=20
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?
Are these std.concurrent threads or std.parallelism tasks? A std.parallelism task is not a thread. Like Erlang or Java Fork/Join framework, the program specifies units of work and then there is a thread pool underneath that works on tasks as required. So you can have zillions of tasks but there will only be a few actual threads working on them.
 Most likely those threads either do nothing or are short living=20
 so you don't get actually 10 000 threads running simultaneously.=20
I suspect it is actually impossible to start this number of kernel threads on any current kernel.
 In general you should expect your operating system to start=20
 stalling at few thousands of concurrent threads competing for=20
 context switches and system resources. Creating new thread is=20
 rather costly operation though you may not spot it in synthetic=20
 snippets, only under actual load.
 Modern default approach is to have amount of "worker" threads=20
 identical or close to amount of CPU cores and handle internal=20
 scheduling manually via fibers or some similar solution.
I have no current data, but it used to be that for a single system it was best to have one or two more threads than the number of cores. Processor architectures and caching changes so new data is required. I am sure someone somewhere has it though.
 If you are totally new to the topic of concurrent services,=20
 getting familiar with http://en.wikipedia.org/wiki/C10k_problem=20
 may be useful :)
I thought they'd moved on the the 100k problem. There is an issue here that I/O bound concurrency and CPU bound concurrency/parallelism are very different beasties. Clearly tools and techniques can apply to either or both. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Aug 04 2014
parent reply "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 16:38:24 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 Modern default approach is to have amount of "worker" threads 
 identical or close to amount of CPU cores and handle internal 
 scheduling manually via fibers or some similar solution.
I have no current data, but it used to be that for a single system it was best to have one or two more threads than the number of cores. Processor architectures and caching changes so new data is required. I am sure someone somewhere has it though.
This is why I had "or close" remark :) Exact number almost always depends on exact deployment layout - i.e. what other processes are running in the system, how hardware interrupts are handled and so on. It is something to decide for each specific application. Sometimes it is even best to have amount of worker threads _less_ than amount of CPU cores if affinity is to be used for some other background service for example.
 If you are totally new to the topic of concurrent services, 
 getting familiar with 
 http://en.wikipedia.org/wiki/C10k_problem may be useful :)
I thought they'd moved on the the 100k problem.
True, C10K is a solved problem but it is best thing to start with to understand why people even bother with all the concurrency complexity - all details can be a bit overwhelming if one starts completely from scratch.
 There is an issue here that I/O bound concurrency and CPU bound
 concurrency/parallelism are very different beasties. Clearly 
 tools and
 techniques can apply to either or both.
Actually with CSP / actor model one can simply consider long-running CPU computation as form of I/O an apply same asynchronous design techniques. For example, have separate dedicated thread running the computation and send input there via message passing - respond message will act similar to I/O notification from the OS. Choosing optimal concurrency architecture for application is probably even harder problem than naming identifiers.
Aug 04 2014
parent reply Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Mon, 2014-08-04 at 16:57 +0000, Dicebot via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 This is why I had "or close" remark :) Exact number almost always=20
 depends on exact deployment layout - i.e. what other processes=20
 are running in the system, how hardware interrupts are handled=20
 and so on. It is something to decide for each specific=20
 application. Sometimes it is even best to have amount of worker=20
 threads _less_ than amount of CPU cores if affinity is to be used=20
 for some other background service for example.
David chose to have the pool thread default be (number-of-cores - 1) if I remember correctly. I am not sure he manipulated affinity. This ought to be on the list of things for a review of std.parallelism. [=E2=80=A6]
 Actually with CSP / actor model one can simply consider=20
 long-running CPU computation as form of I/O an apply same=20
 asynchronous design techniques. For example, have separate=20
 dedicated thread running the computation and send input there via=20
 message passing - respond message will act similar to I/O=20
 notification from the OS.
Now you are on my territory :-) I have been banging on about message passing parallelism architectures for >25 years, but sadly shared memory multi-threading became the standard model for some totally bizarre reason. Probably everyone was taught they had to use all the wonderful OS implementation concurrency techniques in all their applications codes. CSP is great, cf. Go, Python-CSP, GPars, actors are great, cf. Erlang, Akka, GPars, but do not forget dataflow, cf. GPars, Actian DataRush. There have been a number of PhDs trying to provide tools for deciding which parallelism architecture is best suited to a given problem. Sadly most of them have been ignored by the programming language community at large.
 Choosing optimal concurrency architecture for application is=20
 probably even harder problem than naming identifiers.
'Fraid not, it's actually a lot easier. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Aug 04 2014
parent reply "Dicebot" <public dicebot.lv> writes:
On Monday, 4 August 2014 at 18:22:47 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 Actually with CSP / actor model one can simply consider 
 long-running CPU computation as form of I/O an apply same 
 asynchronous design techniques. For example, have separate 
 dedicated thread running the computation and send input there 
 via message passing - respond message will act similar to I/O 
 notification from the OS.
Now you are on my territory :-) I have been banging on about message passing parallelism architectures for >25 years, but sadly shared memory multi-threading became the standard model for some totally bizarre reason. Probably everyone was taught they had to use all the wonderful OS implementation concurrency techniques in all their applications codes.
Well it is a territory not completely alien to me either ;) I am less aware of academia research on topic though, just happen to work in industry where it matters. I think initial spread of multi-threading approach has happened because it was so temptingly easy - no need to worry about actually modelling the concurrency execution flow, blocking I/O or scheduling; just write the code as usual and OS will take care of it. But there is no place for magic in programming world in it has fallen hard once network services started to scale. Right now is the glorious moment when engineers are finally starting to appreciate how previous academia research can help them solve practical issues and all this good stuff goes mainstream :)
 There have been a number of PhDs trying to provide tools for 
 deciding
 which parallelism architecture is best suited to a given 
 problem. Sadly
 most of them have been ignored by the programming language 
 community at
 large.
Doubt programming / engineering community will ever accept research that states that choosing architecture can be done on pure theoretical basis :) It simply contradicts too much all daily experience which says that every concurrent application has some unique traits to consider and only profiling can rule them all.
Aug 04 2014
parent Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:
On Mon, 2014-08-04 at 18:34 +0000, Dicebot via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 Well it is a territory not completely alien to me either ;) I am=20
 less aware of academia research on topic though, just happen to=20
 work in industry where it matters.
I have been out of academia now for 14 years, but tracking the various lists and blogs, not to mention SuperComputing conferences, there is very little new stuff, the last 10 has been about improving. The one new thing is though GPGPU, which started out as an interesting side show but has now come front and centre for data parallelism.
 I think initial spread of multi-threading approach has happened=20
 because it was so temptingly easy - no need to worry about=20
 actually modelling the concurrency execution flow, blocking I/O=20
 or scheduling; just write the code as usual and OS will take care=20
 of it. But there is no place for magic in programming world in it=20
 has fallen hard once network services started to scale.
Threads are infrastructure just like stack and heap, very, very, very few people actually worry about and manage these resources explicitly, most just leave the run time system to handle it. OK so the usual GC argument can be plopped in here, let's not bother though as we've been through it three times this quarter :-)
 Right now is the glorious moment when engineers are finally=20
 starting to appreciate how previous academia research can help=20
 them solve practical issues and all this good stuff goes=20
 mainstream :)
Actors are mid 1960s, dataflow early 1970s, CSP mid 1970s, it has taken the explicit shared-memory multithreading in applications fiasco a long time to pass. I can think of some applications which are effectively operating systems and so need all the best shared-memory multithreading techniques (I was involved in one 1999=E2=80=932004), but most applications people should be using actors, dataflow, CSP or data parallelism as their applications model supported by library frameworks/infrastructure. [=E2=80=A6]
 Doubt programming / engineering community will ever accept=20
 research that states that choosing architecture can be done on=20
 pure theoretical basis :) It simply contradicts too much all=20
 daily experience which says that every concurrent application has=20
 some unique traits to consider and only profiling can rule them=20
 all.
Most solutions to problems or subproblems can be slotted into one of actors, dataflow, pipeline, MVC, data parallelism, event loop for the main picture. If tweaking is needed, profiling and small localized tinkerings can do the trick. I have yet to find many cases in my (computation oriented) world where that is needed. Maybe in an I/O world there are different constraints. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel winder.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Aug 05 2014
prev sibling parent Philippe Sigaud via Digitalmars-d-learn writes:
On Mon, Aug 4, 2014 at 6:38 PM, Russel Winder via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 Are these std.concurrent threads or std.parallelism tasks?

 A std.parallelism task is not a thread. Like Erlang or Java Fork/Join
 framework, the program specifies units of work and then there is a
 thread pool underneath that works on tasks as required. So you can have
 zillions of tasks but there will only be a few actual threads working on
 them.
That's it. Many tasks, a few working threads. That's what I'm converging to. They are not particularly 'concurrent', but they can depend on one another. My only gripes with std.parallelism is that I cannot understand whether it's interesting to use the module if tasks can create other tasks and depend on them in a deeply interconnected graph. I mean, if I have to write lots of scaffolding just to manage dependencies between task, I might as well built it on core.thread and message passing directly. I'm becoming quite enamored of message passing, maybe because it's a new shiny toy for me :) That's for parsing, btw. I'm trying to write a n-core engine for my Pegged parser generator project.
 Most likely those threads either do nothing or are short living
 so you don't get actually 10 000 threads running simultaneously.
I suspect it is actually impossible to start this number of kernel threads on any current kernel
So, what happens when I do void doWork() { ... } Tid[] children; foreach(_; 0 .. 10_000) children ~= spawn(&doWork); ? I mean, it compiles and runs happily. In my current tests, I end the application by sending all thread a CloseDown message and waiting for an answer from each of them. That takes about 1s on my machine.
 I have no current data, but it used to be that for a single system it
 was best to have one or two more threads than the number of cores.
 Processor architectures and caching changes so new data is required. I
 am sure someone somewhere has it though.
I can add that, depending on the tasks I'm using, it's sometime better to use 4, 6, 8 or 10 threads, repeatedly for a given task. I'm using a Core i7, Linux sees it as an 8-core. So, well, I'll try and see.
Aug 04 2014