digitalmars.D.learn - Threadpools, difference between DMD and LDC

Philippe Sigaud (20/20) Aug 03 2014 I'm trying to grok message passing. That's my very first foray

safety0ff (7/9) Aug 03 2014 LDC is likely optimizing the summation:

David Nadlinger (7/17) Aug 03 2014 This is correct – the LLVM optimizer indeed gets rid of the loop

Philippe Sigaud via Digitalmars-d-learn (12/13) Aug 03 2014 OK,that's clever. But I get this even when put a writeln("some msg")

Kapps (17/26) Aug 03 2014 Without going into much detail: Threads are heavy, and creating a

Philippe Sigaud via Digitalmars-d-learn (4/10) Aug 04 2014 OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in

Chris Cain (4/10) Aug 04 2014 There is. It's called taskPool, though:

Philippe Sigaud via Digitalmars-d-learn (4/8) Aug 04 2014 Ah, std.parallelism. I stoopidly searched in std.concurrency and core.*

Dicebot (6/9) Aug 04 2014 http://dlang.org/phobos/core_thread.html#.Fiber

David Nadlinger (8/15) Aug 04 2014 You need the _result_ of the computation for the writeln. LLVM's
Dicebot (15/24) Aug 04 2014 Most likely those threads either do nothing or are short living

Philippe Sigaud via Digitalmars-d-learn (10/21) Aug 04 2014 That's what I guessed. It's juste that I have task that will generate

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> (8/23) Aug 04 2014 If you can live with the fact that your tasks might not be truly
Dicebot (12/48) Aug 04 2014 vibe.d additions may help here:

Philippe Sigaud via Digitalmars-d-learn (4/12) Aug 04 2014 Has anyone used (the fiber/taks of) vibe.d for something other than

Dicebot (6/9) Aug 04 2014 Atila has implemented MQRR broker with it :
Sean Kelly (3/5) Aug 04 2014 https://github.com/D-Programming-Language/phobos/pull/1910

Philippe Sigaud via Digitalmars-d-learn (2/3) Aug 04 2014 Very interesting discussion, thanks. I'm impressed by the amount of

Russel Winder via Digitalmars-d-learn (29/53) Aug 04 2014 Sorry, I missed this thread (!) till now.

Dicebot (21/39) Aug 04 2014 This is why I had "or close" remark :) Exact number almost always

Russel Winder via Digitalmars-d-learn (30/45) Aug 04 2014 On Mon, 2014-08-04 at 16:57 +0000, Dicebot via Digitalmars-d-learn

Dicebot (21/45) Aug 04 2014 Well it is a territory not completely alien to me either ;) I am

Russel Winder via Digitalmars-d-learn (37/56) Aug 05 2014 On Mon, 2014-08-04 at 18:34 +0000, Dicebot via Digitalmars-d-learn

Philippe Sigaud via Digitalmars-d-learn (28/42) Aug 04 2014 That's it. Many tasks, a few working threads. That's what I'm

"Philippe Sigaud" <philippe.sigaud gmail.com> writes:

I'm trying to grok message passing. That's my very first foray
into this, so I'm probably making every mistake in the book :-)

I wrote a small threadpool test, it's there:

http://dpaste.dzfl.pl/3d3a65a00425

I'm playing with the number of threads and the number of tasks,
and getting a feel about how message passing works. I must say I
quite like it: it's a bit like suddenly being able to safely
return different types from a function.

What I don't get is the difference between DMD (I'm using 2.065)
and LDC (0.14-alpha1).

For DMD, I compile with -O -inline -noboundscheck
For LDC, I use -03 -inline

LDC gives me smaller executables than DMD (also, 3 to 5 times
smaller than 0.13, good job!) but above all else incredibly,
astoundingly faster. I'm used to LDC producing 20-30% faster
programs, but here it's 1000 times faster!

8 threads, 1000 tasks: DMD:  4000 ms, LDC: 3 ms (!)

So my current hypothesis is a) I'm doing something wrong or b)
the tasks are optimized away or something.

Can someone confirm the results and tell me what I'm doing wrong?

Aug 03 2014

"safety0ff" <safety0ff.dev gmail.com> writes:

On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:
 Can someone confirm the results and tell me what I'm doing 
 wrong?

LDC is likely optimizing the summation:

     int sum = 0;
     foreach(i; 0..task.goal)
         sum += i;

To something like:

     int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);

Aug 03 2014

"David Nadlinger" <code klickverbot.at> writes:

On Sunday, 3 August 2014 at 22:24:22 UTC, safety0ff wrote:
 On Sunday, 3 August 2014 at 19:52:42 UTC, Philippe Sigaud wrote:
 Can someone confirm the results and tell me what I'm doing 
 wrong?

 LDC is likely optimizing the summation:

     int sum = 0;
     foreach(i; 0..task.goal)
         sum += i;

 To something like:

     int sum = cast(int)(cast(ulong)(task.goal-1)*task.goal/2);

This is correct – the LLVM optimizer indeed gets rid of the loop 
completely.

Although I'd be more than happy to be able to claim a 
thousandfold speedup over DMD on real-world applications. ;)

Cheers,
David

Aug 03 2014

Philippe Sigaud via Digitalmars-d-learn writes:

 This is correct – the LLVM optimizer indeed gets rid of the loop completely.

OK,that's clever. But I get this even when put a writeln("some msg")
inside the task. I thought a write couldn't be optimized away that way
and that it's a slow operation?

Anyway, I discovered Thread.wait() in core in the meantime, I'll use
that. I just wanted to have tasks taking a different amount of time
each time.

I have another question: it seems I can spawn hundreds of threads
(Heck, even 10_000 is accepted), even when I have 4-8 cores. Is there:
is there a limit to the number of threads? I tried a threadpool
because in my application I feared having to spawn ~100-200 threads
but if that's not the case, I can drastically simplify my code.
Is spawning a thread a slow operation in general?

Aug 03 2014

"Kapps" <opantm2+spam gmail.com> writes:

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of 
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200 
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?

Without going into much detail: Threads are heavy, and creating a 
thread is an expensive operation (which is partially why 
virtually every standard library includes a ThreadPool). Along 
with the overhead of creating the thread, you also get the 
overhead of additional context switches for each thread you have 
actively running. Context switches are expensive and a 
significant waste of time where your CPU gets to sit there doing 
effectively nothing while the OS manages scheduling which thread 
will go and restoring its context to run again. If you have 
10,000 threads even if you won't run into limits of how many 
threads you can have, this will provide very significant overhead.

I haven't looked into detail your code, but consider using the 
TaskPool if you just want to schedule some tasks to run amongst a 
few threads, or potentially using Fibers (which are fairly 
light-weight) instead of Threads.

Aug 03 2014

Philippe Sigaud via Digitalmars-d-learn writes:

 Without going into much detail: Threads are heavy, and creating a thread is
 an expensive operation (which is partially why virtually every standard
 library includes a ThreadPool).

 I haven't looked into detail your code, but consider using the TaskPool if
 you just want to schedule some tasks to run amongst a few threads, or
 potentially using Fibers (which are fairly light-weight) instead of Threads.

OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in
core, right?
IIRC, there are fibers somewhere in core, I'll have a look. I also
heard the vibe.d has them.

Aug 04 2014

"Chris Cain" <zshazz gmail.com> writes:

On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 OK, I get it. Just to be sure, there is no ThreadPool in Phobos 
 or in
 core, right?
 IIRC, there are fibers somewhere in core, I'll have a look. I 
 also
 heard the vibe.d has them.

There is. It's called taskPool, though:

Aug 04 2014

Philippe Sigaud via Digitalmars-d-learn writes:

On Mon, Aug 4, 2014 at 2:13 PM, Chris Cain via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 OK, I get it. Just to be sure, there is no ThreadPool in Phobos or in
 core, right?


 There is. It's called taskPool, though:



Ah, std.parallelism. I stoopidly searched in std.concurrency and core.*
Thanks!

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 12:05:31 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 IIRC, there are fibers somewhere in core, I'll have a look. I 
 also
 heard the vibe.d has them.



vibe.d adds some own abstraction on top, for example "Task" 
concept and notion of Isolated types for message passing but 
basic are from Phobos.

Aug 04 2014

"David Nadlinger" <code klickverbot.at> writes:

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 This is correct – the LLVM optimizer indeed gets rid of the 
 loop completely.

 OK,that's clever. But I get this even when put a writeln("some 
 msg")
 inside the task. I thought a write couldn't be optimized away 
 that way
 and that it's a slow operation?

You need the _result_ of the computation for the writeln. LLVM's 
optimizer recognizes what the loop tries to compute, though, and 
replaces it with an equivalent expression for the sum of the 
series, as Trass3r alluded to.

Cheers,
David

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of 
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is 
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200 
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?

Most likely those threads either do nothing or are short living 
so you don't get actually 10 000 threads running simultaneously. 
In general you should expect your operating system to start 
stalling at few thousands of concurrent threads competing for 
context switches and system resources. Creating new thread is 
rather costly operation though you may not spot it in synthetic 
snippets, only under actual load.

Modern default approach is to have amount of "worker" threads 
identical or close to amount of CPU cores and handle internal 
scheduling manually via fibers or some similar solution.

If you are totally new to the topic of concurrent services, 
getting familiar with http://en.wikipedia.org/wiki/C10k_problem 
may be useful :)

Aug 04 2014

Philippe Sigaud via Digitalmars-d-learn writes:

On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 Most likely those threads either do nothing or are short living so you don't
 get actually 10 000 threads running simultaneously. In general you should
 expect your operating system to start stalling at few thousands of
 concurrent threads competing for context switches and system resources.
 Creating new thread is rather costly operation though you may not spot it in
 synthetic snippets, only under actual load.

 Modern default approach is to have amount of "worker" threads identical or
 close to amount of CPU cores and handle internal scheduling manually via
 fibers or some similar solution.

That's what I guessed. It's juste that I have task that will generate
other (linked) tasks, in a DAG. I can use a thread pool of 2-8
threads, but that means storing tasks and their relationships (which
is waiting on which, etc). I rather liked the idea of spawning new
threads when I needed them ;)



 If you are totally new to the topic of concurrent services, getting familiar
 with http://en.wikipedia.org/wiki/C10k_problem may be useful :)

I'll have a look. I'm quite new, my only knowledge comes from reading
the concurrency threads here, std.concurrency, std.parallelism and
TDPL :)

Aug 04 2014

"Marc =?UTF-8?B?U2Now7x0eiI=?= <schuetzm gmx.net> writes:

On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:
 Modern default approach is to have amount of "worker" threads 
 identical or
 close to amount of CPU cores and handle internal scheduling 
 manually via
 fibers or some similar solution.

 That's what I guessed. It's juste that I have task that will 
 generate
 other (linked) tasks, in a DAG. I can use a thread pool of 2-8
 threads, but that means storing tasks and their relationships 
 (which
 is waiting on which, etc). I rather liked the idea of spawning 
 new
 threads when I needed them ;)

If you can live with the fact that your tasks might not be truly 
parallel (i.e. don't use busy waiting or other things that assume 
that other tasks make progress while a specific task is running), 
and you only use them for computing (no synchronous I/O), you can 
still use the fibers in core.thread:

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 14:56:36 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 On Mon, Aug 4, 2014 at 3:36 PM, Dicebot via Digitalmars-d-learn
 <digitalmars-d-learn puremagic.com> wrote:

 Most likely those threads either do nothing or are short 
 living so you don't
 get actually 10 000 threads running simultaneously. In general 
 you should
 expect your operating system to start stalling at few 
 thousands of
 concurrent threads competing for context switches and system 
 resources.
 Creating new thread is rather costly operation though you may 
 not spot it in
 synthetic snippets, only under actual load.

 Modern default approach is to have amount of "worker" threads 
 identical or
 close to amount of CPU cores and handle internal scheduling 
 manually via
 fibers or some similar solution.

 That's what I guessed. It's juste that I have task that will 
 generate
 other (linked) tasks, in a DAG. I can use a thread pool of 2-8
 threads, but that means storing tasks and their relationships 
 (which
 is waiting on which, etc). I rather liked the idea of spawning 
 new
 threads when I needed them ;)

vibe.d additions may help here:

http://vibed.org/api/vibe.core.core/runTask
http://vibed.org/api/vibe.core.core/runWorkerTask
http://vibed.org/api/vibe.core.core/workerThreadCount

"task" abstraction allows exactly that - spawning new execution 
context and have it scheduled automatically via underlying 
fiber/thread pool. However, I am not aware of any good tutorials 
about using those so jump in at your own risk.

 If you are totally new to the topic of concurrent services, 
 getting familiar
 with http://en.wikipedia.org/wiki/C10k_problem may be useful :)

 I'll have a look. I'm quite new, my only knowledge comes from 
 reading
 the concurrency threads here, std.concurrency, std.parallelism 
 and
 TDPL :)

Have fun :P It is rapidly changing topic though, best practices 
may be out of date by the time you have read them :)

Aug 04 2014

Philippe Sigaud via Digitalmars-d-learn writes:

On Mon, Aug 4, 2014 at 6:21 PM, Dicebot via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 vibe.d additions may help here:

 http://vibed.org/api/vibe.core.core/runTask
 http://vibed.org/api/vibe.core.core/runWorkerTask
 http://vibed.org/api/vibe.core.core/workerThreadCount

 "task" abstraction allows exactly that - spawning new execution context and
 have it scheduled automatically via underlying fiber/thread pool. However, I
 am not aware of any good tutorials about using those so jump in at your own
 risk.

Has anyone used (the fiber/taks of) vibe.d for something other than
powering websites?

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via 
Digitalmars-d-learn wrote:
 Has anyone used (the fiber/taks of) vibe.d for something other 
 than
 powering websites?

Atila has implemented MQRR broker with it : 
https://github.com/atilaneves/mqtt
It it still networking application though - I don't know of any 
pure offline usage.

Aug 04 2014

"Sean Kelly" <sean invisibleduck.org> writes:

On Monday, 4 August 2014 at 21:19:14 UTC, Philippe Sigaud via
Digitalmars-d-learn wrote:
 Has anyone used (the fiber/taks of) vibe.d for something other 
 than powering websites?

https://github.com/D-Programming-Language/phobos/pull/1910

Aug 04 2014

Philippe Sigaud via Digitalmars-d-learn writes:

 https://github.com/D-Programming-Language/phobos/pull/1910

Very interesting discussion, thanks. I'm impressed by the amount of
work you guys do on github.

Aug 04 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Sorry, I missed this thread (!) till now.

On Mon, 2014-08-04 at 13:36 +0000, Dicebot via Digitalmars-d-learn
wrote:
 On Monday, 4 August 2014 at 05:14:22 UTC, Philippe Sigaud via=20
 Digitalmars-d-learn wrote:
 I have another question: it seems I can spawn hundreds of=20
 threads
 (Heck, even 10_000 is accepted), even when I have 4-8 cores. Is=20
 there:
 is there a limit to the number of threads? I tried a threadpool
 because in my application I feared having to spawn ~100-200=20
 threads
 but if that's not the case, I can drastically simplify my code.
 Is spawning a thread a slow operation in general?


Are these std.concurrent threads or std.parallelism tasks?

A std.parallelism task is not a thread. Like Erlang or Java Fork/Join
framework, the program specifies units of work and then there is a
thread pool underneath that works on tasks as required. So you can have
zillions of tasks but there will only be a few actual threads working on
them.

 Most likely those threads either do nothing or are short living=20
 so you don't get actually 10 000 threads running simultaneously.=20

I suspect it is actually impossible to start this number of kernel
threads on any current kernel.

 In general you should expect your operating system to start=20
 stalling at few thousands of concurrent threads competing for=20
 context switches and system resources. Creating new thread is=20
 rather costly operation though you may not spot it in synthetic=20
 snippets, only under actual load.

 Modern default approach is to have amount of "worker" threads=20
 identical or close to amount of CPU cores and handle internal=20
 scheduling manually via fibers or some similar solution.

I have no current data, but it used to be that for a single system it
was best to have one or two more threads than the number of cores.
Processor architectures and caching changes so new data is required. I
am sure someone somewhere has it though.

 If you are totally new to the topic of concurrent services,=20
 getting familiar with http://en.wikipedia.org/wiki/C10k_problem=20
 may be useful :)

I thought they'd moved on the the 100k problem.

There is an issue here that I/O bound concurrency and CPU bound
concurrency/parallelism are very different beasties. Clearly tools and
techniques can apply to either or both.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 16:38:24 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 Modern default approach is to have amount of "worker" threads 
 identical or close to amount of CPU cores and handle internal 
 scheduling manually via fibers or some similar solution.

 I have no current data, but it used to be that for a single 
 system it
 was best to have one or two more threads than the number of 
 cores.
 Processor architectures and caching changes so new data is 
 required. I
 am sure someone somewhere has it though.

This is why I had "or close" remark :) Exact number almost always 
depends on exact deployment layout - i.e. what other processes 
are running in the system, how hardware interrupts are handled 
and so on. It is something to decide for each specific 
application. Sometimes it is even best to have amount of worker 
threads _less_ than amount of CPU cores if affinity is to be used 
for some other background service for example.

 If you are totally new to the topic of concurrent services, 
 getting familiar with 
 http://en.wikipedia.org/wiki/C10k_problem may be useful :)

 I thought they'd moved on the the 100k problem.

True, C10K is a solved problem but it is best thing to start with 
to understand why people even bother with all the concurrency 
complexity - all details can be a bit overwhelming if one starts 
completely from scratch.

 There is an issue here that I/O bound concurrency and CPU bound
 concurrency/parallelism are very different beasties. Clearly 
 tools and
 techniques can apply to either or both.

Actually with CSP / actor model one can simply consider 
long-running CPU computation as form of I/O an apply same 
asynchronous design techniques. For example, have separate 
dedicated thread running the computation and send input there via 
message passing - respond message will act similar to I/O 
notification from the OS.

Choosing optimal concurrency architecture for application is 
probably even harder problem than naming identifiers.

Aug 04 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2014-08-04 at 16:57 +0000, Dicebot via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 This is why I had "or close" remark :) Exact number almost always=20
 depends on exact deployment layout - i.e. what other processes=20
 are running in the system, how hardware interrupts are handled=20
 and so on. It is something to decide for each specific=20
 application. Sometimes it is even best to have amount of worker=20
 threads _less_ than amount of CPU cores if affinity is to be used=20
 for some other background service for example.

David chose to have the pool thread default be (number-of-cores - 1) if
I remember correctly. I am not sure he manipulated affinity. This ought
to be on the list of things for a review of std.parallelism.

[=E2=80=A6]

 Actually with CSP / actor model one can simply consider=20
 long-running CPU computation as form of I/O an apply same=20
 asynchronous design techniques. For example, have separate=20
 dedicated thread running the computation and send input there via=20
 message passing - respond message will act similar to I/O=20
 notification from the OS.

Now you are on my territory :-) I have been banging on about message
passing parallelism architectures for >25 years, but sadly shared memory
multi-threading became the standard model for some totally bizarre
reason. Probably everyone was taught they had to use all the wonderful
OS implementation concurrency techniques in all their applications
codes.

CSP is great, cf. Go, Python-CSP, GPars, actors are great, cf. Erlang,
Akka, GPars, but do not forget dataflow, cf. GPars, Actian DataRush.

There have been a number of PhDs trying to provide tools for deciding
which parallelism architecture is best suited to a given problem. Sadly
most of them have been ignored by the programming language community at
large.

 Choosing optimal concurrency architecture for application is=20
 probably even harder problem than naming identifiers.

'Fraid not, it's actually a lot easier.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 04 2014

"Dicebot" <public dicebot.lv> writes:

On Monday, 4 August 2014 at 18:22:47 UTC, Russel Winder via 
Digitalmars-d-learn wrote:
 Actually with CSP / actor model one can simply consider 
 long-running CPU computation as form of I/O an apply same 
 asynchronous design techniques. For example, have separate 
 dedicated thread running the computation and send input there 
 via message passing - respond message will act similar to I/O 
 notification from the OS.

 Now you are on my territory :-) I have been banging on about 
 message
 passing parallelism architectures for >25 years, but sadly 
 shared memory
 multi-threading became the standard model for some totally 
 bizarre
 reason. Probably everyone was taught they had to use all the 
 wonderful
 OS implementation concurrency techniques in all their 
 applications
 codes.

Well it is a territory not completely alien to me either ;) I am 
less aware of academia research on topic though, just happen to 
work in industry where it matters.

I think initial spread of multi-threading approach has happened 
because it was so temptingly easy - no need to worry about 
actually modelling the concurrency execution flow, blocking I/O 
or scheduling; just write the code as usual and OS will take care 
of it. But there is no place for magic in programming world in it 
has fallen hard once network services started to scale.

Right now is the glorious moment when engineers are finally 
starting to appreciate how previous academia research can help 
them solve practical issues and all this good stuff goes 
mainstream :)

 There have been a number of PhDs trying to provide tools for 
 deciding
 which parallelism architecture is best suited to a given 
 problem. Sadly
 most of them have been ignored by the programming language 
 community at
 large.

Doubt programming / engineering community will ever accept 
research that states that choosing architecture can be done on 
pure theoretical basis :) It simply contradicts too much all 
daily experience which says that every concurrent application has 
some unique traits to consider and only profiling can rule them 
all.

Aug 04 2014

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Mon, 2014-08-04 at 18:34 +0000, Dicebot via Digitalmars-d-learn
wrote:
[=E2=80=A6]
 Well it is a territory not completely alien to me either ;) I am=20
 less aware of academia research on topic though, just happen to=20
 work in industry where it matters.

I have been out of academia now for 14 years, but tracking the various
lists and blogs, not to mention SuperComputing conferences, there is
very little new stuff, the last 10 has been about improving. The one new
thing is though GPGPU, which started out as an interesting side show but
has now come front and centre for data parallelism.

 I think initial spread of multi-threading approach has happened=20
 because it was so temptingly easy - no need to worry about=20
 actually modelling the concurrency execution flow, blocking I/O=20
 or scheduling; just write the code as usual and OS will take care=20
 of it. But there is no place for magic in programming world in it=20
 has fallen hard once network services started to scale.

Threads are infrastructure just like stack and heap, very, very, very
few people actually worry about and manage these resources explicitly,
most just leave the run time system to handle it. OK so the usual GC
argument can be plopped in here, let's not bother though as we've been
through it three times this quarter :-)

 Right now is the glorious moment when engineers are finally=20
 starting to appreciate how previous academia research can help=20
 them solve practical issues and all this good stuff goes=20
 mainstream :)

Actors are mid 1960s, dataflow early 1970s, CSP mid 1970s, it has taken
the explicit shared-memory multithreading in applications fiasco a long
time to pass. I can think of some applications which are effectively
operating systems and so need all the best shared-memory multithreading
techniques (I was involved in one 1999=E2=80=932004), but most applications
people should be using actors, dataflow, CSP or data parallelism as
their applications model supported by library frameworks/infrastructure.

[=E2=80=A6]
 Doubt programming / engineering community will ever accept=20
 research that states that choosing architecture can be done on=20
 pure theoretical basis :) It simply contradicts too much all=20
 daily experience which says that every concurrent application has=20
 some unique traits to consider and only profiling can rule them=20
 all.

Most solutions to problems or subproblems can be slotted into one of
actors, dataflow, pipeline, MVC, data parallelism, event loop for the
main picture. If tweaking is needed, profiling and small localized
tinkerings can do the trick. I have yet to find many cases in my
(computation oriented) world where that is needed. Maybe in an I/O world
there are different constraints.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Aug 05 2014

Philippe Sigaud via Digitalmars-d-learn writes:

On Mon, Aug 4, 2014 at 6:38 PM, Russel Winder via Digitalmars-d-learn
<digitalmars-d-learn puremagic.com> wrote:

 Are these std.concurrent threads or std.parallelism tasks?

 A std.parallelism task is not a thread. Like Erlang or Java Fork/Join
 framework, the program specifies units of work and then there is a
 thread pool underneath that works on tasks as required. So you can have
 zillions of tasks but there will only be a few actual threads working on
 them.

That's it. Many tasks, a few working threads. That's what I'm
converging to. They are not particularly 'concurrent', but they can
depend on one another.

My only gripes with std.parallelism is that I cannot understand
whether it's interesting to use the module if tasks can create other
tasks and depend on them in a deeply interconnected graph. I mean, if
I have to write lots of scaffolding just to manage dependencies
between task, I might as well built it on core.thread and message
passing directly. I'm becoming quite enamored of message passing,
maybe because it's a new shiny toy for me :)

That's for parsing, btw. I'm trying to write a n-core engine for my
Pegged parser generator project.



 Most likely those threads either do nothing or are short living
 so you don't get actually 10 000 threads running simultaneously.

 I suspect it is actually impossible to start this number of kernel
 threads on any current kernel

So, what happens when I do

void doWork() { ... }

Tid[] children;
foreach(_; 0 .. 10_000)
    children ~= spawn(&doWork);

?

I mean, it compiles and runs happily.
In my current tests, I end the application by sending all thread a
CloseDown message and waiting for an answer from each of them. That
takes about 1s on my machine.

 I have no current data, but it used to be that for a single system it
 was best to have one or two more threads than the number of cores.
 Processor architectures and caching changes so new data is required. I
 am sure someone somewhere has it though.

I can add that, depending on the tasks I'm using, it's sometime better
to use 4, 6, 8 or 10 threads, repeatedly for a given task. I'm using a
Core i7, Linux sees it as an 8-core.
So, well, I'll try and see.

Aug 04 2014

D Programming

C/C++ Programming

Other

digitalmars.D.learn - Threadpools, difference between DMD and LDC