digitalmars.D - review of std.parallelism
- Andrei Alexandrescu (91/91) Mar 18 2011 0. Overview and vote
- dsimcha (43/62) Mar 18 2011 Thanks for the advice. You mentioned in the past that the documentation...
- Simen kjaeraas (5/15) Mar 19 2011 Please add an example showing that, too. Sure, the documentation says
- dsimcha (4/18) Mar 19 2011 I don't understand how this can be demonstrated in an example. It's an
- Simen kjaeraas (5/24) Mar 19 2011 Yeah, scratch that. I for some reason thought the "array of size
- Andrei Alexandrescu (16/63) Mar 19 2011 Got it.
- dsimcha (26/55) Mar 19 2011 I've already made it private. I can't see what purpose having it public...
- Andrei Alexandrescu (5/13) Mar 19 2011 A great way to handle bunches of almost-identical overloads is to group
- dsimcha (5/96) Mar 19 2011 Ok, thanks again for clarifying **how** the docs could be improved.
- Andrei Alexandrescu (129/133) Mar 19 2011 * When using "parallelism" as a common noun, prefix is with a '_' so
- Michel Fortin (26/34) Mar 19 2011 Actually, I think this is a bad description of what it subverts. What
- dsimcha (13/43) Mar 19 2011 If you want pedal-to-metal parallelism without insane levels of
- Michel Fortin (36/66) Mar 19 2011 I'm not sure where my proposal asks for more verbosity or less
- Michel Fortin (12/31) Mar 19 2011 ... or maybe I just totally misunderstood asyncBuf. Rereading the
- dsimcha (3/32) Mar 19 2011 Right. And this is pretty obviously a race. The other example (without...
- dsimcha (44/77) Mar 20 2011 Ok, I've had some time to think about this. The following example is
- Michel Fortin (36/127) Mar 20 2011 I don't see a problem with the above. The array elements you modify are
- dsimcha (46/105) Mar 20 2011 Ok, I completely agree in principle, though I question whether it's
- Michel Fortin (46/149) Mar 21 2011 Well, it'll work irrespective of whether shared delegates are used or
- dsimcha (36/113) Mar 21 2011 On second thought, no, but for practical, not theoretical reasons: One,...
- Michel Fortin (41/61) Mar 21 2011 Indeed a problem. Either we fix the compiler to support that, or we
- dsimcha (24/68) Mar 21 2011 This problem is **much** less severe than you are suggesting. x86 can a...
- bearophile (4/7) Mar 19 2011 Please, don't give up. It's a lot of time from the first time I've asked...
- Robert Jacques (4/34) Mar 20 2011 Technically, you'd only need a shared delegate or a const delegate to
- dsimcha (93/172) Mar 19 2011 I don't see what I could put here that isn't totally redundant with the
- Michel Fortin (7/26) Mar 19 2011 I think an improvement over undefined behaviour would be to throw an exc...
- dsimcha (12/35) Mar 19 2011 The only problem is that there's no easy, well-documented way to tell
- dsimcha (6/12) Mar 19 2011 ...Though now that you make me think of it I need to support exception
- Michel Fortin (36/75) Mar 19 2011 It's not that silly.
- dsimcha (14/40) Mar 19 2011 Using continue is well defined behavior and does exactly what I think
- Daniel Gibson (7/10) Mar 19 2011 What about some kind of parallel search? You just want to know if someth...
- dsimcha (7/17) Mar 19 2011 It's an interesting suggestion, and I'll give some thought to it, but on...
- Andrei Alexandrescu (120/281) Mar 19 2011 I'm thinking along the lines of:
- dsimcha (30/87) Mar 19 2011 Fair enough. Now that I think of it most of my frustration is that
- Andrei Alexandrescu (11/37) Mar 19 2011 Exactly. I understand. Well here's a suggestion. How about we "git
- dsimcha (2/27) Mar 19 2011 Sounds like a plan.
- Lars T. Kyllingstad (3/31) Mar 20 2011 Sounds good. I'll make the announcement.
- dsimcha (9/15) Mar 19 2011 In addition the the deadline issues already mentioned and resolved, I
- Jonathan M Davis (5/21) Mar 19 2011 Both std.datetime and std.unittests underwent a fair number of changes o...
- Jonathan M Davis (5/21) Mar 19 2011 Both std.datetime and std.unittests underwent a fair number of changes o...
- dsimcha (3/24) Mar 19 2011 Please check your newsreader settings. You've been double-posting a lot...
- dsimcha (96/215) Mar 19 2011 This is definitely feasible in principle. I'd like to implement it, but...
- dsimcha (10/30) Mar 20 2011 I tried putting a few benchmarks in for the performance of parallel
- dsimcha (4/11) Mar 19 2011 I did this for consistency with all the other functions, where work unit...
- Jonathan M Davis (3/12) Mar 19 2011 Oooh. That's a neat trick. I didn't know that you could do that.
0. Overview and vote I think the library delivers the high-level goods (parallel foreach, map, reduce) but is a bit fuzzy in the lower level details. Documentation hurts understanding of the capabilities of the library and essentially is of inadequate quality. Entity documentation and examples do little more than give the impression of dispensing with an unpleasant chore. My vote in favor of acceptance is contingent upon a radical improvement in the quality of documentation and examples. Most if not all artifacts should be motivated by simple, compelling examples. The introductory section must contain a brief and attractive synopsis of the flagship capabilities. All terms must be defined before being used and introduced in a carefully-chosen order. The relationship between various entities should be clarified. I've seen the argument that simple and strong examples are difficult to find. Though I agree such examples are not easy to come up with, I also believe the author of library is in the best position to produce them. ================ 1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on. * Example with squares would be better if double replaced uint, and if a more complicated operation (sqrt, log...) were involved. * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious method should be better: just use one circular buffer. The presence of two dependent parameters makes this abstraction difficult to operate with. * Same question about asyncBuf. What is wrong with a circular buffer filled on one side by threads and on the consumed from the other by the client? I can think of a couple of answers but it would be great if they were part of the documentation. * Why not make workerIndex a ulong and be done with it? * Is stop() really trusted or just unsafe? If it's forcibly killing threads then its unsafe. * uint size() should be size_t for conformity. * defaultPoolThreads - should it be a property? * I don't understand what Task is supposed to do. It is circularly defined: "A struct that encapsulates the information about a task, including its current status, what pool it was submitted to, and its arguments." OK, but what's a task? Could a task be used outside a task pool, and if so to what effect(s)? * If LazyMap is only necessary as the result of lazyMap, could that become a local type inside lazyMap? 2. Documentation: * Documentation unacceptable. It lacks precision and uses terms before defining them. For example: "This class encapsulates a task queue and a set of worker threads" comes before the notions of "task queue" and "worker thread" are defined. Clearly there is an intuition of what those are supposed to mean, but in practice each library lays down some fairly detailed definition of such. * "Note: Initializing a pool with zero threads (as would happen in the case of a single-core CPU) is well-tested and does work." The absence of a bug should not be advertised in so many words. Simpler: "Note: Single-CPU machines will operate transparently with zero-sized pools." * "Allows for custom pool size." I have no idea what pool size means. * "// Do something interesting." Worst example ever. You are the creator of the library. If _you_ can't find a brief compelling example, who could? * "Immediately after the range argument, an optional block size argument may be provided. If none is provided, the default block size is used. An optional buffer for returining the results may be provided as the last argument. If one is not provided, one will be automatically allocated. If one is provided, it must be the same length as the range." An example should be inserted after each of these sentences. * "// Do something expensive with line." Better you do something expensive with line. * "This is not the same thing as commutativity." I think this is obvious enough to be left out. The example of matrices (associative but not commutative) is nice though. * "immutable n = 1000000000;" -> use underscores * Would be great if the documentation examples included some rough experimental results, e.g. "3.8 times faster on a 4-core machine than the equivalent serial loop". * No example for workerIndex and why it's useful. * Is LocalWorker better than WorkerLocal? No, because it's not the worker that's local, it's the storage - which is missing from the name! WorkerLocalStorage? Also replace "create" with "make" or drop it entirely. The example doesn't tell me how I can use bufs. I suspect workerIndex has something to do with it but the example fails to divulge that relationship. * No example for join(), which is quite a central primitive. * No example for put(), which would be interesting to see. * No example for task(). * No example for run(). * The example for Task uses myFuture without defining it. In case Task defines the promise/future idiom, the documentation misses a terrific opportunity of clarifying that. * What is 'run' in the definition of safe task()? * Why is AsyncBuf's doc far away from the function returning it? Same about WorkerLocal. Andrei
Mar 18 2011
Thanks for the advice. You mentioned in the past that the documentation was inadequate but didn't give enough specifics as to how until now. As the author of the library, things seem obvious to me that don't seem obvious to anyone else, so I don't feel that I'm in a good position to judge the quality of the documentation and where it needs improvement. I plan to fix most of the issues you raised, but I've left comments for the few that I can't/won't fix or believe are based on misunderstandings below. On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on.You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.* I'm unclear on the tactics used by lazyMap. I'm thinking the obvious method should be better: just use one circular buffer. The presence of two dependent parameters makes this abstraction difficult to operate with. * Same question about asyncBuf. What is wrong with a circular buffer filled on one side by threads and on the consumed from the other by the client? I can think of a couple of answers but it would be great if they were part of the documentation.Are you really suggesting I give detailed rationales for implementation decisions in the documentation? Anyhow, the two reasons for this choice are to avoid needing synchronization/atomic ops/etc. on every write to the buffer (which we would need since it can be read and written concurrently and we need to track whether we have space to write to) and because parallel map works best when it operates on relatively large buffers, resulting in minimal synchronization overhead per element. (Under the hood, the buffer is filled and then eager parallel map is called.)* Why not make workerIndex a ulong and be done with it?I doubt anyone's really going to create anywhere near 4 billion TaskPool threads over the lifetime of a program. Part of the point of TaskPool is recycling threads rather than paying the overhead of creating and destroying them. Using a ulong on a 32-bit architecture would make worker-local storage substantially slower. workerIndex is how worker-local storage works under the hood, so it needs to be fast.* No example for workerIndex and why it's useful.It should just be private. The fact that it's public is an artifact of when I was designing worker-local storage and didn't know how it was going to work yet. I never thought to revisit this until now. It really isn't useful to client code.* Is stop() really trusted or just unsafe? If it's forcibly killing threads then its unsafe.It's not forcibly killing threads. As the documentation states, it has no effect on jobs already executing, only ones in the queue. Furthermore, it's needed unless makeDaemon is called. Speaking of which, makeDaemon and makeAngel should probably be trusted, too.* defaultPoolThreads - should it be a property?Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.* No example for task().???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.* What is 'run' in the definition of safe task()?It's just the run() adapter function. Isn't that obvious?
Mar 18 2011
On Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:Please add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly. -- Simen1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on.You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Mar 19 2011
== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleOn Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:I don't understand how this can be demonstrated in an example. It's an under-the-hood thing. The only place this appears in the API is in the workUnitSize parameter.On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:Please add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly.1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on.You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Mar 19 2011
On Sat, 19 Mar 2011 13:51:56 +0100, dsimcha <dsimcha yahoo.com> wrote:== Quote from Simen kjaeraas (simen.kjaras gmail.com)'s articleYeah, scratch that. I for some reason thought the "array of size workUnitSize" was global, but it's per thread, innit? Seems so logical now. -- SimenOn Sat, 19 Mar 2011 05:40:08 +0100, dsimcha <dsimcha yahoo.com> wrote:I don't understand how this can be demonstrated in an example. It's an under-the-hood thing. The only place this appears in the API is in the workUnitSize parameter.On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:on1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second worksPlease add an example showing that, too. Sure, the documentation says that's what's being done, but an example would show it more clearly.1, 5, 9, ... and so on.You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.
Mar 19 2011
On 03/18/2011 11:40 PM, dsimcha wrote:Thanks for the advice. You mentioned in the past that the documentation was inadequate but didn't give enough specifics as to how until now. As the author of the library, things seem obvious to me that don't seem obvious to anyone else, so I don't feel that I'm in a good position to judge the quality of the documentation and where it needs improvement. I plan to fix most of the issues you raised, but I've left comments for the few that I can't/won't fix or believe are based on misunderstandings below.Great, thanks.On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:Got it.1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on.You can have this if you want, by setting the work unit size to 1. Setting it to a larger size just causes more elements to be buffered, which may be more efficient in some cases.If you're confident that overflow won't occur, you may want to eliminate that detail from the docs. It throws off the reader.* Why not make workerIndex a ulong and be done with it?I doubt anyone's really going to create anywhere near 4 billion TaskPool threads over the lifetime of a program. Part of the point of TaskPool is recycling threads rather than paying the overhead of creating and destroying them. Using a ulong on a 32-bit architecture would make worker-local storage substantially slower. workerIndex is how worker-local storage works under the hood, so it needs to be fast.> * No example for workerIndex and why it's useful. It should just be private. The fact that it's public is an artifact of when I was designing worker-local storage and didn't know how it was going to work yet. I never thought to revisit this until now. It really isn't useful to client code.It could be public and undocumented.Great. The more safe/trusted, the better.* Is stop() really trusted or just unsafe? If it's forcibly killing threads then its unsafe.It's not forcibly killing threads. As the documentation states, it has no effect on jobs already executing, only ones in the queue. Furthermore, it's needed unless makeDaemon is called. Speaking of which, makeDaemon and makeAngel should probably be trusted, too.Then it should be a property. I think ddoc doesn't reflect that, but an example could.* defaultPoolThreads - should it be a property?Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.Towards the bottom of the document there are overloads of task that don't have examples.* No example for task().???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.I'm referring to this: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, Args args); What is "run"? Andrei* What is 'run' in the definition of safe task()?It's just the run() adapter function. Isn't that obvious?
Mar 19 2011
On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:On 03/18/2011 11:40 PM, dsimcha wrote:I've already made it private. I can't see what purpose having it public would serve. The fact that it was public before was **purely** an oversight.It should just be private. The fact that it's public is an artifact of when I was designing worker-local storage and didn't know how it was going to work yet. I never thought to revisit this until now. It really isn't useful to client code.It could be public and undocumented.Right. It's always been property but ddoc doesn't reflect this. I've changed the docs slightly to call it a "property" instead of a "function". It seems like overkill to me to give examples for this, though, since it's just a getter and a setter.Then it should be a property. I think ddoc doesn't reflect that, but an example could.* defaultPoolThreads - should it be a property?Yes. In spirit it's a global variable. It requires some extra machinations, though, to be threadsafe, which is why it's not implemented as a simple global variable.You mean TaskPool.task()? Since these are such slight variations of the other overloads, I thought an example would be overkill. Since people less familiar with the library don't think so, though, I've added examples that are accordingly slight variations of the examples for the other overloads.Towards the bottom of the document there are overloads of task that don't have examples.* No example for task().???? Yes there is, for both flavors, though these could admittedly be improved. Only the safe version doesn't have an example, and this is just a more restricted version of the function pointer case, so it seems silly to make a separate example for it.Ok, I ended up moving the docs for run() directly above the stuff that uses it. run() is described as "Calls a delegate or function pointer with args. This is an adapter that makes Task work with delegates, function pointers and functors instead of just aliases. It is included in the documentation to clarify how this case is handled, but is not meant to be used directly by client code." I know the Higher Principles of Encapsulation say this should be private and the relevant overloads should return auto. I strongly believe, though, that being anal about encapsulation of this detail is silly since it is so unlikely to change and that exposing it helps to clarify what's really going on here. Encapsulation is good up to a point, but sometimes it's just easier to think about things when you know how they really work at a concrete level, and this tradeoff needs to be weighted.I'm referring to this: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(scope F delegateOrFp, Args args); What is "run"?* What is 'run' in the definition of safe task()?It's just the run() adapter function. Isn't that obvious?
Mar 19 2011
On 03/19/2011 10:28 AM, dsimcha wrote:On 3/19/2011 10:54 AM, Andrei Alexandrescu wrote:A great way to handle bunches of almost-identical overloads is to group them together with /// ditto and explain the slight differences in the consolidated documentation. AndreiTowards the bottom of the document there are overloads of task that don't have examples.You mean TaskPool.task()? Since these are such slight variations of the other overloads, I thought an example would be overkill. Since people less familiar with the library don't think so, though, I've added examples that are accordingly slight variations of the examples for the other overloads.
Mar 19 2011
Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html On 3/18/2011 11:29 PM, Andrei Alexandrescu wrote:0. Overview and vote I think the library delivers the high-level goods (parallel foreach, map, reduce) but is a bit fuzzy in the lower level details. Documentation hurts understanding of the capabilities of the library and essentially is of inadequate quality. Entity documentation and examples do little more than give the impression of dispensing with an unpleasant chore. My vote in favor of acceptance is contingent upon a radical improvement in the quality of documentation and examples. Most if not all artifacts should be motivated by simple, compelling examples. The introductory section must contain a brief and attractive synopsis of the flagship capabilities. All terms must be defined before being used and introduced in a carefully-chosen order. The relationship between various entities should be clarified. I've seen the argument that simple and strong examples are difficult to find. Though I agree such examples are not easy to come up with, I also believe the author of library is in the best position to produce them. ================ 1. Library proper: * "In the case of non-random access ranges, parallel foreach is still usable but buffers lazily to an array..." Wouldn't strided processing help? If e.g. 4 threads the first works on 0, 4, 8, ... second works on 1, 5, 9, ... and so on. * Example with squares would be better if double replaced uint, and if a more complicated operation (sqrt, log...) were involved. * I'm unclear on the tactics used by lazyMap. I'm thinking the obvious method should be better: just use one circular buffer. The presence of two dependent parameters makes this abstraction difficult to operate with. * Same question about asyncBuf. What is wrong with a circular buffer filled on one side by threads and on the consumed from the other by the client? I can think of a couple of answers but it would be great if they were part of the documentation. * Why not make workerIndex a ulong and be done with it? * Is stop() really trusted or just unsafe? If it's forcibly killing threads then its unsafe. * uint size() should be size_t for conformity. * defaultPoolThreads - should it be a property? * I don't understand what Task is supposed to do. It is circularly defined: "A struct that encapsulates the information about a task, including its current status, what pool it was submitted to, and its arguments." OK, but what's a task? Could a task be used outside a task pool, and if so to what effect(s)? * If LazyMap is only necessary as the result of lazyMap, could that become a local type inside lazyMap? 2. Documentation: * Documentation unacceptable. It lacks precision and uses terms before defining them. For example: "This class encapsulates a task queue and a set of worker threads" comes before the notions of "task queue" and "worker thread" are defined. Clearly there is an intuition of what those are supposed to mean, but in practice each library lays down some fairly detailed definition of such. * "Note: Initializing a pool with zero threads (as would happen in the case of a single-core CPU) is well-tested and does work." The absence of a bug should not be advertised in so many words. Simpler: "Note: Single-CPU machines will operate transparently with zero-sized pools." * "Allows for custom pool size." I have no idea what pool size means. * "// Do something interesting." Worst example ever. You are the creator of the library. If _you_ can't find a brief compelling example, who could? * "Immediately after the range argument, an optional block size argument may be provided. If none is provided, the default block size is used. An optional buffer for returining the results may be provided as the last argument. If one is not provided, one will be automatically allocated. If one is provided, it must be the same length as the range." An example should be inserted after each of these sentences. * "// Do something expensive with line." Better you do something expensive with line. * "This is not the same thing as commutativity." I think this is obvious enough to be left out. The example of matrices (associative but not commutative) is nice though. * "immutable n = 1000000000;" -> use underscores * Would be great if the documentation examples included some rough experimental results, e.g. "3.8 times faster on a 4-core machine than the equivalent serial loop". * No example for workerIndex and why it's useful. * Is LocalWorker better than WorkerLocal? No, because it's not the worker that's local, it's the storage - which is missing from the name! WorkerLocalStorage? Also replace "create" with "make" or drop it entirely. The example doesn't tell me how I can use bufs. I suspect workerIndex has something to do with it but the example fails to divulge that relationship. * No example for join(), which is quite a central primitive. * No example for put(), which would be interesting to see. * No example for task(). * No example for run(). * The example for Task uses myFuture without defining it. In case Task defines the promise/future idiom, the documentation misses a terrific opportunity of clarifying that. * What is 'run' in the definition of safe task()? * Why is AsyncBuf's doc far away from the function returning it? Same about WorkerLocal. Andrei
Mar 19 2011
On 03/19/2011 02:32 AM, dsimcha wrote:Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* When using "parallelism" as a common noun, prefix is with a '_' so ddoc doesn't underline it. * "Most of this module completely subverts..." Vague characterizations ("most", "completely", "some") don't belong in a technical documentation. (For example there's either subversion going on or there isn't.) Also, std.concurrency and std.parallelism address different needs so there's little competition between them. Better: "Unless explicitly marked as $(D trusted) or $(D safe), artifacts in this module are not provably memory-safe and cannot be used with SafeD. If used as documented, memory safety is guaranteed." * Speaking of std.concurrency vs. std.parallelism, the first paragraph might be something like: "This module implements high-level primitives for shared memory SMP parallelism. These include parallel foreach, parallel reduce, parallel eager map, pipelining and future/promise parallelism primitives. $(D std._parallelism) is best recommended when the same operation is to be executed in parallel over different data. For communication between arbitrary threads, see $(D std.concurrency)." * Still no synopsis example that illustrates in a catchy way the most attractive artifacts. * "After creation, Task objects are submitted to a TaskPool for execution." I understand it's possible to use Task straight as a promise/future, so s/are/may be/. Also it is my understanding that TaskPool efficiently adapts the concrete number of CPUs to an arbitrary number of tasks. If that's the case, it's worth mentioning. * "A call to workWait(), yieldWait(), or spinWait() can be used to retrive the return value after the function is finished executing." As an aside, a spell checking step would be great ("retrieve") - just put the text in an editor with spellchecking. I think what this means is: "The methods workWait(), yieldWait(), or spinWait() make sure that the function finishes execution and then return its result to the initiating thread. Each uses a different waiting strategy as detailed below." * "If a Task has been submitted to a TaskPool instance, is being stored in a stack frame, and has not yet finished, the destructor for this struct will automatically call yieldWait() so that the task can finish and the stack frame can be destroyed safely." At this point in the doc the reader doesn't understand that at all because TaskPool has not been seen yet. The reader gets worried that she'll be essentially serializing the entire process by mistake. Either move this explanation down or provide an example. * "Function results are returned from yieldWait() and friends by ref." Someone coming from C++ may be thrown off by this sudden casual use of "friends" and think there's a notion of frienship by reference in D. Better: "The forcing methods yieldWait(), workWait(), and spinWait() return the result by reference." * Speaking of which, I'd replace "Wait" with "Force". Right now the nomenclature is far removed from futures and promises. * Is done() a property? * The example that reads two files at the same time should NOT use taskPool. It's just one task, why would the pool ever be needed? If you also provided an example that reads n files in memory at the same time using a pool, that would illustrate nicely why you need it. If a Task can't be launched without being put in a pool, there should be a possibility to do so. At my work we have a simple function called callInNewThread that does what's needed to launch a function in a new thread. * The note below that example gets me thinking: it is an artificial limitation to force users of Task to worry about scope and such. One should be able to create a Future object (Task I think in your terminology), pass it around like a normal value, and ultimately force it. This is the case for all other languages that implement futures. I suspect the "scope" parameter associated with the delegate a couple of definitions below plays a role here, but I think we need to work for providing the smoothest interface possible (possibly prompting improvements in the language along the way). * I'm not sure how to interpret the docs for ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args); So it's documented but I'm not supposed to care. Why not just remove? Surely there must be higher-level examples that clarify that I can use delegates etc. * The examples have code at top-level. That's fine for short snippets but not when using import etc. I recommend putting the code inside unittests or function bodies for such cases. * "If you want to escape the Task object from the function in which it was created or prefer to heap allocate and automatically submit to the pool, see TaskPool.task()." I'm uncomfortable that I need to remember a subtle distinction of the same primitive name ("task") depending on membership in a TaskPool or not, which is a tenuous relationship to remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking ourselves what's the relative overhead of closure allocation vs. task switching etc. Maybe we get to simplify things a lot at a small cost in efficiency. * As I mentioned, in the definition: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args); I can't tell what "run" is. * osReportedNcpu - a case of naming convention gone wild... how about totalCPUs or reportedCPUs. * "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd? * I would group these two together: ParallelForeach!(R) parallel(R)(R range, size_t workUnitSize); ParallelForeach!(R) parallel(R)(R range); Just put a /// ditto for the second, use only one code snippet that illustrates two foreach loops, one with the explicit parameter and one without. In fact swap the two - most of the time people will use the one with the default work unit size. At best don't specify 512 because later on you may come with better algorithm to determine the optimal block size. * "workUnitSize: The number of elements to evaluate in a single Task. Must be less than or equal to bufSize, and in practice should be a fraction of bufSize such that all worker threads can be used." Then why not specify a different parameter such a multiplier or something? The dependence between bufSize and worUnitSize is a sign that these two should be improved. If you have good reasons that the user must have the parameters in this form, give an example substantiating that. * "auto myMax = taskPool.reduce!"a + b * b"(myArr);" This is not computing anything meaningful. To compute the sum of squares, you need to give the seed 0.0. * Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine? * The join example is fine, but the repetitive code suggests that loops should be better: import std.file; void main() { auto pool = new TaskPool(); foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) { pool.put(task!read(filename)); } // Call join() to guarantee that all tasks are done running, the worker // threads have terminated and that the results of all of the tasks can // be accessed without any synchronization primitives. pool.join(); ubyte[][] data; // void[], meh // Use spinWait() since the results are guaranteed to have been computed // and spinWait() is the cheapest of the wait functions. foreach (task; pool.howDoIEnumerateTasks()) { data ~= task1.spinWait(); } Andrei
Mar 19 2011
On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:* "Most of this module completely subverts..." Vague characterizations ("most", "completely", "some") don't belong in a technical documentation. (For example there's either subversion going on or there isn't.) Also, std.concurrency and std.parallelism address different needs so there's little competition between them. Better: "Unless explicitly marked as $(D trusted) or $(D safe), artifacts in this module are not provably memory-safe and cannot be used with SafeD. If used as documented, memory safety is guaranteed."Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
On 3/19/2011 1:09 PM, Michel Fortin wrote:On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features. I thought long and hard about this issue before submitting this lib for review and concluded that any solution would make std.parallelism so slow, so limited and/or such a PITA to use that I'd rather it just punt these issues to the programmer. In practice, parallel foreach is used with very small, performance-critical snippets that are fairly easy to reason about even without any language-level race safety. This is a fundamental design decision that will not be changing. If it's unacceptable then: 1. I give up. 2. I wish someone had told me earlier.* "Most of this module completely subverts..." Vague characterizations ("most", "completely", "some") don't belong in a technical documentation. (For example there's either subversion going on or there isn't.) Also, std.concurrency and std.parallelism address different needs so there's little competition between them. Better: "Unless explicitly marked as $(D trusted) or $(D safe), artifacts in this module are not provably memory-safe and cannot be used with SafeD. If used as documented, memory safety is guaranteed."Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.
Mar 19 2011
On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:On 3/19/2011 1:09 PM, Michel Fortin wrote:I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.I thought long and hard about this issue before submitting this lib for review and concluded that any solution would make std.parallelism so slow, so limited and/or such a PITA to use that I'd rather it just punt these issues to the programmer. In practice, parallel foreach is used with very small, performance-critical snippets that are fairly easy to reason about even without any language-level race safety.I'm not too convinced about the "I know what I'm doing" argument when I look at this example from asyncBuf's documentation: auto lines = File("foo.txt").byLine(); auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles buffer // Fetch more lines in the background while we process the lines already // read into memory into a matrix of doubles. double[][] matrix; auto asyncReader = taskPool.asyncBuf(duped); foreach(line; asyncReader) { auto ls = line.split("\t"); matrix ~= to!(double[])(ls); } Look at the last line of the foreach. You are appending to a non-shared array from many different threads. How is that not a race condition? With my proposal, the compiler would have caught that because opApply would want the foreach body to be a shared delegate, and reading/writing to non-shared variable "matrix" in the outer scope from a shared delegate literal would be an error. I'm not too sure how hard it'd be to do that in the compiler, but I think it's the right thing to do. Once the compiler can do this we can have safety; until that time I'd agree to see std.parallelism stay unsafe. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com> said:I'm not too convinced about the "I know what I'm doing" argument when I look at this example from asyncBuf's documentation: auto lines = File("foo.txt").byLine(); auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles buffer // Fetch more lines in the background while we process the lines already // read into memory into a matrix of doubles. double[][] matrix; auto asyncReader = taskPool.asyncBuf(duped); foreach(line; asyncReader) { auto ls = line.split("\t"); matrix ~= to!(double[])(ls); } Look at the last line of the foreach. You are appending to a non-shared array from many different threads. How is that not a race condition?... or maybe I just totally misunderstood asyncBuf. Rereading the documentation I'm under the impression I'd have to write this to get what I expected: foreach (line; parallel(asyncReader)) ... And that would cause a race condition. If that's the case, the example is fine. Sorry for the misunderstanding. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 19 2011
On 3/19/2011 2:25 PM, Michel Fortin wrote:On 2011-03-19 14:14:51 -0400, Michel Fortin <michel.fortin michelf.com> said:Right. And this is pretty obviously a race. The other example (without the parallel) is completely safe.I'm not too convinced about the "I know what I'm doing" argument when I look at this example from asyncBuf's documentation: auto lines = File("foo.txt").byLine(); auto duped = map!"a.idup"(lines); // Necessary b/c byLine() recycles buffer // Fetch more lines in the background while we process the lines already // read into memory into a matrix of doubles. double[][] matrix; auto asyncReader = taskPool.asyncBuf(duped); foreach(line; asyncReader) { auto ls = line.split("\t"); matrix ~= to!(double[])(ls); } Look at the last line of the foreach. You are appending to a non-shared array from many different threads. How is that not a race condition?.... or maybe I just totally misunderstood asyncBuf. Rereading the documentation I'm under the impression I'd have to write this to get what I expected: foreach (line; parallel(asyncReader)) ... And that would cause a race condition. If that's the case, the example is fine. Sorry for the misunderstanding.
Mar 19 2011
On 3/19/2011 2:14 PM, Michel Fortin wrote:On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:Ok, I've had some time to think about this. The following example is safe, but wouldn't work if I understand correctly what you're proposing: auto logs = new double[1_000_000]; foreach(i, ref elem; taskPool.parallel(logs)) { elem = log(i + 1.0); } The problem is that you're writing to the same array from multiple threads, which the compiler would interpret as writing to the same variable. However, you can guarantee that no two threads ever write to the same element, so it's safe. Note: I'm aware of the false sharing issue when writing to adjacent memory addresses. However, when writing to an array this big it occurs to a negligible extent. If you were using a smaller array, then either the loop body would be so expensive that the cost of false sharing would be negligible, or the whole loop would be too cheap to be worth parallelizing. I'm also aware that word tearing is a concern on some architectures, though not x86. IMHO std.parallelism and its documentation should not be pedantic about portability to architectures that a D compiler doesn't even exist for yet. Also, your example can be trivially modified to be safe. void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { synchronized sum += value; } writeln(sum); } In this case that kills all parallelism, but in more realistic cases I use this pattern often. I find it very common to have an expensive loop body can be performed in parallel, except for a tiny portion that must update a shared data structure. I'm aware that it might be possible, in theory, to write this more formally using reduce() or something. However: 1. If the portion of the loop that deals with shared data is very small (and therefore the serialization caused by the synchronized block is negligible), it's often more efficient to only keep one data structure in memory and update it concurrently, rather than use stronger isolation between threads like reduce() does, and have to maintain one data structure for each thread. 2. In my experience synchronizing on a small portion of the loop body works very well in practice. My general philosophy is that, in a library like this, dangerous but useful constructs must be supported and treated as innocent until proven guilty, not the other way round.On 3/19/2011 1:09 PM, Michel Fortin wrote:I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.
Mar 20 2011
On 2011-03-20 11:42:04 -0400, dsimcha <dsimcha yahoo.com> said:On 3/19/2011 2:14 PM, Michel Fortin wrote:I don't see a problem with the above. The array elements you modify are passed through parallel's opApply which can check easily whether it's safe or not to pass them by ref to different threads (by checking the element's size) and allow or disallow the operation accordingly. It could even do a clever trick to make it safe to pass things such as elements of array of bytes by ref (by coalescing loop iterations for all bytes sharing the same word into one task). That might not work for ranges which are not arrays however. That said, feel free to suggest more problematic examples.On 2011-03-19 13:20:00 -0400, dsimcha <dsimcha yahoo.com> said:Ok, I've had some time to think about this. The following example is safe, but wouldn't work if I understand correctly what you're proposing: auto logs = new double[1_000_000]; foreach(i, ref elem; taskPool.parallel(logs)) { elem = log(i + 1.0); } The problem is that you're writing to the same array from multiple threads, which the compiler would interpret as writing to the same variable. However, you can guarantee that no two threads ever write to the same element, so it's safe.On 3/19/2011 1:09 PM, Michel Fortin wrote:I'm not sure where my proposal asks for more verbosity or less performance. All I can see is a few less casts in std.parallelism and that it'd disallow the case in my example above that is totally wrong. Either you're interpreting it wrong or there are things I haven't thought about (and I'd be happy to know about them). But by looking at all the examples in the documentation, I cannot find one that would need to be changed... well, except the one I'll discuss below.For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.If you want pedal-to-metal parallelism without insane levels of verbosity, you can't have these safety features.Note: I'm aware of the false sharing issue when writing to adjacent memory addresses. However, when writing to an array this big it occurs to a negligible extent. If you were using a smaller array, then either the loop body would be so expensive that the cost of false sharing would be negligible, or the whole loop would be too cheap to be worth parallelizing. I'm also aware that word tearing is a concern on some architectures, though not x86. IMHO std.parallelism and its documentation should not be pedantic about portability to architectures that a D compiler doesn't even exist for yet.No question there.Also, your example can be trivially modified to be safe. void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { synchronized sum += value; } writeln(sum); } In this case that kills all parallelism, but in more realistic cases I use this pattern often. I find it very common to have an expensive loop body can be performed in parallel, except for a tiny portion that must update a shared data structure. I'm aware that it might be possible, in theory, to write this more formally using reduce() or something. However: 1. If the portion of the loop that deals with shared data is very small (and therefore the serialization caused by the synchronized block is negligible), it's often more efficient to only keep one data structure in memory and update it concurrently, rather than use stronger isolation between threads like reduce() does, and have to maintain one data structure for each thread. 2. In my experience synchronizing on a small portion of the loop body works very well in practice. My general philosophy is that, in a library like this, dangerous but useful constructs must be supported and treated as innocent until proven guilty, not the other way round.Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL. Now we're in the exact same situation (except that no classes are involved) and you're proposing that for this case we make convenience prime over low-level race safety? To me this would be an utter lack of coherency in the design of D's "synchronized" feature to go that route. For the case above, wouldn't it be better to use an atomic add instead of a synchronized block? In this case you can make "sum" shared and have the type system check that everything is safe (using my proposed rules). And if your data structure is bigger, likely it'll be a synchronized class and you won't have to resort to bypass type-system safeties inside your loop (although you might have to inside your synchronized class, but that's another matter). -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 20 2011
On 3/20/2011 10:44 PM, Michel Fortin wrote:I don't see a problem with the above. The array elements you modify are passed through parallel's opApply which can check easily whether it's safe or not to pass them by ref to different threads (by checking the element's size) and allow or disallow the operation accordingly. It could even do a clever trick to make it safe to pass things such as elements of array of bytes by ref (by coalescing loop iterations for all bytes sharing the same word into one task). That might not work for ranges which are not arrays however. That said, feel free to suggest more problematic examples.Ok, I completely agree in principle, though I question whether it's worth actually implementing something like this, especially until we get some kind of support for shared delegates.I'd go a little further. If the guarantees that shared was supposed to provide are strong, i.e. apply no matter what threading module is used, then I utterly despise it. It's one of the worst decisions made in the design of D. Making things pedantically strict, so that the type system gets in the way more than it helps, encourages the user to reflexively circumvent the type system without thinking hard about doing this, thus defeating its purpose. (The alternative of always complying with what the type system "expects" you to do is too inflexible to even be worth considering.) Type systems should err on the side of accepting a superset of what's correct and treating code as innocent until proven guilty, not the other way around. I still believe this even if some of the bugs it could be letting pass through might be very difficult to debug. See the discussion we had a few weeks ago about implicit integer casting and porting code to 64. My excuse for std.parallelism is that it's pedal-to-metal parallelism, so it's more acceptable for it to be dangerous than general case concurrency. IMHO when you use the non- safe parts of std.parallelism (i.e. most of the library), that's equivalent to casting away shared in a whole bunch of places. Typing "import std.parallelism;" in a non- safe module is an explicit enough step here. The guarantee is still preserved that, if you only use std.concurrency (D's flagship "safe" concurrency module) for multithreading and don't cast away shared, there can be no low level data races. IMHO this is still a substantial accomplishment in that there exists a way to do safe, statically checkable concurrency in D, even if it's not the **only** way concurrency can be done. BTW, core.thread can also be used to get around D's type system, not just std.parallelism. If you want to check that only safe concurrency is used, importing std.parallelism and core.thread can be grepped just as easily as casting away shared. If, on the other hand, the guarantees of shared are supposed to be weak in that they only apply to programs where only std.concurrency is used for multithreading, then I think strictness is the right thing to do. The whole point of std.concurrency is to give strong guarantees, but if you prefer more dangerous but more flexible multithreading, other paradigms should be readily available.Also, your example can be trivially modified to be safe. void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { synchronized sum += value; } writeln(sum); } In this case that kills all parallelism, but in more realistic cases I use this pattern often. I find it very common to have an expensive loop body can be performed in parallel, except for a tiny portion that must update a shared data structure. I'm aware that it might be possible, in theory, to write this more formally using reduce() or something. However: 1. If the portion of the loop that deals with shared data is very small (and therefore the serialization caused by the synchronized block is negligible), it's often more efficient to only keep one data structure in memory and update it concurrently, rather than use stronger isolation between threads like reduce() does, and have to maintain one data structure for each thread. 2. In my experience synchronizing on a small portion of the loop body works very well in practice. My general philosophy is that, in a library like this, dangerous but useful constructs must be supported and treated as innocent until proven guilty, not the other way round.Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL.Now we're in the exact same situation (except that no classes are involved) and you're proposing that for this case we make convenience prime over low-level race safety? To me this would be an utter lack of coherency in the design of D's "synchronized" feature to go that route. For the case above, wouldn't it be better to use an atomic add instead of a synchronized block?Yes. I've used this pattern in much more complicated cases, though, where atomic wouldn't cut it.In this case you can make "sum" shared and have the type system check that everything is safe (using my proposed rules). And if your data structure is bigger, likely it'll be a synchronized class and you won't have to resort to bypass type-system safeties inside your loop (although you might have to inside your synchronized class, but that's another matter).I'm **still** totally confused about how shared is supposed to work, because I don't have a fully debugged/implemented implementation or good examples of stuff written in this paradigm to play around with. TDPL helps to some degree, but for me it's a lot easier to understand something by actually trying to use it.
Mar 20 2011
On 2011-03-20 23:21:49 -0400, dsimcha <dsimcha yahoo.com> said:On 3/20/2011 10:44 PM, Michel Fortin wrote:Well, it'll work irrespective of whether shared delegates are used or not. I think you could add a compile-time check that the array element size is a multiple of the word size when the element is passed by ref in the loop and leave the clever trick as a possible future improvements. Would that work?I don't see a problem with the above. The array elements you modify are passed through parallel's opApply which can check easily whether it's safe or not to pass them by ref to different threads (by checking the element's size) and allow or disallow the operation accordingly. It could even do a clever trick to make it safe to pass things such as elements of array of bytes by ref (by coalescing loop iterations for all bytes sharing the same word into one task). That might not work for ranges which are not arrays however. That said, feel free to suggest more problematic examples.Ok, I completely agree in principle, though I question whether it's worth actually implementing something like this, especially until we get some kind of support for shared delegates.I agree with you that this is a serious problem. I think part of why it hasn't been talked much yet is that nobody is currently using D2 seriously for multithreaded stuff at this time (apart from you I guess), so we're missing experience with it. Andrei seems to think it's fine to required casts as soon as you need to protect something beyond an indirection inside synchronized classes, with the mitigation measure that you can make classes share their mutex (not implemented yet I think) so if the indirection leads to a class it is less of a problem. Personally, I don't.I'd go a little further. If the guarantees that shared was supposed to provide are strong, i.e. apply no matter what threading module is used, then I utterly despise it. It's one of the worst decisions made in the design of D. Making things pedantically strict, so that the type system gets in the way more than it helps, encourages the user to reflexively circumvent the type system without thinking hard about doing this, thus defeating its purpose. (The alternative of always complying with what the type system "expects" you to do is too inflexible to even be worth considering.) Type systems should err on the side of accepting a superset of what's correct and treating code as innocent until proven guilty, not the other way around. I still believe this even if some of the bugs it could be letting pass through might be very difficult to debug. See the discussion we had a few weeks ago about implicit integer casting and porting code to 64.Also, your example can be trivially modified to be safe. void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { synchronized sum += value; } writeln(sum); } In this case that kills all parallelism, but in more realistic cases I use this pattern often. I find it very common to have an expensive loop body can be performed in parallel, except for a tiny portion that must update a shared data structure. I'm aware that it might be possible, in theory, to write this more formally using reduce() or something. However: 1. If the portion of the loop that deals with shared data is very small (and therefore the serialization caused by the synchronized block is negligible), it's often more efficient to only keep one data structure in memory and update it concurrently, rather than use stronger isolation between threads like reduce() does, and have to maintain one data structure for each thread. 2. In my experience synchronizing on a small portion of the loop body works very well in practice. My general philosophy is that, in a library like this, dangerous but useful constructs must be supported and treated as innocent until proven guilty, not the other way round.Your second example is not really a good justification of anything. I'll refer you to how synchronized classes work. It was decided that synchronized in a class protects everything that is directly stored in the class. Anything behind an indirection is considered shared by the compiler. The implication of this is that if you have an array or a pointer to something that you want semantically to be protected by the class's mutex, you have to cast things to unshared. It was decided that things should be safe against low-level races first, and convenience was relegated as a secondary concern. I don't like it very much, but that's what was decided and written in TDPL.My excuse for std.parallelism is that it's pedal-to-metal parallelism, so it's more acceptable for it to be dangerous than general case concurrency. IMHO when you use the non- safe parts of std.parallelism (i.e. most of the library), that's equivalent to casting away shared in a whole bunch of places. Typing "import std.parallelism;" in a non- safe module is an explicit enough step here.I still think this "pedal-to-metal" qualification needs to be justified. Not having shared delegates in the language seems like an appropriate justification to me. Wanting to bypass casts you normally have to do around synchronized as the sole reason seems like a bad justification to me. It's not that I like how synchronized works, it's just that I think it should work the same everywhere.The guarantee is still preserved that, if you only use std.concurrency (D's flagship "safe" concurrency module) for multithreading and don't cast away shared, there can be no low level data races. IMHO this is still a substantial accomplishment in that there exists a way to do safe, statically checkable concurrency in D, even if it's not the **only** way concurrency can be done. BTW, core.thread can also be used to get around D's type system, not just std.parallelism. If you want to check that only safe concurrency is used, importing std.parallelism and core.thread can be grepped just as easily as casting away shared.Unless I'm mistaken, the only thing that bypasses race-safety in core.thread is the Thread constructor that takes a delegate. Which means it could easily be made race-safe by making that delegate parameter shared (once shared delegates are implemented).If, on the other hand, the guarantees of shared are supposed to be weak in that they only apply to programs where only std.concurrency is used for multithreading, then I think strictness is the right thing to do. The whole point of std.concurrency is to give strong guarantees, but if you prefer more dangerous but more flexible multithreading, other paradigms should be readily available.I think the language as a whole is designed to have strong guaranties, otherwise synchronized classes wouldn't require out-of-guaranty casts at every indirection. I'm not too pleased with the way synchronized classes are supposed to work, nor am I too pleased with how it impacts the rest of the language. But if this is a problem (and I think it is), it ought to be fixed globally, not by shutting down safeties in every module dealing with multithreading that isn't std.concurrency.I'm **still** totally confused about how shared is supposed to work, because I don't have a fully debugged/implemented implementation or good examples of stuff written in this paradigm to play around with.I think nobody have played much with the paradigm at this point, or we'd have heard some feedback. Well, actually we have your feedback which seem to indicate that it's better to shut off safeties than to play nice with them. - - - Quoting Andrei, February 4 of 2010 in "there is no escape" on the dmd-concurrency mailing list:As we already knew, shared/synchronized limit quite drastically the range of lock-based designs without casts. Fortunately, using a class reference member inside a synchronized object will be possible because... well I'll explain in the text. I continue to believe this is the right bet to make, but I expect push back from experienced lock-based programmers.Now is the beginning of that push back, I guess. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 21 2011
On 3/21/2011 8:37 AM, Michel Fortin wrote:Well, it'll work irrespective of whether shared delegates are used or not. I think you could add a compile-time check that the array element size is a multiple of the word size when the element is passed by ref in the loop and leave the clever trick as a possible future improvements. Would that work?On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works. Two, AFAIK there's no way to get the native word size.This is where you and I disagree. I think that the type system's guarantees should be weak, i.e. only apply to std.concurrency. IMHO the strictness is reasonable when using message passing as your primary method of multithreading and only very little shared state. However, it's completely unreasonable if you want to use a paradigm where shared state is more heavily used. D, being a systems language, needs to allow other styles of multithreading without making them a huge PITA that requires casts everywhere.I'd go a little further. If the guarantees that shared was supposed to provide are strong, i.e. apply no matter what threading module is used, then I utterly despise it. It's one of the worst decisions made in the design of D. Making things pedantically strict, so that the type system gets in the way more than it helps, encourages the user to reflexively circumvent the type system without thinking hard about doing this, thus defeating its purpose. (The alternative of always complying with what the type system "expects" you to do is too inflexible to even be worth considering.) Type systems should err on the side of accepting a superset of what's correct and treating code as innocent until proven guilty, not the other way around. I still believe this even if some of the bugs it could be letting pass through might be very difficult to debug. See the discussion we had a few weeks ago about implicit integer casting and porting code to 64.I agree with you that this is a serious problem. I think part of why it hasn't been talked much yet is that nobody is currently using D2 seriously for multithreaded stuff at this time (apart from you I guess), so we're missing experience with it. Andrei seems to think it's fine to required casts as soon as you need to protect something beyond an indirection inside synchronized classes, with the mitigation measure that you can make classes share their mutex (not implemented yet I think) so if the indirection leads to a class it is less of a problem. Personally, I don't.My excuse for std.parallelism is that it's pedal-to-metal parallelism, so it's more acceptable for it to be dangerous than general case concurrency. IMHO when you use the non- safe parts of std.parallelism (i.e. most of the library), that's equivalent to casting away shared in a whole bunch of places. Typing "import std.parallelism;" in a non- safe module is an explicit enough step here.I still think this "pedal-to-metal" qualification needs to be justified. Not having shared delegates in the language seems like an appropriate justification to me. Wanting to bypass casts you normally have to do around synchronized as the sole reason seems like a bad justification to me. It's not that I like how synchronized works, it's just that I think it should work the same everywhere.And then you'd only need one cast to break it if you wanted to, not casts everywhere. Just cast an unshared delegate to shared when passing it to core.thread.The guarantee is still preserved that, if you only use std.concurrency (D's flagship "safe" concurrency module) for multithreading and don't cast away shared, there can be no low level data races. IMHO this is still a substantial accomplishment in that there exists a way to do safe, statically checkable concurrency in D, even if it's not the **only** way concurrency can be done. BTW, core.thread can also be used to get around D's type system, not just std.parallelism. If you want to check that only safe concurrency is used, importing std.parallelism and core.thread can be grepped just as easily as casting away shared.Unless I'm mistaken, the only thing that bypasses race-safety in core.thread is the Thread constructor that takes a delegate. Which means it could easily be made race-safe by making that delegate parameter shared (once shared delegates are implemented).Well, there's an easy way around that, too. Just declare the whole method body synchronized, but don't declare the method synchronized in the signature.If, on the other hand, the guarantees of shared are supposed to be weak in that they only apply to programs where only std.concurrency is used for multithreading, then I think strictness is the right thing to do. The whole point of std.concurrency is to give strong guarantees, but if you prefer more dangerous but more flexible multithreading, other paradigms should be readily available.I think the language as a whole is designed to have strong guaranties, otherwise synchronized classes wouldn't require out-of-guaranty casts at every indirection.I'm not too pleased with the way synchronized classes are supposed to work, nor am I too pleased with how it impacts the rest of the language. But if this is a problem (and I think it is), it ought to be fixed globally, not by shutting down safeties in every module dealing with multithreading that isn't std.concurrency.Again, I completely disagree. IMHO it's not fixable globally such that both of the following are achieved: 1. The strong guarantees when using only std.concurrency are preserved. 2. More shared state-intensive multithreading can be done without the type system getting in the way more than it helps.I agree that it's better to shut off the safeties **unless** you're doing very coarse-grained multithreading with very little shared state like std.concurrency had in mind. If this is what you want, then the safeties are great. Unfortunately, I'm going to have to take a hard line on this one. The issue of integrating std.parallelism into the race safety system had been discussed a bunch in the past and it was basically agreed that std.parallelism is a "here be dragons" module that cannot reasonably be made to conform to such a model. Given the choice (and I hope I'm not forced to make this choice) I'd rather std.parallelism be a third-party module that's actually usable than a Phobos module that is such a PITA to use that noone does in practice.I'm **still** totally confused about how shared is supposed to work, because I don't have a fully debugged/implemented implementation or good examples of stuff written in this paradigm to play around with.I think nobody have played much with the paradigm at this point, or we'd have heard some feedback. Well, actually we have your feedback which seem to indicate that it's better to shut off safeties than to play nice with them.
Mar 21 2011
On 2011-03-21 09:50:09 -0400, dsimcha <dsimcha yahoo.com> said:On 3/21/2011 8:37 AM, Michel Fortin wrote:Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.Well, it'll work irrespective of whether shared delegates are used or not. I think you could add a compile-time check that the array element size is a multiple of the word size when the element is passed by ref in the loop and leave the clever trick as a possible future improvements. Would that work?On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works.Two, AFAIK there's no way to get the native word size.Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).Unfortunately, I'm going to have to take a hard line on this one. The issue of integrating std.parallelism into the race safety system had been discussed a bunch in the past and it was basically agreed that std.parallelism is a "here be dragons" module that cannot reasonably be made to conform to such a model.About the "cannot reasonably be made to conform to such a model" part: that is certainly true today, but might turn out not to be true as the model evolves. It certainly is true as long as we don't have shared delegates. Beyond that it becomes more fuzzy. The final goal of the module shouldn't be to bypass safeties but to provide good parallelism primitives. By all means, if safeties can be enabled reasonably they should be (but as of today, they can't). And I totally agree with you that it's quite silly to require casts everywhere to use synchronized. I'll be the first to admit that it's hard to see synchronized classes being very practical as they're implemented today. There's room for improvements there too.Given the choice (and I hope I'm not forced to make this choice) I'd rather std.parallelism be a third-party module that's actually usable than a Phobos module that is such a PITA to use that noone does in practice.Which is understandable. -- Michel Fortin michel.fortin michelf.com http://michelf.com/
Mar 21 2011
== Quote from Michel Fortin (michel.fortin michelf.com)'s articleThis problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works.Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.Two, AFAIK there's no way to get the native word size.Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).I'll agree that, if the model evolves sufficiently, everything I've said deserves to be reconsidered. My comments only apply to the model as it exists currently or will exist in the foreseeable future. It's just that I have **serious** doubts that it will evolve much and I don't want to delay things to have an academic discussion about what-ifs with regard to this.Unfortunately, I'm going to have to take a hard line on this one. The issue of integrating std.parallelism into the race safety system had been discussed a bunch in the past and it was basically agreed that std.parallelism is a "here be dragons" module that cannot reasonably be made to conform to such a model.About the "cannot reasonably be made to conform to such a model" part: that is certainly true today, but might turn out not to be true as the model evolves. It certainly is true as long as we don't have shared delegates. Beyond that it becomes more fuzzy. The final goal of the module shouldn't be to bypass safeties but to provide good parallelism primitives. By all means, if safeties can be enabled reasonably they should be (but as of today, they can't).And I totally agree with you that it's quite silly to require casts everywhere to use synchronized. I'll be the first to admit that it's hard to see synchronized classes being very practical as they're implemented today. There's room for improvements there too.This is the point I'm trying to make: Requiring these casts is perfectly reasonable if you assume that shared state is going to be rare, as it is in the std.concurrency model. When the user is looking for a more fine-grained, more shared state-heavy paradigm, I don't think it's possible to make it safe without also making it virtually unusable. This is why we punted and decided to go with message passing and very limited shared data as the flagship multithreading model. On the other hand, in a systems language, we can't virtually prohibit shared state-heavy multithreading by making it virtually unusable. Thus, the no low level race guarantee needs to apply to std.concurrency only.
Mar 21 2011
== Quote from dsimcha (dsimcha yahoo.com)'s article== Quote from Michel Fortin (michel.fortin michelf.com)'s articleActually, just remembered that word tearing is also an issue with unaligned memory access. I guess I could just include a warning that says not to do this with unaligned memory.This problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works.Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.Two, AFAIK there's no way to get the native word size.Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
Mar 21 2011
On 3/21/2011 11:58 AM, dsimcha wrote:== Quote from dsimcha (dsimcha yahoo.com)'s articles/is/may be . I'm trying to read up on word tearing and post questions here and to StackOverflow. Good documentation about it seems ridiculously hard to come by. About the only solid pieces of information I've found are that x86 definitely **can** do byte granularity (which doesn't mean it always does) and that Java makes these guarantees (which is only useful if you're writing in Java). The general gut feeling people here and on StackOverflow seem to have is that it's not an issue, but there doesn't seem to be much backing up of this. Maybe I'm just being paranoid and this is a complete non-issue except on the most obscure/ancient hardware and/or most stupidly implemented compilers, which is why few people even think of it.== Quote from Michel Fortin (michel.fortin michelf.com)'s articleActually, just remembered that word tearing is also an issue with unaligned memory access. I guess I could just include a warning that says not to do this with unaligned memory.This problem is **much** less severe than you are suggesting. x86 can address single bytes, so it's not a problem even if you're iterating over bytes on a 64-bit machine. CPUs like Alpha (which no D compiler even exists for) can't natively address individual bytes. Therefore, writing to a byte would be implemented much like writing to a bit is on x86: You'd read the full word in, change one byte, write the full word back. I'm not sure exactly how it would be implemented at the compiler level, or whether you'd even be allowed to have a reference to a byte in such an implementation. This is why I consider this more a theoretical problem than a serious practical issue.On second thought, no, but for practical, not theoretical reasons: One, you can't introspect whether a foreach loop is using a ref or a value parameter. This is an issue with how opApply works.Indeed a problem. Either we fix the compiler to support that, or we change the syntax to something like this: taskPool.apply(range, (ref int value) { ... }); Or we leave things as they are.Two, AFAIK there's no way to get the native word size.Right. That's a problem too... you could probably alleviate this by doing a runtime check with some fancy instructions to get the native word size, but I'd expect that to be rather convoluted. I'd like to check if I understand that well. For instance this code: int[100] values; foreach (i, ref value; parallel(values)) value = i; would normally run fine on a 32-bit processor, but it'd create low-level a race on a 64-bit processor (even a 64-bit processor running a 32-bit program in 32-bit compatibility mode). And even that is a generalization, some 32-bit processors out there *might* have 64-bit native words. So the code above isn't portable. Is that right? Which makes me think... we need to document those pitfalls somewhere. Perhaps std.parallelism's documentation should link to a related page about what you can and what you can't do safely. People who read that "all the safeties are off" in std.parallelism aren't going to understand what you're talking about unless you explain the pitfalls with actual examples (like the one above).
Mar 21 2011
dsimcha:1. I give up. 2. I wish someone had told me earlier.Please, don't give up. It's a lot of time from the first time I've asked a parallel map in Phobos :-) Bye, bearophile
Mar 19 2011
On Sat, 19 Mar 2011 13:09:23 -0400, Michel Fortin <michel.fortin michelf.com> wrote:On 2011-03-19 12:03:51 -0400, Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> said:Technically, you'd only need a shared delegate or a const delegate to guarantee race safety for parallel foreach.* "Most of this module completely subverts..." Vague characterizations ("most", "completely", "some") don't belong in a technical documentation. (For example there's either subversion going on or there isn't.) Also, std.concurrency and std.parallelism address different needs so there's little competition between them. Better: "Unless explicitly marked as $(D trusted) or $(D safe), artifacts in this module are not provably memory-safe and cannot be used with SafeD. If used as documented, memory safety is guaranteed."Actually, I think this is a bad description of what it subverts. What it subverts isn't the memory-safety that SafeD provides, but the safety against low-level races that even unsafe D protects against unless you cast shared away. For instance: void main() { int sum = 0; foreach (int value; taskPool.parallel([0,2,3,6,1,4,6,3,3,3,6])) { sum += value; } writeln(sum); } The "+=" would need to be an atomic operation to avoid low-level races. I think that ParallelForeach's opApply should only accept a shared delegate. I define shared delegate as a delegate that does not reference any non-shared variables of its outer scope. The problem is that DMD currently doesn't know how to determine whether a delegate literal is shared or not, thus a delegate literal is never shared and if ParallelForeach's opApply asked a shared delegate as it should it would just not work. Fix DMD to create shared delegate literals where appropriate and everything can be guarantied race-free.
Mar 20 2011
On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:On 03/19/2011 02:32 AM, dsimcha wrote:I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like. In general I feel like std.parallelism is being held to a ridiculously high standard that none of the other Phobos modules currently meet.Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.* "After creation, Task objects are submitted to a TaskPool for execution." I understand it's possible to use Task straight as a promise/future, so s/are/may be/.No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)Also it is my understanding that TaskPool efficiently adapts the concrete number of CPUs to an arbitrary number of tasks. If that's the case, it's worth mentioning.Isn't this kind of obvious from the examples, etc.?* "If a Task has been submitted to a TaskPool instance, is being stored in a stack frame, and has not yet finished, the destructor for this struct will automatically call yieldWait() so that the task can finish and the stack frame can be destroyed safely." At this point in the doc the reader doesn't understand that at all because TaskPool has not been seen yet. The reader gets worried that she'll be essentially serializing the entire process by mistake. Either move this explanation down or provide an example.This is getting ridiculous. There are too many circular dependencies between Task and TaskPool that are impossible to remove here that I'm not even going to try. One or the other has to be introduced first, but neither can be explained without mentioning the other. This is why I explain the relationship briefly in the module level summary, so that the user has at least some idea. I think this is about the best I can do.* Is done() a property?Yes. DDoc sucks.* The example that reads two files at the same time should NOT use taskPool. It's just one task, why would the pool ever be needed? If you also provided an example that reads n files in memory at the same time using a pool, that would illustrate nicely why you need it. If a Task can't be launched without being put in a pool, there should be a possibility to do so. At my work we have a simple function called callInNewThread that does what's needed to launch a function in a new thread.I guess I could add something like this to Task. Under the hood it would (for implementation simplicity, to reuse a bunch of code from TaskPool) fire up a new single-thread pool, submit the task, call TaskPool.finish(), and then return. Since you're already creating a new thread, the extra overhead of creating a new TaskPool for the thread would be negligible and it would massively simplify the implementation. My only concern is that, when combined with scoped versus non-scoped tasks (which are definitely here to stay, see below) this small convenience function would add way more API complexity than it's worth. Besides, task pools don't need to be created explicitly anyhow. That's what the default instantiation is for. I don't see how callInNewThread would really solve much. Secondly, I think you're reading **WAY** too much into what was meant to be a simple example to illustrate usage mechanics. This is another case where I can't think of a small, cute example of where you'd really need the pool. There are plenty of larger examples, but the smallest/most self-contained one I can think of is a parallel sort. I decided to use file reading because it was good enough to illustrate the mechanics of usage, even if it didn't illustrate a particularly good use case.* The note below that example gets me thinking: it is an artificial limitation to force users of Task to worry about scope and such. One should be able to create a Future object (Task I think in your terminology), pass it around like a normal value, and ultimately force it. This is the case for all other languages that implement futures. I suspect the "scope" parameter associated with the delegate a couple of definitions below plays a role here, but I think we need to work for providing the smoothest interface possible (possibly prompting improvements in the language along the way).This is what TaskPool.task is for. Maybe this should be moved to the top of the definition of TaskPool and emphasized, and the scoped/stack allocated versions should be moved below TaskPool and de-emphasized? At any rate, though, anyone who's playing with parallelism should be an advanced enough programmer that concepts like scope and destructors are second nature.* I'm not sure how to interpret the docs for ReturnType!(F) run(F, Args...)(F fpOrDelegate, ref Args args); So it's documented but I'm not supposed to care. Why not just remove? Surely there must be higher-level examples that clarify that I can use delegates etc.Yes, and there are such examples. It's just that I want to explicitly state what type is returned in this case rather than returning auto. If I omitted the docs for run(), you'd ask what run() is.* "If you want to escape the Task object from the function in which it was created or prefer to heap allocate and automatically submit to the pool, see TaskPool.task()." I'm uncomfortable that I need to remember a subtle distinction of the same primitive name ("task") depending on membership in a TaskPool or not, which is a tenuous relationship to remember. How about "scopedTask()" vs. "task()"? Also, it's worth asking ourselves what's the relative overhead of closure allocation vs. task switching etc. Maybe we get to simplify things a lot at a small cost in efficiency.We absolutely **need** scope delegates for calling stuff where a closure can't be allocated due to objects on the stack frame having scoped destruction. This is not going anywhere. Also, I **much** prefer to have everything that does conceptually the same thing have the same name. I think this is clearer, not less clear.* As I mentioned, in the definition: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args); I can't tell what "run" is.run() is just the adapter function described right above this code. I cannot for the life of me understand how this could possibly be unclear.* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.* Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine?I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.* The join example is fine, but the repetitive code suggests that loops should be better: import std.file; void main() { auto pool = new TaskPool(); foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) { pool.put(task!read(filename)); } // Call join() to guarantee that all tasks are done running, the worker // threads have terminated and that the results of all of the tasks can // be accessed without any synchronization primitives. pool.join(); ubyte[][] data; // void[], meh // Use spinWait() since the results are guaranteed to have been computed // and spinWait() is the cheapest of the wait functions. foreach (task; pool.howDoIEnumerateTasks()) { data ~= task1.spinWait(); }You can't enumerate the tasks like this, but there's a good reason for this limitation. The design requires the pool not hold onto any references to the tasks after they're out of the queue, so that they can be safely destroyed as soon as the caller wants to destroy them. If you need to enumerate over them like this, then: 1. You're probably doing something wrong and parallel foreach would be a better choice. 2. If you insist on doing this, you need to explicitly store them in an array or something. As a more general comment, I know the join() example sucks. join() is actually a seldom-used primitive, not a frequently used one as you suggest. Most of the time you wait for individual tasks (explicitly or implicitly through the higher level functions), not the whole pool, to finish executing. I feel like it's obligatory that join() and finish() exist, but I never use them myself. They were much more useful in pre-release versions of this lib, when parallel foreach didn't exist and it made sense to have explicit arrays of Tasks. The difficulty I've had in coming up with a decent example for them makes me halfway tempted to
Mar 19 2011
On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:I think an improvement over undefined behaviour would be to throw an exception. Also, what happens if one of the tasks throws an exception? -- Michel Fortin michel.fortin michelf.com http://michelf.com/* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.
Mar 19 2011
On 3/19/2011 2:35 PM, Michel Fortin wrote:On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:The only problem is that there's no easy, well-documented way to tell from the return value of opApply whether it was a break, a goto, a labeled break/continue, etc. This would be implementable only if I changed the semantics of break to also throw. This might not be a bad thing (IMHO any type of breaking out of a parallel foreach loop is just silly) but others had asked for different semantics for break.I think an improvement over undefined behaviour would be to throw an exception.* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.Also, what happens if one of the tasks throws an exception?It gets rethrown when yieldWait()/spinWait()/workWait() is called. In the case of the higher-level primitives, it gets re-thrown to the calling thread at some non-deterministic point in the execution of these functions. I didn't see the need to document this explicitly because it "just works".
Mar 19 2011
On 3/19/2011 3:36 PM, dsimcha wrote:On 3/19/2011 2:35 PM, Michel Fortin wrote: It gets rethrown when yieldWait()/spinWait()/workWait() is called. In the case of the higher-level primitives, it gets re-thrown to the calling thread at some non-deterministic point in the execution of these functions. I didn't see the need to document this explicitly because it "just works"....Though now that you make me think of it I need to support exception chaining for the case of multiple concurrently thrown exceptions instead of just arbitrarily and non-deterministically rethrowing one. The code that handles this was written before exception chaining existed. Will get on that.
Mar 19 2011
On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:On 3/19/2011 2:35 PM, Michel Fortin wrote:It's not that silly. Essentially, what you'd express like this with a normal function taking a delegate: taskPool.apply([1,2,3], (int i) { if (i == 1) return; // do some things }); you'd express like this in a parallel foreach: foreach (int i; parallel([1,2,3])) { if (i == 1) break; // do some things } It's not following the semantics of break within a foreach, but it's still useful to be able to return early from a function (from a loop iteration in this case), so I see the use case for making 'break' do what it does. My only gripe is that the semantics are distorted. In fact, just making foreach parallel distorts its semantics. I was confused earlier about a foreach being parallel when it was not, someone could also be confused in the other direction, thinking foreach is the normal foreach when it actually is parallel. This makes code harder to review. Don't consider only my opinion on this, but in my opinion the first form above (taskPool.apply) is preferable because you absolutely can't mistake it with a normal foreach. And I think it's especially important to make it stand out given that the compiler can't check for low-level races.On 2011-03-19 13:16:02 -0400, dsimcha <dsimcha yahoo.com> said:The only problem is that there's no easy, well-documented way to tell from the return value of opApply whether it was a break, a goto, a labeled break/continue, etc. This would be implementable only if I changed the semantics of break to also throw. This might not be a bad thing (IMHO any type of breaking out of a parallel foreach loop is just silly) but others had asked for different semantics for break.I think an improvement over undefined behaviour would be to throw an exception.* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.That's indeed what I'd expect. But I think it'd still be worth mentioning in a short sentense in yeildWait()/spinWait()/workWait()'s documentation. It's comforting when the documentation confirms your expectations. -- Michel Fortin michel.fortin michelf.com http://michelf.com/Also, what happens if one of the tasks throws an exception?It gets rethrown when yieldWait()/spinWait()/workWait() is called. In the case of the higher-level primitives, it gets re-thrown to the calling thread at some non-deterministic point in the execution of these functions. I didn't see the need to document this explicitly because it "just works".
Mar 19 2011
On 3/19/2011 5:33 PM, Michel Fortin wrote:On 2011-03-19 15:36:24 -0400, dsimcha <dsimcha yahoo.com> said:Using continue is well defined behavior and does exactly what I think you're suggesting. The problem with break is that it's supposed to stop all subsequent iterations of the loop from being executed, not just end the current one early. This only makes sense in a serial context, where "subsequent iterations" is well-defined. Currently break breaks from the current work unit but continues executing all other work units. I put this semantic in because I couldn't think of anything else to make it do and someone (I don't remember who) asked for it. I have never encountered a use case for it. IMHO all early termination that affects subsequent loop iterations as well (break, goto, labeled break and continue, but not regular continue) should just throw because they make absolutely no sense in a parallel context.The only problem is that there's no easy, well-documented way to tell from the return value of opApply whether it was a break, a goto, a labeled break/continue, etc. This would be implementable only if I changed the semantics of break to also throw. This might not be a bad thing (IMHO any type of breaking out of a parallel foreach loop is just silly) but others had asked for different semantics for break.It's not that silly. Essentially, what you'd express like this with a normal function taking a delegate: taskPool.apply([1,2,3], (int i) { if (i == 1) return; // do some things }); you'd express like this in a parallel foreach: foreach (int i; parallel([1,2,3])) { if (i == 1) break; // do some things } It's not following the semantics of break within a foreach, but it's still useful to be able to return early from a function (from a loop iteration in this case), so I see the use case for making 'break' do what it does.
Mar 19 2011
Am 19.03.2011 22:45, schrieb dsimcha:IMHO all early termination that affects subsequent loop iterations as well (break, goto, labeled break and continue, but not regular continue) should just throw because they make absolutely no sense in a parallel context.What about some kind of parallel search? You just want to know if something is there and on first sight you're done, so you want to break out of this iteration and don't want any new parallel iterations to be done. (Not sure what should be done with currently running iterations.. should they be killed or go on until they're done - anyway, the executing thread shouldn't start a new iteration after that)
Mar 19 2011
On 3/19/2011 7:33 PM, Daniel Gibson wrote:Am 19.03.2011 22:45, schrieb dsimcha:It's an interesting suggestion, and I'll give some thought to it, but on first glance it sounds **very** difficult to implement efficiently. To avoid doing a lot of extra work after you've found what you need, you'd need to use very small work units. To avoid excessive overhead you'd need to use fairly large work units. The only case I can think of where this would be do-able is when evaluating the predicate is very expensive.IMHO all early termination that affects subsequent loop iterations as well (break, goto, labeled break and continue, but not regular continue) should just throw because they make absolutely no sense in a parallel context.What about some kind of parallel search? You just want to know if something is there and on first sight you're done, so you want to break out of this iteration and don't want any new parallel iterations to be done. (Not sure what should be done with currently running iterations.. should they be killed or go on until they're done - anyway, the executing thread shouldn't start a new iteration after that)
Mar 19 2011
On 03/19/2011 12:16 PM, dsimcha wrote:On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.On 03/19/2011 02:32 AM, dsimcha wrote:I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.In general I feel like std.parallelism is being held to a ridiculously high standard that none of the other Phobos modules currently meet.I agree, but that goes without saying. This is not a double standard; it's a simple means to improve quality of Phobos overall. Clearly certain modules that are already in Phobos would not make it if proposed today. And that's a good thing! Comparing our current ambitions to the quality of the average Phobos module would not help us. Generally it seems you have grown already tired of the exchange and it would take only a few more exchanges for you to say, "you know what, either let's get it over it or forget about it." Please consider for a minute how this is the wrong tone and attitude to be having on several levels. This is not a debate and not the place to get defensive. Your role in the discussion is not symmetric with the others' and at best you'd use the review as an opportunity to improve your design, not stick heels in the ground to defend its current incarnation (within reason). Your potential customers are attempting to help you by asking questions (some of which no doubt are silly) and by making suggestions (some of which, again, are ill-founded). Nevertheless, we _are_ your potential customers and in a way the customer is always right. Your art is to steer customers from what they think they want to what you know they need - because you're the expert! - and to improve design, nomenclature, and implementation to the extent that would help them. Furthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert. I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.I very much hope this does change. Otherwise the role of Task in the design could be drastically reduced (e.g. nested type inside of TaskPool) without prejudice. At the minimum I want to be able to create a task, launch it, and check its result later without involving a pool. A pool is when I have many tasks that may exceed the number of CPUs etc. Simplicity would be great. // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all auto foo = readFoo.yieldWait(); auto bar = readBar.yieldWait(); auto baz = readBaz.yieldWait(); There's no need at this level for a task pool. What would be nice would be to have a join() that joins all tasks spawned by the current thread: // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all join(); // fetch results auto foo = readFoo.spinWait(); auto bar = readBar.spinWait(); auto baz = readBaz.spinWait(); The way I see it is, task pools are an advanced means that coordinate m threads over n CPUs. If I don't care about that (as above) there should be no need for a pool at all. (Of course it's fine if used by the implementation.)* "After creation, Task objects are submitted to a TaskPool for execution." I understand it's possible to use Task straight as a promise/future, so s/are/may be/.No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)It wasn't to me. I had an intuition that that might be the case, but obvious - far from it. The fact that I need a task pool even when I don't care about m:n issues made me doubtful.Also it is my understanding that TaskPool efficiently adapts the concrete number of CPUs to an arbitrary number of tasks. If that's the case, it's worth mentioning.Isn't this kind of obvious from the examples, etc.?Perhaps cross-references could help (see macro XREF). As I mentioned, an example here could go a long way, too.* "If a Task has been submitted to a TaskPool instance, is being stored in a stack frame, and has not yet finished, the destructor for this struct will automatically call yieldWait() so that the task can finish and the stack frame can be destroyed safely." At this point in the doc the reader doesn't understand that at all because TaskPool has not been seen yet. The reader gets worried that she'll be essentially serializing the entire process by mistake. Either move this explanation down or provide an example.This is getting ridiculous. There are too many circular dependencies between Task and TaskPool that are impossible to remove here that I'm not even going to try. One or the other has to be introduced first, but neither can be explained without mentioning the other. This is why I explain the relationship briefly in the module level summary, so that the user has at least some idea. I think this is about the best I can do.If we sit down for a second and think outside of the current design, the simplest Task use would be as such: "I start a task and get a sort of a future. I do something else. Then when I nede the result of the future I call some method force() against it." Nowhere is the notion of a task pool to be found. To me this looks like an artifact of the implementation has spilled into the API, which I understand how is natural to you. But also it's unnatural to me.* The example that reads two files at the same time should NOT use taskPool. It's just one task, why would the pool ever be needed? If you also provided an example that reads n files in memory at the same time using a pool, that would illustrate nicely why you need it. If a Task can't be launched without being put in a pool, there should be a possibility to do so. At my work we have a simple function called callInNewThread that does what's needed to launch a function in a new thread.I guess I could add something like this to Task. Under the hood it would (for implementation simplicity, to reuse a bunch of code from TaskPool) fire up a new single-thread pool, submit the task, call TaskPool.finish(), and then return. Since you're already creating a new thread, the extra overhead of creating a new TaskPool for the thread would be negligible and it would massively simplify the implementation. My only concern is that, when combined with scoped versus non-scoped tasks (which are definitely here to stay, see below) this small convenience function would add way more API complexity than it's worth. Besides, task pools don't need to be created explicitly anyhow. That's what the default instantiation is for. I don't see how callInNewThread would really solve much.Secondly, I think you're reading **WAY** too much into what was meant to be a simple example to illustrate usage mechanics. This is another case where I can't think of a small, cute example of where you'd really need the pool. There are plenty of larger examples, but the smallest/most self-contained one I can think of is a parallel sort. I decided to use file reading because it was good enough to illustrate the mechanics of usage, even if it didn't illustrate a particularly good use case.It's impossible to not have a good small example. Sorting is great. You have the partition primitive already in std.algorithm, then off you go with tasks. Dot product on dense vectors is another good one. There's just plenty of operations that people understand are important to make fast. So to me it looks like this: * there are already some good examples of "I want 1-3 of tasks but don't care about pooling them": file reading and processing etc. * there should be 1-2 good example of partitioning a large chunk of work into many tasks and let taskPool carry them. With such, people would clearly understand the capabilities of the library and what they can achieve with various levels of involvement.I agree. That's not a reason to not find good names for entities. To me the relationship "global task() <=> scoped delegate; TaskPool method task() <=> closure" is inscrutable. I see no way of remembering it except rote memorization.* The note below that example gets me thinking: it is an artificial limitation to force users of Task to worry about scope and such. One should be able to create a Future object (Task I think in your terminology), pass it around like a normal value, and ultimately force it. This is the case for all other languages that implement futures. I suspect the "scope" parameter associated with the delegate a couple of definitions below plays a role here, but I think we need to work for providing the smoothest interface possible (possibly prompting improvements in the language along the way).This is what TaskPool.task is for. Maybe this should be moved to the top of the definition of TaskPool and emphasized, and the scoped/stack allocated versions should be moved below TaskPool and de-emphasized? At any rate, though, anyone who's playing with parallelism should be an advanced enough programmer that concepts like scope and destructors are second nature.We absolutely **need** scope delegates for calling stuff where a closure can't be allocated due to objects on the stack frame having scoped destruction. This is not going anywhere. Also, I **much** prefer to have everything that does conceptually the same thing have the same name. I think this is clearer, not less clear.I agree in principle, but this case is just not that.OK.* As I mentioned, in the definition: Task!(run,TypeTuple!(F,Args)) task(F, Args...)(F fun, Args args); I can't tell what "run" is.run() is just the adapter function described right above this code. I cannot for the life of me understand how this could possibly be unclear.Yah, I was actually thinking of disabling goto outside a local delegate everywhere.* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat"). Speaking of efficiency, I assume parallel foreach uses opApply with a delegate and the inherent overhead. So I'm thinking that a practical way to take advantage of parallel foreach would be to parallelize at some block level and then do a regular foreach inside the block? foreach (i, ref elem; taskPool.parallel(logs)) { foreach (???) elem = log(i + 1.0); } How can I arrange things such that I compute e.g. 64 logs serially inside each pass?* Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine?I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.I'm confused here. I use join() pretty much all the time. I launch stuff (e.g. run a large matrix-vector multiplication for distinct vectors) and then I join that. Then again and again. The thread of execution has a repeated hourglass shape - it fans out and then becomes single-threaded at join points. I assume some of those computations are indeed the charter of parallel foreach, but I'd guess not all. But then what do I know - I'm the customer :o). Andrei* The join example is fine, but the repetitive code suggests that loops should be better: import std.file; void main() { auto pool = new TaskPool(); foreach (filename; ["foo.txt", "bar.txt", "baz.txt"]) { pool.put(task!read(filename)); } // Call join() to guarantee that all tasks are done running, the worker // threads have terminated and that the results of all of the tasks can // be accessed without any synchronization primitives. pool.join(); ubyte[][] data; // void[], meh // Use spinWait() since the results are guaranteed to have been computed // and spinWait() is the cheapest of the wait functions. foreach (task; pool.howDoIEnumerateTasks()) { data ~= task1.spinWait(); }You can't enumerate the tasks like this, but there's a good reason for this limitation. The design requires the pool not hold onto any references to the tasks after they're out of the queue, so that they can be safely destroyed as soon as the caller wants to destroy them. If you need to enumerate over them like this, then: 1. You're probably doing something wrong and parallel foreach would be a better choice. 2. If you insist on doing this, you need to explicitly store them in an array or something. As a more general comment, I know the join() example sucks. join() is actually a seldom-used primitive, not a frequently used one as you suggest. Most of the time you wait for individual tasks (explicitly or implicitly through the higher level functions), not the whole pool, to finish executing. I feel like it's obligatory that join() and finish() exist, but I never use them myself. They were much more useful in pre-release versions of this lib, when parallel foreach didn't exist and it made sense to have explicit arrays of Tasks. The difficulty I've had in coming up with a decent example for them makes me halfway tempted to
Mar 19 2011
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:On 03/19/2011 12:16 PM, dsimcha wrote:Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review. How about we make next Friday a soft deadline/start of voting? It can be extended as necessary by mutual agreement of me and Lars (the review manager), and likely will be if the review process is still yielding good suggestions and/or I haven't had time to implement/clean up some key things. Having a deadline breathing down your neck like this is really not conducive to being open to suggestions and thoughtful consideration, especially for issues that seem like fairly minor details. Also increasing the deadline pressure issue, Michael Fortin just caused me to rethink the issue of exception handling in parallel foreach. I had more-or-less working code for this, but I realized it's severely broken in subtle ways that I've (knock on wood) never actually run into in real world code. It's gonna take some time to fix. These kinds of issues with error handling code can very easily slip under the radar in a library with only a few users, and unfortunately, one has. I definitely know how to fix it, but I have to implement the fix and it's somewhat non-trivial, i.e. I'm debugging it now and it's looking like a marathon debugging session.On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.On 03/19/2011 02:32 AM, dsimcha wrote:I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.In general I feel like std.parallelism is being held to a ridiculously high standard that none of the other Phobos modules currently meet.I agree, but that goes without saying. This is not a double standard; it's a simple means to improve quality of Phobos overall. Clearly certain modules that are already in Phobos would not make it if proposed today. And that's a good thing! Comparing our current ambitions to the quality of the average Phobos module would not help us. Generally it seems you have grown already tired of the exchange and it would take only a few more exchanges for you to say, "you know what, either let's get it over it or forget about it." Please consider for a minute how this is the wrong tone and attitude to be having on several levels. This is not a debate and not the place to get defensive. Your role in the discussion is not symmetric with the others' and at best you'd use the review as an opportunity to improve your design, not stick heels in the ground to defend its current incarnation (within reason). Your potential customers are attempting to help you by asking questions (some of which no doubt are silly) and by making suggestions (some of which, again, are ill-founded). Nevertheless, we _are_ your potential customers and in a way the customer is always right. Your art is to steer customers from what they think they want to what you know they need - because you're the expert! - and to improve design, nomenclature, and implementation to the extent that would help them. Furthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert. I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.
Mar 19 2011
On 03/19/2011 04:25 PM, dsimcha wrote:On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes. Lars?I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.Also increasing the deadline pressure issue, Michael Fortin just caused me to rethink the issue of exception handling in parallel foreach. I had more-or-less working code for this, but I realized it's severely broken in subtle ways that I've (knock on wood) never actually run into in real world code. It's gonna take some time to fix. These kinds of issues with error handling code can very easily slip under the radar in a library with only a few users, and unfortunately, one has. I definitely know how to fix it, but I have to implement the fix and it's somewhat non-trivial, i.e. I'm debugging it now and it's looking like a marathon debugging session.Fixing these sooner rather than later would be great, so all the better for suspending the review. Andrei
Mar 19 2011
On 3/19/2011 6:58 PM, Andrei Alexandrescu wrote:On 03/19/2011 04:25 PM, dsimcha wrote:Sounds like a plan.On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes.I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.
Mar 19 2011
On Sat, 19 Mar 2011 17:58:46 -0500, Andrei Alexandrescu wrote:On 03/19/2011 04:25 PM, dsimcha wrote:Sounds good. I'll make the announcement. -LarsOn 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Exactly. I understand. Well here's a suggestion. How about we "git stash" this review? I recall another submission was vying for the queue so we can proceed with that while you have time to operate the changes you want. If not, we can simply wait 2-3 weeks and then have you resubmit for a shorter process (1-2 weeks) that would gather more feedback (hopefully minor that time) and votes. Lars?I know you'd have no problem finding the right voice in this discussion if you only frame it in the right light. Again, people are trying to help (however awkwardly) and in no way is that ridiculous.Fair enough. Now that I think of it most of my frustration is that these details are only getting looked at now, when I have a week (and an otherwise very busy week) to fix all this stuff, when this module has been officially in review for the past two weeks and unofficially for several months. I would be much more open to this process if the issues raised could be fixed at my leisure rather than on a hard and tight deadline. This is exacerbated by the fact that I have another important, unrelated deadline, also next Friday. At the same time, though, I'm afraid that if we didn't fix a vote date and put some urgency into things, the pace of the reviews would be glacial at best, like it was for the first two weeks of official review and the months of unofficial review.
Mar 20 2011
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Furthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert.In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Mar 19 2011
On Saturday 19 March 2011 17:31:18 dsimcha wrote:On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M DavisFurthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert.In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Mar 19 2011
On Saturday 19 March 2011 17:31:18 dsimcha wrote:On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M DavisFurthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert.In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Mar 19 2011
On 3/19/2011 8:48 PM, Jonathan M Davis wrote:On Saturday 19 March 2011 17:31:18 dsimcha wrote:Please check your newsreader settings. You've been double-posting a lot lately.On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:Both std.datetime and std.unittests underwent a fair number of changes over the course the review process. A lot of the stuff stayed the same, but a lot of it changed too. On the whole, though, the end results were much better for it. - Jonathan M DavisFurthermore, you should expect that the review process will prompt changes. My perception is that you consider the submission more or less final modulo possibly a few minor nits. You shouldn't. I'm convinced you know much more about SMP than most or all others in this group, but in no way that means your design has reached perfection and is beyond improvement even from a non-expert.In addition the the deadline issues already mentioned and resolved, I did misunderstand the review process somewhat. I didn't participate in the reviews for std.datetime (because I know nothing about what makes a good date/time lib) or for std.unittest (because I was fairly ambivalent about it), so I didn't learn anything from them. I was under the impression that the module is **expected** to be very close to its final form and that, if a lot of issues are found, then that basically means the proposal is going to be rejected.
Mar 19 2011
On 3/19/2011 4:35 PM, Andrei Alexandrescu wrote:On 03/19/2011 12:16 PM, dsimcha wrote:Good example, will do.On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:I'm thinking along the lines of: http://www.digitalmars.com/d/2.0/phobos/std_exception.html A nice synopsis would be the pi computation. Just move that up to the synopsis. It's simple, clean, and easy to relate to. Generally, you'd put here not all details but the stuff you think would make it easiest for people to get into your library.On 03/19/2011 02:32 AM, dsimcha wrote:I don't see what I could put here that isn't totally redundant with the rest of the documentation. Anything I could think of would basically just involve concatentating all the examples. Furthermore, none of the other Phobos modules have this, so I don't know what one should look like.Ok, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* Still no synopsis example that illustrates in a catchy way the most attractive artifacts.This is definitely feasible in principle. I'd like to implement it, but there's a few annoying, hairy details standing in the way. For reasons I detailed previously, we need both scoped and non-scoped tasks. We also have alias vs. callable (i.e. function pointer or delegate) tasks. Now we're adding pool vs. new-thread tasks. This is turning into a combinatorial explosion and needs to be simplified somehow. I propose the following: 1. I've reconsidered and actually like the idea of task() vs. scopedTask(). task() returns a pointer on the heap. scopedTask() returns a struct on the stack. Neither would be a member function of TaskPool. 2. Non-scoped Task pointers would need to be explicitly submitted to the task pool via the put() method. This means getting rid of TaskPool.task(). 3. The Task struct would grow a function runInNewThread() or something similar. (If you think this would be a common case, even just execute() might cut it.) The work flow would now be that you call task() to get a heap-allocated Task*, or scopedTask to get a stack-allocated Task. You then call either TaskPool.put() to execute it on a pool or Task.runInNewThread() to run it in a new thread. The creation of the Task is completely orthogonal to how it's run.I very much hope this does change. Otherwise the role of Task in the design could be drastically reduced (e.g. nested type inside of TaskPool) without prejudice. At the minimum I want to be able to create a task, launch it, and check its result later without involving a pool. A pool is when I have many tasks that may exceed the number of CPUs etc. Simplicity would be great. // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all auto foo = readFoo.yieldWait(); auto bar = readBar.yieldWait(); auto baz = readBaz.yieldWait();* "After creation, Task objects are submitted to a TaskPool for execution." I understand it's possible to use Task straight as a promise/future, so s/are/may be/.No. The only way Task is useful is by submitting it to a pool to be executed. (Though this may change, see below.)There's no need at this level for a task pool. What would be nice would be to have a join() that joins all tasks spawned by the current thread: // start three reads auto readFoo = task!readText("foo.txt"); auto readBar = task!readText("bar.txt"); auto readBaz = task!readText("baz.txt"); // join'em all join(); // fetch results auto foo = readFoo.spinWait(); auto bar = readBar.spinWait(); auto baz = readBaz.spinWait();I don't understand how this would be a substantial improvement over the first example, where you just call yieldWait() on all three. Furthermore, implementing join() as shown in this example would require some kind of central registry of all tasks/worker threads/task pools/something similar, which would be a huge PITA to implement efficiently.I forgot about std.algorithm.partition. This makes a parallel quick sort so trivial to implement (ignoring the issue of selecting a good pivot, which I think can be safely ignored in example code) that it might actually make a good example.Secondly, I think you're reading **WAY** too much into what was meant to be a simple example to illustrate usage mechanics. This is another case where I can't think of a small, cute example of where you'd really need the pool. There are plenty of larger examples, but the smallest/most self-contained one I can think of is a parallel sort. I decided to use file reading because it was good enough to illustrate the mechanics of usage, even if it didn't illustrate a particularly good use case.It's impossible to not have a good small example. Sorting is great. You have the partition primitive already in std.algorithm, then off you go with tasks. Dot product on dense vectors is another good one. There's just plenty of operations that people understand are important to make fast.See the discussion I'm having with Michael Fortin. My latest idea is that break, labeled break/continue, return, and goto should all throw exceptions when found inside a parallel foreach loop. They affect the flow of subsequent iterations, and "subsequent iterations" only makes sense when the loop is being executed in serial.Yah, I was actually thinking of disabling goto outside a local delegate everywhere.* "A goto from inside the parallel foreach loop to a label outside the loop will result in undefined behavior." Would this be a bug in dmd?No, it's because a goto of this form has no reasonable, useful semantics. I should probably mention in the docs that the same applies to labeled break and continue. I have no idea what semantics these should have, and even if I did, given the long odds that even one person would actually need them, I think they'd be more trouble than they're worth to implement. For example, once you break out of a parallel foreach loop to some arbitrary address (and different threads can goto different labels, etc.), well, it's no longer a parallel foreach loop. It's just a bunch of completely unstructured threading doing god-knows-what. Therefore, I slapped undefined behavior on it as a big sign that says, "Just don't do it." This also has the advantage that, if anyone ever thinks of any good, clearly useful semantics, these will be implementable without breaking code later.Ok, I thought you were asking for something much more rigorous than this. I therefore didn't want to provide it because I figured that, no matter what I did, someone would be able to say that the benchmark is flawed some how, yada, yada, yada. Given how inexact a science benchmarking is, I'm still hesitant to put such results in API docs, but I can see where you're coming from here.If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat").* Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine?I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.Speaking of efficiency, I assume parallel foreach uses opApply with a delegate and the inherent overhead. So I'm thinking that a practical way to take advantage of parallel foreach would be to parallelize at some block level and then do a regular foreach inside the block? foreach (i, ref elem; taskPool.parallel(logs)) { foreach (???) elem = log(i + 1.0); } How can I arrange things such that I compute e.g. 64 logs serially inside each pass?Three things here: 1. For this kind of nano-parallelism, map() might be better suited. To fill an existing array, just use map's explicit buffer feature. taskPool.map!log(iota(1, logs.length + 1), logs); The option of using map() for nano-parallelism is part of my rationale for keeping the pretty but mildly inefficient delegate call in parallel foreach. 2. You're severely overestimating the overhead of the delegate call. log() is a pretty cheap function and even so speedups are decent with parallel foreach compared to a regular foreach loop. import std.stdio, std.parallelism, std.datetime, std.math; void main() { auto logs = new float[10_000_000]; taskPool(); // Initialize TaskPool before timing. auto sw = StopWatch(autoStart); foreach(i, ref elem; logs) { elem = log(i + 1); } writeln("Serial: ", sw.peek.msecs); sw.reset(); foreach(i, ref elem; parallel(logs)) { elem = log(i + 1); } writeln("Parallel Foreach: ", sw.peek.msecs); } Results: Serial: 619 Parallel Foreach: 388 I'd include parallel map, too, but for some reason (probably stack alignment or something) including it changes the results for parallel foreach. 3. If you really want to do as you suggest, just make a chunks range or something (maybe this should be in std.range) that iterates over all non-overlapping size N slices of a sliceable range. Use this for the parallel loop, then loop over the individual elements of the slice inside.I'm confused here. I use join() pretty much all the time. I launch stuff (e.g. run a large matrix-vector multiplication for distinct vectors) and then I join that. Then again and again. The thread of execution has a repeated hourglass shape - it fans out and then becomes single-threaded at join points. I assume some of those computations are indeed the charter of parallel foreach, but I'd guess not all. But then what do I know - I'm the customer :o).Yes, parallel foreach is idiomatic here, or maybe parallel map. In the pre-release versions of std.parallelism (my experimentation with parallelism libraries in D goes back to mid-2008, over a year before I released the first version as Parallelfuture), only the task primitive existed. I discovered that, in practice, creating Task objects in a loop is almost always a PITA. It can also be inefficient in that, in a naive implementation all of these exist in memory simultaneously when this might be completely unnecessary. I decided to create higher level primitives to handle all the use cases I could think of where you might otherwise want to create Task objects in a loop. If you're explicitly creating Task objects in a loop in std.parallelism, I can just about guarantee that there's an easier and at least equally efficient way to accomplish what you want to accomplish. If there are any important use cases I missed, I'd be much more interested in creating a few more high-level primitives rather than making it easier to work with Task objects created in a loop.
Mar 19 2011
On 3/19/2011 10:16 PM, dsimcha wrote:I tried putting a few benchmarks in for the performance of parallel foreach, map and reduce. Basically, I just put the timings in for the examples on my dual core machine, and for an equivalent serial version that is not shown in the interest of brevity but is easily inferred. I didn't do this for the pipelining or future/promise primitives because these aren't as capable of automatically taking full advantage of as many cores as you can throw at them as map, reduce and foreach. This means that coming up with a benchmark that shows where a good speedup can be obtained is a much more difficult art.Ok, I thought you were asking for something much more rigorous than this. I therefore didn't want to provide it because I figured that, no matter what I did, someone would be able to say that the benchmark is flawed some how, yada, yada, yada. Given how inexact a science benchmarking is, I'm still hesitant to put such results in API docs, but I can see where you're coming from here.If scalability information is present in however a non-committal form, then people would be compelled ("ok, so this shape of the loop would actually scale linearly with CPUs... neat").* Again: speed of e.g. parallel min/max vs. serial, pi computation etc. on a usual machine?I **STRONGLY** believe this does not belong in API documentation because it's too machine specific, compiler specific, stack alignment specific, etc. and almost any benchmark worth doing takes up more space than an example should. Furthermore, anyone who wants to know this can easily time it themselves. I have absolutely no intention of including this. While in general I appreciate and have tried to accommodate your suggestions, this is one I'll be standing firm on.
Mar 20 2011
On 3/19/2011 12:03 PM, Andrei Alexandrescu wrote:* "workUnitSize: The number of elements to evaluate in a single Task. Must be less than or equal to bufSize, and in practice should be a fraction of bufSize such that all worker threads can be used." Then why not specify a different parameter such a multiplier or something? The dependence between bufSize and worUnitSize is a sign that these two should be improved. If you have good reasons that the user must have the parameters in this form, give an example substantiating that.I did this for consistency with all the other functions, where work unit size is specified directly. I don't want lazyMap to be the oddball function where it's specified in a completely different way.
Mar 19 2011
On Saturday 19 March 2011 09:03:51 Andrei Alexandrescu wrote:On 03/19/2011 02:32 AM, dsimcha wrote:Oooh. That's a neat trick. I didn't know that you could do that. - Jonathan M DavisOk, thanks again for clarifying **how** the docs could be improved. I've implemented the suggestions and generally given the docs a good reading over and clean up. The new docs are at: http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html* When using "parallelism" as a common noun, prefix is with a '_' so ddoc doesn't underline it.
Mar 19 2011