digitalmars.D - Thoughts on parallel programming?
- jfd (4/4) Nov 10 2010 Any thoughts on parallel programming. I was looking at something about ...
- bearophile (5/9) Nov 10 2010 In past I have shown here two large posts about Chapel, that's a languag...
- dsimcha (11/15) Nov 10 2010 Well, there's my std.parallelism library, which is in review for inclusi...
- Russel Winder (53/57) Nov 11 2010 that
- Fawzi Mohamed (105/155) Nov 11 2010 on this I am not so sure, heterogeneous clusters are more difficult to
- Fawzi Mohamed (14/23) Nov 11 2010 I just finished reading "Parallel Programmability and the Chapel
- Fawzi Mohamed (18/38) Nov 11 2010 charset=US-ASCII;
- Tobias Pfaff (5/9) Nov 11 2010 Unfortunately I only know about the standard stuff, OpenMP/OpenCL...
- Trass3r (2/5) Nov 11 2010 That would require compiler support for it.
- Tobias Pfaff (3/8) Nov 11 2010 Ok, that's what I suspected.
- Russel Winder (24/27) Nov 11 2010 I'd hardly call OpenMP lightweight. I agree that as a meta-notation for
- Trass3r (1/2) Nov 11 2010 http://bitbucket.org/trass3r/cl4d/wiki/Home
- Tobias Pfaff (14/31) Nov 11 2010 Well, I am looking for an easy & efficient way to perform parallel
- dsimcha (13/49) Nov 11 2010 I think you'll be very pleased with std.parallelism when/if it gets into...
- Tobias Pfaff (4/53) Nov 12 2010 I did a quick test of the module, looks really good so far, thanks for
- Fawzi Mohamed (3/16) Nov 12 2010 If you use D1 blip.parallel.smp offers that, and it does scale well to
- Sean Kelly (2/11) Nov 11 2010 I've considered backing spawn() calls by fibers multiplexed by a thread ...
- sybrandy (9/20) Nov 11 2010 I actually did something similar for a very simple web server I was
- Fawzi Mohamed (14/30) Nov 12 2010 I agree I think that TBB offers primitives for many parallelization
- Russel Winder (91/156) Nov 11 2010 =20
- retard (19/167) Nov 11 2010 FWIW, I'm not a parallel computing expert and have almost no experience
- retard (8/12) Nov 11 2010 At least it seems so to me. My last 1 and 2 core systems had a TDP of 65...
- Gary Whatmore (5/176) Nov 11 2010 You're unfortunately completely wrong. The industry is moving away from ...
- Walter Bright (7/13) Nov 11 2010 Yup. I am bemused by the efforts put into analyzing loops so that they c...
- bearophile (4/11) Nov 11 2010 I agree a lot. The language has to offer means to express all the semant...
- retard (4/24) Nov 11 2010 How does the Chapel work when I need to sort data (just basic quicksort
- Walter Bright (3/8) Nov 11 2010 I think it's more than being trained to think sequentially. I think it i...
- Sean Kelly (2/11) Nov 11 2010 Distributed programming is essentially a bunch of little sequential prog...
- %u (2/15) Nov 11 2010 Intel promised this AVX instruction set next year. Does it also work lik...
- Gary Whatmore (2/19) Nov 11 2010 AVX isn't parallel programming, it's vector processing. A dying breed of...
- %u (7/28) Nov 11 2010 Currently the amount of information available is scarce. I have no idea ...
- Don (16/28) Nov 12 2010 The Erlang people seem to say that a lot. The thing they omit to say,
- sybrandy (20/30) Nov 13 2010 That's only part of the reasoning behind all of the little programs in
- Fawzi Mohamed (29/195) Nov 12 2010 Vector co processors, yes I see that, and short term the effect of
- Sean Kelly (5/35) Nov 13 2010 True enough. But it's certainly more natural to think about than mutex-...
- sybrandy (2/4) Nov 13 2010 I like that description!
Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.
Nov 10 2010
jfd:Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.In past I have shown here two large posts about Chapel, that's a language contains several good ideas worth stealing, but my posts were mostly ignored. Chapel is designed for heavy numerical computing on multi-cores or multi-CPUs, it has good ideas of CPU-localization of the work, while D isn't much serious about that kind of parallelism (yet). So far D has instead embraced message-passing, that's fit for other purposes. Bye, bearophile
Nov 10 2010
== Quote from jfd (jfd nospam.com)'s articleAny thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.Well, there's my std.parallelism library, which is in review for inclusion in Phobos. (http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html, http://www.dsource.org/projects/scrapple/browser/trunk/parallelFuture/std_parallelism.d) One unfortunate thing about it is that it doesn't use (and actually bypasses) D's thread isolation system and allows unchecked sharing. I couldn't think of any way to create a pedal-to-metal parallelism library that was simultaneously useful, safe and worked with the language as-is, and I wanted something that worked **now**, not next year or in D3 or whatever, so I decided to omit safety. Given that the library is in review, now would be the perfect time to offer any suggestions on how it can be improved.
Nov 10 2010
On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:Any thoughts on parallel programming. I was looking at something about C=hapeland X10 languages etc. for parallelism, and it looks interesting. I know=thatit is still an area of active research, and it is not yet (far from?) don=e,but anyone have thoughts on this as future direction? Thank you.Any programming language that cannot be used to program applications running on a heterogeneous collection of processors, including CPUs and GPUs as computational devices, on a single chip, with there being many such chips on a board, possibly clustered, doesn't have much of a future. Timescale 5--10 years. Intel's 80-core, 48-core and 50-core devices show the way server, workstation and laptop architectures are going. There may be a large central memory unit as now, but it will be secondary storage not primary storage. All the chip architectures are shifting to distributed memory -- basically cache coherence is too hard a problem to solve, so instead of solving it, they are getting rid of it. Also the memory bus stops being the bottleneck for computations, which is actually the biggest problem with current architectures. Windows, Linux and Mac OS X have a serious problem and will either die or be revolutionized. Apple at least recognize the issue, hence they pushed OpenCL. Actor model, CSP, dataflow, and similar distributed memory/process-based architectures will become increasingly important for software. There will be an increasing move to declarative expression, but I doubt functional languages will ever make the main stream. The issue here is that parallelism generally requires programmers not to try and tell the computer every detail how to do something, but instead specify the start and end conditions and allow the runtime system to handle the realization of the transformation. Hence the move in Fortran from lots of "do" loops to "whole array" operations. MPI and all the SPMD approaches have a severely limited future, but I bet the HPC codes are still using Fortran and MPI in 50 years time. You mentioned Chapel and X10, but don't forget the other one of the original three HPCS projects, Fortress. Whilst all three are PGAS (partitioned global address space) languages, Fortress takes a very different viewpoint compared to Chapel and X10. The summary of the summary is: programmers will either be developing parallelism systems or they will be unemployed. <shameless-plug> To hear more, I am doing a session on all this stuff for ACCU London 2010-11-18 18:30+00:00 http://skillsmatter.com/event/java-jee/java-python-ruby-linux-windows-are-a= ll-doomed </shameless-plug> --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
On 11-nov-10, at 09:58, Russel Winder wrote:On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:on this I am not so sure, heterogeneous clusters are more difficult to program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not convinced they are necessarily the future.Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.Any programming language that cannot be used to program applications running on a heterogeneous collection of processors, including CPUs and GPUs as computational devices, on a single chip, with there being many such chips on a board, possibly clustered, doesn't have much of a future. Timescale 5--10 years.Intel's 80-core, 48-core and 50-core devices show the way server, workstation and laptop architectures are going. There may be a large central memory unit as now, but it will be secondary storage not primary storage. All the chip architectures are shifting to distributed memory -- basically cache coherence is too hard a problem to solve, so instead of solving it, they are getting rid of it. Also the memory bus stops being the bottleneck for computations, which is actually the biggest problem with current architectures.yes many core is the future I agree on this, and also that distributed approach is the only way to scale to a really large number of processors. Bud distributed systems *are* more complex, so I think that for the foreseeable future one will have a hybrid approach.Windows, Linux and Mac OS X have a serious problem and will either die or be revolutionized. Apple at least recognize the issue, hence they pushed OpenCL.again not sure the situation is as dire as you paint it, Linux does quite well in the HPC field... but I agree that to be the ideal OS for these architectures it will need more changes.Actor model, CSP, dataflow, and similar distributed memory/process- based architectures will become increasingly important for software. There will be an increasing move to declarative expression, but I doubt functional languages will ever make the main stream. The issue here is that parallelism generally requires programmers not to try and tell the computer every detail how to do something, but instead specify the start and end conditions and allow the runtime system to handle the realization of the transformation. Hence the move in Fortran from lots of "do" loops to "whole array" operations.Whole array operation are useful, and when possible one gains much using them, unfortunately not all problems can be reduced to few large array operations, data parallel languages are not the main type of language for these reasons.MPI and all the SPMD approaches have a severely limited future, but I bet the HPC codes are still using Fortran and MPI in 50 years time.well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex. About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.You mentioned Chapel and X10, but don't forget the other one of the original three HPCS projects, Fortress. Whilst all three are PGAS (partitioned global address space) languages, Fortress takes a very different viewpoint compared to Chapel and X10.It might be a personal thing, but I am kind of "suspicious" toward PGAS, I find a generalized MPI model better than PGAS when you want to have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage with an object that sends remote requests to access remote memory pieces. This means having a local server where this wrapped objects can be "published" and that can respond in any moment to external requests. I call this rpc (remote procedure call) and it can be realized easily on the top of MPI. As not all objects are distributed and in a complex program it does not always makes sense to distribute these objects on all processors or none, I find that the robust partitioning and collective communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but then you loose all its simplicity.The summary of the summary is: programmers will either be developing parallelism systems or they will be unemployed.The situation is not so dire, some problems are trivially parallel, or can be solved with simple parallel patterns, others don't need to be solved in parallel, as the sequential solution if fast enough, but I do agree that being able to develop parallel systems is increasingly important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to build something to do parallel programs "the way it should be", or at least the way I would like it to be ;) The result is what I did with blip, http://dsource.org/projects/blip . I don't think that (excluding some simple examples) fully automatic (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra burden on the programmer. Still it is possible to have several levels of parallelization, and if you program a fully parallel program it should still be possible to use it relatively efficiently locally, but a local program will not automatically become fully parallel. What I did is a basic smp parallelization for programs with shared memory. This level tries to schedule efficiently independent recursive tasks using all processors as efficiently as possible (using the topology detected by libhwloc. It leverages an event based framework (libev) to avoid blocking waiting for external tasks. The ability to describe complex asynchronous processes can be very useful also to work with GPUs. mpi parallelization is part of the hierarchy of parallelization, for the reasons I described before, it is wrapped so that on a single processor one can use a "pseudo" mpi. rpc (remote procedure call) might be better described as distributed objects, offers a server that can responds to external requests at any moment and the possibility to publish objects that will be then identified by urls. There urls can be used to create local proxies that call the remote object and get results from it. This can be done using mpi, or directly sockets. If one uses sockets he has the whole flexibility (but also the whole complexity) of a fully distributed system. The basic building blocks of this can be used also in a distributed protocol like distributed hashtables. blip is available now, and works with osx and linux. It should be possible to port it to windows, (both libhwloc and libev work on windows), but I didn't do it. It needs D1 and tango, tango trunk can be compiled using the scripts in blip/buildTango, and then programs using blip can be compiled more easily with the dbuild script (that uses xfbuild behind the scenes). I planned to make an official release this w.e., but you can look already now, the code is all there... Fawzi ----------------------------------------------------- Dr. Fawzi Mohamed, Office: 3'322 Humboldt-Universitaet zu Berlin, Institut fuer Chemie Post: Unter den Linden 6, 10099, Berlin Besucher/Pakete: Brook-Taylor-Str. 2, 12489 Berlin Tel: +49 30 20 93 7140 Fax: +49 30 2093 7136 -----------------------------------------------------
Nov 11 2010
On 11-nov-10, at 15:16, Fawzi Mohamed wrote:On 11-nov-10, at 09:58, Russel Winder wrote:I just finished reading "Parallel Programmability and the Chapel Language" by Chamberlain, Callahan and Zima. A very nice read, and overview of several languages and approaches. Still I stand by my earlier view, an MPI like approach is more flexible, but indeed having a nice parallel implementation of distributed arrays (which on MPI one can have using Global Arrays for example), can be very useful. I think that a language like D can hide these behind wrapper objects, and reach for these objects (that are not the only ones present in a complex parallel program) an expressivity similar to chapel using the approach I have in blip. A direct implementation might be more efficient on shared memory machines though.On Thu, 2010-11-11 at 02:24 +0000, jfd wrote:Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.
Nov 11 2010
charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit On 11-nov-10, at 15:16, Fawzi Mohamed wrote:On 11-nov-10, at 09:58, Russel Winder wrote:sorry I translated that as SIMD, not SPMD, but the answer below still holds in my opinion, if one has a complex parallel problem mpi is a worthy contender, the thing is that in many occasions one doesn't need all its power. If a client server, a distributed or a map/reduce approach work, then simpler and more flexible solutions are superior. That (and its reliability problem, that PGAS also shares) is, in my opinion, the reason MPI is not very used outside the computational community. Being able to tackle also MPMD in a good way can be useful, and that is what the rpc level does between computers, and the event based scheduling within a single computer (ensuring that one processor can do meaningful work while the other waits.MPI and all the SPMD approaches have a severely limited future, but I bet the HPC codes are still using Fortran and MPI in 50 years time.well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex.About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.
Nov 11 2010
On 11/11/2010 03:24 AM, jfd wrote:Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ? Thanks!
Nov 11 2010
Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?That would require compiler support for it. Other than that there only seems to be dsimcha's std.parallelism
Nov 11 2010
On 11/11/2010 07:01 PM, Trass3r wrote:Ok, that's what I suspected. std.parallelism doesn't look to bad though, I'll try around with that...Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?That would require compiler support for it. Other than that there only seems to be dsimcha's std.parallelism
Nov 11 2010
On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote: [ . . . ]Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight=20 multithreading in D, that is, something like OpenMP ?I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing. --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
Having a D binding to OpenCL is probably going to be a good thing.http://bitbucket.org/trass3r/cl4d/wiki/Home
Nov 11 2010
On 11/11/2010 08:10 PM, Russel Winder wrote:On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote: [ . . . ]Well, I am looking for an easy & efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, TobiasUnfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Nov 11 2010
== Quote from Tobias Pfaff (nospam spam.no)'s articleOn 11/11/2010 08:10 PM, Russel Winder wrote:I think you'll be very pleased with std.parallelism when/if it gets into Phobos. The design philosophy is exactly what you're looking for: Simple shared memory parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or hardware-level infrastructure. Basically, it's got parallel foreach, parallel map, parallel reduce and parallel tasks. All you need to fully utilize it is DMD and a multicore PC. As a reminder, the docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at http://dsource.org/projects/scrapple/browser/trunk/parallelFutu e/std_parallelism.d . If this doesn't meet your needs in its current form, I'd like as much constructive criticism as possible, as long as it's within the scope of simple, everyday parallelism without fancy infrastructure.On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote: [ . . . ]Well, I am looking for an easy & efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, TobiasUnfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Nov 11 2010
On 11/12/2010 12:44 AM, dsimcha wrote:== Quote from Tobias Pfaff (nospam spam.no)'s articleI did a quick test of the module, looks really good so far, thanks for providing this ! (Is this module scheduled for inclusion in phobos2 ?) If I find issues with it I'll let you know.On 11/11/2010 08:10 PM, Russel Winder wrote:I think you'll be very pleased with std.parallelism when/if it gets into Phobos. The design philosophy is exactly what you're looking for: Simple shared memory parallelism on multicore computers, assuming no fancy/unusual OS-, compiler- or hardware-level infrastructure. Basically, it's got parallel foreach, parallel map, parallel reduce and parallel tasks. All you need to fully utilize it is DMD and a multicore PC. As a reminder, the docs are at http://cis.jhu.edu/~dsimcha/d/phobos/std_parallelism.html and the code is at http://dsource.org/projects/scrapple/browser/trunk/parallelFutu e/std_parallelism.d . If this doesn't meet your needs in its current form, I'd like as much constructive criticism as possible, as long as it's within the scope of simple, everyday parallelism without fancy infrastructure.On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote: [ . . . ]Well, I am looking for an easy& efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation... Cheers, TobiasUnfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution. Using parallel versions of for, map, filter, reduce in the language is probably a better way forward. Having a D binding to OpenCL (and OpenGL, MPI, etc.) is probably going to be a good thing.
Nov 12 2010
On 12-nov-10, at 00:29, Tobias Pfaff wrote:[...] Well, I am looking for an easy & efficient way to perform parallel numerical calculations on our 4-8 core machines. With C++, that's OpenMP (or GPGPU stuff using CUDA/OpenCL) for us now. Maybe lightweight was the wrong word, what I meant is that OpenMP is easy to use, and efficient for the problems we are solving. There actually might be better tools for that, honestly we didn't look into that much options -- we are no HPC guys, 1000-cpu clusters are not a relevant scenario and we are happy that we even started parallelizing our code at all :) Anyways, I was thinking about the logical thing to use in D for this scenario. It's nothing super-fancy, in cases just a parallel_for we will, and sometimes a map/reduce operation...If you use D1 blip.parallel.smp offers that, and it does scale well to 4-8 cores.
Nov 12 2010
Tobias Pfaff Wrote:On 11/11/2010 03:24 AM, jfd wrote:I've considered backing spawn() calls by fibers multiplexed by a thread pool (receive() calls would cause the fiber to yield) instead of having each call generate a new kernel thread. The only issue is that TLS (ie. non-shared static storage) is thread-local, not fiber-local. One idea, however, is to do OSX-style manual TLS inside Fiber, so each fiber would have its own automatic local storage. Perhaps as an experiment I'll create a new derivative of Fiber that does this and see how it works.Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?
Nov 11 2010
On 11/11/2010 02:41 PM, Sean Kelly wrote:Tobias Pfaff Wrote:I actually did something similar for a very simple web server I was experimenting with. It is similar to how Erlang works in that the Erlang processes are, at least to me, similar to fibers and they are run in one of several threads in the interpreter. The only problem I had was ensuring that my logging was thread-safe. If you could implement a TLS-like system for Fibers, I think that would help prevent that issue. CaseyOn 11/11/2010 03:24 AM, jfd wrote:I've considered backing spawn() calls by fibers multiplexed by a thread pool (receive() calls would cause the fiber to yield) instead of having each call generate a new kernel thread. The only issue is that TLS (ie. non-shared static storage) is thread-local, not fiber-local. One idea, however, is to do OSX-style manual TLS inside Fiber, so each fiber would have its own automatic local storage. Perhaps as an experiment I'll create a new derivative of Fiber that does this and see how it works.Any thoughts on parallel programming. I was looking at something about Chapel and X10 languages etc. for parallelism, and it looks interesting. I know that it is still an area of active research, and it is not yet (far from?) done, but anyone have thoughts on this as future direction? Thank you.Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?
Nov 11 2010
On 11-nov-10, at 20:10, Russel Winder wrote:On Thu, 2010-11-11 at 18:24 +0100, Tobias Pfaff wrote: [ . . . ]I agree I think that TBB offers primitives for many parallelization kinds, and is more clean and flexible than OpenMP, but in my opinion it has a big weakness: it cannot cope well with independent tasks. Coping well wit both nested parallelism and independent tasks is a crucial thing to have a generic solution that can be applied to several problems. This is missing as far as I know also from Chapel. I think that having a solution that copes well with both nested parallelism and independent tasks is an excellent starting on which to build almost all other higher level parallelization schemes. It is important to handle this centrally, because the number of threads that one should spawn should ideally stay limited to the number of execution units.Unfortunately I only know about the standard stuff, OpenMP/OpenCL... Speaking of which: Are there any attempts to support lightweight multithreading in D, that is, something like OpenMP ?I'd hardly call OpenMP lightweight. I agree that as a meta-notation for directing the compiler how to insert appropriate code to force multithreading of certain classes of code, using OpenMP generally beats manual coding of the threads. But OpenMP is very Fortran oriented even though it can be useful for C, and indeed C++ as well. However, given things like Threading Building Blocks (TBB) and the functional programming inspired techniques used by Chapel, OpenMP increasingly looks like a "hack" rather than a solution.
Nov 12 2010
On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]on this I am not so sure, heterogeneous clusters are more difficult to ==20program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not =20 convinced they are necessarily the future.The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level.=20 [ . . . ]yes many core is the future I agree on this, and also that distributed ==20approach is the only way to scale to a really large number of =20 processors. Bud distributed systems *are* more complex, so I think that for the =20 foreseeable future one will have a hybrid approach.Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . . =20again not sure the situation is as dire as you paint it, Linux does =20 quite well in the HPC field... but I agree that to be the ideal OS for ==20these architectures it will need more changes.The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. =20 [ . . . ]Whole array operation are useful, and when possible one gains much =20 using them, unfortunately not all problems can be reduced to few large ==20array operations, data parallel languages are not the main type of =20 language for these reasons.Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]well whole array operations are a generalization of the SPMD approach, ==20so I this sense you said that that kind of approach will have a future ==20(but with a more difficult optimization as the hardware is more complex.I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.=20About MPI I think that many don't see what MPI really does, mpi offers ==20a simplified parallel model. The main weakness of this model is that it assumes some kind of =20 reliability, but then it offers a clear computational model with processors ordered in a linear of =20 higher dimensional structure and efficient collective communication =20 primitives. Yes MPI is not the right choice for all problems, but when usable it =20 is very powerful, often superior to the alternatives, and programming ==20with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or =20 easily parallelizable MPI will remain as the best choice.I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions.=20 [ . . . ]It might be a personal thing, but I am kind of "suspicious" toward =20 PGAS, I find a generalized MPI model better than PGAS when you want to ==20have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage =20 with an object that sends remote requests to access remote memory =20 pieces. This means having a local server where this wrapped objects can be =20 "published" and that can respond in any moment to external requests. I ==20call this rpc (remote procedure call) and it can be realized easily on ==20the top of MPI. As not all objects are distributed and in a complex program it does =20 not always makes sense to distribute these objects on all processors =20 or none, I find that the robust partitioning and collective =20 communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but ==20then you loose all its simplicity.I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]The situation is not so dire, some problems are trivially parallel, or ==20can be solved with simple parallel patterns, others don't need to be =20 solved in parallel, as the sequential solution if fast enough, but I =20 do agree that being able to develop parallel systems is increasingly =20 important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to =20 build something to do parallel programs "the way it should be", or at ==20least the way I would like it to be ;)The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)The result is what I did with blip, http://dsource.org/projects/blip . I don't think that (excluding some simple examples) fully automatic =20 (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra =20 burden on the programmer. Still it is possible to have several levels of parallelization, and if ==20you program a fully parallel program it should still be possible to =20 use it relatively efficiently locally, but a local program will not =20 automatically become fully parallel.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.=20 [ . . . ] --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Nov 11 2010
Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]FWIW, I'm not a parallel computing expert and have almost no experience of it outside basic parallel programming courses, but it seems to me that the HPC clusters are a completely separate domain. It used to be the case that *all* other systems are single core and only the HPC consists of (hybrid multicore setups of) several nodes. Now what is happening is both embedded and mainstream PCs are getting multiple cores in both CPU and GPU chips. Multi-socket setups are still rare. The growth rate maybe follows the Moore's law at least in GPUs, in CPUs the problems with programmability are slowing things down and many laptops are still dual-core despite multiple cores are more energy efficient than higher GHz and my home PC has 8 virtual cores in a single CPU. The HPC systems with hundreds of processors are definitely still important, but I see that 99.9(99)% of the market is in desktop and embedded systems. We need efficient ways to program multicore mobile phones, multicore laptops, multicore tablet devices and so on. These are all shared memory systems. I don't think MPI works well in shared memory systems. It's good to have MPI like system for D, but it cannot solve these problems.on this I am not so sure, heterogeneous clusters are more difficult to program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not convinced they are necessarily the future.The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level. [ . . . ]yes many core is the future I agree on this, and also that distributed approach is the only way to scale to a really large number of processors. Bud distributed systems *are* more complex, so I think that for the foreseeable future one will have a hybrid approach.Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . .again not sure the situation is as dire as you paint it, Linux does quite well in the HPC field... but I agree that to be the ideal OS for these architectures it will need more changes.The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. [ . . . ]Whole array operation are useful, and when possible one gains much using them, unfortunately not all problems can be reduced to few large array operations, data parallel languages are not the main type of language for these reasons.Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex.I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions. [ . . . ]It might be a personal thing, but I am kind of "suspicious" toward PGAS, I find a generalized MPI model better than PGAS when you want to have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage with an object that sends remote requests to access remote memory pieces. This means having a local server where this wrapped objects can be "published" and that can respond in any moment to external requests. I call this rpc (remote procedure call) and it can be realized easily on the top of MPI. As not all objects are distributed and in a complex program it does not always makes sense to distribute these objects on all processors or none, I find that the robust partitioning and collective communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but then you loose all its simplicity.I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]The situation is not so dire, some problems are trivially parallel, or can be solved with simple parallel patterns, others don't need to be solved in parallel, as the sequential solution if fast enough, but I do agree that being able to develop parallel systems is increasingly important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to build something to do parallel programs "the way it should be", or at least the way I would like it to be ;)The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)The result is what I did with blip, http://dsource.org/projects/blip . I don't think that (excluding some simple examples) fully automatic (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra burden on the programmer. Still it is possible to have several levels of parallelization, and if you program a fully parallel program it should still be possible to use it relatively efficiently locally, but a local program will not automatically become fully parallel.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things. [ . . . ]
Nov 11 2010
Thu, 11 Nov 2010 20:01:09 +0000, retard wrote:in CPUs the problems with programmability are slowing things down and many laptops are still dual-core despite multiple cores are more energy efficient than higher GHz and my home PC has 8 virtual cores in a single CPU.At least it seems so to me. My last 1 and 2 core systems had a TDP of 65 and 105W. Now it's 130W, the next gen have 12 cores and 130W TDP. So I currently have 8 CPU cores and 480 GPU cores. Unfortunately many open source applications don't use the GPU (maybe OpenGL 1.0 but usually software rendering. The gpu accelerated desktops are still buggy and crash prone) and are single threaded. Even some heavier tasks like video encoding uses cores very inefficiently. Would MPI help?
Nov 11 2010
retard Wrote:Thu, 11 Nov 2010 19:41:56 +0000, Russel Winder wrote:You're unfortunately completely wrong. The industry is moving away from desktop applications. The reason is simple, software as service brings more profit and provides a handy vendor lock-in the customers don't even realize now. Advertisers pay the services now, the customers directly in the future once local desktop application market has been crushed. Another reason is the amount of open source software out there. You can't compete with free (as in beer) and it's considered good enough by typical users. Desktop applications suffer from segfaults and bugs. You can hide those with server technology. Just blame the infrastructure. Conceptually internet is so complex that people accept broken behavior more often. "yep, it wasn't facebook's fault - some pipe exploded in africa and your net is down". High performance web servers need MPI and similar technologies to scale on huge clusters. The streaming services and browser plugins guarantee that you don't need to do video encoding at home anymore. Low upload bandwidth guarantees that you won't share your personal content (e.g. images or videos taken with a camera) even when ipv6 comes. All games will be only available on game consoles. The multicore PC will just die away. In the future the client systems are even dumber than they're now. Maybe even X like thin clients on top of http/html5/ajax. These systems run just fine on single tasking single core.On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]FWIW, I'm not a parallel computing expert and have almost no experience of it outside basic parallel programming courses, but it seems to me that the HPC clusters are a completely separate domain. It used to be the case that *all* other systems are single core and only the HPC consists of (hybrid multicore setups of) several nodes. Now what is happening is both embedded and mainstream PCs are getting multiple cores in both CPU and GPU chips. Multi-socket setups are still rare. The growth rate maybe follows the Moore's law at least in GPUs, in CPUs the problems with programmability are slowing things down and many laptops are still dual-core despite multiple cores are more energy efficient than higher GHz and my home PC has 8 virtual cores in a single CPU. The HPC systems with hundreds of processors are definitely still important, but I see that 99.9(99)% of the market is in desktop and embedded systems. We need efficient ways to program multicore mobile phones, multicore laptops, multicore tablet devices and so on. These are all shared memory systems. I don't think MPI works well in shared memory systems. It's good to have MPI like system for D, but it cannot solve these problems.on this I am not so sure, heterogeneous clusters are more difficult to program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not convinced they are necessarily the future.The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level. [ . . . ]yes many core is the future I agree on this, and also that distributed approach is the only way to scale to a really large number of processors. Bud distributed systems *are* more complex, so I think that for the foreseeable future one will have a hybrid approach.Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past. I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . .again not sure the situation is as dire as you paint it, Linux does quite well in the HPC field... but I agree that to be the ideal OS for these architectures it will need more changes.The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here. [ . . . ]Whole array operation are useful, and when possible one gains much using them, unfortunately not all problems can be reduced to few large array operations, data parallel languages are not the main type of language for these reasons.Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex.I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions. [ . . . ]It might be a personal thing, but I am kind of "suspicious" toward PGAS, I find a generalized MPI model better than PGAS when you want to have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage with an object that sends remote requests to access remote memory pieces. This means having a local server where this wrapped objects can be "published" and that can respond in any moment to external requests. I call this rpc (remote procedure call) and it can be realized easily on the top of MPI. As not all objects are distributed and in a complex program it does not always makes sense to distribute these objects on all processors or none, I find that the robust partitioning and collective communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but then you loose all its simplicity.I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true. [ . . . ]The situation is not so dire, some problems are trivially parallel, or can be solved with simple parallel patterns, others don't need to be solved in parallel, as the sequential solution if fast enough, but I do agree that being able to develop parallel systems is increasingly important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to build something to do parallel programs "the way it should be", or at least the way I would like it to be ;)The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)The result is what I did with blip, http://dsource.org/projects/blip . I don't think that (excluding some simple examples) fully automatic (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra burden on the programmer. Still it is possible to have several levels of parallelization, and if you program a fully parallel program it should still be possible to use it relatively efficiently locally, but a local program will not automatically become fully parallel.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things. [ . . . ]
Nov 11 2010
Russel Winder wrote:Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays!Yup. I am bemused by the efforts put into analyzing loops so that they can (by the compiler) be re-written into a higher level construct, and then the higher level construct is compiled. It just is backwards what the compiler should be doing. The high level construct is what the programmer should be writing. It shouldn't be something the compiler reconstructs from low level source code.
Nov 11 2010
Walter:Yup. I am bemused by the efforts put into analyzing loops so that they can (by the compiler) be re-written into a higher level construct, and then the higher level construct is compiled. It just is backwards what the compiler should be doing. The high level construct is what the programmer should be writing. It shouldn't be something the compiler reconstructs from low level source code.I agree a lot. The language has to offer means to express all the semantics and constraints, that the arrays are disjointed, that the operations done on them are pure or not pure, that the operations are not pure but determined only by a small window in the arrays, and so on and on. And then the compiler has to optimize the code according to the presence of SIMD registers, multi-cores, etc. This maybe is not enough for max performance applications, but in most situations it's plenty enough. (Incidentally, this is a lot what the Chapel language does (and D doesn't), and what I have explained in two past posts about Chapel, that were mostly ignored.) Bye, bearophile
Nov 11 2010
Thu, 11 Nov 2010 16:32:03 -0500, bearophile wrote:Walter:How does the Chapel work when I need to sort data (just basic quicksort on 12 cores, for instance) or e.g. compile many files in parallel or encode xvid? What is the content of the array with xvid files?Yup. I am bemused by the efforts put into analyzing loops so that they can (by the compiler) be re-written into a higher level construct, and then the higher level construct is compiled. It just is backwards what the compiler should be doing. The high level construct is what the programmer should be writing. It shouldn't be something the compiler reconstructs from low level source code.I agree a lot. The language has to offer means to express all the semantics and constraints, that the arrays are disjointed, that the operations done on them are pure or not pure, that the operations are not pure but determined only by a small window in the arrays, and so on and on. And then the compiler has to optimize the code according to the presence of SIMD registers, multi-cores, etc. This maybe is not enough for max performance applications, but in most situations it's plenty enough. (Incidentally, this is a lot what the Chapel language does (and D doesn't), and what I have explained in two past posts about Chapel, that were mostly ignored.)
Nov 11 2010
Russel Winder wrote:At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
Walter Bright Wrote:Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
Sean Kelly Wrote:Walter Bright Wrote:Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
%u Wrote:Sean Kelly Wrote:AVX isn't parallel programming, it's vector processing. A dying breed of paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel (don't know it, but heard it here). Fortran. These are all good examples. AVX is just a cpu intrinsics stuff in std.intrinsicsWalter Bright Wrote:Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
Gary Whatmore Wrote:%u Wrote:Currently the amount of information available is scarce. I have no idea how I use AVX or SSE in D. Auto-vectorization? Does it cover all use cases? So.. SSE & autovectorization & intrinsics => loops, hand written inline assembly parts, very small scale local worker threads / fibers => dsimcha's lib, medium scale local area network => the great flagship distributed message passing system, huge clusters with 1000+ computers? Why is message passing system so important? Assume I have dual-core laptop with AVX instructions next year. Use of 2 threads doubles my processor power. Use of AVX gives 8 times more power in good loops. I have no cluster so the flagship system provides zero benefit.Sean Kelly Wrote:AVX isn't parallel programming, it's vector processing. A dying breed of paradigms. Parallel programming deals with concurrency. OpenMP and MPI. Chapel (don't know it, but heard it here). Fortran. These are all good examples. AVX is just a cpu intrinsics stuff in std.intrinsicsWalter Bright Wrote:Intel promised this AVX instruction set next year. Does it also work like distributed processes? I hear it doubles your FLOPS. These are exciting times parallel computing. Lots of new medias for distributed message passing programming. Lots of little fibers filling the multimedia pipelines with parallel data. Might even beat GPU soon if Larrabee comes.Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 11 2010
Sean Kelly wrote:Walter Bright Wrote:The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact. I agree with Walter -- I don't think it's got much to do with programmer training. It's a problem that hasn't been solved in the real world in the general case. The analogy with the real world suggests to me that there are three cases that work well: * massively parallel; * _completely_ independent tasks; and * very small teams. Large teams are a management nightmare, and I see no reason to believe that wouldn't hold true for a large number of cores as well.Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.
Nov 12 2010
That's only part of the reasoning behind all of the little programs in Erlang. The one of the more important aspect is the concept of supervisor trees where you have processes that monitor* other processes. In the event that a child process fails, the parent process will try to perform a simpler version of what needs to occur until it is successful. The other aspect is the concept of failing fast. It is assumed that a process that fails does not know how to resolve the issue, therefore it should just stop running and allow the parent process to do the right thing. If you build your software the Erlang way, then you implicitly build software that is multi-core friendly. How well it uses multiple cores depends on the software that is written, however I believe that Erlang is supposed to be better than most other languages at obtaining something close to linear scaling across cores. Not 100% sure, though. Does this mean that I believe distributed programming is easy in Erlang? Well, that depends on what you're doing, but I will say that being able to spawn functions on different machines is dirt simple. Doing it efficiently...well, that's where I think the programmer needs to know what they're doing. Casey * The monitoring is something implicit to the language.Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact.
Nov 13 2010
On 11-nov-10, at 20:41, Russel Winder wrote:On Thu, 2010-11-11 at 15:16 +0100, Fawzi Mohamed wrote: [ . . . ]Vector co processors, yes I see that, and short term the effect of things like AMD fusion (CPU/GPU merging). Is this necessarily the future? I don't know, neither does intel I think, as they are still evaluating larabee. But CPU/GPU will stay around fro some time more for sure.on this I am not so sure, heterogeneous clusters are more difficult to program, and GPU & co are slowly becoming more and more general purpose. Being able to take advantage of those is useful, but I am not convinced they are necessarily the future.The Intel roadmap is for processor chips that have a number of cores with different architectures. Heterogeneity is not going going to be a choice, it is going to be an imposition. And this is at bus level, not at cluster level.[ . . . ]yes many core is the future I agree on this, and also that distributed approach is the only way to scale to a really large number of processors. Bud distributed systems *are* more complex, so I think that for the foreseeable future one will have a hybrid approach.Hybrid is what I am saying is the future whether we like it or not. SMP as the whole system is the past.I disagree that distributed systems are more complex per se. I suspect comments are getting so general here that anything anyone writes can be seen as both true and false simultaneously. My perception is that shared memory multithreading is less and less a tool that applications programmers should be thinking in terms of. Multiple processes with an hierarchy of communications costs is the overarching architecture with each process potentially being SMP or CSP or . . .I agree that on not too large shared memory machines a hierarchy of tasks is the correct approach. This is what I did in blip.parallel.smp. Using that one can have fairly efficient automatic scheduling, and so forget most of the complexities, and actual hardware configuration.yes microkernels& co will be more and more important (but I wonder how much this will be the case for the desktop). ABI mobility?not so sure, for hpc I can imagine having to compile to different ABIs (but maybe that is what you mean with ABI mobility)again not sure the situation is as dire as you paint it, Linux does quite well in the HPC field... but I agree that to be the ideal OS for these architectures it will need more changes.The Linux driver architecture is already creaking at the seams, it implies a central monolithic approach to operating system. This falls down in a multiprocessor shared memory context. The fact that the Top 500 generally use Linux is because it is the least worst option. M$ despite throwing large amounts of money at the problem, and indeed bought some very high profile names to try and do something about the lack of traction, have failed to make any headway in the HPC operating system stakes. Do you want to have to run a virus checker on your HPC system? My gut reaction is that we are going to see a rise of hypervisors as per Tilera chips, at least in the short to medium term, simply as a bridge from the now OSes to the future. My guess is that L4 microkernels and/or nanokernels, exokernels, etc. will find a central place in future systems. The problem to be solved is ensuring that the appropriate ABI is available on the appropriate core at the appropriate time. Mobility of ABI is the critical factor here.[ . . . ]PGAS and MPI both have the same executable everywhere, but MPI is more flexible, with respect of making different part execute different things, and MPI does provide more generic packing/unpacking, but I guess I see you problems with it. Having the same executable is a big constraint, but is also a simplification.Whole array operation are useful, and when possible one gains much using them, unfortunately not all problems can be reduced to few large array operations, data parallel languages are not the main type of language for these reasons.Agreed. My point was that in 1960s code people explicitly handled array operations using do loops because they had to. Nowadays such code is anathema to efficient execution. My complaint here is that people have put effort into compiler technology instead of rewriting the codes in a better language and/or idiom. Clearly whole array operations only apply to algorithms that involve arrays! [ . . . ]well whole array operations are a generalization of the SPMD approach, so I this sense you said that that kind of approach will have a future (but with a more difficult optimization as the hardware is more complex.I guess this is where the PGAS people are challenging things. Applications can be couched in terms of array algorithms which can be scattered across distributed memory systems. Inappropriate operations lead to huge inefficiencies, but handles correctly, code runs very fast.About MPI I think that many don't see what MPI really does, mpi offers a simplified parallel model. The main weakness of this model is that it assumes some kind of reliability, but then it offers a clear computational model with processors ordered in a linear of higher dimensional structure and efficient collective communication primitives. Yes MPI is not the right choice for all problems, but when usable it is very powerful, often superior to the alternatives, and programming with it is *simpler* than thinking about a generic distributed system. So I think that for problems that are not trivially parallel, or easily parallelizable MPI will remain as the best choice.I guess my main irritant with MPI is that I have to run the same executable on every node and, perhaps more importantly, the message passing structure is founded on Fortran primitive data types. OK so you can hack up some element of abstraction so as to send complex messages, but it would be far better if the MPI standard provided better abstractions.[ . . . ]Yes I guess that is trueIt might be a personal thing, but I am kind of "suspicious" toward PGAS, I find a generalized MPI model better than PGAS when you want to have separated address spaces. Using MPI one can define a PGAS like object wrapping local storage with an object that sends remote requests to access remote memory pieces. This means having a local server where this wrapped objects can be "published" and that can respond in any moment to external requests. I call this rpc (remote procedure call) and it can be realized easily on the top of MPI. As not all objects are distributed and in a complex program it does not always makes sense to distribute these objects on all processors or none, I find that the robust partitioning and collective communication primitives of MPI superior to PGAS. With enough effort you probably can get everything also from PGAS, but then you loose all its simplicity.I think we are going to have to take this one off the list. My summary is that MPI and PGAS solve different problems differently. There are some problems that one can code up neatly in MPI and that are ugly in PGAS, but the converse is also true.[ . . . ]when you have a network of things communicating (I think that once you have a distributed system you come at that level) then i is not sufficient anymore to think about each piece in isolation, you have to think about the interactions too. There are some patterns that might help reduce the complexity: client/ server, map/reduce,.... but in general it is more complex.The situation is not so dire, some problems are trivially parallel, or can be solved with simple parallel patterns, others don't need to be solved in parallel, as the sequential solution if fast enough, but I do agree that being able to develop parallel systems is increasingly important. In fact it is something that I like to do, and I thought about a lot. I did program parallel systems, and out of my experience I tried to build something to do parallel programs "the way it should be", or at least the way I would like it to be ;)The real question is whether future computers will run Word, OpenOffice.org, Excel, Powerpoint fast enough so that people don't complain. Everything else is an HPC ghetto :-)The result is what I did with blip, http://dsource.org/projects/ blip . I don't think that (excluding some simple examples) fully automatic (trasparent) parallelization is really feasible. At some point being parallel is more complex, and it puts an extra burden on the programmer. Still it is possible to have several levels of parallelization, and if you program a fully parallel program it should still be possible to use it relatively efficiently locally, but a local program will not automatically become fully parallel.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.
Nov 12 2010
Don Wrote:Sean Kelly wrote:True enough. But it's certainly more natural to think about than mutex-based concurrency, automatic parallelization, etc. In the long term there may turn out to be better models, but I don't know of one today. Also, there are other goals for such a design than increasing computation speed: decreased maintenance cost, system reliability, etc. Erlang processes are equivalent to objects in C++ or Java with the added benefit of asynchronous execution in instances where an immediate response (ie. RPC) is not required. Performance gain is a direct function of how often this is true. But even where it's not, the other benefits exist.Walter Bright Wrote:The Erlang people seem to say that a lot. The thing they omit to say, though, is that it is very, very difficult in the real world! Consider managing a team of ten people. Getting them to be ten times as productive as a single person is extremely difficult -- virtually impossible, in fact.Russel Winder wrote:Distributed programming is essentially a bunch of little sequential program that interact, which is basically how people cooperate in the real world. I think that is by far the most intuitive of any concurrent programming model, though it's still a significant conceptual shift from the traditional monolithic imperative program.At the heart of all this is that programmers are taught that algorithm is a sequence of actions to achieve a goal. Programmers are trained to think sequentially and this affects their coding. This means that parallelism has to be expressed at a sufficiently high level that programmers can still reason about algorithms as sequential things.I think it's more than being trained to think sequentially. I think it is in the inherent nature of how we think.I agree with Walter -- I don't think it's got much to do with programmer training. It's a problem that hasn't been solved in the real world in the general case.I agree. But we still need something better than the traditional approach now :-)The analogy with the real world suggests to me that there are three cases that work well: * massively parallel; * _completely_ independent tasks; and * very small teams. Large teams are a management nightmare, and I see no reason to believe that wouldn't hold true for a large number of cores as well.Back when the Java OS was announced I envisioned a modular system backed by a database of objects serving different functions. Kind of like the old OpenDoc model, but at an OS level. It clearly didn't work out this way, but I'd be interested to see something along these lines. I honestly couldn't say whether apps would turn out to be easier or more difficult to create in such an environment though.
Nov 13 2010
True enough. But it's certainly more natural to think about than mutex-based concurrency, automatic parallelization, etc. In the long term there may turn out to be better models, but I don't know of one today. Also, there are other goals for such a design than increasing computation speed: decreased maintenance cost, system reliability, etc. Erlang processes are equivalent to objects in C++ or Java with the added benefit of asynchronous execution in instances where an immediate response (ie. RPC) is not required. Performance gain is a direct function of how often this is true. But even where it's not, the other benefits exist.I like that description! Casey
Nov 13 2010