www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Re: RFC: naming for FrontTransversal and Transversal ranges

reply bearophile <bearophileHUGS lycos.com> writes:
Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops.  This is thankfully a problem that D array libs do not
 have.  If you think of it as a loop, go ahead and implement it as a
 loop.

Sigh! Already today, and even more tomorrow, this is often false for D too. In my computer I have a cheap GPU that is sleeping while my D code runs. Even my other core sleeps. And I am using one core at 32 bits only. You will need ways to data-parallelize and other forms of parallel processing. So maybe nornmal loops will not cuti it. I think the best way to go is following the ideas of the Chapel language I have shown a bit here: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=87311 Bye, bearophile
May 01 2009
next sibling parent reply Bill Baxter <wbaxter gmail.com> writes:
On Fri, May 1, 2009 at 5:36 PM, bearophile <bearophileHUGS lycos.com> wrote=
:
 Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops. =A0This is thankfully a problem that D array libs do not
 have. =A0If you think of it as a loop, go ahead and implement it as a
 loop.

Sigh! Already today, and even more tomorrow, this is often false for D to=

Even my other core sleeps. And I am using one core at 32 bits only.
 You will need ways to data-parallelize and other forms of parallel proces=

Yeh. If you want to use multiple cores you've got a whole 'nother can o worms. But at least I find that today most apps seem get by just fine using a single core. Strange though, aren't you the guy always telling us how being able to express your algorithm clearly is often more important than raw performance? --bb
May 01 2009
parent reply Don <nospam nospam.com> writes:
Bill Baxter wrote:
 On Fri, May 1, 2009 at 5:36 PM, bearophile <bearophileHUGS lycos.com> wrote:
 Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops.  This is thankfully a problem that D array libs do not
 have.  If you think of it as a loop, go ahead and implement it as a
 loop.

You will need ways to data-parallelize and other forms of parallel processing. So maybe nornmal loops will not cuti it.

Yeh. If you want to use multiple cores you've got a whole 'nother can o worms. But at least I find that today most apps seem get by just fine using a single core. Strange though, aren't you the guy always telling us how being able to express your algorithm clearly is often more important than raw performance? --bb

I confess to being mighty skeptical about the whole multi-threaded, multi-core thing. I think we're going to find that there's only two practical uses of multi-core: (1) embarressingly-parallel operations; and (2) process-level concurrency. I just don't believe that apps have as much opportunity for parallelism as people seem to think. There's just too many dependencies. Sure, you can (say) with a game, split your AI into a seperate core from your graphics stuff, but that's only applicable for 2-4 cores. It doesn't work for 100+ cores. (Which is why I think that broadening the opportunity for case (1) is the most promising avenue for actually using a host of cores).
May 02 2009
parent reply Georg Wrede <georg.wrede iki.fi> writes:
Don wrote:
 Bill Baxter wrote:
 On Fri, May 1, 2009 at 5:36 PM, bearophile <bearophileHUGS lycos.com> 
 wrote:
 Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops.  This is thankfully a problem that D array libs do not
 have.  If you think of it as a loop, go ahead and implement it as a
 loop.

D too. In my computer I have a cheap GPU that is sleeping while my D code runs. Even my other core sleeps. And I am using one core at 32 bits only. You will need ways to data-parallelize and other forms of parallel processing. So maybe nornmal loops will not cuti it.

Yeh. If you want to use multiple cores you've got a whole 'nother can o worms. But at least I find that today most apps seem get by just fine using a single core. Strange though, aren't you the guy always telling us how being able to express your algorithm clearly is often more important than raw performance? --bb

I confess to being mighty skeptical about the whole multi-threaded, multi-core thing. I think we're going to find that there's only two practical uses of multi-core: (1) embarressingly-parallel operations; and (2) process-level concurrency. I just don't believe that apps have as much opportunity for parallelism as people seem to think. There's just too many dependencies. Sure, you can (say) with a game, split your AI into a seperate core from your graphics stuff, but that's only applicable for 2-4 cores. It doesn't work for 100+ cores.

I had this bad dream where there's a language where it's trivial to use multiple CPUs. And I could see every Joe and John executing their trivial apps, each of which used all available CPUs. They had their programs and programlets run twice or four times as fast, but most of them ran in less than a couple of seconds anyway, and the longer ones spent most of their time waiting for external resources. All it ended up with was a lot of work for the OS, the total throughput of the computer decreasing because now every CPU had to deal with every process, not to mention the increase in electricity consumption and heat because none of the CPUs could rest. And still nobody was using the GPU, MMX, SSE, etc. Most of these programs consisted of sequences, with the odd selection or short iteration spread far and apart. And none of them used parallellizable data.
 (Which is why I think that broadening the opportunity for case (1) is 
 the most promising avenue for actually using a host of cores).

The more I think about it, the more I'm starting to believe that the average desktop or laptop won't see two dozen cores in the immediate future. And definitely, by the time there are more cores than processes on the average Windows PC, we're talking about gross wasteage. OTOH, Serious Computing is different, of course. Corporate machine rooms would benefit from many cores. Virtual host servers, heavy-duty web servers, and of course scientific and statistical computing come to mind. It's interesting to note that in the old days, machine room computers were totally different from PCs. Then they sort-of got together, as in machine rooms all of a sudden filled with regular PCs running Linux. And now, I see the trend again separating the PC from the machine room computers. Software for the latter might be the target for language features that utilize multiple CPUs.
May 02 2009
parent reply bearophile <bearophileHUGS lycos.com> writes:
Georg Wrede:
 The more I think about it, the more I'm starting to believe that the 
 average desktop or laptop won't see two dozen cores in the immediate 
 future.

Too much late. My personal computer has already about 100 small cores in the GPU, and using CUDA (and soon OpenCL) you are even able to use them for almost general computations... And then comes Nehalem: http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture) Bye, bearophile
May 02 2009
parent Georg Wrede <georg.wrede iki.fi> writes:
bearophile wrote:
 Georg Wrede:
 The more I think about it, the more I'm starting to believe that the 
 average desktop or laptop won't see two dozen cores in the immediate 
 future.

Too much late. My personal computer has already about 100 small cores in the GPU, and using CUDA (and soon OpenCL) you are even able to use them for almost general computations... And then comes Nehalem: http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)

http://en.wikipedia.org/wiki/Simultaneous_multithreading: The latest MIPS architecture designs include an SMT system known as "MIPS MT". MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities. I hope this stuff is for the machine room. :-) And recession helps...
May 02 2009
prev sibling next sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Fri, 01 May 2009 21:14:54 -0400, Bill Baxter <wbaxter gmail.com> wrote:

 On Fri, May 1, 2009 at 5:36 PM, bearophile <bearophileHUGS lycos.com>  
 wrote:
 Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops.  This is thankfully a problem that D array libs do not
 have.  If you think of it as a loop, go ahead and implement it as a
 loop.

Sigh! Already today, and even more tomorrow, this is often false for D too. In my computer I have a cheap GPU that is sleeping while my D code runs. Even my other core sleeps. And I am using one core at 32 bits only. You will need ways to data-parallelize and other forms of parallel processing. So maybe nornmal loops will not cuti it.

Yeh. If you want to use multiple cores you've got a whole 'nother can o worms. But at least I find that today most apps seem get by just fine using a single core. Strange though, aren't you the guy always telling us how being able to express your algorithm clearly is often more important than raw performance? --bb

Well, since I do GP-GPU work, the GPU algorithm is much algorithmically cleaner than the CPU algorithm. :) But I do know that this is very algorithm dependent.
May 02 2009
prev sibling parent "Robert Jacques" <sandford jhu.edu> writes:
On Sat, 02 May 2009 04:17:29 -0400, Don <nospam nospam.com> wrote:

 Bill Baxter wrote:
 On Fri, May 1, 2009 at 5:36 PM, bearophile <bearophileHUGS lycos.com>  
 wrote:
 Bill Baxter:
 Much more often the discussion on the numpy list takes the form of
 "how do I make this loop faster" becuase loops are slow in Python so
 you have to come up with clever transformations to turn your loop into
 array ops.  This is thankfully a problem that D array libs do not
 have.  If you think of it as a loop, go ahead and implement it as a
 loop.

too. In my computer I have a cheap GPU that is sleeping while my D code runs. Even my other core sleeps. And I am using one core at 32 bits only. You will need ways to data-parallelize and other forms of parallel processing. So maybe nornmal loops will not cuti it.

o worms. But at least I find that today most apps seem get by just fine using a single core. Strange though, aren't you the guy always telling us how being able to express your algorithm clearly is often more important than raw performance? --bb

I confess to being mighty skeptical about the whole multi-threaded, multi-core thing. I think we're going to find that there's only two practical uses of multi-core: (1) embarressingly-parallel operations; and (2) process-level concurrency. I just don't believe that apps have as much opportunity for parallelism as people seem to think. There's just too many dependencies. Sure, you can (say) with a game, split your AI into a seperate core from your graphics stuff, but that's only applicable for 2-4 cores. It doesn't work for 100+ cores. (Which is why I think that broadening the opportunity for case (1) is the most promising avenue for actually using a host of cores).

Actually, AI is mostly embarrassingly-parallel. The issue is the mostly part of that statement, which is why optimized reader/writer locks and STM are showing up in game engines. And really, >90% of the CPU time has been physics, which is both embarrassingly-parallel and being off-loaded onto the GPU/multi-cores.
May 02 2009