www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - Why ranges don't return vectors?

reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
Seriously, nobody will ever get performance from single-element 
iterator/range pattern - this makes CPU stall!

Anytime you read one byte from typical hard disk, system reads full 
sector - typically 512 bytes, no more, no less. Anytime you read one 
byte from memory, CPU loads entire cache line from RAM into the cache 
(64 bytes on all modern Intel CPU's). Why not exploit that with ranges?

Ranges could potentially return arrays in front() (or in 
frontVector/whatever), so basically they will become ranges of ranges 
where the deepest range is always RandomAccessRange. This has obvious 
performance benefits. Everyone knows that traversing memory is faster 
than iteraring by popFront(). On the other hand memory lacks the 
flexibility of ranges - so lets make a hybrid range!

Advantages:
* performance (!)
* limited lookahead/backtracking

How?

1. instruction level parallelism: CPU's can execute few instructions in 
parallel given that they operate on different data (different vector 
elements)
2. SIMD: load entire vector to SSE/AVX register then run the operation 
on all elements at once
3. use of L1 cache: traversing vector in memory is way faster that 
calling popFront() for each vector element - likely if it's inlined

Disadvantages:
* potential inconvenience

I know it can't be easy drop-in addition to the current range 
implementation, but who knows...
Feb 28 2013
next sibling parent reply "bearophile" <bearophileHUGS lycos.com> writes:
Piotr Szturmaj:

 I know it can't be easy drop-in addition to the current range 
 implementation, but who knows...
Time ago I have suggested in this newsgroup that idea, naming it "vectorized lazyness", that was mostly ignored: http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS It's also related to the idea of Segmented Iterators that I have discussed later in this same newsgroup (I think I received no answers): http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf I think that so far there is only a little amount of vectorized lazyness implemented in Phobos, search for "semi-lazy parallel map" here: http://dlang.org/phobos/std_parallelism.html Bye, bearophile
Feb 28 2013
parent Piotr Szturmaj <bncrbme jadamspam.pl> writes:
bearophile wrote:
 Piotr Szturmaj:

 I know it can't be easy drop-in addition to the current range
 implementation, but who knows...
Time ago I have suggested in this newsgroup that idea, naming it "vectorized lazyness", that was mostly ignored: http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS It's also related to the idea of Segmented Iterators that I have discussed later in this same newsgroup (I think I received no answers): http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf I think that so far there is only a little amount of vectorized lazyness implemented in Phobos, search for "semi-lazy parallel map" here: http://dlang.org/phobos/std_parallelism.html
Thanks for the references. They look interesting indeed.
Mar 01 2013
prev sibling next sibling parent reply "Chris Cain" <clcain uncg.edu> writes:
On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Seriously, nobody will ever get performance from single-element 
 iterator/range pattern - this makes CPU stall!
There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
Feb 28 2013
parent reply "H. S. Teoh" <hsteoh quickfur.ath.cx> writes:
On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
 On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
Seriously, nobody will ever get performance from single-element
iterator/range pattern - this makes CPU stall!
There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
You can make a buffered range that reads in a large chunk of data at a time, and .front and .popFront merely move an internal pointer until it reaches the end of the buffer, then the next buffer page is loaded in. In this case, .front is very simple (just a pointer lookup) and will probably be inlined by an optimizing compiler. Just because the abstraction is reading per-element, doesn't mean the implementation has to literally do that! T -- "Maybe" is a strange word. When mom or dad says it it means "yes", but when my big brothers say it it means "no"! -- PJ jr.
Feb 28 2013
next sibling parent "Brad Anderson" <eco gnuk.net> writes:
On Friday, 1 March 2013 at 03:44:30 UTC, H. S. Teoh wrote:
 You can make a buffered range that reads in a large chunk of 
 data at a
 time, and .front and .popFront merely move an internal pointer 
 until it
 reaches the end of the buffer, then the next buffer page is 
 loaded in.
 In this case, .front is very simple (just a pointer lookup) and 
 will
 probably be inlined by an optimizing compiler.

 Just because the abstraction is reading per-element, doesn't 
 mean the
 implementation has to literally do that!


 T
Isn't this essentially what is going on in stdio when a range uses fread?
Feb 28 2013
prev sibling parent reply Piotr Szturmaj <bncrbme jadamspam.pl> writes:
H. S. Teoh wrote:
 On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
 On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Seriously, nobody will ever get performance from single-element
 iterator/range pattern - this makes CPU stall!
There's no reason you can't do that. In fact, some ranges already do this. Take a look at std.stdio.File.byChunk and byLine. This isn't appropriate for all ranges, though. Like arrays, for instance. You want to be able to access single elements and do things with them, otherwise the abstraction is pointless.
Yes, I've written some buffered ranges myself, but my original post was not about buffering.
 You can make a buffered range that reads in a large chunk of data at a
 time, and .front and .popFront merely move an internal pointer until it
 reaches the end of the buffer, then the next buffer page is loaded in.
 In this case, .front is very simple (just a pointer lookup) and will
 probably be inlined by an optimizing compiler.
Inlining is not always the case.
 Just because the abstraction is reading per-element, doesn't mean the
 implementation has to literally do that!
Range abstraction is limiting some useful optimizations, mainly vectorization. Even if range internally uses a buffer and does some SIMD operations on it, it exposes only one element from this buffer, which can't be used for next SIMD operation. In other words it limits pipelining. I'm arguing that (vector) ranges of primitive types should always expose at least 16 byte vector, even if it's one element range - but then abstraction guarantees that it can be used for SSE ops. They could return slices pointing to fixed size buffers (at least 16 byte ones), then slices may be adjusted for the last iteration to specify actual number of elements in a vector. Anyway SIMD op will be done on the full vector (underlying buffer) - CPU don't care if element in a vector is garbage or not. So, in short: ranges pass these vectors between themselfs and each subsequent range mutates the same vector - possibly using SIMD.
Mar 01 2013
next sibling parent "monarch_dodra" <monarchdodra gmail.com> writes:
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
 Range abstraction is limiting some useful optimizations, mainly 
 vectorization. Even if range internally uses a buffer and does 
 some SIMD operations on it, it exposes only one element from 
 this buffer, which can't be used for next SIMD operation. In 
 other words it limits pipelining.
Well, why don't you write a range whose "elements" are vectors? Or that operate on vectors. The range abstraction only defines how you iterate and view a collection. YOU are the one that defines what the elements in that collection are. For example, "chunk" will do exactly that: Given a range, it transforms it into a range of sub-slices: //---- int[] arr = iota(0, 16).array(); //[0, 1, 2, ..., 16] auto chunks = std.range.chunks(arr, 4); foreach (int[] chunk ; chunks) { chunk[] *= 2; writeln(chunk); } //---- What else are you asking for? Nothing is stopping you from pipping the chunk elements with another range, btw.
Mar 01 2013
prev sibling parent reply "Chris Cain" <clcain uncg.edu> writes:
On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
 Yes, I've written some buffered ranges myself, but my original 
 post was not about buffering.
My point wasn't about things being buffered. I was just suggesting that many ranges do return arrays, when it's appropriate. If you want to conceptualize a "single item" as a grouping of items from a range (and such a grouping may be a vector of 16 bytes if you so choose), then you are free to do so. The concept is flexible enough to allow this. However, not all ranges should return arrays. As an answer to the topic's name, the reason why ranges don't (always) return arrays is that the concept of a range is to _abstract_ operations on _any_ type of container (and even things that are not containers, such as RNGs), as long as it has certain capabilities. Returning arrays isn't helpful when you're trying to process arrays, for instance, because if you didn't want an abstract look at an array, you would have worked on the original array without the abstraction in the first place.
 Range abstraction is limiting some useful optimizations, mainly 
 vectorization. Even if range internally uses a buffer and does 
 some SIMD operations on it, it exposes only one element from 
 this buffer, which can't be used for next SIMD operation. In 
 other words it limits pipelining.

 I'm arguing that (vector) ranges of primitive types should 
 always expose at least 16 byte vector, even if it's one element 
 range - but then abstraction guarantees that it can be used for 
 SSE ops. They could return slices pointing to fixed size 
 buffers (at least 16 byte ones), then slices may be adjusted 
 for the last iteration to specify actual number of elements in 
 a vector. Anyway SIMD op will be done on the full vector 
 (underlying buffer) - CPU don't care if element in a vector is 
 garbage or not. So, in short: ranges pass these vectors between 
 themselfs and each subsequent range mutates the same vector - 
 possibly using SIMD.
Okay, sure, performance. But if you want to code for performance, there's nothing stopping you now. The idea about ranges is to unify lots of concepts. If you need lower level work and require every bit of performance, but still want to use ranges, then you can, but it should be a specialized range type that you do it on which meets your precise specification. I hope I'm explaining it clearly: ranges are more of a universal specification, and it sounds like you have a very particular set of requirements which is a subset of what ranges are supposed to cover. Creating a specialized range type that meets your exact specifications would give you better performance than a generalized range concept (even optimized) and an optimized generalized range concept would not be appropriate for all uses. Perhaps you could code up a vectorRange that does things as you want and submit it in an enhancement request. And you can show off how awesome it is so that everyone will want to use it when it's appropriate. But certainly not all ranges should be vectorRanges because most of the time we're just trying to work on singular units at a time.
Mar 01 2013
parent "deadalnix" <deadalnix gmail.com> writes:
On Friday, 1 March 2013 at 22:18:10 UTC, Chris Cain wrote:
 Okay, sure, performance. But if you want to code for 
 performance, there's nothing stopping you now. The idea about 
 ranges is to unify lots of concepts. If you need lower level 
 work and require every bit of performance, but still want to 
 use ranges, then you can, but it should be a specialized range 
 type that you do it on which meets your precise specification.
And still, some time you'd be surprised by what the compiler can do for you. Cf the discussion about SentinelRange.
Mar 04 2013
prev sibling next sibling parent Marco Leise <Marco.Leise gmx.de> writes:
Am Fri, 01 Mar 2013 02:02:16 +0100
schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 Anytime you read one byte from typical hard disk, system reads full 
 sector - typically 512 bytes, no more, no less.
Side note: HDD manufacturers recently switched to 4 KiB sectors (8x size). Everyone who bypasses operating system buffers to do raw disk access now needs to query the system for optimal buffer alignment. -- Marco
Feb 28 2013
prev sibling parent "renoX" <renozyx gmail.com> writes:
On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Anytime you read one byte from typical hard disk, system reads 
 full sector - typically 512 bytes, no more, no less.
As Marco Leise wrote: now this 4KiB in many case.. That's one big issue with performance tuning: if you hardcoded your code with 512B page, you don't get the optimal performance in a 4KiB page system..
Mar 04 2013