digitalmars.D - Why ranges don't return vectors?

Piotr Szturmaj (27/27) Feb 28 2013 Seriously, nobody will ever get performance from single-element

bearophile (14/16) Feb 28 2013 Time ago I have suggested in this newsgroup that idea, naming it

Piotr Szturmaj (2/14) Mar 01 2013 Thanks for the references. They look interesting indeed.

Chris Cain (6/8) Feb 28 2013 There's no reason you can't do that. In fact, some ranges already

H. S. Teoh (12/21) Feb 28 2013 You can make a buffered range that reads in a large chunk of data at a

Brad Anderson (3/16) Feb 28 2013 Isn't this essentially what is going on in stdio when a range
Piotr Szturmaj (18/35) Mar 01 2013 Yes, I've written some buffered ranges myself, but my original post was

monarch_dodra (19/24) Mar 01 2013 Well, why don't you write a range whose "elements" are vectors?
Chris Cain (35/53) Mar 01 2013 My point wasn't about things being buffered. I was just

deadalnix (3/9) Mar 04 2013 And still, some time you'd be surprised by what the compiler can

Marco Leise (8/10) Feb 28 2013 Side note: HDD manufacturers recently switched to 4 KiB
renoX (5/7) Mar 04 2013 As Marco Leise wrote: now this 4KiB in many case..

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

Seriously, nobody will ever get performance from single-element 
iterator/range pattern - this makes CPU stall!

Anytime you read one byte from typical hard disk, system reads full 
sector - typically 512 bytes, no more, no less. Anytime you read one 
byte from memory, CPU loads entire cache line from RAM into the cache 
(64 bytes on all modern Intel CPU's). Why not exploit that with ranges?

Ranges could potentially return arrays in front() (or in 
frontVector/whatever), so basically they will become ranges of ranges 
where the deepest range is always RandomAccessRange. This has obvious 
performance benefits. Everyone knows that traversing memory is faster 
than iteraring by popFront(). On the other hand memory lacks the 
flexibility of ranges - so lets make a hybrid range!

Advantages:
* performance (!)
* limited lookahead/backtracking

How?

1. instruction level parallelism: CPU's can execute few instructions in 
parallel given that they operate on different data (different vector 
elements)
2. SIMD: load entire vector to SSE/AVX register then run the operation 
on all elements at once
3. use of L1 cache: traversing vector in memory is way faster that 
calling popFront() for each vector element - likely if it's inlined

Disadvantages:
* potential inconvenience

I know it can't be easy drop-in addition to the current range 
implementation, but who knows...

Feb 28 2013

"bearophile" <bearophileHUGS lycos.com> writes:

Piotr Szturmaj:

 I know it can't be easy drop-in addition to the current range 
 implementation, but who knows...

Time ago I have suggested in this newsgroup that idea, naming it 
"vectorized lazyness", that was mostly ignored:

http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS

It's also related to the idea of Segmented Iterators that I have 
discussed later in this same newsgroup (I think I received no 
answers):

http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf

I think that so far there is only a little amount of vectorized 
lazyness implemented in Phobos, search for "semi-lazy parallel 
map" here:
http://dlang.org/phobos/std_parallelism.html

Bye,
bearophile

Feb 28 2013

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

bearophile wrote:
 Piotr Szturmaj:

 I know it can't be easy drop-in addition to the current range
 implementation, but who knows...

 Time ago I have suggested in this newsgroup that idea, naming it
 "vectorized lazyness", that was mostly ignored:

 http://web.archiveorange.com/archive/v/j3CAOE6ZVy4122g4OfhS

 It's also related to the idea of Segmented Iterators that I have
 discussed later in this same newsgroup (I think I received no answers):

 http://pdf.aminer.org/000/136/863/segmented_iterators_and_hierarchical_algorithms.pdf


 I think that so far there is only a little amount of vectorized lazyness
 implemented in Phobos, search for "semi-lazy parallel map" here:
 http://dlang.org/phobos/std_parallelism.html

Thanks for the references. They look interesting indeed.

Mar 01 2013

"Chris Cain" <clcain uncg.edu> writes:

On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Seriously, nobody will ever get performance from single-element 
 iterator/range pattern - this makes CPU stall!

There's no reason you can't do that. In fact, some ranges already 
do this. Take a look at std.stdio.File.byChunk and byLine. This 
isn't appropriate for all ranges, though. Like arrays, for 
instance. You want to be able to access single elements and do 
things with them, otherwise the abstraction is pointless.

Feb 28 2013

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
 On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
Seriously, nobody will ever get performance from single-element
iterator/range pattern - this makes CPU stall!

 
 There's no reason you can't do that. In fact, some ranges already do
 this. Take a look at std.stdio.File.byChunk and byLine. This isn't
 appropriate for all ranges, though. Like arrays, for instance. You
 want to be able to access single elements and do things with them,
 otherwise the abstraction is pointless.

You can make a buffered range that reads in a large chunk of data at a
time, and .front and .popFront merely move an internal pointer until it
reaches the end of the buffer, then the next buffer page is loaded in.
In this case, .front is very simple (just a pointer lookup) and will
probably be inlined by an optimizing compiler.

Just because the abstraction is reading per-element, doesn't mean the
implementation has to literally do that!


T

-- 
"Maybe" is a strange word.  When mom or dad says it it means "yes", but
when my big brothers say it it means "no"! -- PJ jr.

Feb 28 2013

"Brad Anderson" <eco gnuk.net> writes:

On Friday, 1 March 2013 at 03:44:30 UTC, H. S. Teoh wrote:
 You can make a buffered range that reads in a large chunk of 
 data at a
 time, and .front and .popFront merely move an internal pointer 
 until it
 reaches the end of the buffer, then the next buffer page is 
 loaded in.
 In this case, .front is very simple (just a pointer lookup) and 
 will
 probably be inlined by an optimizing compiler.

 Just because the abstraction is reading per-element, doesn't 
 mean the
 implementation has to literally do that!


 T

Isn't this essentially what is going on in stdio when a range 
uses fread?

Feb 28 2013

Piotr Szturmaj <bncrbme jadamspam.pl> writes:

H. S. Teoh wrote:
 On Fri, Mar 01, 2013 at 03:35:23AM +0100, Chris Cain wrote:
 On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Seriously, nobody will ever get performance from single-element
 iterator/range pattern - this makes CPU stall!

 There's no reason you can't do that. In fact, some ranges already do
 this. Take a look at std.stdio.File.byChunk and byLine. This isn't
 appropriate for all ranges, though. Like arrays, for instance. You
 want to be able to access single elements and do things with them,
 otherwise the abstraction is pointless.


Yes, I've written some buffered ranges myself, but my original post was 
not about buffering.

 You can make a buffered range that reads in a large chunk of data at a
 time, and .front and .popFront merely move an internal pointer until it
 reaches the end of the buffer, then the next buffer page is loaded in.
 In this case, .front is very simple (just a pointer lookup) and will
 probably be inlined by an optimizing compiler.

Inlining is not always the case.

 Just because the abstraction is reading per-element, doesn't mean the
 implementation has to literally do that!

Range abstraction is limiting some useful optimizations, mainly 
vectorization. Even if range internally uses a buffer and does some SIMD 
operations on it, it exposes only one element from this buffer, which 
can't be used for next SIMD operation. In other words it limits pipelining.

I'm arguing that (vector) ranges of primitive types should always expose 
at least 16 byte vector, even if it's one element range - but then 
abstraction guarantees that it can be used for SSE ops. They could 
return slices pointing to fixed size buffers (at least 16 byte ones), 
then slices may be adjusted for the last iteration to specify actual 
number of elements in a vector. Anyway SIMD op will be done on the full 
vector (underlying buffer) - CPU don't care if element in a vector is 
garbage or not. So, in short: ranges pass these vectors between 
themselfs and each subsequent range mutates the same vector - possibly 
using SIMD.

Mar 01 2013

"monarch_dodra" <monarchdodra gmail.com> writes:

On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
 Range abstraction is limiting some useful optimizations, mainly 
 vectorization. Even if range internally uses a buffer and does 
 some SIMD operations on it, it exposes only one element from 
 this buffer, which can't be used for next SIMD operation. In 
 other words it limits pipelining.

Well, why don't you write a range whose "elements" are vectors? 
Or that operate on vectors.

The range abstraction only defines how you iterate and view a 
collection. YOU are the one that defines what the elements in 
that collection are.

For example, "chunk" will do exactly that: Given a range, it 
transforms it into a range of sub-slices:

//----
     int[] arr = iota(0, 16).array(); //[0, 1, 2, ..., 16]
     auto chunks = std.range.chunks(arr, 4);
     foreach (int[] chunk ; chunks)
     {
         chunk[] *= 2;
         writeln(chunk);
     }
//----

What else are you asking for? Nothing is stopping you from 
pipping the chunk elements with another range, btw.

Mar 01 2013

"Chris Cain" <clcain uncg.edu> writes:

On Friday, 1 March 2013 at 15:38:28 UTC, Piotr Szturmaj wrote:
 Yes, I've written some buffered ranges myself, but my original 
 post was not about buffering.

My point wasn't about things being buffered. I was just 
suggesting that many ranges do return arrays, when it's 
appropriate. If you want to conceptualize a "single item" as a 
grouping of items from a range (and such a grouping may be a 
vector of 16 bytes if you so choose), then you are free to do so. 
The concept is flexible enough to allow this.

However, not all ranges should return arrays. As an answer to the 
topic's name, the reason why ranges don't (always) return arrays 
is that the concept of a range is to _abstract_ operations on 
_any_ type of container (and even things that are not containers, 
such as RNGs), as long as it has certain capabilities. Returning 
arrays isn't helpful when you're trying to process arrays, for 
instance, because if you didn't want an abstract look at an 
array, you would have worked on the original array without the 
abstraction in the first place.

 Range abstraction is limiting some useful optimizations, mainly 
 vectorization. Even if range internally uses a buffer and does 
 some SIMD operations on it, it exposes only one element from 
 this buffer, which can't be used for next SIMD operation. In 
 other words it limits pipelining.

 I'm arguing that (vector) ranges of primitive types should 
 always expose at least 16 byte vector, even if it's one element 
 range - but then abstraction guarantees that it can be used for 
 SSE ops. They could return slices pointing to fixed size 
 buffers (at least 16 byte ones), then slices may be adjusted 
 for the last iteration to specify actual number of elements in 
 a vector. Anyway SIMD op will be done on the full vector 
 (underlying buffer) - CPU don't care if element in a vector is 
 garbage or not. So, in short: ranges pass these vectors between 
 themselfs and each subsequent range mutates the same vector - 
 possibly using SIMD.

Okay, sure, performance. But if you want to code for performance, 
there's nothing stopping you now. The idea about ranges is to 
unify lots of concepts. If you need lower level work and require 
every bit of performance, but still want to use ranges, then you 
can, but it should be a specialized range type that you do it on 
which meets your precise specification.

I hope I'm explaining it clearly: ranges are more of a universal 
specification, and it sounds like you have a very particular set 
of requirements which is a subset of what ranges are supposed to 
cover. Creating a specialized range type that meets your exact 
specifications would give you better performance than a 
generalized range concept (even optimized) and an optimized 
generalized range concept would not be appropriate for all uses.

Perhaps you could code up a vectorRange that does things as you 
want and submit it in an enhancement request. And you can show 
off how awesome it is so that everyone will want to use it when 
it's appropriate. But certainly not all ranges should be 
vectorRanges because most of the time we're just trying to work 
on singular units at a time.

Mar 01 2013

"deadalnix" <deadalnix gmail.com> writes:

On Friday, 1 March 2013 at 22:18:10 UTC, Chris Cain wrote:
 Okay, sure, performance. But if you want to code for 
 performance, there's nothing stopping you now. The idea about 
 ranges is to unify lots of concepts. If you need lower level 
 work and require every bit of performance, but still want to 
 use ranges, then you can, but it should be a specialized range 
 type that you do it on which meets your precise specification.

And still, some time you'd be surprised by what the compiler can 
do for you. Cf the discussion about SentinelRange.

Mar 04 2013

Marco Leise <Marco.Leise gmx.de> writes:

Am Fri, 01 Mar 2013 02:02:16 +0100
schrieb Piotr Szturmaj <bncrbme jadamspam.pl>:

 Anytime you read one byte from typical hard disk, system reads full 
 sector - typically 512 bytes, no more, no less.

Side note: HDD manufacturers recently switched to 4 KiB
sectors (8x size). Everyone who bypasses operating system
buffers to do raw disk access now needs to query the
system for optimal buffer alignment.

-- 
Marco

Feb 28 2013

"renoX" <renozyx gmail.com> writes:

On Friday, 1 March 2013 at 01:02:17 UTC, Piotr Szturmaj wrote:
 Anytime you read one byte from typical hard disk, system reads 
 full sector - typically 512 bytes, no more, no less.

As Marco Leise wrote: now this 4KiB in many case..
That's one big issue with performance tuning: if you hardcoded 
your code with 512B page, you don't get the optimal performance 
in a 4KiB page system..

Mar 04 2013

D Programming

C/C++ Programming

Other

digitalmars.D - Why ranges don't return vectors?