digitalmars.D - stride in slices

DigitalDesigns (4/4) Jun 02 2018 Proposal:

Dmitry Olshansky (2/6) Jun 02 2018 Ranges work for this and don’t need special syntax. Just saying.
Meta (3/7) Jun 03 2018 This is exactly what std.range.stride does. The syntax [a..b;m]

Seb (5/14) Jun 03 2018 We even have slide which is a generalization of stride (striding
DigitalDesigns (14/23) Jun 03 2018 So, can I do

Paul Backus (4/17) Jun 04 2018 You can use std.algorithm.iteration.each to modify a range
Steven Schveighoffer (8/37) Jun 04 2018 You are confusing range and array syntax. All of what you want exists,

Dennis (14/17) Jun 04 2018 I've compared the range versions with a for-loop. For integers

Steven Schveighoffer (7/25) Jun 04 2018 Interesting!

Dennis (5/10) Jun 04 2018 I don't know much about this either. Clang has link-time

Johan Engelen (17/28) Jun 06 2018 Cross-module inlining is never implicitly enabled in LDC. Not

Ethan (12/13) Jun 04 2018 Just to drive this point home.

Meta (29/43) Jun 04 2018 Just as an aside:

David Bennett (9/12) Jun 04 2018 When using `dmd -inline -O -release` with an extra simd benchmark

David Bennett (7/20) Jun 04 2018 Here's a version that works in ldc:

Andrei Alexandrescu (6/23) Jun 05 2018 BTW I've had this thought for a long time to implement stride with a

DigitalDesigns (6/25) Jun 04 2018 This is why I wanted to make sure! I would be using it for a

Steven Schveighoffer (8/36) Jun 05 2018 See later postings from Ethan and others. It's a matter of optimization

DigitalDesigns (21/62) Jun 05 2018 It would be nice if testing could be done. Maybe even profiling

Steven Schveighoffer (14/25) Jun 05 2018 Just to point it out again, Ethan's numbers:
Timon Gehr (12/14) Jun 05 2018 This is not even close to being true for modern CPUs. There are a lot of...

DigitalDesigns (25/41) Jun 05 2018 Those optimizations are not part of the instruction set so are

Adam D. Ruppe (5/6) Jun 05 2018 What are they?

DigitalDesigns (22/28) Jun 05 2018 It doesn't matter! The issue that I said was not that ranges were

Ethan (3/8) Jun 05 2018 This is the best part. Ranges *ARE* a language semantic.

Steven Schveighoffer (5/10) Jun 05 2018 Again, I want to stress, ranges are not "as slow as possible", and it's
Ethan (36/37) Jun 05 2018 ...

DigitalDesigns (11/48) Jun 05 2018 Ok asshat! You still don't get it! I didn't say ranges would not

Steven Schveighoffer (16/25) Jun 05 2018 Nope, he doesn't. Look at what you said:

DigitalDesigns (34/61) Jun 05 2018 No, you have shown a few fucking cases, why are you guys

aberba (4/9) Jun 05 2018 Does ranges not evaluate lazily on some cases. So it'll avoid
Steven Schveighoffer (5/8) Jun 06 2018 It's "accept" not "except". You need to *accept* my world as authority.

DigitalDesigns (10/19) Jun 07 2018 This is genius!! This made my day! I feel 40 years younger!!! It

Ethan (3/4) Jun 05 2018 Take it to reddit. Back your arguments up with actual knowledge

DigitalDesigns (4/8) Jun 05 2018 You are an idiot! You obviously do not understand basic logic.

Ethan (3/4) Jun 05 2018 Take it to reddit. Back your arguments up with actual knowledge

Timon Gehr (38/93) Jun 05 2018 I was responding to claims that for loops are basically a wrapper on

DigitalDesigns <DigitalDesigns gmail.com> writes:

Proposal:

[a..b;m]


be good chars.

Jun 02 2018

Dmitry Olshansky <dmitry.olsh gmail.com> writes:

On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.

Ranges work for this and don’t need special syntax. Just saying.

Jun 02 2018

Meta <jared771 gmail.com> writes:

On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.

This is exactly what std.range.stride does. The syntax [a..b;m] 
directly translates to [a..b].stride(m).

Jun 03 2018

Seb <seb wilzba.ch> writes:

On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.

 This is exactly what std.range.stride does. The syntax [a..b;m] 
 directly translates to [a..b].stride(m).

We even have slide which is a generalization of stride (striding 
is sliding with a window size of 1):

[a..b].slide(1, m)

https://dlang.org/phobos/std_range.html#slide

Jun 03 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.

 This is exactly what std.range.stride does. The syntax [a..b;m] 
 directly translates to [a..b].stride(m).


So, can I do

X[a..b].stride(m) = 0;

? Just curious because if it is exactly the same notion then I 
should be able to do it, right?

Of course, I'm sure another hoop could be created to jump through 
and it will work, will it still be exactly the same though?

If there is an efficient and optimal setting so one could get the 
same effect, then I guess it might be a direct translation. If 
not then it isn't.

What I am looking for is a sort of zeromemory or memset with 
stride. It should not allocate a new array or be significantly 
slower. I'd like some proof that they are "equivalent" such as a 
disassembly or a profiling... just to satisfy my own skepticism.

Jun 03 2018

Paul Backus <snarwin gmail.com> writes:

On Sunday, 3 June 2018 at 11:13:52 UTC, DigitalDesigns wrote:
 On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.

 This is exactly what std.range.stride does. The syntax 
 [a..b;m] directly translates to [a..b].stride(m).


 So, can I do

 X[a..b].stride(m) = 0;

You can use std.algorithm.iteration.each to modify a range 
in-place:

https://run.dlang.io/is/2jjZHh

Jun 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/3/18 7:13 AM, DigitalDesigns wrote:
 On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 good chars.

 This is exactly what std.range.stride does. The syntax [a..b;m] 
 directly translates to [a..b].stride(m).

 
 
 So, can I do
 
 X[a..b].stride(m) = 0;

X[a..b].stride(m).fill(0);

 ? Just curious because if it is exactly the same notion then I should be 
 able to do it, right?

You are confusing range and array syntax. All of what you want exists, 
but isn't necessarily using the same syntax.

 Of course, I'm sure another hoop could be created to jump through and it 
 will work, will it still be exactly the same though?
 
 If there is an efficient and optimal setting so one could get the same 
 effect, then I guess it might be a direct translation. If not then it 
 isn't.
 
 What I am looking for is a sort of zeromemory or memset with stride. It 
 should not allocate a new array or be significantly slower. I'd like 
 some proof that they are "equivalent" such as a disassembly or a 
 profiling... just to satisfy my own skepticism.

fill is what you are looking for.

Note, it's not going to necessarily be as efficient, but it's likely to 
be close.

-Steve

Jun 04 2018

Dennis <dkorpel gmail.com> writes:

On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
wrote:
 Note, it's not going to necessarily be as efficient, but it's 
 likely to be close.

 -Steve

I've compared the range versions with a for-loop. For integers 
and longs or high stride amounts the time is roughly equal, but 
for bytes with low stride amounts it can be up to twice as slow.
https://run.dlang.io/is/BoTflQ

50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
For-loop  18 ms
Fill(0)   33 ms
each!     33 ms

With stride = 13:
For-loop  7.3 ms
Fill(0)   7.5 ms
each!     7.8 ms

Jun 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/4/18 1:40 PM, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
 Note, it's not going to necessarily be as efficient, but it's likely 
 to be close.

 
 I've compared the range versions with a for-loop. For integers and longs 
 or high stride amounts the time is roughly equal, but for bytes with low 
 stride amounts it can be up to twice as slow.
 https://run.dlang.io/is/BoTflQ
 
 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
 For-loop  18 ms
 Fill(0)   33 ms
 each!     33 ms
 
 With stride = 13:
 For-loop  7.3 ms
 Fill(0)   7.5 ms
 each!     7.8 ms

Interesting!

BTW, do you have cross-module inlining on? I wonder if that makes a 
difference if you didn't have it on before. (I'm somewhat speaking from 
ignorance, as I've heard people talk about this limitation, but am not 
sure exactly when it's enabled)

-Steve

Jun 04 2018

Dennis <dkorpel gmail.com> writes:

On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
wrote:
 BTW, do you have cross-module inlining on? I wonder if that 
 makes a difference if you didn't have it on before. (I'm 
 somewhat speaking from ignorance, as I've heard people talk 
 about this limitation, but am not sure exactly when it's 
 enabled)

I don't know much about this either. Clang has link-time 
optimization with -O4, but looking at the --help of LDC it turns 
out -O4 is equivalent to -O3 for D. Maybe someone else knows?

Jun 04 2018

Johan Engelen <j j.nl> writes:

On Monday, 4 June 2018 at 18:47:02 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
 wrote:
 BTW, do you have cross-module inlining on? I wonder if that 
 makes a difference if you didn't have it on before. (I'm 
 somewhat speaking from ignorance, as I've heard people talk 
 about this limitation, but am not sure exactly when it's 
 enabled)


Cross-module inlining is never implicitly enabled in LDC. Not 
having it enabled is definitely something that hurts performance 
of LDC generated code.
Enable it with: `-enable-cross-module-inlining`. (May lead to 
missing symbols during linking when using templates with __FILE__ 
arguments.)

 I don't know much about this either. Clang has link-time 
 optimization with -O4, but looking at the --help of LDC it 
 turns out -O4 is equivalent to -O3 for D. Maybe someone else 
 knows?

Clang and LDC treat `-O4` as `-O3`.
To enable LTO (cross-module inlining and other big perf gains), 
you have to use `-flto=full`or `-flto=thin`, but it'll only give 
cross module inlining for modules that have been compiled with 
it. Notably: default Phobos/druntime is _not_ compiled with LTO 
enabled. LDC 1.9.0 release packages on Github ship with a second 
set of LTO Phobos/druntime libs. With LDC 1.9.0, you can do 
`-flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`, 
for maximum druntime/Phobos inlining.

-Johan

Jun 06 2018

Ethan <gooberman gmail.com> writes:

On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
wrote:
 BTW, do you have cross-module inlining on?

Just to drive this point home.

https://run.dlang.io/is/nrdzb0

Manually implemented stride and fill with everything forced 
inline. Otherwise, the original code is unchanged.

17 ms, 891 μs, and 6 hnsecs
15 ms, 694 μs, and 1 hnsec
15 ms, 570 μs, and 9 hnsecs

My new stride outperformed std.range stride, and the manual 
for-loop. And, because the third test uses the new stride, it 
also benefited. But interestingly runs every so slightly faster...

Jun 04 2018

Meta <jared771 gmail.com> writes:

On Monday, 4 June 2018 at 23:08:17 UTC, Ethan wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
 wrote:
 BTW, do you have cross-module inlining on?

 Just to drive this point home.

 https://run.dlang.io/is/nrdzb0

 Manually implemented stride and fill with everything forced 
 inline. Otherwise, the original code is unchanged.

 17 ms, 891 μs, and 6 hnsecs
 15 ms, 694 μs, and 1 hnsec
 15 ms, 570 μs, and 9 hnsecs

 My new stride outperformed std.range stride, and the manual 
 for-loop. And, because the third test uses the new stride, it 
 also benefited. But interestingly runs every so slightly 
 faster...

Just as an aside:

     ...
     pragma( inline )  property length() const { return 
range.length / strideCount; }
     pragma( inline )  property empty() const { return currFront > 
range.length; }
     pragma( inline )  property ref Elem front() { return range[ 
currFront ]; }
     pragma( inline ) void popFront() { currFront += strideCount; }
     ...

     pragma( inline ) auto stride( Range )( Range r, int a )
     ...

     pragma( inline ) auto fill( Range, Value )( Range r, Value v )
     ...

pragma(inline), without any argument, does not force inlining. It 
actually does nothing; it just specifies that the 
"implementation's default behaviour" should be used. You have to 
annotate with pragma(inline, true) to force inlining 
(https://dlang.org/spec/pragma.html#inline).

When I change all the pragma(inline) to pragma(inline, true), 
there is a non-trivial speedup:

14 ms, 517 μs, and 9 hnsecs
13 ms, 110 μs, and 1 hnsec
13 ms, 199 μs, and 9 hnsecs

There's further reductions using ldc-beta:

14 ms, 520 μs, and 4 hnsecs
13 ms, 87 μs, and 2 hnsecs
12 ms, 938 μs, and 8 hnsecs

Jun 04 2018

David Bennett <davidbennett bravevision.com> writes:

On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
 14 ms, 520 μs, and 4 hnsecs
 13 ms, 87 μs, and 2 hnsecs
 12 ms, 938 μs, and 8 hnsecs

When using `dmd -inline -O -release` with an extra simd benchmark 
I get:

for loop:    21 ms, 291 μs, and 6 hnsecs
stride/fill: 64 ms, 927 μs, and 9 hnsecs
stride/each: 52 ms, 740 μs, and 8 hnsecs
simd &=:     6 ms, 900 μs, and 8 hnsecs

https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release

Though the simd version does not work in ldc...

Jun 04 2018

David Bennett <davidbennett bravevision.com> writes:

On Tuesday, 5 June 2018 at 05:05:47 UTC, David Bennett wrote:
 On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
 14 ms, 520 μs, and 4 hnsecs
 13 ms, 87 μs, and 2 hnsecs
 12 ms, 938 μs, and 8 hnsecs

 When using `dmd -inline -O -release` with an extra simd 
 benchmark I get:

 for loop:    21 ms, 291 μs, and 6 hnsecs
 stride/fill: 64 ms, 927 μs, and 9 hnsecs
 stride/each: 52 ms, 740 μs, and 8 hnsecs
 simd &=:     6 ms, 900 μs, and 8 hnsecs

 https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release

 Though the simd version does not work in ldc...

Here's a version that works in ldc:

https://run.dlang.io/gist/1d4bb542427fb82cc455fe9dc30185d7?compiler=ldc&args=-inline%20-O4%20-release

for loop:    16 ms, 594 μs, and 1 hnsec
stride/fill: 14 ms, 918 μs, and 9 hnsecs
stride/each: 14 ms and 813 μs
simd &=:     7 ms, 153 μs, and 6 hnsecs

Jun 04 2018

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 06/04/2018 07:08 PM, Ethan wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:
 BTW, do you have cross-module inlining on?

 
 Just to drive this point home.
 
 https://run.dlang.io/is/nrdzb0
 
 Manually implemented stride and fill with everything forced inline. 
 Otherwise, the original code is unchanged.
 
 17 ms, 891 μs, and 6 hnsecs
 15 ms, 694 μs, and 1 hnsec
 15 ms, 570 μs, and 9 hnsecs
 
 My new stride outperformed std.range stride, and the manual for-loop. 
 And, because the third test uses the new stride, it also benefited. But 
 interestingly runs every so slightly faster...

BTW I've had this thought for a long time to implement stride with a 
compile-time step... never got around to implementing it. It would 
easily generalize the existing code without too much work. Essentially 
the step would be a template parameter; if that is 0, then use a 
run-time stride. Most of the code works unchanged.

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
 wrote:
 Note, it's not going to necessarily be as efficient, but it's 
 likely to be close.

 -Steve

 I've compared the range versions with a for-loop. For integers 
 and longs or high stride amounts the time is roughly equal, but 
 for bytes with low stride amounts it can be up to twice as slow.
 https://run.dlang.io/is/BoTflQ

 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 
 -release
 For-loop  18 ms
 Fill(0)   33 ms
 each!     33 ms

 With stride = 13:
 For-loop  7.3 ms
 Fill(0)   7.5 ms
 each!     7.8 ms


This is why I wanted to make sure! I would be using it for a 
stride of 2 and it seems it might have doubled the cost for no 
other reason than using ranged. Ranges are great but one can't 
reason about what is happening in then as easy as a direct loop 
so I wanted to be sure. Thanks for running the test!

Jun 04 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/4/18 5:52 PM, DigitalDesigns wrote:
 On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
 Note, it's not going to necessarily be as efficient, but it's likely 
 to be close.

 -Steve

 I've compared the range versions with a for-loop. For integers and 
 longs or high stride amounts the time is roughly equal, but for bytes 
 with low stride amounts it can be up to twice as slow.
 https://run.dlang.io/is/BoTflQ

 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release
 For-loop  18 ms
 Fill(0)   33 ms
 each!     33 ms

 With stride = 13:
 For-loop  7.3 ms
 Fill(0)   7.5 ms
 each!     7.8 ms

 
 
 This is why I wanted to make sure! I would be using it for a stride of 2 
 and it seems it might have doubled the cost for no other reason than 
 using ranged. Ranges are great but one can't reason about what is 
 happening in then as easy as a direct loop so I wanted to be sure. 
 Thanks for running the test!

See later postings from Ethan and others. It's a matter of optimization 
being able to see the "whole thing". This is why for loops are sometimes 
better. It's not inherent with ranges, but if you use the right 
optimization flags, it's done as fast as if you hand-wrote it.

What I've found with D (and especially LDC) is that when you give the 
compiler everything to work with, it can do some seemingly magic things.

-Steve

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 13:05:56 UTC, Steven Schveighoffer 
wrote:
 On 6/4/18 5:52 PM, DigitalDesigns wrote:
 On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
 wrote:
 Note, it's not going to necessarily be as efficient, but 
 it's likely to be close.

 -Steve

 I've compared the range versions with a for-loop. For 
 integers and longs or high stride amounts the time is roughly 
 equal, but for bytes with low stride amounts it can be up to 
 twice as slow.
 https://run.dlang.io/is/BoTflQ

 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 
 -release
 For-loop  18 ms
 Fill(0)   33 ms
 each!     33 ms

 With stride = 13:
 For-loop  7.3 ms
 Fill(0)   7.5 ms
 each!     7.8 ms

 
 
 This is why I wanted to make sure! I would be using it for a 
 stride of 2 and it seems it might have doubled the cost for no 
 other reason than using ranged. Ranges are great but one can't 
 reason about what is happening in then as easy as a direct 
 loop so I wanted to be sure. Thanks for running the test!

 See later postings from Ethan and others. It's a matter of 
 optimization being able to see the "whole thing". This is why 
 for loops are sometimes better. It's not inherent with ranges, 
 but if you use the right optimization flags, it's done as fast 
 as if you hand-wrote it.

 What I've found with D (and especially LDC) is that when you 
 give the compiler everything to work with, it can do some 
 seemingly magic things.

 -Steve

It would be nice if testing could be done. Maybe even profiling 
in unit tests to make sure ranges are within some margin of 
error(10%). One of the main reasons I don't use ranges is I 
simply don't have faith they will be as fast as direct encoding. 
While they might offer a slightly easier syntax I don't know what 
is going on under the hood so I can't reason about it(unless I 
look up the source). With a for loop, it is pretty much a wrapper 
on internal cpu logic so it will be near as fast as possible.

I suppose in the long run ranges do have the potential to out 
perform since they do abstract but there is no guarantee they 
will even come close. Having some "proof" that they are working 
well would ease my mind. As this thread shows, ranges have some 
major issues. Imagine having some code on your machine that is 
very performant but on another machine in a slightly different 
circumstances it runs poorly. Now, say it is the stride issue... 
One normally would not think of that being an issue so one will 
look in other areas and could waste times. At least with direct 
loops you pretty much get what you see. It is very easy for 
ranges to be slow but more difficult for them to be fast.

Jun 05 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/5/18 12:50 PM, DigitalDesigns wrote:

 I suppose in the long run ranges do have the potential to out perform 
 since they do abstract but there is no guarantee they will even come 
 close. Having some "proof" that they are working well would ease my 
 mind. As this thread shows, ranges have some major issues. 

Just to point it out again, Ethan's numbers:

17 ms, 891 μs, and 6 hnsecs // for loop
15 ms, 694 μs, and 1 hnsec  // fill
15 ms, 570 μs, and 9 hnsecs // each

I think ranges are doing just fine. You just need the cross-module 
inlining turned on.

 Imagine 
 having some code on your machine that is very performant but on another 
 machine in a slightly different circumstances it runs poorly. Now, say 
 it is the stride issue... One normally would not think of that being an 
 issue so one will look in other areas and could waste times. At least 
 with direct loops you pretty much get what you see. It is very easy for 
 ranges to be slow but more difficult for them to be fast.

The same can be said even with direct loops. There are no guarantees of 
course that one type of programming style is going to outperform another 
on all platforms and all compilers. My experience is that things that 
seem like they should be slower can sometimes be faster, and vice versa. 
It depends on so many factors, and the best thing to do is test and 
observe in each situation.

-Steve

Jun 05 2018

Timon Gehr <timon.gehr gmx.ch> writes:

On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu logic so it 
 will be near as fast as possible.

This is not even close to being true for modern CPUs. There are a lot of 
architectural and micro-architectural details that affect performance 
but are not visible or accessible in your for loop. If you care about 
performance, you will need to test anyway, as even rather sophisticated 
models of CPU performance don't get everything right.

Also, it is often not necessary to be "as fast as possible". It is 
usually more helpful to figure out where the bottleneck is for your code 
and concentrate optimization effort there, which you can do more 
effectively if you can save time and effort for the remaining parts of 
your program by writing simple and obviously correct range-based code, 
which often will be fast as well.

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:
 On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu 
 logic so it will be near as fast as possible.

 This is not even close to being true for modern CPUs. There are 
 a lot of architectural and micro-architectural details that 
 affect performance but are not visible or accessible in your 
 for loop. If you care about performance, you will need to test 
 anyway, as even rather sophisticated models of CPU performance 
 don't get everything right.

Those optimizations are not part of the instruction set so are 
irrelevant. They will occur with ranges too.

For loops HAVE a direct cpu semantic! Do you doubt this?


Cpu's do not have range semantics. Ranges are layers on top of 
compiler semantics... you act like they are equivalent, they are 
not! All range semantics must go through the library code then to 
the compiler then to cpu. For loops of all major systems 
languages go almost directly to cpu instructions.

for(int i = 0; i < N; i++)

translates in to either increment and loop or jump instructions.

There is absolutely no reason why any decent compiler would not 
use what the cpu has to offer. For loops are language semantics, 
Ranges are library semantics. To pretend they are equivalent is 
wrong and no amount of justifying will make them the same. I 
actually do not know even any commercial viable cpu exists 
without loop semantics. I also no of no commercially viable 
compiler that does not wrap those instructions in a for loop(or 
while, or whatever) like syntax that almost maps directly to the 
cpu instructions.

 Also, it is often not necessary to be "as fast as possible". It 
 is usually more helpful to figure out where the bottleneck is 
 for your code and concentrate optimization effort there, which 
 you can do more effectively if you can save time and effort for 
 the remaining parts of your program by writing simple and 
 obviously correct range-based code, which often will be fast as 
 well.

It's also often not necessary to be "as slow as possible". I'm 
not asking for about generalities but specifics. It's great to 
make generalizations about how things should be but I would like 
to know how they are. Maybe in theory ranges could be more 
optimal than other semantics but theory never equals practice.

Jun 05 2018

Adam D. Ruppe <destructionator gmail.com> writes:

On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?

What are they?

And for bonus points, are they actually used by compilers?

Then the final question: is the generated code any different than 
inlined ranges?

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 19:18:15 UTC, Adam D. Ruppe wrote:
 On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?

 What are they?

 And for bonus points, are they actually used by compilers?

 Then the final question: is the generated code any different 
 than inlined ranges?

It doesn't matter! The issue that I said was not that ranges were 
slower but that ranges exist on an abstract on top of language 
semantics! that means that they can never be faster than the 
language itself! Anything that a range does can never be faster 
than doing it by hand.

This is not a hard concept to understand. I know everyone wants 
to defend their precious ranges but what happens is irrelevant in 
some particular case. Ranges could be 100x faster in some 
specific case but it doesn't change the fact that they are an 
abstraction on top of the language, not in the language.

I've already pointed out, and made it clear, I will do it one 
last time:

There is no guarantee(doesn't have to be 100% by the rule of god) 
by the compiler that a ranges performance will even come close to 
hand written code or that it won't all of a sudden change when 
someone decides to "optimize" it and actually makes it worse! 
These problems are far less likely to occur in a well established 
language semantic.

I can't believe people are really defending library solutions as 
not having these issues, but I guess that is the nature of most 
humans.

Jun 05 2018

Ethan <gooberman gmail.com> writes:

On Tuesday, 5 June 2018 at 22:20:08 UTC, DigitalDesigns wrote:
 It doesn't matter! The issue that I said was not that ranges 
 were slower but that ranges exist on an abstract on top of 
 language semantics! that means that they can never be faster 
 than the language itself! Anything that a range does can never 
 be faster than doing it by hand.

This is the best part. Ranges *ARE* a language semantic.

https://tour.dlang.org/tour/en/basics/ranges

Jun 05 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/5/18 3:05 PM, DigitalDesigns wrote:

 It's also often not necessary to be "as slow as possible". I'm not 
 asking for about generalities but specifics. It's great to make 
 generalizations about how things should be but I would like to know how 
 they are. Maybe in theory ranges could be more optimal than other 
 semantics but theory never equals practice.

Again, I want to stress, ranges are not "as slow as possible", and it's 
clear from the numbers posted here that they are faster than for loops 
in practice, at least for this example.

-Steve

Jun 05 2018

Ethan <gooberman gmail.com> writes:

On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?

...

Right. If you're gonna keep running your mouth off. How about 
looking at some disassembly then.

for(auto i=0; i<a.length; i+=strideAmount)

Using ldc -O4 -release for x86_64 processors, the initialiser 
translates to:

mov byte ptr [rbp + rcx], 0

The comparison translates to:

cmp r13, rcx
ja .LBB0_2

And the increment and store translates to:

mov byte ptr [rbp + rcx], 0
movsxd rcx, eax
add eax, 3

So. It uses three of the most basic instructions you can think 
of: mov, cmp, j<cond>, add.

Now, what might you ask are the instructions that a range 
compiles down to when everything is properly inlined?

The initialisation, since it's a function, pulls from the stack.

mov rax, qword ptr [rsp + 16]
movsxd rcx, dword ptr [rsp + 32]

But the comparison looks virtually identical.

cmp rax, rcx
jb .LBB2_4

But how does it do the add? With some register magic.

movsxd rcx, edx
lea edx, [rcx + r9]

Now, what that looks like it's doing to me is combing the pointer 
load and index increment in to two those two instructions. One 
instruction less than the flat for loop.

In conclusion. The semantics you talk about are literally some of 
the most basic instructions in computing; and that escaping the 
confines of a for loop for a foreach loop can let the compiler 
generate more efficient code than 50-year-old compsci concepts 
can.

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:
 On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?

 ...

 Right. If you're gonna keep running your mouth off. How about 
 looking at some disassembly then.

 for(auto i=0; i<a.length; i+=strideAmount)

 Using ldc -O4 -release for x86_64 processors, the initialiser 
 translates to:

 mov byte ptr [rbp + rcx], 0

 The comparison translates to:

 cmp r13, rcx
 ja .LBB0_2

 And the increment and store translates to:

 mov byte ptr [rbp + rcx], 0
 movsxd rcx, eax
 add eax, 3

 So. It uses three of the most basic instructions you can think 
 of: mov, cmp, j<cond>, add.

 Now, what might you ask are the instructions that a range 
 compiles down to when everything is properly inlined?

 The initialisation, since it's a function, pulls from the stack.

 mov rax, qword ptr [rsp + 16]
 movsxd rcx, dword ptr [rsp + 32]

 But the comparison looks virtually identical.

 cmp rax, rcx
 jb .LBB2_4

 But how does it do the add? With some register magic.

 movsxd rcx, edx
 lea edx, [rcx + r9]

 Now, what that looks like it's doing to me is combing the 
 pointer load and index increment in to two those two 
 instructions. One instruction less than the flat for loop.

 In conclusion. The semantics you talk about are literally some 
 of the most basic instructions in computing; and that escaping 
 the confines of a for loop for a foreach loop can let the 
 compiler generate more efficient code than 50-year-old compsci 
 concepts can.

Ok asshat! You still don't get it! I didn't say ranges would not 
compile down to the same thing! Do you have trouble understanding 
the English language?

You don't seem to get the concept where ranges are library 
solutions and someone can some along at any time and modify some 
code and WHAM! They no longer compile down to your efficient 
precious instructions. That is far more unlikely to occur with 
language semantics. Why is that so difficult for you to 
understand you sure do you have an attitude for someone that has 
difficulty with English.

Jun 05 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/5/18 5:22 PM, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:

 In conclusion. The semantics you talk about are literally some of the 
 most basic instructions in computing; and that escaping the confines 
 of a for loop for a foreach loop can let the compiler generate more 
 efficient code than 50-year-old compsci concepts can.

 
 Ok asshat! You still don't get it! I didn't say ranges would not compile 
 down to the same thing! Do you have trouble understanding the English 
 language?

Nope, he doesn't. Look at what you said:

"Maybe in theory ranges could be more optimal than other semantics but 
theory never equals practice. "

And now you have been shown (multiple times) that in practice ranges in 
fact outperform for loops. Including the assembly to prove it (which 
helps with this comment: "Having some "proof" that they are working well 
would ease my mind.")

So tone down the attitude, you got what you *clearly* asked for but seem 
reluctant to acknowledge. Ranges are good, for loops are good too, but 
not as. So maybe you should just use ranges and use the correct 
optimization flags and call it a day? Or else use for loops and accept 
that even though they may not run as quickly, they are "safer" to use 
since some malicious coder could come along and add in sleeps inside the 
std.algorithm functions.

-Steve

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer 
wrote:
 On 6/5/18 5:22 PM, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:

 In conclusion. The semantics you talk about are literally 
 some of the most basic instructions in computing; and that 
 escaping the confines of a for loop for a foreach loop can 
 let the compiler generate more efficient code than 
 50-year-old compsci concepts can.

 
 Ok asshat! You still don't get it! I didn't say ranges would 
 not compile down to the same thing! Do you have trouble 
 understanding the English language?

 Nope, he doesn't. Look at what you said:

 "Maybe in theory ranges could be more optimal than other 
 semantics but theory never equals practice. "

 And now you have been shown (multiple times) that in practice 
 ranges in fact outperform for loops. Including the assembly to 
 prove it (which helps with this comment: "Having some "proof" 
 that they are working well would ease my mind.")

No, you have shown a few fucking cases, why are you guys 
attacking me for being dense?

You can't prove that ranges are more optimal than direct 
semantics! Do it! I'd like to see you try!


 So tone down the attitude, you got what you *clearly* asked for 
 but seem reluctant to acknowledge. Ranges are good, for loops 
 are good too, but not as. So maybe you should just use ranges 
 and use the correct optimization flags and call it a day? Or 
 else use for loops and accept that even though they may not run 
 as quickly, they are "safer" to use since some malicious coder 
 could come along and add in sleeps inside the std.algorithm 
 functions.

 -Steve

What it seems is that a few of you are upset because I didn't bow 
down to your precious range semantics and ask for clarification. 
At first I was jumped on then someone did some tests and found 
out that it wasn't so rosy like everyone thought. Of course, the 
work around is to force optimizations that fix the problem when 
the problem doesn't exist in for loops. Then you come along and 
tell me that specific cases prove the general case... that is 
real dense.

You know, it takes two to have an attitude! I asked for 
information regarding stride. I got the range version, it turned 
out to be slower in some corner case for some bizarre reason. I 
was then told it required optimizations(why? That is fishy why 
the corner cause would be 200% slower for a weird edge case) and 
then I was told that ranges are always faster(which is what you 
just said because you act like one example proves everything). 
Every step of the way I am told "don't worry". You've already 
stepped in the shit once and you expect me to believe everything 
you say?

Why is it so hard to have a test suite that checks the 
performance of range constructs instead of just getting me to 
believe you? Huh? Do you really think I'm suppose to believe 
every thing any asshat says on the net just because they want me 
to? Back up your beliefs, that simple. Provide timings for all 
the range functions in various combinations and give me a worse 
case scenario compared to their equivalent hand-coded versions. 
Once you do that then I will be able to make an informed decision 
rather than doing what you really want, which is except your 
world as authority regardless of the real truth.

Jun 05 2018

aberba <karabutaworld gmail.com> writes:

On Tuesday, 5 June 2018 at 22:28:44 UTC, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer 
 wrote:
 [...]


 [...]

Does ranges not evaluate lazily on some cases. So it'll avoid 
unnecessary work...and be much faster and efficient. If I'm 
correct.
 [...]

Jun 05 2018

Steven Schveighoffer <schveiguy yahoo.com> writes:

On 6/5/18 6:28 PM, DigitalDesigns wrote:
 Once you do that then I will be able to make an informed 
 decision rather than doing what you really want, which is except your 
 world as authority regardless of the real truth.

It's "accept" not "except". You need to *accept* my world as authority.

Here, this may help:

https://www.vocabulary.com/articles/chooseyourwords/accept-except/

-Steve

Jun 06 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Wednesday, 6 June 2018 at 14:23:29 UTC, Steven Schveighoffer 
wrote:
 On 6/5/18 6:28 PM, DigitalDesigns wrote:
 Once you do that then I will be able to make an informed 
 decision rather than doing what you really want, which is 
 except your world as authority regardless of the real truth.

 It's "accept" not "except". You need to *accept* my world as 
 authority.

 Here, this may help:

 https://www.vocabulary.com/articles/chooseyourwords/accept-except/

 -Steve

This is genius!! This made my day! I feel 40 years younger!!! It 
took me back to 3rd grade English class. I can remember when 
little Lissa Lou was sitting next to me on a fine Wednesday 
morning! She had just used a word in the wrong context and out of 
the blue the gestapo barged in shoved a grenade down her throat 
then tossed her out the window! It was a fun day! We all had 
icecream and danced like the little Nazi's we were! That was the 
good ol days! Thanks!

Jun 07 2018

Ethan <gooberman gmail.com> writes:

On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:
 Ok asshat!

Take it to reddit. Back your arguments up with actual knowledge 
and intelligence, not unfounded agression.

Jun 05 2018

DigitalDesigns <DigitalDesigns gmail.com> writes:

On Tuesday, 5 June 2018 at 21:52:03 UTC, Ethan wrote:
 On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:
 Ok asshat!

 Take it to reddit. Back your arguments up with actual knowledge 
 and intelligence, not unfounded agression.

You are an idiot! You obviously do not understand basic logic. 
Unfounded aggression? Yep, way to see how you didn't start it! 
Must be nice being the bully!

Jun 05 2018

Ethan <gooberman gmail.com> writes:

On Tuesday, 5 June 2018 at 21:54:20 UTC, DigitalDesigns wrote:
 You are an idiot!

Take it to reddit. Back your arguments up with actual knowledge 
and intelligence, not unfounded agression.

Jun 05 2018

Timon Gehr <timon.gehr gmx.ch> writes:

On 05.06.2018 21:05, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:
 On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu logic so 
 it will be near as fast as possible.

 This is not even close to being true for modern CPUs. There are a lot 
 of architectural and micro-architectural details that affect 
 performance but are not visible or accessible in your for loop. If you 
 care about performance, you will need to test anyway, as even rather 
 sophisticated models of CPU performance don't get everything right.

 
 Those optimizations are not part of the instruction set so are 
 irrelevant. They will occur with ranges too.
 ...

I was responding to claims that for loops are basically a wrapper on 
internal CPU logic and nearly as fast as possible. Both of those claims 
were wrong.

 For loops HAVE a direct cpu semantic! Do you doubt this?
 ...

You'd have to define what that means. (E.g., Google currently shows no 
hits for "direct CPU semantics".)

 
 Cpu's do not have range semantics. Ranges are layers on top of compiler 
 semantics... you act like they are equivalent, they are not!

I don't understand why you bring this up nor what you think it means.

The compiler takes a program and produces some machine code that has the 
right behavior. Performance is usually not formally specified. In terms 
of resulting behavior, code with explicit for loops and range-based code 
may have identical semantics. Which one executes faster depends on 
internal details of the compiler and the target architecture, and it may 
change over time, e.g. between compiler releases.

 All range 
 semantics must go through the library code then to the compiler then to 
 cpu. For loops of all major systems languages go almost directly to cpu 
 instructions.
 
 for(int i = 0; i < N; i++)
 
 translates in to either increment and loop or jump instructions.
 ...

Sure, or whatever else the compiler decides to do. It might even be 
translated into a memcpy call. Even if you want to restrict yourself to 
use only for loops, my point stands. Write maintainable code by default 
and let the compiler do what it does. Then optimize further in those 
cases where the resulting code is actually too slow. Test for 
performance regressions.

 There is absolutely no reason why any decent compiler would not use what 
 the cpu has to offer. For loops are language semantics, Ranges are 
 library semantics.

Not really. Also, irrelevant.

 To pretend they are equivalent is wrong and no amount 
 of justifying will make them the same.

Again, I don't think this point is part of this discussion.

 I actually do not know even any 
 commercial viable cpu exists without loop semantics.

What does it mean for a CPU to have "loop semantics"? CPUs typically 
have an instruction pointer register and possibly some built-in 
instructions to manipulate said instruction pointer. x86 has some 
built-in loop instructions, but I think they are just there for legacy 
support and not actually something you want to use in performant code.

 I also no of no 
 commercially viable compiler that does not wrap those instructions in a 
 for loop(or while, or whatever) like syntax that almost maps directly to 
 the cpu instructions.
 ...

The compiler takes your for loop and generates some machine code. I 
don't think there is a "commercially viable" compiler that does not 
sometimes do things that are not direct. And even then, there is no very 
simple mapping from CPU instructions to observed performance, so the 
entire point is a bit moot.

 Also, it is often not necessary to be "as fast as possible". It is 
 usually more helpful to figure out where the bottleneck is for your 
 code and concentrate optimization effort there, which you can do more 
 effectively if you can save time and effort for the remaining parts of 
 your program by writing simple and obviously correct range-based code, 
 which often will be fast as well.

 
 It's also often not necessary to be "as slow as possible".

This seems to be quoting an imaginary person. My point is that to get 
even faster code, you need to spend effort and often get lower 
maintainability. This is not always a good trade-off, in particular if 
the optimization does not improve performance a lot and/or the code in 
question is not executed very often.

 I'm not 
 asking for about generalities but specifics. It's great to make 
 generalizations about how things should be but I would like to know how 
 they are.

That's a bit unspecific.

 Maybe in theory ranges could be more optimal than other 
 semantics but theory never equals practice.
 

I don't know who this is addressed to. My point was entirely practical.

Jun 05 2018

D Programming

C/C++ Programming

Other

digitalmars.D - stride in slices