www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - stride in slices

reply DigitalDesigns <DigitalDesigns gmail.com> writes:
Proposal:

[a..b;m]


be good chars.
Jun 02 2018
next sibling parent Dmitry Olshansky <dmitry.olsh gmail.com> writes:
On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.
Ranges work for this and don’t need special syntax. Just saying.
Jun 02 2018
prev sibling parent reply Meta <jared771 gmail.com> writes:
On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.
This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
Jun 03 2018
next sibling parent Seb <seb wilzba.ch> writes:
On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.
This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
We even have slide which is a generalization of stride (striding is sliding with a window size of 1): [a..b].slide(1, m) https://dlang.org/phobos/std_range.html#slide
Jun 03 2018
prev sibling parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.
This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
So, can I do X[a..b].stride(m) = 0; ? Just curious because if it is exactly the same notion then I should be able to do it, right? Of course, I'm sure another hoop could be created to jump through and it will work, will it still be exactly the same though? If there is an efficient and optimal setting so one could get the same effect, then I guess it might be a direct translation. If not then it isn't. What I am looking for is a sort of zeromemory or memset with stride. It should not allocate a new array or be significantly slower. I'd like some proof that they are "equivalent" such as a disassembly or a profiling... just to satisfy my own skepticism.
Jun 03 2018
next sibling parent Paul Backus <snarwin gmail.com> writes:
On Sunday, 3 June 2018 at 11:13:52 UTC, DigitalDesigns wrote:
 On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 could be good chars.
This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
So, can I do X[a..b].stride(m) = 0;
You can use std.algorithm.iteration.each to modify a range in-place: https://run.dlang.io/is/2jjZHh
Jun 04 2018
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/3/18 7:13 AM, DigitalDesigns wrote:
 On Sunday, 3 June 2018 at 07:30:56 UTC, Meta wrote:
 On Saturday, 2 June 2018 at 18:49:51 UTC, DigitalDesigns wrote:
 Proposal:

 [a..b;m]


 good chars.
This is exactly what std.range.stride does. The syntax [a..b;m] directly translates to [a..b].stride(m).
So, can I do X[a..b].stride(m) = 0;
X[a..b].stride(m).fill(0);
 ? Just curious because if it is exactly the same notion then I should be 
 able to do it, right?
You are confusing range and array syntax. All of what you want exists, but isn't necessarily using the same syntax.
 Of course, I'm sure another hoop could be created to jump through and it 
 will work, will it still be exactly the same though?
 
 If there is an efficient and optimal setting so one could get the same 
 effect, then I guess it might be a direct translation. If not then it 
 isn't.
 
 What I am looking for is a sort of zeromemory or memset with stride. It 
 should not allocate a new array or be significantly slower. I'd like 
 some proof that they are "equivalent" such as a disassembly or a 
 profiling... just to satisfy my own skepticism.
fill is what you are looking for. Note, it's not going to necessarily be as efficient, but it's likely to be close. -Steve
Jun 04 2018
parent reply Dennis <dkorpel gmail.com> writes:
On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
wrote:
 Note, it's not going to necessarily be as efficient, but it's 
 likely to be close.

 -Steve
I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
Jun 04 2018
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/4/18 1:40 PM, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
 Note, it's not going to necessarily be as efficient, but it's likely 
 to be close.
I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop  18 ms Fill(0)   33 ms each!     33 ms With stride = 13: For-loop  7.3 ms Fill(0)   7.5 ms each!     7.8 ms
Interesting! BTW, do you have cross-module inlining on? I wonder if that makes a difference if you didn't have it on before. (I'm somewhat speaking from ignorance, as I've heard people talk about this limitation, but am not sure exactly when it's enabled) -Steve
Jun 04 2018
next sibling parent reply Dennis <dkorpel gmail.com> writes:
On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
wrote:
 BTW, do you have cross-module inlining on? I wonder if that 
 makes a difference if you didn't have it on before. (I'm 
 somewhat speaking from ignorance, as I've heard people talk 
 about this limitation, but am not sure exactly when it's 
 enabled)
I don't know much about this either. Clang has link-time optimization with -O4, but looking at the --help of LDC it turns out -O4 is equivalent to -O3 for D. Maybe someone else knows?
Jun 04 2018
parent Johan Engelen <j j.nl> writes:
On Monday, 4 June 2018 at 18:47:02 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
 wrote:
 BTW, do you have cross-module inlining on? I wonder if that 
 makes a difference if you didn't have it on before. (I'm 
 somewhat speaking from ignorance, as I've heard people talk 
 about this limitation, but am not sure exactly when it's 
 enabled)
Cross-module inlining is never implicitly enabled in LDC. Not having it enabled is definitely something that hurts performance of LDC generated code. Enable it with: `-enable-cross-module-inlining`. (May lead to missing symbols during linking when using templates with __FILE__ arguments.)
 I don't know much about this either. Clang has link-time 
 optimization with -O4, but looking at the --help of LDC it 
 turns out -O4 is equivalent to -O3 for D. Maybe someone else 
 knows?
Clang and LDC treat `-O4` as `-O3`. To enable LTO (cross-module inlining and other big perf gains), you have to use `-flto=full`or `-flto=thin`, but it'll only give cross module inlining for modules that have been compiled with it. Notably: default Phobos/druntime is _not_ compiled with LTO enabled. LDC 1.9.0 release packages on Github ship with a second set of LTO Phobos/druntime libs. With LDC 1.9.0, you can do `-flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto`, for maximum druntime/Phobos inlining. -Johan
Jun 06 2018
prev sibling parent reply Ethan <gooberman gmail.com> writes:
On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
wrote:
 BTW, do you have cross-module inlining on?
Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
Jun 04 2018
next sibling parent reply Meta <jared771 gmail.com> writes:
On Monday, 4 June 2018 at 23:08:17 UTC, Ethan wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer 
 wrote:
 BTW, do you have cross-module inlining on?
Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
Just as an aside: ... pragma( inline ) property length() const { return range.length / strideCount; } pragma( inline ) property empty() const { return currFront > range.length; } pragma( inline ) property ref Elem front() { return range[ currFront ]; } pragma( inline ) void popFront() { currFront += strideCount; } ... pragma( inline ) auto stride( Range )( Range r, int a ) ... pragma( inline ) auto fill( Range, Value )( Range r, Value v ) ... pragma(inline), without any argument, does not force inlining. It actually does nothing; it just specifies that the "implementation's default behaviour" should be used. You have to annotate with pragma(inline, true) to force inlining (https://dlang.org/spec/pragma.html#inline). When I change all the pragma(inline) to pragma(inline, true), there is a non-trivial speedup: 14 ms, 517 μs, and 9 hnsecs 13 ms, 110 μs, and 1 hnsec 13 ms, 199 μs, and 9 hnsecs There's further reductions using ldc-beta: 14 ms, 520 μs, and 4 hnsecs 13 ms, 87 μs, and 2 hnsecs 12 ms, 938 μs, and 8 hnsecs
Jun 04 2018
parent reply David Bennett <davidbennett bravevision.com> writes:
On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
 14 ms, 520 μs, and 4 hnsecs
 13 ms, 87 μs, and 2 hnsecs
 12 ms, 938 μs, and 8 hnsecs
When using `dmd -inline -O -release` with an extra simd benchmark I get: for loop: 21 ms, 291 μs, and 6 hnsecs stride/fill: 64 ms, 927 μs, and 9 hnsecs stride/each: 52 ms, 740 μs, and 8 hnsecs simd &=: 6 ms, 900 μs, and 8 hnsecs https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release Though the simd version does not work in ldc...
Jun 04 2018
parent David Bennett <davidbennett bravevision.com> writes:
On Tuesday, 5 June 2018 at 05:05:47 UTC, David Bennett wrote:
 On Tuesday, 5 June 2018 at 03:13:05 UTC, Meta wrote:
 14 ms, 520 μs, and 4 hnsecs
 13 ms, 87 μs, and 2 hnsecs
 12 ms, 938 μs, and 8 hnsecs
When using `dmd -inline -O -release` with an extra simd benchmark I get: for loop: 21 ms, 291 μs, and 6 hnsecs stride/fill: 64 ms, 927 μs, and 9 hnsecs stride/each: 52 ms, 740 μs, and 8 hnsecs simd &=: 6 ms, 900 μs, and 8 hnsecs https://run.dlang.io/gist/5fe73cbf9943aa57be1101e597bb25e4?args=-inline%20-O%20-release Though the simd version does not work in ldc...
Here's a version that works in ldc: https://run.dlang.io/gist/1d4bb542427fb82cc455fe9dc30185d7?compiler=ldc&args=-inline%20-O4%20-release for loop: 16 ms, 594 μs, and 1 hnsec stride/fill: 14 ms, 918 μs, and 9 hnsecs stride/each: 14 ms and 813 μs simd &=: 7 ms, 153 μs, and 6 hnsecs
Jun 04 2018
prev sibling parent Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:
On 06/04/2018 07:08 PM, Ethan wrote:
 On Monday, 4 June 2018 at 18:11:47 UTC, Steven Schveighoffer wrote:
 BTW, do you have cross-module inlining on?
Just to drive this point home. https://run.dlang.io/is/nrdzb0 Manually implemented stride and fill with everything forced inline. Otherwise, the original code is unchanged. 17 ms, 891 μs, and 6 hnsecs 15 ms, 694 μs, and 1 hnsec 15 ms, 570 μs, and 9 hnsecs My new stride outperformed std.range stride, and the manual for-loop. And, because the third test uses the new stride, it also benefited. But interestingly runs every so slightly faster...
BTW I've had this thought for a long time to implement stride with a compile-time step... never got around to implementing it. It would easily generalize the existing code without too much work. Essentially the step would be a template parameter; if that is 0, then use a run-time stride. Most of the code works unchanged.
Jun 05 2018
prev sibling parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
 wrote:
 Note, it's not going to necessarily be as efficient, but it's 
 likely to be close.

 -Steve
I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop 18 ms Fill(0) 33 ms each! 33 ms With stride = 13: For-loop 7.3 ms Fill(0) 7.5 ms each! 7.8 ms
This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!
Jun 04 2018
parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/4/18 5:52 PM, DigitalDesigns wrote:
 On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer wrote:
 Note, it's not going to necessarily be as efficient, but it's likely 
 to be close.

 -Steve
I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop  18 ms Fill(0)   33 ms each!     33 ms With stride = 13: For-loop  7.3 ms Fill(0)   7.5 ms each!     7.8 ms
This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!
See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it. What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things. -Steve
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 13:05:56 UTC, Steven Schveighoffer 
wrote:
 On 6/4/18 5:52 PM, DigitalDesigns wrote:
 On Monday, 4 June 2018 at 17:40:57 UTC, Dennis wrote:
 On Monday, 4 June 2018 at 15:43:20 UTC, Steven Schveighoffer 
 wrote:
 Note, it's not going to necessarily be as efficient, but 
 it's likely to be close.

 -Steve
I've compared the range versions with a for-loop. For integers and longs or high stride amounts the time is roughly equal, but for bytes with low stride amounts it can be up to twice as slow. https://run.dlang.io/is/BoTflQ 50 Mb array, type = byte, stride = 3, compiler = LDC -O4 -release For-loop  18 ms Fill(0)   33 ms each!     33 ms With stride = 13: For-loop  7.3 ms Fill(0)   7.5 ms each!     7.8 ms
This is why I wanted to make sure! I would be using it for a stride of 2 and it seems it might have doubled the cost for no other reason than using ranged. Ranges are great but one can't reason about what is happening in then as easy as a direct loop so I wanted to be sure. Thanks for running the test!
See later postings from Ethan and others. It's a matter of optimization being able to see the "whole thing". This is why for loops are sometimes better. It's not inherent with ranges, but if you use the right optimization flags, it's done as fast as if you hand-wrote it. What I've found with D (and especially LDC) is that when you give the compiler everything to work with, it can do some seemingly magic things. -Steve
It would be nice if testing could be done. Maybe even profiling in unit tests to make sure ranges are within some margin of error(10%). One of the main reasons I don't use ranges is I simply don't have faith they will be as fast as direct encoding. While they might offer a slightly easier syntax I don't know what is going on under the hood so I can't reason about it(unless I look up the source). With a for loop, it is pretty much a wrapper on internal cpu logic so it will be near as fast as possible. I suppose in the long run ranges do have the potential to out perform since they do abstract but there is no guarantee they will even come close. Having some "proof" that they are working well would ease my mind. As this thread shows, ranges have some major issues. Imagine having some code on your machine that is very performant but on another machine in a slightly different circumstances it runs poorly. Now, say it is the stride issue... One normally would not think of that being an issue so one will look in other areas and could waste times. At least with direct loops you pretty much get what you see. It is very easy for ranges to be slow but more difficult for them to be fast.
Jun 05 2018
next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/18 12:50 PM, DigitalDesigns wrote:

 I suppose in the long run ranges do have the potential to out perform 
 since they do abstract but there is no guarantee they will even come 
 close. Having some "proof" that they are working well would ease my 
 mind. As this thread shows, ranges have some major issues. 
Just to point it out again, Ethan's numbers: 17 ms, 891 μs, and 6 hnsecs // for loop 15 ms, 694 μs, and 1 hnsec // fill 15 ms, 570 μs, and 9 hnsecs // each I think ranges are doing just fine. You just need the cross-module inlining turned on.
 Imagine 
 having some code on your machine that is very performant but on another 
 machine in a slightly different circumstances it runs poorly. Now, say 
 it is the stride issue... One normally would not think of that being an 
 issue so one will look in other areas and could waste times. At least 
 with direct loops you pretty much get what you see. It is very easy for 
 ranges to be slow but more difficult for them to be fast.
The same can be said even with direct loops. There are no guarantees of course that one type of programming style is going to outperform another on all platforms and all compilers. My experience is that things that seem like they should be slower can sometimes be faster, and vice versa. It depends on so many factors, and the best thing to do is test and observe in each situation. -Steve
Jun 05 2018
prev sibling parent reply Timon Gehr <timon.gehr gmx.ch> writes:
On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu logic so it 
 will be near as fast as possible.
This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right. Also, it is often not necessary to be "as fast as possible". It is usually more helpful to figure out where the bottleneck is for your code and concentrate optimization effort there, which you can do more effectively if you can save time and effort for the remaining parts of your program by writing simple and obviously correct range-based code, which often will be fast as well.
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:
 On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu 
 logic so it will be near as fast as possible.
This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right.
Those optimizations are not part of the instruction set so are irrelevant. They will occur with ranges too. For loops HAVE a direct cpu semantic! Do you doubt this? Cpu's do not have range semantics. Ranges are layers on top of compiler semantics... you act like they are equivalent, they are not! All range semantics must go through the library code then to the compiler then to cpu. For loops of all major systems languages go almost directly to cpu instructions. for(int i = 0; i < N; i++) translates in to either increment and loop or jump instructions. There is absolutely no reason why any decent compiler would not use what the cpu has to offer. For loops are language semantics, Ranges are library semantics. To pretend they are equivalent is wrong and no amount of justifying will make them the same. I actually do not know even any commercial viable cpu exists without loop semantics. I also no of no commercially viable compiler that does not wrap those instructions in a for loop(or while, or whatever) like syntax that almost maps directly to the cpu instructions.
 Also, it is often not necessary to be "as fast as possible". It 
 is usually more helpful to figure out where the bottleneck is 
 for your code and concentrate optimization effort there, which 
 you can do more effectively if you can save time and effort for 
 the remaining parts of your program by writing simple and 
 obviously correct range-based code, which often will be fast as 
 well.
It's also often not necessary to be "as slow as possible". I'm not asking for about generalities but specifics. It's great to make generalizations about how things should be but I would like to know how they are. Maybe in theory ranges could be more optimal than other semantics but theory never equals practice.
Jun 05 2018
next sibling parent reply Adam D. Ruppe <destructionator gmail.com> writes:
On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?
What are they? And for bonus points, are they actually used by compilers? Then the final question: is the generated code any different than inlined ranges?
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 19:18:15 UTC, Adam D. Ruppe wrote:
 On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?
What are they? And for bonus points, are they actually used by compilers? Then the final question: is the generated code any different than inlined ranges?
It doesn't matter! The issue that I said was not that ranges were slower but that ranges exist on an abstract on top of language semantics! that means that they can never be faster than the language itself! Anything that a range does can never be faster than doing it by hand. This is not a hard concept to understand. I know everyone wants to defend their precious ranges but what happens is irrelevant in some particular case. Ranges could be 100x faster in some specific case but it doesn't change the fact that they are an abstraction on top of the language, not in the language. I've already pointed out, and made it clear, I will do it one last time: There is no guarantee(doesn't have to be 100% by the rule of god) by the compiler that a ranges performance will even come close to hand written code or that it won't all of a sudden change when someone decides to "optimize" it and actually makes it worse! These problems are far less likely to occur in a well established language semantic. I can't believe people are really defending library solutions as not having these issues, but I guess that is the nature of most humans.
Jun 05 2018
parent Ethan <gooberman gmail.com> writes:
On Tuesday, 5 June 2018 at 22:20:08 UTC, DigitalDesigns wrote:
 It doesn't matter! The issue that I said was not that ranges 
 were slower but that ranges exist on an abstract on top of 
 language semantics! that means that they can never be faster 
 than the language itself! Anything that a range does can never 
 be faster than doing it by hand.
This is the best part. Ranges *ARE* a language semantic. https://tour.dlang.org/tour/en/basics/ranges
Jun 05 2018
prev sibling next sibling parent Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/18 3:05 PM, DigitalDesigns wrote:

 It's also often not necessary to be "as slow as possible". I'm not 
 asking for about generalities but specifics. It's great to make 
 generalizations about how things should be but I would like to know how 
 they are. Maybe in theory ranges could be more optimal than other 
 semantics but theory never equals practice.
Again, I want to stress, ranges are not "as slow as possible", and it's clear from the numbers posted here that they are faster than for loops in practice, at least for this example. -Steve
Jun 05 2018
prev sibling next sibling parent reply Ethan <gooberman gmail.com> writes:
On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?
... Right. If you're gonna keep running your mouth off. How about looking at some disassembly then. for(auto i=0; i<a.length; i+=strideAmount) Using ldc -O4 -release for x86_64 processors, the initialiser translates to: mov byte ptr [rbp + rcx], 0 The comparison translates to: cmp r13, rcx ja .LBB0_2 And the increment and store translates to: mov byte ptr [rbp + rcx], 0 movsxd rcx, eax add eax, 3 So. It uses three of the most basic instructions you can think of: mov, cmp, j<cond>, add. Now, what might you ask are the instructions that a range compiles down to when everything is properly inlined? The initialisation, since it's a function, pulls from the stack. mov rax, qword ptr [rsp + 16] movsxd rcx, dword ptr [rsp + 32] But the comparison looks virtually identical. cmp rax, rcx jb .LBB2_4 But how does it do the add? With some register magic. movsxd rcx, edx lea edx, [rcx + r9] Now, what that looks like it's doing to me is combing the pointer load and index increment in to two those two instructions. One instruction less than the flat for loop. In conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:
 On Tuesday, 5 June 2018 at 19:05:27 UTC, DigitalDesigns wrote:
 For loops HAVE a direct cpu semantic! Do you doubt this?
... Right. If you're gonna keep running your mouth off. How about looking at some disassembly then. for(auto i=0; i<a.length; i+=strideAmount) Using ldc -O4 -release for x86_64 processors, the initialiser translates to: mov byte ptr [rbp + rcx], 0 The comparison translates to: cmp r13, rcx ja .LBB0_2 And the increment and store translates to: mov byte ptr [rbp + rcx], 0 movsxd rcx, eax add eax, 3 So. It uses three of the most basic instructions you can think of: mov, cmp, j<cond>, add. Now, what might you ask are the instructions that a range compiles down to when everything is properly inlined? The initialisation, since it's a function, pulls from the stack. mov rax, qword ptr [rsp + 16] movsxd rcx, dword ptr [rsp + 32] But the comparison looks virtually identical. cmp rax, rcx jb .LBB2_4 But how does it do the add? With some register magic. movsxd rcx, edx lea edx, [rcx + r9] Now, what that looks like it's doing to me is combing the pointer load and index increment in to two those two instructions. One instruction less than the flat for loop. In conclusion. The semantics you talk about are literally some of the most basic instructions in computing; and that escaping the confines of a for loop for a foreach loop can let the compiler generate more efficient code than 50-year-old compsci concepts can.
Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language? You don't seem to get the concept where ranges are library solutions and someone can some along at any time and modify some code and WHAM! They no longer compile down to your efficient precious instructions. That is far more unlikely to occur with language semantics. Why is that so difficult for you to understand you sure do you have an attitude for someone that has difficulty with English.
Jun 05 2018
next sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/18 5:22 PM, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:
 In conclusion. The semantics you talk about are literally some of the 
 most basic instructions in computing; and that escaping the confines 
 of a for loop for a foreach loop can let the compiler generate more 
 efficient code than 50-year-old compsci concepts can.
Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language?
Nope, he doesn't. Look at what you said: "Maybe in theory ranges could be more optimal than other semantics but theory never equals practice. " And now you have been shown (multiple times) that in practice ranges in fact outperform for loops. Including the assembly to prove it (which helps with this comment: "Having some "proof" that they are working well would ease my mind.") So tone down the attitude, you got what you *clearly* asked for but seem reluctant to acknowledge. Ranges are good, for loops are good too, but not as. So maybe you should just use ranges and use the correct optimization flags and call it a day? Or else use for loops and accept that even though they may not run as quickly, they are "safer" to use since some malicious coder could come along and add in sleeps inside the std.algorithm functions. -Steve
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer 
wrote:
 On 6/5/18 5:22 PM, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 20:07:06 UTC, Ethan wrote:
 In conclusion. The semantics you talk about are literally 
 some of the most basic instructions in computing; and that 
 escaping the confines of a for loop for a foreach loop can 
 let the compiler generate more efficient code than 
 50-year-old compsci concepts can.
Ok asshat! You still don't get it! I didn't say ranges would not compile down to the same thing! Do you have trouble understanding the English language?
Nope, he doesn't. Look at what you said: "Maybe in theory ranges could be more optimal than other semantics but theory never equals practice. " And now you have been shown (multiple times) that in practice ranges in fact outperform for loops. Including the assembly to prove it (which helps with this comment: "Having some "proof" that they are working well would ease my mind.")
No, you have shown a few fucking cases, why are you guys attacking me for being dense? You can't prove that ranges are more optimal than direct semantics! Do it! I'd like to see you try!
 So tone down the attitude, you got what you *clearly* asked for 
 but seem reluctant to acknowledge. Ranges are good, for loops 
 are good too, but not as. So maybe you should just use ranges 
 and use the correct optimization flags and call it a day? Or 
 else use for loops and accept that even though they may not run 
 as quickly, they are "safer" to use since some malicious coder 
 could come along and add in sleeps inside the std.algorithm 
 functions.

 -Steve
What it seems is that a few of you are upset because I didn't bow down to your precious range semantics and ask for clarification. At first I was jumped on then someone did some tests and found out that it wasn't so rosy like everyone thought. Of course, the work around is to force optimizations that fix the problem when the problem doesn't exist in for loops. Then you come along and tell me that specific cases prove the general case... that is real dense. You know, it takes two to have an attitude! I asked for information regarding stride. I got the range version, it turned out to be slower in some corner case for some bizarre reason. I was then told it required optimizations(why? That is fishy why the corner cause would be 200% slower for a weird edge case) and then I was told that ranges are always faster(which is what you just said because you act like one example proves everything). Every step of the way I am told "don't worry". You've already stepped in the shit once and you expect me to believe everything you say? Why is it so hard to have a test suite that checks the performance of range constructs instead of just getting me to believe you? Huh? Do you really think I'm suppose to believe every thing any asshat says on the net just because they want me to? Back up your beliefs, that simple. Provide timings for all the range functions in various combinations and give me a worse case scenario compared to their equivalent hand-coded versions. Once you do that then I will be able to make an informed decision rather than doing what you really want, which is except your world as authority regardless of the real truth.
Jun 05 2018
next sibling parent aberba <karabutaworld gmail.com> writes:
On Tuesday, 5 June 2018 at 22:28:44 UTC, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 21:35:03 UTC, Steven Schveighoffer 
 wrote:
 [...]
 [...]
Does ranges not evaluate lazily on some cases. So it'll avoid unnecessary work...and be much faster and efficient. If I'm correct.
 [...]
Jun 05 2018
prev sibling parent reply Steven Schveighoffer <schveiguy yahoo.com> writes:
On 6/5/18 6:28 PM, DigitalDesigns wrote:
 Once you do that then I will be able to make an informed 
 decision rather than doing what you really want, which is except your 
 world as authority regardless of the real truth.
It's "accept" not "except". You need to *accept* my world as authority. Here, this may help: https://www.vocabulary.com/articles/chooseyourwords/accept-except/ -Steve
Jun 06 2018
parent DigitalDesigns <DigitalDesigns gmail.com> writes:
On Wednesday, 6 June 2018 at 14:23:29 UTC, Steven Schveighoffer 
wrote:
 On 6/5/18 6:28 PM, DigitalDesigns wrote:
 Once you do that then I will be able to make an informed 
 decision rather than doing what you really want, which is 
 except your world as authority regardless of the real truth.
It's "accept" not "except". You need to *accept* my world as authority. Here, this may help: https://www.vocabulary.com/articles/chooseyourwords/accept-except/ -Steve
This is genius!! This made my day! I feel 40 years younger!!! It took me back to 3rd grade English class. I can remember when little Lissa Lou was sitting next to me on a fine Wednesday morning! She had just used a word in the wrong context and out of the blue the gestapo barged in shoved a grenade down her throat then tossed her out the window! It was a fun day! We all had icecream and danced like the little Nazi's we were! That was the good ol days! Thanks!
Jun 07 2018
prev sibling parent reply Ethan <gooberman gmail.com> writes:
On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:
 Ok asshat!
Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
Jun 05 2018
parent reply DigitalDesigns <DigitalDesigns gmail.com> writes:
On Tuesday, 5 June 2018 at 21:52:03 UTC, Ethan wrote:
 On Tuesday, 5 June 2018 at 21:22:27 UTC, DigitalDesigns wrote:
 Ok asshat!
Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
You are an idiot! You obviously do not understand basic logic. Unfounded aggression? Yep, way to see how you didn't start it! Must be nice being the bully!
Jun 05 2018
parent Ethan <gooberman gmail.com> writes:
On Tuesday, 5 June 2018 at 21:54:20 UTC, DigitalDesigns wrote:
 You are an idiot!
Take it to reddit. Back your arguments up with actual knowledge and intelligence, not unfounded agression.
Jun 05 2018
prev sibling parent Timon Gehr <timon.gehr gmx.ch> writes:
On 05.06.2018 21:05, DigitalDesigns wrote:
 On Tuesday, 5 June 2018 at 18:46:41 UTC, Timon Gehr wrote:
 On 05.06.2018 18:50, DigitalDesigns wrote:
 With a for loop, it is pretty much a wrapper on internal cpu logic so 
 it will be near as fast as possible.
This is not even close to being true for modern CPUs. There are a lot of architectural and micro-architectural details that affect performance but are not visible or accessible in your for loop. If you care about performance, you will need to test anyway, as even rather sophisticated models of CPU performance don't get everything right.
Those optimizations are not part of the instruction set so are irrelevant. They will occur with ranges too. ...
I was responding to claims that for loops are basically a wrapper on internal CPU logic and nearly as fast as possible. Both of those claims were wrong.
 For loops HAVE a direct cpu semantic! Do you doubt this?
 ...
You'd have to define what that means. (E.g., Google currently shows no hits for "direct CPU semantics".)
 
 Cpu's do not have range semantics. Ranges are layers on top of compiler 
 semantics... you act like they are equivalent, they are not!
I don't understand why you bring this up nor what you think it means. The compiler takes a program and produces some machine code that has the right behavior. Performance is usually not formally specified. In terms of resulting behavior, code with explicit for loops and range-based code may have identical semantics. Which one executes faster depends on internal details of the compiler and the target architecture, and it may change over time, e.g. between compiler releases.
 All range 
 semantics must go through the library code then to the compiler then to 
 cpu. For loops of all major systems languages go almost directly to cpu 
 instructions.
 
 for(int i = 0; i < N; i++)
 
 translates in to either increment and loop or jump instructions.
 ...
Sure, or whatever else the compiler decides to do. It might even be translated into a memcpy call. Even if you want to restrict yourself to use only for loops, my point stands. Write maintainable code by default and let the compiler do what it does. Then optimize further in those cases where the resulting code is actually too slow. Test for performance regressions.
 There is absolutely no reason why any decent compiler would not use what 
 the cpu has to offer. For loops are language semantics, Ranges are 
 library semantics.
Not really. Also, irrelevant.
 To pretend they are equivalent is wrong and no amount 
 of justifying will make them the same.
Again, I don't think this point is part of this discussion.
 I actually do not know even any 
 commercial viable cpu exists without loop semantics.
What does it mean for a CPU to have "loop semantics"? CPUs typically have an instruction pointer register and possibly some built-in instructions to manipulate said instruction pointer. x86 has some built-in loop instructions, but I think they are just there for legacy support and not actually something you want to use in performant code.
 I also no of no 
 commercially viable compiler that does not wrap those instructions in a 
 for loop(or while, or whatever) like syntax that almost maps directly to 
 the cpu instructions.
 ...
The compiler takes your for loop and generates some machine code. I don't think there is a "commercially viable" compiler that does not sometimes do things that are not direct. And even then, there is no very simple mapping from CPU instructions to observed performance, so the entire point is a bit moot.
 Also, it is often not necessary to be "as fast as possible". It is 
 usually more helpful to figure out where the bottleneck is for your 
 code and concentrate optimization effort there, which you can do more 
 effectively if you can save time and effort for the remaining parts of 
 your program by writing simple and obviously correct range-based code, 
 which often will be fast as well.
It's also often not necessary to be "as slow as possible".
This seems to be quoting an imaginary person. My point is that to get even faster code, you need to spend effort and often get lower maintainability. This is not always a good trade-off, in particular if the optimization does not improve performance a lot and/or the code in question is not executed very often.
 I'm not 
 asking for about generalities but specifics. It's great to make 
 generalizations about how things should be but I would like to know how 
 they are.
That's a bit unspecific.
 Maybe in theory ranges could be more optimal than other 
 semantics but theory never equals practice.
 
I don't know who this is addressed to. My point was entirely practical.
Jun 05 2018