digitalmars.D.learn - howto count lines

digitalmars.D.learn - howto count lines - fast

Nitram (6/6) May 30 2017 After reading

Jordan Wilson (10/15) May 30 2017 Not sure if this is the fastest, but anyway

Jordan Wilson (5/22) May 30 2017 I should say, if you don't care about storing the data,

Daniel Kozak via Digitalmars-d-learn (20/26) May 30 2017 I do not know this is my first attempt and it is almost same fast as wc
H. S. Teoh via Digitalmars-d-learn (165/173) May 30 2017 This little challenge piqued my interest. So I decided to take a shot

Stanislav Blinov (31/39) May 30 2017 Hm... I cheated a little bit: took your program and compiled with
Patrick Schluter (21/41) May 31 2017 You should try something more like

Nitram (4/25) May 31 2017 I am glad to see this participation on this issue :)

=?UTF-8?Q?Ali_=c3=87ehreli?= (14/20) May 30 2017 I could not make the D program come close to wc's performance when the

Daniel Kozak via Digitalmars-d-learn (3/16) May 30 2017 How do you compile it? When I use ldc2 -O3 -release -mcpu=bdver1 lc.d

=?UTF-8?Q?Ali_=c3=87ehreli?= (5/7) May 31 2017 My bad: I'm not familiar with ldc's optimization options. (I used -O3

cym13 (6/15) May 31 2017 It seems to me that your initial result is more interesting: you

H. S. Teoh via Digitalmars-d-learn (17/19) May 31 2017 [...]
Jonathan M Davis via Digitalmars-d-learn (7/13) May 31 2017 I don't know, but memchr wouldn't work with CTFE, so someone might have
H. S. Teoh via Digitalmars-d-learn (44/57) May 31 2017 [...]

Patrick Schluter (19/82) May 31 2017 That's what I suggested above. It's the first optimisation to do

H. S. Teoh via Digitalmars-d-learn (108/120) Jun 01 2017 [...]

H. S. Teoh via Digitalmars-d-learn (53/53) May 30 2017 P.S. After I posted the code, I took a closer look at the disassembly
Russel Winder via Digitalmars-d-learn (29/36) May 31 2017 a
H. S. Teoh via Digitalmars-d-learn (26/39) May 31 2017 [...]
Jonathan M Davis via Digitalmars-d-learn (7/62) May 31 2017 Digitalmars-d-learn wrote:

Patrick Schluter (8/17) May 31 2017 See my link above to realdworldtech. Using SIMD can give good

Jonathan M Davis via Digitalmars-d-learn (10/30) May 31 2017 Whenever you attempt more complicated optimizations, it becomes harder t...
H. S. Teoh via Digitalmars-d-learn (33/50) May 31 2017 [...]

Russel Winder via Digitalmars-d-learn (27/35) Jun 01 2017 The people looking at modern, ultra-parallel hardware architectures are
Jonathan M Davis via Digitalmars-d-learn (6/10) Jun 01 2017 Based on some previous discussions, I think that this is the sort of thi...
Russel Winder via Digitalmars-d-learn (31/53) Jun 01 2017 [=E2=80=A6]
H. S. Teoh via Digitalmars-d-learn (85/126) Jun 02 2017 Unfortunately, the compiler can only go so far, because it doesn't
Russel Winder via Digitalmars-d-learn (40/82) Jun 02 2017 There are many different sorts of programming. Operating systems,
H. S. Teoh via Digitalmars-d-learn (48/75) Jun 02 2017 Fair enough. I can see how this would lead to unnecessarily ugly,

Nitram <martin.brzenska googlemail.com> writes:

After reading 
https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ 
, i was wondering how fast one can do a simple "wc -l" in D.

So i made a couple short implementations and found myself 
confronted with slow results compared to "/usr/bin/wc -l".

How would a implementation look like in D, which is fast?

May 30 2017

Jordan Wilson <wilsonjord gmail.com> writes:

On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote:
 After reading 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was
wondering how fast one can do a simple "wc -l" in D.

 So i made a couple short implementations and found myself 
 confronted with slow results compared to "/usr/bin/wc -l".

 How would a implementation look like in D, which is fast?

Not sure if this is the fastest, but anyway

void main(){
     import std.file : read;
     import std.conv : to;
     import std.algorithm : count;

     auto data = cast(ubyte[])read("somefile.txt");
     auto lc = data.count('\n'.to!ubyte);
}

Jordan

May 30 2017

Jordan Wilson <wilsonjord gmail.com> writes:

On Tuesday, 30 May 2017 at 20:37:44 UTC, Jordan Wilson wrote:
 On Tuesday, 30 May 2017 at 20:02:38 UTC, Nitram wrote:
 After reading 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was
wondering how fast one can do a simple "wc -l" in D.

 So i made a couple short implementations and found myself 
 confronted with slow results compared to "/usr/bin/wc -l".

 How would a implementation look like in D, which is fast?

 Not sure if this is the fastest, but anyway

 void main(){
     import std.file : read;
     import std.conv : to;
     import std.algorithm : count;

     auto data = cast(ubyte[])read("somefile.txt");
     auto lc = data.count('\n'.to!ubyte);
 }

 Jordan

I should say, if you don't care about storing the data, 
File("somefile.txt","r").byLine.count is probably more idiomatic, 
and probably just as fast.

Jordan

May 30 2017

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

I do not know this is my first attempt and it is almost same fast as wc 
on my pc:

int main(string[] args)
{
     import std.stdio : writeln, writefln, File;
     import std.array : uninitializedArray;

     auto f = File("data");
     size_t c = 0;
     auto buffer = uninitializedArray!(ubyte[])(1024);
     foreach (chunk; f.byChunk(buffer))
     {
         foreach (ch; chunk)
             if (ch == '\n')
                 ++c;
     }
     writeln(c);
     return 0;
}

And I belive it could be faster when iopipe is used


Dne 30.5.2017 v 22:02 Nitram via Digitalmars-d-learn napsal(a):
 After reading 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i 
 was wondering how fast one can do a simple "wc -l" in D.

 So i made a couple short implementations and found myself confronted 
 with slow results compared to "/usr/bin/wc -l".

 How would a implementation look like in D, which is fast?

May 30 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via Digitalmars-d-learn wrote:
 After reading
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i
 was wondering how fast one can do a simple "wc -l" in D.
 
 So i made a couple short implementations and found myself confronted
 with slow results compared to "/usr/bin/wc -l".
 
 How would a implementation look like in D, which is fast?

This little challenge piqued my interest.  So I decided to take a shot
at seeing if I could beat my system's /usr/bin/wc -l.

First order of business: whenever it comes to performance, always choose
the right compiler for the job.  While dmd is the reference compiler and
has the latest and greatest features, it doesn't do too well in the
performance department.  So the first thing I did was to use gdc
instead. Specifically, gdc -O3, to get the most aggressive optimizations
the gcc backend can give me.

Second order of business: the na�ve algorithm of reading 1 character at
a time from the file is obviously suboptimal, so I didn't even consider
it.

Third order of business: I constructed a large text file containing
10634420 lines by concatenating 5 copies of a large CSV file I obtained
online. The sample had to be large enough so that overhead like program
startup times and background system noise don't bias the test results
too much.

With this setup, I measured the performance of my system's wc -l as the
baseline for comparing the performance of the D solutions. Here's a
typical output of my shell's `time wc -l` command:

	real    0m0.344s
	user    0m0.153s
	sys     0m0.191s

Since we're talking about beating wc, and others have already posted
the obvious solutions of loading into a buffer and scanning the buffer,
the most obvious next step up is to use std.mmfile to memory-map the
file so that we don't waste effort managing file buffers ourselves: let
the OS do it for us.

So here are the algorithms I tried:

1) lineCount1(): use std.mmfile to memory-map the file so that we can
scan it as a single, contiguous array without any fussy
buffer-management details. Result: pretty fast, but still loses to wc.
Typical measurements:

	real    0m0.422s
	user    0m0.366s
	sys     0m0.054s

2) lineCount2(): just as a comparison, I did a read-into-buffer
algorithm so that we have a reference as to how it performs. As
expected, it's worse than the std.mmfile solution. Typical measurements:

	real    0m0.519s
	user    0m0.320s
	sys     0m0.198s

3) lineCount3(): In lineCount1(), I used std.algorithm.searching.count
for counting newlines; so I thought, what if the range abstraction is
introducing too much overhead?  So I decided to write a simple foreach
loop instead, in hope that the simpler code will allow the gcc optimizer
do a better job. Sadly, the result is pretty much the same as
lineCount1:

	real    0m0.421s
	user    0m0.379s
	sys     0m0.042s

(The good news, though, is that this shows that using std.algorithm does
not introduce excessive overhead compared to a hand-written loop. This
proves that the high-level range abstraction does not necessarily equate
to slower performance.)

4) lineCount4(): Having failed to beat wc thus far, it's time to bring
out the big guns.  Since we're essentially counting newlines in the
input, who says we have to process the data sequentially from start to
end?  Counting from front to back or back to front gives the same
answer, as does counting from middle to end, then front to middle. In
particular, counting from middle to end *in parallel* with counting from
front to middle, then summing the subtotals, gives the same answer.
I.e., time to bust out std.parallelism. :-)  So, in my 4th algorithm, I
split the input into 16KB chunks, and then scan them for newlines in
parallel, creating file_size/16384 subtotals. Then I sum the subtotals
in the end. Here's the result, tested on my AMD Phenom 6-core CPU:

	real    0m0.242s
	user    0m1.151s
	sys     0m0.057s

Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty fair
margin, too.  Eat your heart out, wc!!!  (The large user time is because
we're using all 6 cores at once. But the actual elapsed time is
shorter.)

Here's the code for all of the above:

---------snip---------
import std.stdio;

// real    0m0.422s
// user    0m0.366s
// sys     0m0.054s
size_t lineCount1(string filename)
{
    import std.algorithm.searching : count;
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    return data.count('\n');
}

// real    0m0.519s
// user    0m0.320s
// sys     0m0.198s
size_t lineCount2(string filename)
{
    import std.algorithm.searching : count;

    auto f = File(filename);
    ubyte[] buf;
    size_t c = 0;

    buf.length = 32768;
    do
    {
        auto d = f.rawRead(buf);
        c += d.count('\n');
    } while (!f.eof());

    return c;
}

// real    0m0.421s
// user    0m0.379s
// sys     0m0.042s
size_t lineCount3(string filename)
{
    import std.mmfile;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t c;

    foreach (i; 0 .. data.length)
    {
        if (data[i] == '\n') c++;
    }
    return c;
}

// real    0m0.242s
// user    0m1.151s
// sys     0m0.057s
size_t lineCount4(string filename)
{
    import std.algorithm.comparison : min;
    import std.algorithm.iteration : sum;
    import std.mmfile;
    import std.parallelism;
    import std.range : chunks;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t[] subtotals;

    enum blockSize = 16384;
    if (data.length == 0) return 0;
    subtotals.length = 1 + (data.length - 1) / blockSize;

    foreach (j, ref subtotal; parallel(subtotals))
    {
        size_t end = min(blockSize*(j+1), data.length);
        foreach (i; blockSize*j .. end)
        {
            if (data[i] == '\n') subtotal++;
        }
    }
    return subtotals.sum;
}

int main(string[] args)
{
    if (args.length < 2)
    {
        stderr.writeln("Please specify target file.");
        return 1;
    }

    auto filename = args[1];
    //auto count = lineCount1(filename);
    //auto count = lineCount2(filename);
    //auto count = lineCount3(filename);
    auto count = lineCount4(filename);
    writeln(count);
    return 0;
}

// vim:set sw=4 ts=4 et ai:
---------snip---------


T

-- 
BREAKFAST.COM halted...Cereal Port Not Responding. -- YHL

May 30 2017

Stanislav Blinov <stanislav.blinov gmail.com> writes:

On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote:

 This little challenge piqued my interest.  So I decided to take 
 a shot at seeing if I could beat my system's /usr/bin/wc -l.

 First order of business: whenever it comes to performance, 
 always choose the right compiler for the job...


 Woohoo!!! Finally, we beat the system's wc -l!! And by a pretty 
 fair margin, too.  Eat your heart out, wc!!!  (The large user 
 time is because we're using all 6 cores at once. But the actual 
 elapsed time is shorter.)


Hm... I cheated a little bit: took your program and compiled with 
`ldc2 -release -O3 -mcpu=skylake`. For data I took std.datetime 
concatenated 1000 times (35446000 lines). Results (minimum of 10 
runs each):

wc -l
real 0.50
user 0.40
sys 0.09

lineCount1
real 0.23
user 0.19
sys 0.04

lineCount2
real 0.29
user 0.17
sys 0.12

lineCount3
real 0.23
user 0.18
sys 0.04

lineCount4
real 0.22
user 1.52
sys 0.04

Seems like all of them beat wc, so machine and compiler matter a 
great deal in this comparison. Thing is, on small files wc is 
going to win pretty much always, due to D's clunky startup. But 
with larger ones - wc falls far behind.
Interestingly, both lineCount1 and lineCount3 on this machine are 
nearly even with lineCount4 :)

May 30 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Tuesday, 30 May 2017 at 23:41:01 UTC, H. S. Teoh wrote:
 On Tue, May 30, 2017 at 08:02:38PM +0000, Nitram via 
 Digitalmars-d-learn wrote:
 After reading 
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i was
wondering how fast one can do a simple "wc -l" in D.
 

 size_t lineCount3(string filename)
 {
     import std.mmfile;

     auto f = new MmFile(filename);
     auto data = cast(ubyte[]) f[];
     size_t c;

     foreach (i; 0 .. data.length)
     {
         if (data[i] == '\n') c++;
     }
     return c;
 }

 // real    0m0.242s
 // user    0m1.151s
 // sys     0m0.057s

You should try something more like
     auto data = cast(ulong[]) f[];

      foreach (i; 0 .. data.length/ulong.sizeof)

and then using bitfiedling to count the number of \n in the 
loaded ulong. This divides by 8 the number of load instructions. 
The counting of \n in the loaded word then only uses registers. 
It is also possible to use bit fiddling to detect and count the 
characters in that ulong. I don't know if it is really faster then

Here a function to detect if a given character is in an ulong

     auto detect(alias CHAR)(ulong t)
     {
       enum ulong u = CHAR;
       enum mask1 = u|(u<<8)|(u<<16)|(u<<24UL)|(u<<32UL);
       enum mask = (mask1<<32)|mask1;
       return ((t^mask) - 0x0101010101010101UL) & ~(t^mask) & 
0x8080808080808080UL;
     }

The returned value is 0 if the character is not in t. And the 
highest bit of each byte is set if it contained the character. If 
the CPU has a fast popcnt it should be easy to count.

May 31 2017

Nitram <martin.brzenska googlemail.com> writes:

I am glad to see this participation on this issue :)
The hints about trying another compiler and std.mmfile turned out 
to be very effective.

Even this simple code is faster then my systems "wc -l" now:
void main() {
	import std.stdio;
	writeln(lcs("benchmark.dat"));
}

size_t lcs(string filename) {
	import std.mmfile;
	auto f = new MmFile(filename);
	auto data = cast(ubyte[]) f[];
	size_t c = 0;
	foreach(ref i ; data) {
		if (i == '\n') c++;
	}
	return c;
}

time wc -l ./benchmark.dat
10500000 ./benchmark.dat
wc -l ./benchmark.dat  0,06s user 0,03s system 99% cpu 0,089 
total

time ./lcs
10500000
./lcs  0,05s user 0,01s system 99% cpu 0,061 total

May 31 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 05/30/2017 01:02 PM, Nitram wrote:
 After reading
 https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ , i
 was wondering how fast one can do a simple "wc -l" in D.

 So i made a couple short implementations and found myself confronted
 with slow results compared to "/usr/bin/wc -l".

 How would a implementation look like in D, which is fast?

I could not make the D program come close to wc's performance when the 
data was piped from stdin. A typical run with Daniel Kozak's program:

$ time cat deleteme.txt | wc -l
5062176

real	0m0.086s
user	0m0.068s
sys	0m0.048s

$ time cat deleteme.txt | ./deneme
5062176

real	0m0.174s
user	0m0.152s
sys	0m0.072s

Ali

May 30 2017

Daniel Kozak via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

Dne 31.5.2017 v 02:13 Ali Çehreli via Digitalmars-d-learn napsal(a):

 I could not make the D program come close to wc's performance when the 
 data was piped from stdin. A typical run with Daniel Kozak's program:

 $ time cat deleteme.txt | wc -l
 5062176

 real    0m0.086s
 user    0m0.068s
 sys    0m0.048s

 $ time cat deleteme.txt | ./deneme
 5062176

 real    0m0.174s
 user    0m0.152s
 sys    0m0.072s

 Ali

How do you compile it? When I use ldc2 -O3  -release -mcpu=bdver1 lc.d 
my code is even faster than wc

May 30 2017

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 05/30/2017 11:50 PM, Daniel Kozak via Digitalmars-d-learn wrote:

 How do you compile it? When I use ldc2 -O3  -release -mcpu=bdver1 lc.d
 my code is even faster than wc

My bad: I'm not familiar with ldc's optimization options. (I used -O3 
but not -release) Now I get the same performance as 'wc -l' when I add 
-release.

Ali

May 31 2017

cym13 <cpicard openmailbox.org> writes:

On Wednesday, 31 May 2017 at 17:23:46 UTC, Ali Çehreli wrote:
 On 05/30/2017 11:50 PM, Daniel Kozak via Digitalmars-d-learn 
 wrote:

 How do you compile it? When I use ldc2 -O3  -release

 -mcpu=bdver1 lc.d
 my code is even faster than wc

 My bad: I'm not familiar with ldc's optimization options. (I 
 used -O3 but not -release) Now I get the same performance as 
 'wc -l' when I add -release.

 Ali

It seems to me that your initial result is more interesting: you 
manage to get faster than wc *while keeping bound safety*. At a 
time where safety is finally getting the importance it should 
always have had showing that you can write fast code without 
sacrifiying any of it is important I think.

May 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Tue, May 30, 2017 at 05:13:46PM -0700, Ali �ehreli via Digitalmars-d-learn
wrote:
[...]
 I could not make the D program come close to wc's performance when the
 data was piped from stdin.

[...]

Hmm. This is a particularly interesting case, because I adapted some of
my algorithms to handle reading from stdin (i.e., std.mmfile is not an
option), and I could not get it to reach wc's performance!  I even
installed ldc just to see if that made a difference... it was somewhat
faster than gdc, but still, the timings were about twice as slow as wc.

I did some digging around, and it seems that wc is using glibc's memchr,
which is highly-optimized, whereas std.algorithm.count just uses a
simplistic loop. Which is strange, because I'm pretty sure somebody
optimized std.algorithm some time ago to use memchr() instead of a loop
when searching for a byte value in an array.  Whatever happened to
that??


T

-- 
Sometimes the best solution to morale problems is just to fire all of the
unhappy people. -- despair.com

May 31 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn 
wrote:
 I did some digging around, and it seems that wc is using glibc's memchr,
 which is highly-optimized, whereas std.algorithm.count just uses a
 simplistic loop. Which is strange, because I'm pretty sure somebody
 optimized std.algorithm some time ago to use memchr() instead of a loop
 when searching for a byte value in an array.  Whatever happened to
 that??

I don't know, but memchr wouldn't work with CTFE, so someone might have
removed it to make it work in CTFE (though that could be done with a
different branch for CTFE). Or maybe it never made it into std.algorithm for
one reason or another.

- Jonathan M Davis

May 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via
Digitalmars-d-learn wrote:
 On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn 
 wrote:
 I did some digging around, and it seems that wc is using glibc's
 memchr, which is highly-optimized, whereas std.algorithm.count just
 uses a simplistic loop. Which is strange, because I'm pretty sure
 somebody optimized std.algorithm some time ago to use memchr()
 instead of a loop when searching for a byte value in an array.
 Whatever happened to that??

 
 I don't know, but memchr wouldn't work with CTFE, so someone might
 have removed it to make it work in CTFE (though that could be done
 with a different branch for CTFE). Or maybe it never made it into
 std.algorithm for one reason or another.

[...]

I checked the Phobos code again, and it appears that my memory deceived
me. Somebody *did* add memchr optimization to find() and its friends,
but not to count().

CTFE compatibility is not a problem, since we can just if(__ctfe) the
optimized block away.

I'm currently experimenting with a memchr-optimized version of count(),
but I'm getting mixed results: on small arrays or large arrays densely
packed with matching elements, the memchr version runs rather slowly,
because it involves a function call into the C library per matching
element.  On large arrays only sparsely populated with matching
elements, though, the memchr-optimized version beats the current code by
about an order of magnitude.

Since it wouldn't be a wise idea to assume sparsity of matches in
Phobos, I decided to do a little more digging, and looked up the glibc
implementation of memchr. The main optimization is that it iterates over
the array not by byte, as a na�ve loop would do, but by ulong's. (Of
course, the first n bytes and last n bytes that are not ulong-aligned
are checked with a per-byte loop; so for very short arrays it doesn't
lose out to the na�ve loop.)  In each iteration over ulong, it performs
the bit-twiddling hack alluded to by Nitram to detect the presence of
matching bytes, upon which it breaks out to the closing per-byte loop to
find the first match. For short arrays, or arrays where a match is
quickly found, it's comparable in performance to the na�ve loop; for
large arrays where the match is not found until later, it easily
outperforms the na�ve loop.

My current thought is to adopt the same approach: iterate over size_t or
some such larger unit, and adapt the bit-twiddling hack to be able to
count the number of matches in each size_t.  This is turning out to be
trickier than I'd like, though, because there is a case where carry
propagation makes it unclear how to derive the number of matches without
iterating over the bytes a second time.

But this may not be a big problem, since size_t.sizeof is relatively
small, so I can probably loop over individual bytes when one or more
matches is detected, and a sufficiently-capable optimizer like ldc or
gdc would be able to unroll this into a series of sete + add
instructions, no branches that might stall the CPU pipeline. For
densely-matching arrays, this should still have comparable performance
to the na�ve loops; for sparsely-matching arrays this should show
significant speedups.


T

-- 
My program has no bugs! Only unintentional features...

May 31 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Wednesday, 31 May 2017 at 23:03:54 UTC, H. S. Teoh wrote:
 On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via 
 Digitalmars-d-learn wrote:
 On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via 
 Digitalmars-d-learn wrote:
 I did some digging around, and it seems that wc is using 
 glibc's memchr, which is highly-optimized, whereas 
 std.algorithm.count just uses a simplistic loop. Which is 
 strange, because I'm pretty sure somebody optimized 
 std.algorithm some time ago to use memchr() instead of a 
 loop when searching for a byte value in an array. Whatever 
 happened to that??

 
 I don't know, but memchr wouldn't work with CTFE, so someone 
 might have removed it to make it work in CTFE (though that 
 could be done with a different branch for CTFE). Or maybe it 
 never made it into std.algorithm for one reason or another.

 [...]

 I checked the Phobos code again, and it appears that my memory 
 deceived
 me. Somebody *did* add memchr optimization to find() and its 
 friends,
 but not to count().

 CTFE compatibility is not a problem, since we can just 
 if(__ctfe) the
 optimized block away.

 I'm currently experimenting with a memchr-optimized version of 
 count(), but I'm getting mixed results: on small arrays or 
 large arrays densely packed with matching elements, the memchr 
 version runs rather slowly, because it involves a function call 
 into the C library per matching element.  On large arrays only 
 sparsely populated with matching elements, though, the 
 memchr-optimized version beats the current code by about an 
 order of magnitude.

 Since it wouldn't be a wise idea to assume sparsity of matches 
 in Phobos, I decided to do a little more digging, and looked up 
 the glibc implementation of memchr. The main optimization is 
 that it iterates over the array not by byte, as a naïve loop 
 would do, but by ulong's.

That's what I suggested above. It's the first optimisation to do 
when looping over a buffer (memcpy, memset, memchr etc.).


  (Of course, the first n bytes and
 last n bytes that are not ulong-aligned are checked with a 
 per-byte loop; so for very short arrays it doesn't lose out to 
 the naïve loop.)  In each iteration over ulong, it performs the 
 bit-twiddling hack alluded to by Nitram to detect the presence 
 of matching bytes, upon which it breaks out to the closing 
 per-byte loop to find the first match. For short arrays, or 
 arrays where a match is quickly found, it's comparable in 
 performance to the naïve loop; for large arrays where the match 
 is not found until later, it easily outperforms the naïve loop.

It is also important to not overdo the optimisations as it can 
happen that the overhead generated manifests in pessimations not 
visible in a specific benchmark. The code size explosion may 
induce I-cache misses, it can also cost I-TLB misses. Worse, 
using SSE or AVX can kill thread switch time or worse even reduce 
the turboing of the CPU.
It's currently a hot topic on realworldtech[1]. Linus Torvalds 
rants about this issue wit memcpy() which is over-engineered and 
does more harm than good in practice but has nice benchmark 
result.

 My current thought is to adopt the same approach: iterate over 
 size_t or some such larger unit, and adapt the bit-twiddling 
 hack to be able to count the number of matches in each size_t.  
 This is turning out to be trickier than I'd like, though, 
 because there is a case where carry propagation makes it 
 unclear how to derive the number of matches without iterating 
 over the bytes a second time.

 But this may not be a big problem, since size_t.sizeof is 
 relatively small, so I can probably loop over individual bytes 
 when one or more matches is detected, and a 
 sufficiently-capable optimizer like ldc or gdc would be able to 
 unroll this into a series of sete + add instructions, no 
 branches that might stall the CPU pipeline. For 
 densely-matching arrays, this should still have comparable 
 performance to the naïve loops; for sparsely-matching arrays 
 this should show significant speedups.

That's what I think too, that a small and simple loop to count 
the matching bytes in the ulong would be a somehow faster than 
the bit twiddling trick which requires a population count of bits.

[1]: 
http://www.realworldtech.com/forum/?threadid=168200&curpostid=168700

May 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, May 31, 2017 at 12:13:04PM -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
 On Tue, May 30, 2017 at 05:13:46PM -0700, Ali �ehreli via Digitalmars-d-learn
wrote:
 [...]
 I could not make the D program come close to wc's performance when the
 data was piped from stdin.

 [...]
 
 Hmm. This is a particularly interesting case, because I adapted some
 of my algorithms to handle reading from stdin (i.e., std.mmfile is not
 an option), and I could not get it to reach wc's performance!  I even
 installed ldc just to see if that made a difference... it was somewhat
 faster than gdc, but still, the timings were about twice as slow as
 wc.

[...]

Here's a little update on my experiments w.r.t. reading from stdin
without being able to use std.mmfile: I found that I was able to achieve
decent performance by modifying the loop so that it loads data from the
array in size_t chunks rather than by individual bytes, and looping over
the bytes in the size_t to check for matches. Here's the code:
 
	size_t lineCount7(ref File input)
	{
	    import std.algorithm.searching : count;

	    ubyte[] buf;
	    size_t c = 0;

	    buf.length = 8192;

	    foreach (d; input.byChunk(buf))
	    {
		if (d.length == buf.length)
		{
		    auto ichunk = cast(ulong[]) d;
		    size_t subtotal = 0;

		    foreach (i; ichunk)
		    {
			enum eol = cast(ulong) '\n';
			if ((i & 0x00000000000000FF) ==  eol       ) subtotal++;
			if ((i & 0x000000000000FF00) == (eol <<  8)) subtotal++;
			if ((i & 0x0000000000FF0000) == (eol << 16)) subtotal++;
			if ((i & 0x00000000FF000000) == (eol << 24)) subtotal++;
			if ((i & 0x000000FF00000000) == (eol << 32)) subtotal++;
			if ((i & 0x0000FF0000000000) == (eol << 40)) subtotal++;
			if ((i & 0x00FF000000000000) == (eol << 48)) subtotal++;
			if ((i & 0xFF00000000000000) == (eol << 56)) subtotal++;
		    }
		    c += subtotal;
		}
		else
		    c += d.count('\n');
	    }
	    return c;
	}

When the last chunk of the file (possibly the entire file, if it's < 8
KB) is incomplete, we revert back to the na�ve loop-over-bytes search.

While this superficially may seem like unnecessary complication, it
actually makes a significant performance difference, because:

(1) Reading the array in size_t chunks means less roundtrips to RAM or
L1/L2/L3 caches.

(2) Since a size_t fits within a single CPU register, the inner loop can
be completely done inside the CPU without needing to even go to L1
cache, which, while it's pretty fast, is still a memory roundtrip. The
register file is the fastest memory of all, so we maximize this
advantage here.

(3) Since size_t has a fixed size, the loop can be completely unrolled
(ldc does this) and thus completely eliminate branch hazards from the
inner loop.

I originally tried to copy the glibc memchr implementation's xor trick
for checking whether a size_t word contains any matching bytes, but I
got mixed results, and in any case it loses out to my system's wc
implementation.  I suppose given enough effort I could track down what's
causing the D code to slow down, but then I realized something else
about wc: since it uses memchr to find EOL bytes, and now that I know
memchr's implementation in glibc, it means that a lot of overhead is
introduced when the data being scanned contains a lot of matches.

So I did a little test: I created two text files, both 200 million bytes
long, with all 200 million bytes newlines (i.e., 200,000,000 blank
lines), and one with 100-character lines (2,000,000 lines in total).
Then as an intermediate between these two extremes, I concatenated all
of the .d files in Phobos 10 times to make a file with 2.8 million lines
of varying lengths.  Then I tested both my system's wc and my linecount
written in D to see how they performed on these files (piped through
stdin, so no mmap-ing is involved).

Here are the results (note: these are raw measurements; I did not
account for system background noise):

	    +------------------+-------------------+------------------+
	    | 200M blank lines | 2M 100-byte lines | 10x Phobos code  |
+-----------+------------------+-------------------+------------------+
| wc -l     | real    0m0.475s | real    0m0.080s  | real    0m0.083s |
|           | user    0m0.417s | user    0m0.034s  | user    0m0.062s |
|           | sys     0m0.058s | sys     0m0.046s  | sys     0m0.020s |
+-----------+------------------+-------------------+------------------+
| linecount | real    0m0.181s | real    0m0.190s  | real    0m0.099s |
|           | user    0m0.138s | user    0m0.129s  | user    0m0.059s |
|           | sys     0m0.042s | sys     0m0.060s  | sys     0m0.040s |
+-----------+------------------+-------------------+------------------+

As expected, wc -l loses when dealing with blank lines (and, presumably,
short lines); the D version was able to beat it by more than a factor of
2.  On the file with 100-byte lines, though, the performance of wc
improved tremendously, because glibc's memchr is optimized for scanning
large amounts of data before finding a match, whereas the D version
performs more-or-less on par with the blank line case, but losing out to
wc by about a factor of 2.

The results of the 10x Phobos code runs are not directly comparable with
the first two test files, because the total file size is different.
Here, the D code still loses out slightly to wc, presumably because
memchr is ultimately still more efficient given the average line length
in Phobos code.

These results show that the performance of these algorithms depend on
the kind of data you feed them, and there's probably no "optimal"
line-counting algorithm unless you can predict the average line lengths
in advance.  In general, though, if you're dealing with text, I'd wager
that the average line length should be closer to the 100-byte lines end
of the extreme than the file filled with blank lines, so wc probably
wins on your typical text file.  I guess that means that using memchr or
its equivalent in D would be the best strategy to obtain results on par
with wc, as far as reading from stdin is concerned.

When you can use std.mmfile, though, the D version still beats wc by a
large margin. :-D


T

-- 
Help a man when he is in trouble and he will remember you when he is in trouble
again.

Jun 01 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

P.S. After I posted the code, I took a closer look at the disassembly
and found that gdc wasn't generating the best code for the parallel
foreach loop body.  I haven't fully traced the cause yet, but I did find
a simple optimization (arguably a micro-optimization): updating the
subtotal inside the inner loop is a bit inefficient, because the
subtotal is outside the loop body and, due to the way std.parallelism is
implemented, is passed by reference to the loop body. This caused gdc
not to enregister the subtotal, so incrementing it requires a cache
roundtrip at best, a full RAM roundtrip at worst.

So I introduced a local count variable for incrementing, and only assign
to the subtotal array at the end of the block:

-------snip-------
// real    0m0.240s
// user    0m1.045s
// sys     0m0.059s
size_t lineCount4(string filename)
{
    import std.algorithm.comparison : min;
    import std.algorithm.iteration : sum;
    import std.mmfile;
    import std.parallelism;

    auto f = new MmFile(filename);
    auto data = cast(ubyte[]) f[];
    size_t[] subtotals;

    enum blockSize = 16384;
    if (data.length == 0) return 0;
    subtotals.length = 1 + (data.length - 1) / blockSize;

    foreach (j, ref subtotal; parallel(subtotals))
    {
        size_t end = min(blockSize*(j+1), data.length);
        size_t c = 0;
        foreach (i; blockSize*j .. end)
        {
            if (data[i] == '\n') c++;
        }
        subtotal = c;
    }
    return subtotals.sum;
}
-------snip-------

A look at the disassembly showed that gdc has enregistered the subtotal,
and thereby able to eliminate a branch from the inner loop using a sete
instruction and an add for the if statement.  The measured performance
is only marginally better, though, and doesn't change the overall
performance in a significant way.  But I thought this might be a useful
tip for people who want to squeeze out the last drop of juice from their
CPUs. ;-)

Next, I'll have to look into why gdc didn't unroll the inner loop, like
I thought it would.  But I'm out of time now, so that will have to wait
for another day.


T

-- 
The peace of mind---from knowing that viruses which exploit Microsoft system
vulnerabilities cannot touch Linux---is priceless. -- Frustrated system
administrator.

May 30 2017

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Tue, 2017-05-30 at 17:22 -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
 performance in a significant way.=C2=A0=C2=A0But I thought this might be =

a
 useful
 tip for people who want to squeeze out the last drop of juice from
 their
 CPUs. ;-)
=20

[=E2=80=A6]

I have the beginnings of wc written in various languages. I may well
follow this thread up to create a full D implementation of wc that
people can then optimise a bit.

However, I note here that the Chapel folk are taking a quite
interesting view of algorithm implementation in the Benchmarks Game.
They are totally eschewing "heroic implementations" such as all the C,
C++, etc. in favour of understandable and simple ones. Their take on
raw performance is "if you need it to go faster, use a bigger
computer". Which is quite easy when you have a number of Cray computers
at your disposal. :-) Whilst having some fun, the Benchmark's Game has
become all about heroic implementations on a specific computer. Which
makes the Chapel line an excellent one.

OK for wc the case is different because it is about performance on the
users computer. But still I like the "keep the implementation
comprehensible, avoid heroic implementation".=20

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

May 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jun 01, 2017 at 12:20:53AM +0100, Russel Winder via Digitalmars-d-learn
wrote:
[...]
 However, I note here that the Chapel folk are taking a quite
 interesting view of algorithm implementation in the Benchmarks Game.
 They are totally eschewing "heroic implementations" such as all the C,
 C++, etc. in favour of understandable and simple ones. Their take on
 raw performance is "if you need it to go faster, use a bigger
 computer". Which is quite easy when you have a number of Cray
 computers at your disposal. :-) Whilst having some fun, the
 Benchmark's Game has become all about heroic implementations on a
 specific computer. Which makes the Chapel line an excellent one.
 
 OK for wc the case is different because it is about performance on the
 users computer. But still I like the "keep the implementation
 comprehensible, avoid heroic implementation". 

[...]

With D, we can have the cake and eat it too.  The understandable / na�ve
implementation can be available as a fallback (and reference
implementation), with OS-specific optimized implementations guarded
under version() or static-if blocks so that one could, in theory,
provide implementations specific to each supported platform that would
give the best performance.

I disagree about the philosophy of "if you need to go faster, use a
bigger computer".  There are some inherently complex problems (such as
NP-complete, PSPACE-complete, or worse, outright exponential class
problems) where the difference between a "heroic implementation" of a
computational primitive and a na�ve one may mean the difference between
obtaining a result in this lifetime vs. practically never. Or, more
realistically speaking, the difference between being able to solve
moderately-complex problem instances vs. being able to solve only
trivial toy instances.  When you're dealing with exponential complexity,
every small bit counts, and you can never get a big enough computer.

An even more down-to-earth counterargument is that if CPU vendors had
been content with understandable, simple CPU implementations, and
eschewed "heroic", hard-to-understand things like instruction pipelines
and cache hierarchies, we'd still be stuck with 16 MHz CPU's in 2017.


T

-- 
Not all rumours are as misleading as this one.

May 31 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via Digitalmars-d-learn 
wrote:
 On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via 

Digitalmars-d-learn wrote:
 On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn

 wrote:
 I did some digging around, and it seems that wc is using glibc's
 memchr, which is highly-optimized, whereas std.algorithm.count just
 uses a simplistic loop. Which is strange, because I'm pretty sure
 somebody optimized std.algorithm some time ago to use memchr()
 instead of a loop when searching for a byte value in an array.
 Whatever happened to that??

 I don't know, but memchr wouldn't work with CTFE, so someone might
 have removed it to make it work in CTFE (though that could be done
 with a different branch for CTFE). Or maybe it never made it into
 std.algorithm for one reason or another.

 [...]

 I checked the Phobos code again, and it appears that my memory deceived
 me. Somebody *did* add memchr optimization to find() and its friends,
 but not to count().

 CTFE compatibility is not a problem, since we can just if(__ctfe) the
 optimized block away.

 I'm currently experimenting with a memchr-optimized version of count(),
 but I'm getting mixed results: on small arrays or large arrays densely
 packed with matching elements, the memchr version runs rather slowly,
 because it involves a function call into the C library per matching
 element.  On large arrays only sparsely populated with matching
 elements, though, the memchr-optimized version beats the current code by
 about an order of magnitude.

 Since it wouldn't be a wise idea to assume sparsity of matches in
 Phobos, I decided to do a little more digging, and looked up the glibc
 implementation of memchr. The main optimization is that it iterates over
 the array not by byte, as a na�ve loop would do, but by ulong's. (Of
 course, the first n bytes and last n bytes that are not ulong-aligned
 are checked with a per-byte loop; so for very short arrays it doesn't
 lose out to the na�ve loop.)  In each iteration over ulong, it performs
 the bit-twiddling hack alluded to by Nitram to detect the presence of
 matching bytes, upon which it breaks out to the closing per-byte loop to
 find the first match. For short arrays, or arrays where a match is
 quickly found, it's comparable in performance to the na�ve loop; for
 large arrays where the match is not found until later, it easily
 outperforms the na�ve loop.

 My current thought is to adopt the same approach: iterate over size_t or
 some such larger unit, and adapt the bit-twiddling hack to be able to
 count the number of matches in each size_t.  This is turning out to be
 trickier than I'd like, though, because there is a case where carry
 propagation makes it unclear how to derive the number of matches without
 iterating over the bytes a second time.

 But this may not be a big problem, since size_t.sizeof is relatively
 small, so I can probably loop over individual bytes when one or more
 matches is detected, and a sufficiently-capable optimizer like ldc or
 gdc would be able to unroll this into a series of sete + add
 instructions, no branches that might stall the CPU pipeline. For
 densely-matching arrays, this should still have comparable performance
 to the na�ve loops; for sparsely-matching arrays this should show
 significant speedups.

If you're really trying to make it fast, there may be something that you can
do with SIMD. IIRC, Brian Schott did that with his lexer (or maybe he was
just talking about it - I don't remember for sure).

- Jonathan M Davis

May 31 2017

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Thursday, 1 June 2017 at 04:39:17 UTC, Jonathan M Davis wrote:
 On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via 
 Digitalmars-d-learn wrote:
 [...]

 Digitalmars-d-learn wrote:
 [...]

 If you're really trying to make it fast, there may be something 
 that you can do with SIMD. IIRC, Brian Schott did that with his 
 lexer (or maybe he was just talking about it - I don't remember 
 for sure).

See my link above to realdworldtech. Using SIMD can give good 
results in micro-benchmarks but completely screw up performance 
of other things in practice (the alignment requirements are heavy 
and result in code bloat, cache misses, TLB misses, cost of 
context switches, AVX warm up time (Agner Fog observed around 
10000 cycles before AVX switches from 128 bits to 256 bits 
operations), reduced turboing, etc.).

May 31 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn 
wrote:
 On Thursday, 1 June 2017 at 04:39:17 UTC, Jonathan M Davis wrote:
 On Wednesday, May 31, 2017 16:03:54 H. S. Teoh via

 Digitalmars-d-learn wrote:
 [...]

 Digitalmars-d-learn wrote:
 [...]

 If you're really trying to make it fast, there may be something
 that you can do with SIMD. IIRC, Brian Schott did that with his
 lexer (or maybe he was just talking about it - I don't remember
 for sure).

 See my link above to realdworldtech. Using SIMD can give good
 results in micro-benchmarks but completely screw up performance
 of other things in practice (the alignment requirements are heavy
 and result in code bloat, cache misses, TLB misses, cost of
 context switches, AVX warm up time (Agner Fog observed around
 10000 cycles before AVX switches from 128 bits to 256 bits
 operations), reduced turboing, etc.).

Whenever you attempt more complicated optimizations, it becomes harder to
get it right, and you always have the problem of figuring out whether you
really did make it better in general. It's the sort of thing that's easier
when you have a specific use case and it's very difficult to get right when
dealing with a general solution for a standard library. So, it doesn't
surprise me at all if a particular optimization turns out to be a bad idea
for Phobos even if it's great for some use cases.

- Jonathan M Davis

May 31 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Wed, May 31, 2017 at 10:05:50PM -0700, Jonathan M Davis via
Digitalmars-d-learn wrote:
 On Thursday, June 01, 2017 04:52:40 Patrick Schluter via Digitalmars-d-learn 

[...]
 See my link above to realdworldtech. Using SIMD can give good
 results in micro-benchmarks but completely screw up performance
 of other things in practice (the alignment requirements are heavy
 and result in code bloat, cache misses, TLB misses, cost of
 context switches, AVX warm up time (Agner Fog observed around
 10000 cycles before AVX switches from 128 bits to 256 bits
 operations), reduced turboing, etc.).

 
 Whenever you attempt more complicated optimizations, it becomes harder
 to get it right, and you always have the problem of figuring out
 whether you really did make it better in general. It's the sort of
 thing that's easier when you have a specific use case and it's very
 difficult to get right when dealing with a general solution for a
 standard library. So, it doesn't surprise me at all if a particular
 optimization turns out to be a bad idea for Phobos even if it's great
 for some use cases.

[...]

It makes me want to just say, write a na�ve loop expressing exactly what
you intend to achieve, and let the compiler's optimizer figure out how
to best optimize it for your target architecture.

Unfortunately, just earlier today while testing an incomplete version of
count() that uses the ulong iteration optimization, I discovered to my
horror that ldc (at -O3) apparently recognizes that exact hack and turns
the loop into a massive bowl of SSE/MMX/AVX/etc soup that's many times
the size of the "unoptimized" loop.  After reading the thread Patrick
pointed out from realworldtech, I'm starting to wonder if the result is
actually faster in practice, or if it only looks good in benchmarks,
because that code bloat is going to add instruction cache pressure and
probably TLB misses, etc..  If your program mostly calls count() on
smallish arrays (which I'd say is rather likely in cases that matter,
because I can't imagine someone would want to count bytes in 1MB arrays
inside an inner loop -- in inner loops you'd tend to be working with
smaller chunks of data and thus you'd want count() to be fast for small
to medium-sized arrays), then a small, tight "unoptimized" loop is going
to perform much better, because it would be easily inlineable, won't add
1KB to your function body size, and thus increase the chances your
function body won't overflow the instruction cache. Plus, reducing the
amount of complicated branches and other paraphrenalia will make the
CPU's branch predictor more likely to get it right, so you're less
likely to cause pipeline stalls.

Perhaps the right approach is to check if the array length is less than
some arbitrary threshold, and just use a na�ve loop below that, and only
switch to the complicated hackish stuff where you're sure it will
actually benefit, rather than hurt, performance.


T

-- 
Not all rumours are as misleading as this one.

May 31 2017

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
=20
 An even more down-to-earth counterargument is that if CPU vendors had
 been content with understandable, simple CPU implementations, and
 eschewed "heroic", hard-to-understand things like instruction
 pipelines
 and cache hierarchies, we'd still be stuck with 16 MHz CPU's in 2017.
=20

The people looking at modern, ultra-parallel hardware architectures are
indeed looking to use very simple CPU with ultra-low power use. Just
because Moore Law, the demand for computation, etc, during the 1980s,
1990s and 2000s led to x86_64 with its "heroic" silicon wafer layout,
doesn't mean that is where we have to stay. That's legacy thinking
based on huge investments of capital and requirement of a company to
continue to force an income stream from it's customers. The current
state of mainstream hardware is all about not innovating.=20

The problem with supercomputing just at the moment, is that you have to
build a power station for each one. The x86_64 and GPGPU approach
hasn't hit the end of Moore's Law, it's hit the "we can't supply enough
power to run it" wall. The Cloud is also not the answer, it's just and
income stream for a couple of companies pretending to continue to
innovate.
 =20
--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jun 01 2017

Jonathan M Davis via Digitalmars-d-learn writes:

On Wednesday, May 31, 2017 22:50:19 H. S. Teoh via Digitalmars-d-learn 
wrote:
 Perhaps the right approach is to check if the array length is less than
 some arbitrary threshold, and just use a na�ve loop below that, and only
 switch to the complicated hackish stuff where you're sure it will
 actually benefit, rather than hurt, performance.

Based on some previous discussions, I think that this is the sort of thing
that std.algorithm.sort does (switch algorithms depending on the size of the
range to be sorted), but I've never actually verified it.

- Jonathan M Davis

Jun 01 2017

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
=20

[=E2=80=A6]
 With D, we can have the cake and eat it too.=C2=A0=C2=A0The understandabl=

e /
 na=C3=AFve
 implementation can be available as a fallback (and reference
 implementation), with OS-specific optimized implementations guarded
 under version() or static-if blocks so that one could, in theory,
 provide implementations specific to each supported platform that
 would
 give the best performance.

But isn't that the compiler's job?

The lesson of functional programming, logic programming, etc. (when the
acolytes  remember the actual lesson) is that declarative expression of
goal is more comprehensible to people than detailed explanation of how
the computer calculates a result. Object-oriented computing lost the
plot somewhere in the early 2000s.

The advance of Scala, Kotlin, Groovy, and now Rust and Go (but only to
some extent), which D has, is to express algorithms as declarative
requirements in a dataflow manner.

One day hardware people will catch up with the hardware innovations of
the 1970s and 1980s regarding support of dataflow rather than state.

 I disagree about the philosophy of "if you need to go faster, use a
 bigger computer".=C2=A0=C2=A0There are some inherently complex problems (=

such
 as
 NP-complete, PSPACE-complete, or worse, outright exponential class
 problems) where the difference between a "heroic implementation" of a
 computational primitive and a na=C3=AFve one may mean the difference
 between
 obtaining a result in this lifetime vs. practically never. Or, more
 realistically speaking, the difference between being able to solve
 moderately-complex problem instances vs. being able to solve only
 trivial toy instances.=C2=A0=C2=A0When you're dealing with exponential
 complexity,
 every small bit counts, and you can never get a big enough computer.

There are always places for experimentation, and innovation. Hard
problems will always be hard, we just need to find the least hard way
of expressing the solutions. The crucial thing is that people always
work to remove the heroicism of the initial solutions, creating new
computational models, new programming languages, etc. to do it.

--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jun 01 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Thu, Jun 01, 2017 at 08:39:07AM +0100, Russel Winder via Digitalmars-d-learn
wrote:
 On Wed, 2017-05-31 at 16:37 -0700, H. S. Teoh via Digitalmars-d-learn
 wrote:
 

 […]
 With D, we can have the cake and eat it too.  The understandable /
 naïve implementation can be available as a fallback (and reference
 implementation), with OS-specific optimized implementations guarded
 under version() or static-if blocks so that one could, in theory,
 provide implementations specific to each supported platform that
 would give the best performance.

 
 But isn't that the compiler's job?

Unfortunately, the compiler can only go so far, because it doesn't
understand the larger context of what you're trying to accomplish.
Modern optimizing compilers certainly go a lot further than before, but
still, at some point some amount of human judgment is needed.

Also, compiler implementors do still have to do the "heroics", or
rather, teach the compiler to do the "heroics" when compiling
straightforward code. So while the general programmer probably will have
less need for it, compiler writers still need to know how to do them in
order to write the optimizing compilers in the first place.


 The lesson of functional programming, logic programming, etc. (when
 the acolytes  remember the actual lesson) is that declarative
 expression of goal is more comprehensible to people than detailed
 explanation of how the computer calculates a result. Object-oriented
 computing lost the plot somewhere in the early 2000s.

There is no argument that straightforward code is more comprehensible to
people.  The question here is whether it delivers maximum performance.

We know from Kolgomorov complexity theory that global optimization, in
the general case, is undecidable, so no optimizing compiler is going to
be able to generate optimal code in all cases. There will always be
cases where you have to manually tweak it yourself.  Of course, that
doesn't mean you go around micro-optimizing everything -- the usual
approach is to write it the straightforward way first, then profile it,
identify the hotspots, and find ways to improve performance in the
hotspots. Well, at a higher level, the first order of business is really
to look at it from an algorithmic POV and decide whether or not a
different algorithm ought to be used (and no optimizing compiler can
help you there).  Then if that's still not enough, then you dig into the
details and see if you can squeeze more juice out of your current
algorithms -- if the profiler has indicated that they are the
bottleneck.


 The advance of Scala, Kotlin, Groovy, and now Rust and Go (but only to
 some extent), which D has, is to express algorithms as declarative
 requirements in a dataflow manner.
 
 One day hardware people will catch up with the hardware innovations of
 the 1970s and 1980s regarding support of dataflow rather than state.

Dataflow can only capture a limited subset of algorithms.  Of course, in
my observation, 90% of code being written today generally belongs in the
category of what I call "linear shuffling of data", i.e., iterate over
some linear set of objects and perform some transformation X on each
object, copying linear memory region X to linear memory region Y,
rearranging some linear set of objects, permuting the order of some
linear set of things, etc..  This category basically subsumes all of GUI
programming and web programming, which in my estimation covers 90% or
maybe even 95% of code out there.  The bulk of game code also falls
under this category -- they are basically a matter of copying X items
from A to B, be they pixels to be copied to the video buffer, traversing
the list of in-game objects to update their current position, direction,
speed, or traversing scanlines of a polygon to map a 3D object to 2D,
etc.. Almost all business logic also falls under this category. All of
these are easily captured by dataflow models.

However, there are algorithms outside of this category, that are not
easily captured by the dataflow model.  Solving certain graph theory
problems, for example, require actual insight into the structure of the
problem than mere "move X items from A to B".  Route planning, which is
an NP-complete problem that, for practical applications, can only be
approximated, and therefore actual thought is required for how you
actually go about solving the problem beyond "data X moves from A to B".
Computing the convex hull of a set of input points, used for solving
optimization problems, if expressed and solved in a naïve way, has
O(n^3) time complexity, and therefore impractical for non-trivial
problem instances.

True, your average general programmer may not even know what a convex
hull problem is, and probably doesn't even need to care -- at worst,
there are already libraries out there that implement the algorithms for
you.  But the point is, *somebody* out there needs to write these
algorithms, and they need to implement these algorithms in an optimal
way so that it will continue to be useful for non-trivial problem sizes.
You cannot just say, "here is the problem specification, now just let
the computer / AI system / whatever figure out for themselves how to
obtain the answer".  *Somebody* has to actually sit down and specify
exactly how to compute the answer, because generic ways of arriving at
the answer are exponential or worse and are therefore useless. And even
a feasible algorithm may require a lot of "heroics" in order to make
medium-sized problems more tractable.  You want your weather forecasting
model to produce an answer by tomorrow before the 10am news, not 3
months later when the answer is no longer relevant.


 I disagree about the philosophy of "if you need to go faster, use a
 bigger computer".  There are some inherently complex problems (such
 as NP-complete, PSPACE-complete, or worse, outright exponential
 class problems) where the difference between a "heroic
 implementation" of a computational primitive and a naïve one may
 mean the difference between obtaining a result in this lifetime vs.
 practically never. Or, more realistically speaking, the difference
 between being able to solve moderately-complex problem instances vs.
 being able to solve only trivial toy instances.  When you're dealing
 with exponential complexity, every small bit counts, and you can
 never get a big enough computer.

 
 There are always places for experimentation, and innovation. Hard
 problems will always be hard, we just need to find the least hard way
 of expressing the solutions.

Some problems are inherently hard, and no amount of searching can reduce
its complexity past a certain limit. These require extreme measures to
be even remotely tractable.


 The crucial thing is that people always work to remove the heroicism
 of the initial solutions, creating new computational models, new
 programming languages, etc. to do it.

[...]

But *somebody* has to implement those computational models and
programming languages.  If nobody knows how to write "heroic" code, then
nobody would know how to write an optimizing compiler that produces such
code either, and these computational models and programming languages
wouldn't exist in the first place.

I know that the average general programmer doesn't (and shouldn't) care.
But *somebody* has to, in order to implement the system in the first
place. *Somebody* had to implement the "heroic" version of memchr so
that others can use it as a primitive. Without that, everyone would have
to roll their own, and it's almost a certainty that the results will be
underwhelming.


T

-- 
Question authority. Don't ask why, just do it.

Jun 02 2017

Russel Winder via Digitalmars-d-learn <digitalmars-d-learn puremagic.com> writes:

On Fri, 2017-06-02 at 10:32 -0700, H. S. Teoh via Digitalmars-d-learn
wrote:
 [=E2=80=A6]
=20
 Also, compiler implementors do still have to do the "heroics", or
 rather, teach the compiler to do the "heroics" when compiling
 straightforward code. So while the general programmer probably will
 have
 less need for it, compiler writers still need to know how to do them
 in
 order to write the optimizing compilers in the first place.

There are many different sorts of programming. Operating systems,
compilers, GUIs, Web services, machine learning, etc., etc. all require
different techniques. Also there are always new areas, where idioms and
standard approaches are yet to be discovered. There will always be a
place for "heroic", but to put it up on a pedestal as being a Good
Thing For All=E2=84=A2 is to do "heroic" an injustice.

We should also note that in the Benchmark Game, the "heroic" solutions
are targetted specifically at Isaac's execution machine, which often
means they are crap programs on anyone else's computer.

 [=E2=80=A6]
 be able to generate optimal code in all cases. There will always be
 cases where you have to manually tweak it yourself.=C2=A0=C2=A0Of course,=

 that
 doesn't mean you go around micro-optimizing everything -- the usual
 approach is to write it the straightforward way first, then profile
 it,
 identify the hotspots, and find ways to improve performance in the
 hotspots. Well, at a higher level, the first order of business is
 really
 to look at it from an algorithmic POV and decide whether or not a
 different algorithm ought to be used (and no optimizing compiler can
 help you there).=C2=A0=C2=A0Then if that's still not enough, then you dig=

 into
 the
 details and see if you can squeeze more juice out of your current
 algorithms -- if the profiler has indicated that they are the
 bottleneck.

The optimisations are though generally aimed at the current execution
computer. Which is fine in the short term. However in the long term,
the optimisations become the problem. When the execution context of an
optimised code changes then the optimisations should be backed out and
new optimisations applied. Sadly this rarely happens, and you end up
with new optimisations laid on old (redundant) optimisations, and hence
to incomprehensible code that people darn't amend as they have no idea


=20

[=E2=80=A6]
 But *somebody* has to implement those computational models and
 programming languages.=C2=A0=C2=A0If nobody knows how to write "heroic" c=

ode,
 then
 nobody would know how to write an optimizing compiler that produces
 such
 code either, and these computational models and programming languages
 wouldn't exist in the first place.

Which returns us to there are different sorts of programming, and there
are people at "the bleeding edge" of languages, techniques, and
hardware, researching these new things. Or there ought to be, it's just
that you need funds to do it.

 I know that the average general programmer doesn't (and shouldn't)
 care.
 But *somebody* has to, in order to implement the system in the first
 place. *Somebody* had to implement the "heroic" version of memchr so
 that others can use it as a primitive. Without that, everyone would
 have
 to roll their own, and it's almost a certainty that the results will
 be
 underwhelming.

It may be worth noting that far too few supposedly professional
programmers actually know enough about the history of their subject to
be deemed competent.=20


--=20
Russel.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder ekiga.n=
et
41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel winder.org.uk
London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder

Jun 02 2017

"H. S. Teoh via Digitalmars-d-learn" <digitalmars-d-learn puremagic.com> writes:

On Sat, Jun 03, 2017 at 07:00:47AM +0100, Russel Winder via Digitalmars-d-learn
wrote:
[...]
 There are many different sorts of programming. Operating systems,
 compilers, GUIs, Web services, machine learning, etc., etc. all
 require different techniques. Also there are always new areas, where
 idioms and standard approaches are yet to be discovered. There will
 always be a place for "heroic", but to put it up on a pedestal as
 being a Good Thing For All™ is to do "heroic" an injustice.

Fair enough.  I can see how this would lead to unnecessarily ugly,
prematurely-optimized code.  It's probably the origin of premature
optimization culture especially prevalent in C circles, where you just
get into the habit of automatically thinking things like "i = i + 1 is
less efficient than ++i", which may have been true in some bygone era
but is no longer relevant in the machines and optimizing compilers of
today. And also, constantly "optimizing" code that actually aren't the
bottleneck, because of some vague notion of wanting "everything" to be
fast, yet not being willing to use a profiler to find out where the real
bottleneck is.  As a result you spend inordinate amounts of time writing
the absolutestly fastest O(n^2) algorithm rather than substituting a
moderately unoptimized O(n) algorithm that's far superior, thus actually
introducing new bottlenecks instead of fixing existing ones.


 We should also note that in the Benchmark Game, the "heroic" solutions
 are targetted specifically at Isaac's execution machine, which often
 means they are crap programs on anyone else's computer.

Well, there is some value in targeting a specific execution environment,
but I agree that holding that up as being exemplary of how code should
be written would be rather misguided.


[...]
 The optimisations are though generally aimed at the current execution
 computer. Which is fine in the short term. However in the long term,
 the optimisations become the problem. When the execution context of an
 optimised code changes then the optimisations should be backed out and
 new optimisations applied. Sadly this rarely happens, and you end up
 with new optimisations laid on old (redundant) optimisations, and
 hence to incomprehensible code that people darn't amend as they have


I've been thinking about this for a while now, actually.  It almost
seems as though there ought to be two distinct layers of abstraction in
a given piece of code, a high-level, logical layer that specifies using
some computation model the desired results, and another lower-level
layer that contains either implementation details or target-specific
tweaks.  There should be an automatic translation from the upper layer
to the lower layer, but after the automatic translation you can go in
and tweak the lower layer *while keeping the upper layer intact*, and
the system (IDE or whatever) would keep track of *both*, with the lower
layer customizations tracked as a set of diffs against the automatically
translated version.  When the upper layer changes, any corresponding
diffs in the lower layer get invalidated and either produces a conflict
the programmer must manually resolve, or else defaults to the new
automated translation.

Furthermore, there should be some system of tracking multiple diff sets
for the lower layer, so that you can specify diff A as applying to
target machine X, and diff B as applying to target machine Y. So you can
target the same logical piece of code to different target machines with
different implementations.


[...]
 I know that the average general programmer doesn't (and shouldn't)
 care.  But *somebody* has to, in order to implement the system in
 the first place. *Somebody* had to implement the "heroic" version of
 memchr so that others can use it as a primitive. Without that,
 everyone would have to roll their own, and it's almost a certainty
 that the results will be underwhelming.

 
 It may be worth noting that far too few supposedly professional
 programmers actually know enough about the history of their subject to
 be deemed competent. 

[...]

Yes, and that is why the people who actually know what they're doing
need to be able to write the "hackish", optimized implementations of the
nicer APIs provided by the language / system, so that at the very least
the API calls would do something sane, even if the code above that is
crap.


T

-- 
Questions are the beginning of intelligence, but the fear of God is the
beginning of wisdom.

Jun 02 2017

D Programming

C/C++ Programming

Other

digitalmars.D.learn - howto count lines - fast