digitalmars.D - Success! Find is slow no more

Michael Brown (53/53) May 22 2019 Hi All,

H. S. Teoh (11/22) May 22 2019 [...]

Brian Schott (37/39) May 22 2019 Cloning the repository and running "make ldc" on my machine gives

H. S. Teoh (10/25) May 22 2019 [...]

Jon Degenhardt (10/34) May 22 2019 Should also benchmark with LTO turned on (including LTO for
Michael Brown (79/103) May 22 2019 Hi All,

Seb (6/13) May 22 2019 BTW while the low-hanging fruits for find have been picked, there
Andrei Alexandrescu (23/46) May 23 2019 This line should return haystack[0 .. 0] (an empty string matches at the...

Michael Brown <mikey.be gmail.com> writes:

Hi All,

I saw the blog post regarding the improvements to find three 
years ago. Attempted to improve upon it and failed miserably.

After watching Andrei's recent talk on fastware (when is the book 
coming out??) I thought id have another go at it. Success! I'm 
getting a 5-10% improvement over the Phobos find() on all tests - 
and it beats all over tested alternatives too.

Compiled with DMD on windows.

Search in Alice in Wonderland
        std: 103 ±1
     manual: 126 ±2
   A2Phobos: 111 ±2
      Chris: 136 ±2
     Andrei: 298 ±3
    Andrei2: 140 ±1
     Faster: 100 ±0
Search in random short strings
        std: 125 ±6
     manual: 134 ±9
   A2Phobos: 104 ±4
      Chris: 136 ±11
     Andrei: 123 ±11
    Andrei2: 110 ±6
     Faster: 101 ±2
Mismatch in random long strings
        std: 109 ±8
     manual: 183 ±66
   A2Phobos: 116 ±9
      Chris: 199 ±77
     Andrei: 331 ±140
    Andrei2: 143 ±19
     Faster: 100 ±0
Search random haystack with random needle
        std: 110 ±7
     manual: 151 ±26
   A2Phobos: 119 ±11
      Chris: 165 ±34
     Andrei: 261 ±51
    Andrei2: 145 ±17
     Faster: 101 ±2
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Celeron(R) Dual-Core CPU       T3500    
2.10GHz

Full code here: https://github.com/mikey-b/find-is-too-slow - 
method is faster_find()

In short, it was achieved with

* Better branch prediction - The if's and loops have been 
rearranged to promote the hot path's.
* Loop unrolling - I have unrolled the loop once, so that Skip 
can be calculated on the first failure, allowing us to remove the 
branch in future iterations needed to test skip's lazy evaluation.

Kind Regards,
Mike Brown

May 22 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 22, 2019 at 11:13:36PM +0000, Michael Brown via Digitalmars-d wrote:
 Hi All,
 
 I saw the blog post regarding the improvements to find three years
 ago.  Attempted to improve upon it and failed miserably.
 
 After watching Andrei's recent talk on fastware (when is the book
 coming out??) I thought id have another go at it. Success! I'm getting
 a 5-10% improvement over the Phobos find() on all tests - and it beats
 all over tested alternatives too.
 
 Compiled with DMD on windows.

[...]

Can you also try to run your tests using ldc/gdc?

Optimizing for dmd is not interesting IMO, because of DMD's
suboptimal(!) optimizer.  I personally would rather not uglify code just
to improve DMD benchmarks.

But if your code shows improvement with ldc/gdc, *then* we have
something interesting to talk about.


T

-- 
It always amuses me that Windows has a Safe Mode during bootup. Does that mean
that Windows is normally unsafe?

May 22 2019

Brian Schott <briancschott gmail.com> writes:

On Wednesday, 22 May 2019 at 23:24:47 UTC, H. S. Teoh wrote:
 But if your code shows improvement with ldc/gdc, *then* we have 
 something interesting to talk about.

Cloning the repository and running "make ldc" on my machine gives 
the following:

Search in Alice in Wonderland
        std: 178 ±13
     manual: 124 ±10
   A2Phobos: 100 ±0
      Chris: 121 ±10
     Andrei: 172 ±11
    Andrei2: 116 ±8
     Faster: 143 ±10
Search in random short strings
        std: 216 ±49
     manual: 178 ±43
   A2Phobos: 106 ±8
      Chris: 195 ±66
     Andrei: 121 ±18
    Andrei2: 124 ±23
     Faster: 125 ±18
Mismatch in random long strings
        std: 242 ±87
     manual: 229 ±100
   A2Phobos: 104 ±7
      Chris: 226 ±96
     Andrei: 227 ±101
    Andrei2: 133 ±29
     Faster: 153 ±25
Search random haystack with random needle
        std: 244 ±79
     manual: 194 ±64
   A2Phobos: 105 ±9
      Chris: 198 ±69
     Andrei: 177 ±39
    Andrei2: 144 ±40
     Faster: 160 ±27
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Intel(R) Core(TM) i7-7700HQ CPU   2.80GHz

May 22 2019

"H. S. Teoh" <hsteoh quickfur.ath.cx> writes:

On Wed, May 22, 2019 at 11:52:12PM +0000, Brian Schott via Digitalmars-d wrote:
 On Wednesday, 22 May 2019 at 23:24:47 UTC, H. S. Teoh wrote:
 But if your code shows improvement with ldc/gdc, *then* we have
 something interesting to talk about.

 
 Cloning the repository and running "make ldc" on my machine gives the
 following:
 
 Search in Alice in Wonderland
        std: 178 �13
     manual: 124 �10
   A2Phobos: 100 �0
      Chris: 121 �10
     Andrei: 172 �11
    Andrei2: 116 �8
     Faster: 143 �10

[...]

Interesting! So A2Phobos (which version does it refer to?) is faster in
all these cases. Looks like that's what we should merge. The `Faster`
algorithm doesn't seem to do very well, I'm guessing because the manual
optimizations have confused LDC's optimizer enough that it's unable to
produce good code.


T

-- 
"You know, maybe we don't *need* enemies." "Yeah, best friends are about all I
can take." -- Calvin & Hobbes

May 22 2019

Jon Degenhardt <jond noreply.com> writes:

On Thursday, 23 May 2019 at 00:16:16 UTC, H. S. Teoh wrote:
 On Wed, May 22, 2019 at 11:52:12PM +0000, Brian Schott via 
 Digitalmars-d wrote:
 On Wednesday, 22 May 2019 at 23:24:47 UTC, H. S. Teoh wrote:
 But if your code shows improvement with ldc/gdc, *then* we 
 have something interesting to talk about.

 
 Cloning the repository and running "make ldc" on my machine 
 gives the following:
 
 Search in Alice in Wonderland
        std: 178 ±13
     manual: 124 ±10
   A2Phobos: 100 ±0
      Chris: 121 ±10
     Andrei: 172 ±11
    Andrei2: 116 ±8
     Faster: 143 ±10

 [...]

 Interesting! So A2Phobos (which version does it refer to?) is 
 faster in all these cases. Looks like that's what we should 
 merge. The `Faster` algorithm doesn't seem to do very well, I'm 
 guessing because the manual optimizations have confused LDC's 
 optimizer enough that it's unable to produce good code.


 T

Should also benchmark with LTO turned on (including LTO for 
Phobos/DRuntime). This doesn't always help, but it does in a 
number of cases, and it's difficult to predict without trying. 
I've had especially good results when tight loops are involved. 
For LDC add  '-flto=thin 
-defaultlib=phobos2-ldc-lto,druntime-ldc-lto' to the build line.

(LTO+PGO is even more likely to be beneficial, but PGO takes more 
investment than LTO).

--Jon

May 22 2019

Michael Brown <mikey.be gmail.com> writes:

On Thursday, 23 May 2019 at 00:16:16 UTC, H. S. Teoh wrote:
 On Wed, May 22, 2019 at 11:52:12PM +0000, Brian Schott via 
 Digitalmars-d wrote:
 On Wednesday, 22 May 2019 at 23:24:47 UTC, H. S. Teoh wrote:
 But if your code shows improvement with ldc/gdc, *then* we 
 have something interesting to talk about.

 
 Cloning the repository and running "make ldc" on my machine 
 gives the following:
 
 Search in Alice in Wonderland
        std: 178 ±13
     manual: 124 ±10
   A2Phobos: 100 ±0
      Chris: 121 ±10
     Andrei: 172 ±11
    Andrei2: 116 ±8
     Faster: 143 ±10

 [...]

 Interesting! So A2Phobos (which version does it refer to?) is 
 faster in all these cases. Looks like that's what we should 
 merge. The `Faster` algorithm doesn't seem to do very well, I'm 
 guessing because the manual optimizations have confused LDC's 
 optimizer enough that it's unable to produce good code.


 T

Hi All,

Just installed LDC, Yes - ldc not producing SIMD instructions on 
the loops, which it is for the other methods. Rewriting the loops 
to the code below gives.

Search in Alice in Wonderland
        std: 122 ±4
     manual: 113 ±2
   A2Phobos: 100 ±0
      Chris: 152 ±3
     Andrei: 161 ±3
    Andrei2: 118 ±2
     Faster: 113 ±2
Search in random short strings
        std: 125 ±13
     manual: 126 ±13
   A2Phobos: 120 ±9
      Chris: 129 ±11
     Andrei: 103 ±4
    Andrei2: 134 ±19
     Faster: 102 ±3
Mismatch in random long strings
        std: 117 ±7
     manual: 166 ±62
   A2Phobos: 105 ±7
      Chris: 211 ±84
     Andrei: 170 ±76
    Andrei2: 118 ±9
     Faster: 111 ±8
Search random haystack with random needle
        std: 116 ±8
     manual: 139 ±32
   A2Phobos: 109 ±13
      Chris: 169 ±35
     Andrei: 131 ±29
    Andrei2: 116 ±8
     Faster: 112 ±9
  (avg slowdown vs fastest; absolute deviation)
CPU ID: GenuineIntel Celeron(R) Dual-Core CPU       T3500    
2.10GHz

Still - A solid second place, against A2Phobos. Wondering if it 
has some tricks its using I could incorp.

CODE


T[] faster_find(T)(T[] haystack, T[] needle)
{
     if (needle.length == 0) return haystack[$..$];
	if (needle.length > haystack.length) return haystack[$..$];

	immutable end = needle.length - 1;
	size_t j = end, skip = 0;
	
     while(++j < haystack.length)
     {
         if (haystack[j] != needle[end]) continue;
		
         for(size_t i = 1, k = 0; i < needle.length; ++i, ++k) {
             if (haystack[j - needle.length + i] != needle[k]) {
				while (skip < end && needle[end - 1 - skip] != needle[end]) { 
++skip; }
				j += skip;
				goto main_loop2;
			}
         }
		return haystack[j - end .. $];
     }
     return haystack[$ .. $];

	main_loop2: while(++j < haystack.length)
     {
         if (haystack[j] != needle[end]) continue;
		
         for(size_t i = 1, k = 0; i < needle.length; ++i, ++k) {
             if (haystack[j - needle.length + i] != needle[k]) {
				j += skip;
				continue main_loop2;
			}
         }
		return haystack[j - end .. $];
     }
     return haystack[$ .. $];
}

May 22 2019

Seb <seb wilzba.ch> writes:

On Thursday, 23 May 2019 at 01:26:09 UTC, Michael Brown wrote:
 On Thursday, 23 May 2019 at 00:16:16 UTC, H. S. Teoh wrote:
 [...]

 Hi All,

 Just installed LDC, Yes - ldc not producing SIMD instructions 
 on the loops, which it is for the other methods. Rewriting the 
 loops to the code below gives.

 [...]

BTW while the low-hanging fruits for find have been picked, there 
are still quite a lot of functions in std.algorithm where 
optimization is easier.
Replicating "find is too slow, so we fixed it" for another symbol 
would be extremely valuable.

May 22 2019

Andrei Alexandrescu <SeeWebsiteForEmail erdani.com> writes:

A few comments within the code:

 T[] faster_find(T)(T[] haystack, T[] needle)
 {
      if (needle.length == 0) return haystack[$..$];

This line should return haystack[0 .. 0] (an empty string matches at the 
beginning). Distinction is mainly academic but it will reduce the size 
of the function.

      if (needle.length > haystack.length) return haystack[$..$];

I think this falls off naturally so no need to treat it as a special case.

(A good case to treat as a special case is needle.length == 1.)

      immutable end = needle.length - 1;
      size_t j = end, skip = 0;
 
      while(++j < haystack.length)

Whenever I see ++something I think "the first increment here is 
useless". By the same token it could happen that with something++ the 
last increment is useless. The difference is that the latter has fewer 
dependencies when it's also part of a comparison. I suggest looking into 
replacing ++j with starting at needle.length and doing the loop slightly 
differently.

      {
          if (haystack[j] != needle[end]) continue;

Inside this loop you have already checked the bounds so if the compiler 
generates them you may want to use .ptr[index]. There's some  trusted 
gymnastics you need to add.

          for(size_t i = 1, k = 0; i < needle.length; ++i, ++k) {

Here you have two iteration variables that always differ by 1. You may 
want to use only one iteration variable. (I'm not sure this will improve 
matters for built-in types as there's enough registers.)

              if (haystack[j - needle.length + i] != needle[k]) {
                  while (skip < end && needle[end - 1 - skip] != 
 needle[end]) { ++skip; }
                  j += skip;
                  goto main_loop2;
              }
          }
          return haystack[j - end .. $];
      }
      return haystack[$ .. $];
 
      main_loop2: while(++j < haystack.length)

Looks like the two versions of the loop differ only by the computation 
of skip. The code would be smaller (and therefore potentially faster) if 
skip were initialized e.g. to size_t.max and then initialized once. The 
test skip != size_t.max will fail only once and after that it will be 
reliably speculated.

May 23 2019

D Programming

C/C++ Programming

Other

digitalmars.D - Success! Find is slow no more