digitalmars.D - A Friendly Challenge for D

Jabari Zakiya (46/46) Oct 10 2018 Hi.

SrMordred (3/4) Oct 10 2018 Looking forward to this :)
Kagamin (4/5) Oct 10 2018 As i understand, main thread preallocates global memory and

Jabari Zakiya (69/74) Oct 10 2018 Here's an abreviated elementary explanation of the algorithm and

Neia Neutuladh (3/4) Oct 10 2018 It would be great if you could provide a link to a freely downloadable

Jabari Zakiyth (11/15) Oct 10 2018 You can download the paper for free from that link. Did you have

=?UTF-8?Q?Ali_=c3=87ehreli?= (6/20) Oct 10 2018 I think the problem is, scribd requires an account, which they

Jabari Zakiyth (6/38) Oct 11 2018 What country are you trying to get access from, because I know

Carl Sturtivant (4/11) Oct 11 2018 Not credible --- I cannot get either without an account browsing

Jabari Zakiya (9/22) Oct 11 2018 Wow. Those links used to work with no problem, so I don't know

Jabari Zakiya (16/40) Oct 11 2018 Ok, hopefully this will work for everyone. Try this link:

=?UTF-8?Q?Ali_=c3=87ehreli?= (4/6) Oct 11 2018 Thank you. That worked just fine. I clicked the Download link and the

tide (8/11) Oct 10 2018 If you want help with your paper, possibly some kind of decent

Jabari Zakiyth (15/26) Oct 10 2018 I'm writing the paper anyway (just like the others), so other

welkam (13/19) Oct 12 2018 I looked into your NIM code and from programmers point of view

Jabari Zakiya (43/63) Oct 12 2018 Hmm,I don't think what you're saying about similar

welkam (19/26) Oct 12 2018 Thats why godbolt exists.
welkam (6/8) Oct 12 2018 There is no idiomatic D code. There is only better

Jabari Zakiya (2/10) Oct 12 2018 I await your implementation(s)! :-)

Vijay Nayar (8/21) Oct 13 2018 I downloaded the reference NIM implementation and got the latest

welkam (5/12) Oct 13 2018 import algorithm

Vijay Nayar (7/21) Oct 13 2018 I ran into the same problem as you did, and then followed the
Jabari Zakiya (17/31) Oct 13 2018 My mistake. I updated the file and forgot to include the 'import

Vijay Nayar (5/33) Oct 13 2018 Ok, now it builds. I was previously following the build

Vijay Nayar (8/49) Oct 13 2018 Interesting results so far. I have a partially converted program

Jabari Zakiya (16/66) Oct 13 2018 It may be also running into a hard time limit imposed on

Vijay Nayar (4/19) Oct 13 2018 It's P17, the biggest one that takes the longest to build in the

Jabari Zakiya (12/35) Oct 13 2018 Yes, that's what I figured, because that's when the Nim compiler
Vijay Nayar (17/40) Oct 13 2018 It's not just DMD either.

Jabari Zakiya (27/68) Oct 13 2018 In the Nim code, starting line 91 is when the PG constants are

Vijay Nayar (20/95) Oct 14 2018 Updated:

Bastiaan Veelo (9/17) Oct 14 2018 Are you aware of DCompute? [1] I haven’t used it myself but since
Vijay Nayar (41/44) Oct 15 2018 I don't actually understand the underlying algorithm, but I at

Jabari Zakiya (111/151) Oct 15 2018 Hey Great! I'll take some time and study your code.
Jabari Zakiya (16/32) Oct 15 2018 Here's what I get on my system.

Vijay Nayar (14/47) Oct 16 2018 D has multiple compilers, but for the speed of the finished

welkam (13/13) Oct 16 2018 So I run profiler and 97% of time is spent in void twinsSieve

Jabari Zakiya (6/19) Oct 16 2018 This is the exact same behavior I found with the Nim compiler too.

welkam (7/10) Oct 16 2018 Well Nim compiler is more like translator. It translates Nim code

Jabari Zakiya (16/27) Oct 16 2018 Yes, it finally is the compiler doing it, and Clang does the same

welkam (8/15) Oct 16 2018 Its hard to prove that you will only write 1 or 0 in the array

Jabari Zakiya (7/21) Oct 16 2018 Exactly. It does the same instruction twice, and once it's set

Jon Degenhardt (10/17) Oct 16 2018 It'd be interesting to see if LTO or PGO generated an
Jabari Zakiya (50/82) Oct 16 2018 Hey Vijay I found subtle static typing errors.
Jabari Zakiya (29/81) Nov 07 2018 Hey Vijay I got a complete D version running correctly.

Bastiaan Veelo (17/23) Nov 08 2018 Hi Jabari,

Jabari Zakiya (25/49) Nov 13 2018 Hi Bastiaan,

Jabari Zakiya (12/12) Dec 11 2018 I finally finished my paper on Twin Primes, which just kept

welkam (2/5) Dec 11 2018 Hey Mom I contributed to white paper. I am famous now :D

Jabari Zakiya <jzakiya gmail.com> writes:

Hi.

I hope this is the right place to request this, if not please 
tell me a better one.

I had looked at D, and played with it some circa 2010~2012, but 
time and life took my priorities away. But I'm still interested 
in learning different languages, but there are so many more now 
it's hard to devote the time to learn them to some high level of 
proficiency.

Subsequently, I started learning Nim, because I find the syntax 
and constructs simpler and more familiar than most (I'm coming 
mostly from a Ruby background).

I have developed in Nim a program to find twinprimes that seems 
to be the fastest of its type, compared to primesieve 
(https://primesieve.org/) which claims to be the fastest, written 
in C++.

Here is the code to my Nim implementation of twinprimes_ssoz.

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

The total file is just 318 lines of code, with about 60 separate 
lines of comments.
The code is extensively commented per line to explain what's 
happening.
Reading the references given in the code introduction will 
explain the general process.
See  "The Segmented Sieve of Zakiya (SSoZ)"
https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

I am in the process of writing up a paper to explain this 
application, and implementation. What I am requesting here is for 
a person(s) who is an "expert" (very good) to create a very fast 
D version, using whatever tricks it has to maximize performance.

I would like to include in my paper a good comparison of various 
implementations in different compiled languages (C/C++, D, Nim, 
etc) to show how it performs with each.

This algorithm is designed to be run in multiprocessing 
environments (more core/threads, better performance). The Nim 
version (0.18.0, 0.19.0 recently released) uses its native 
parallel processing structure. In C++, et al, it may be best 
implemented using OPenMP, or CUDA, etc. These are implementation 
details one versed in a language could determine.

Well that's it. I am willing to share performance data, compared 
to primesive, if you like. The beauty of this 
algorithm|implementation is its simplicity in design, and minimal 
code. It shouldn't be that hard to understand to determine how to 
translate into D. Of course, I'm am fully available to answer 
questions and provide help.

Thanks

Jabari

Oct 10 2018

SrMordred <patric.dexheimer gmail.com> writes:

On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 [...]

Looking forward to this :)

Oct 10 2018

Kagamin <spam here.lot> writes:

On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

As i understand, main thread preallocates global memory and 
tracks it, and other threads don't track it?

Oct 10 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Wednesday, 10 October 2018 at 20:43:01 UTC, Kagamin wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

 As i understand, main thread preallocates global memory and 
 tracks it, and other threads don't track it?

Here's an abreviated elementary explanation of the algorithm and 
implementation.
You will need to read (download) my paper "The Segmented Sieve of 
Zakiya (SSoZ)"
because I will make reference to things in it, to keep this as 
short as possible.

https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

To really understand the math and theory of the algorithm 
requires primarily
just understanding Table 3 on page 3 of the paper, as it 
encapsulates everything.
You can read the paper to understand the language of the 
algorithm used in the code.

Twinprimes are 2 (odd) primes that differ by only 2 eg. 3-5, 5-7, 
11-13, 29-31.

Table 3 shows all the residues values (prime candidates|pcs) and 
residues groups
(resgroups|columns) to find all the primes upto 541 using P5 
(modpg = 30, rescnt = 8).

For a given value N, it will be represented with a PG table of 
some number of
resgroups, with max size I call Kmax (the regroups value N 
residues in).

Using P5, I only need to sieve primes along the residue tracks 
(restracks) that can
produce twinprimes, here 11-13, 17-19, 29-31. Thus I create 3 
byte arrays, one for each twinpair, and use the lower 2 bits to 
represent the upper and lower twinprime restracks.

Then for each twinpair (here 3) I run 3 thread which perform the 
SSoZ on the entire Kmax length of resgroups in parallel. At the 
end I accumulate the results and print them out. This, in a 
nutshell, is what the algorithm does. The paper gives you enough 
to understand the fundamental nature of the algorithm, though 
I've learned so much more than in 2014. :)

The larger the PG the more twinpair restracks (see page 14) there 
are to use. For larger numbers you want to use the largest PG 
possible that 'fits' into the hardware cpu. All my 
development|testing was done on laptops using Intel I5|I7 cpus 
with 4 or 8 threads. I'm really interested how it performs on 
other platforms (ARM, AMD, PowerPC, etc).


The main proc "twinprimes_ssoz" manages program flow and set as 
follows:

1) accepts the input number (an integer) in "val" from the cli

2) sets "num" to be first odd number < "val" if "val" even

3) calls "selectPG" with "num" to select optimum Prime Generator 
(PG) parameters

4) compute various working parameters per selected PG and number 
value (see refs)

5) compute global values for number of primes, and their values, 
<= sqrt(num) for
selected PG with proc "sozpg"

6) Set initial twinprime count for selected PG

7) Then with proc "segsieve" allocate one thread to perform SSoZ 
(segmented sieve of zakiya)
on each twinprime pair residues (determined by selected PG), and 
count number of twinprmes
computed in each thread.

8) Then determine true total twinprime count and last twinprime 
<= "num"

It also is timing different intervals and prints out diagnostics 
and final output.


The proc "twins_sieve" is the primary function that manages and 
performs the SSoZ for
a given twinprim pair parameters in each thread. The more threads 
the faster the process goes.

I'll provide more info later. I have to run now. I wanted to get 
this out now while I
was at my laptop, and online.

Oct 10 2018

Neia Neutuladh <neia ikeran.org> writes:

On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

It would be great if you could provide a link to a freely downloadable 
version of this.

Oct 10 2018

Jabari Zakiyth <jzakiya gmail.com> writes:

On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh 
wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

 It would be great if you could provide a link to a freely 
 downloadable version of this.

You can download the paper for free from that link. Did you have 
trouble doing it?

Here's another link to paper.

https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_


Here, again, is the link to the Nim code. Just install Nim 
(current 0.19.0) and compile and run it per instructions in code. 
I recommend people do that to see its outputs and verify its 
results.

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

Oct 10 2018

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/10/2018 07:52 PM, Jabari Zakiyth wrote:
 On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

 It would be great if you could provide a link to a freely downloadable
 version of this.

 You can download the paper for free from that link. Did you have trouble
 doing it?

I think the problem is, scribd requires an account, which they 
apparently happy to link to an existing Google or Facebook account.

 Here's another link to paper.

 https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_

Similarly, that one requires a Google account.

 Here, again, is the link to the Nim code. Just install Nim (current
 0.19.0) and compile and run it per instructions in code. I recommend
 people do that to see its outputs and verify its results.

 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

That works! :)

Ali

Oct 10 2018

Jabari Zakiyth <jzakiya gmail.com> writes:

On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli wrote:
 On 10/10/2018 07:52 PM, Jabari Zakiyth wrote:
 On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh

 wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 



 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
 It would be great if you could provide a link to a freely


 downloadable
 version of this.

 You can download the paper for free from that link. Did you

 have trouble
 doing it?

 I think the problem is, scribd requires an account, which they 
 apparently happy to link to an existing Google or Facebook 
 account.

 Here's another link to paper.

 

 https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_

 Similarly, that one requires a Google account.

 Here, again, is the link to the Nim code. Just install Nim

 (current
 0.19.0) and compile and run it per instructions in code. I

 recommend
 people do that to see its outputs and verify its results.

 

 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

 That works! :)

 Ali

What country are you trying to get access from, because I know 
people in the US have gotten the papers from those link, for free 
and without an account. It may also have something to do with 
your browser. They work for me in various modern browsers, 
including the Tor Browser.

Oct 11 2018

Carl Sturtivant <sturtivant gmail.com> writes:

On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli wrote:
 [...]

 What country are you trying to get access from, because I know 
 people in the US have gotten the papers from those link, for 
 free and without an account. It may also have something to do 
 with your browser. They work for me in various modern browsers, 
 including the Tor Browser.

Not credible --- I cannot get either without an account browsing 
from the US using Firefox with no unusual arrangements.

Oct 11 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Thursday, 11 October 2018 at 14:49:54 UTC, Carl Sturtivant 
wrote:
 On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
 wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli 
 wrote:
 [...]

 What country are you trying to get access from, because I know 
 people in the US have gotten the papers from those link, for 
 free and without an account. It may also have something to do 
 with your browser. They work for me in various modern 
 browsers, including the Tor Browser.

 Not credible --- I cannot get either without an account 
 browsing from the US using Firefox with no unusual arrangements.

Wow. Those links used to work with no problem, so I don't know 
what is it (vpn use, etc,?).

OK, this will definitely work. Send me an email and I will email 
you the paper (and figure out a sure fire way to make it 
available).

Construct this full email (for humans).

(my first initial)(lastname) at gmail

Oct 11 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Thursday, 11 October 2018 at 16:13:17 UTC, Jabari Zakiya wrote:
 On Thursday, 11 October 2018 at 14:49:54 UTC, Carl Sturtivant 
 wrote:
 On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
 wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli 
 wrote:
 [...]

 What country are you trying to get access from, because I 
 know people in the US have gotten the papers from those link, 
 for free and without an account. It may also have something 
 to do with your browser. They work for me in various modern 
 browsers, including the Tor Browser.

 Not credible --- I cannot get either without an account 
 browsing from the US using Firefox with no unusual 
 arrangements.

 Wow. Those links used to work with no problem, so I don't know 
 what is it (vpn use, etc,?).

 OK, this will definitely work. Send me an email and I will 
 email you the paper (and figure out a sure fire way to make it 
 available).

 Construct this full email (for humans).

 (my first initial)(lastname) at gmail

Ok, hopefully this will work for everyone. Try this link:

https://mega.nz/#!yJxUEQgK!MY9dwjiWheE8tACtEeS0szduIvdBjiyTn4O6mMD_aZw

However, I'm still interested to find out what is|are the 
specific issues with the other links so I can make a possible bug 
report to those services.

1) Can you actually get to the site?

2) Can you see the document (a reference to it)?

3) Can you read the document online (did you try)?

4) What happened when you tried to download it (popup requiring 
account, etc)?

The paper is from 2014, Scribd has changed a lot since then. The 
more information people can provide will help me determine how to 
provide better access to them.

But you can always email me to get them as a last resort, or just 
to ask questions.

Oct 11 2018

=?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:

On 10/11/2018 10:14 AM, Jabari Zakiya wrote:

 Ok, hopefully this will work for everyone. Try this link:

 https://mega.nz/#!yJxUEQgK!MY9dwjiWheE8tACtEeS0szduIvdBjiyTn4O6mMD_aZw

Thank you. That worked just fine. I clicked the Download link and the 
pdf was saved on my end. :)

Ali

Oct 11 2018

tide <tide tide.tide> writes:

On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages (C/C++, 
 D, Nim, etc) to show how it performs with each.

If you want help with your paper, possibly some kind of decent 
financial incentive would be appropriate. If the algorithm 
benefits from more threads than finding or creating an 
implementation that runs on a GPU would probably be the true 
performance test. CPUs have like 4-8 cores in the mainstream? A 
GPU has hundreds, though with some limitations.

Oct 10 2018

Jabari Zakiyth <jzakiya gmail.com> writes:

On Thursday, 11 October 2018 at 00:22:10 UTC, tide wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages 
 (C/C++, D, Nim, etc) to show how it performs with each.

 If you want help with your paper, possibly some kind of decent 
 financial incentive would be appropriate. If the algorithm 
 benefits from more threads than finding or creating an 
 implementation that runs on a GPU would probably be the true 
 performance test. CPUs have like 4-8 cores in the mainstream? A 
 GPU has hundreds, though with some limitations.

I'm writing the paper anyway (just like the others), so other 
implementations are icing on the cake to show implementation 
variations, as a benefit to readers. Maybe if I set up a website 
and created a Rosetta Code repo for people to post their 
different language implementations, and offer a T-shirt for 
fastest implementation. :-)

Yes, a GPU based implementation would be the epitome for this 
algorithm, by far. This is actually why I have gotten the 
algorithm to this implementation so that the number crunching can 
all be done in parallel threads. (It would also be screamingly 
fast done in hardware in a FPGA too.) However, I only have 
standard consumer grade laptops. Hopefully someone(s) with 
sufficient hardware, interest, and time, will take this upon 
themselves to do this and publicize their results.

Oct 10 2018

welkam <wwwelkam gmail.com> writes:

On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 What I am requesting here is for a person(s) who is an "expert" 
 (very good) to create a very fast D version, using whatever 
 tricks it has to maximize performance.

 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages (C/C++, 
 D, Nim, etc) to show how it performs with each.

I looked into your NIM code and from programmers point of view 
there is nothing interesting going on. Simple data structures and 
simple operations. If you wrote equivalent code in C, C++, D, 
NIM, Rust, Zig and compiled with same optimizing compiler (llvm 
or gcc) you should get the same machine code and almost the same 
performance (less than 1% difference due to runtime). If you got 
different machine code for equivalent implementation then you 
should file a bug report.

The only way you will get different performance is by changing 
implementation details but then you would compare apples to 
oranges.

Oct 12 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Friday, 12 October 2018 at 15:11:17 UTC, welkam wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 What I am requesting here is for a person(s) who is an 
 "expert" (very good) to create a very fast D version, using 
 whatever tricks it has to maximize performance.

 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages 
 (C/C++, D, Nim, etc) to show how it performs with each.

 I looked into your NIM code and from programmers point of view 
 there is nothing interesting going on. Simple data structures 
 and simple operations. If you wrote equivalent code in C, C++, 
 D, NIM, Rust, Zig and compiled with same optimizing compiler 
 (llvm or gcc) you should get the same machine code and almost 
 the same performance (less than 1% difference due to runtime). 
 If you got different machine code for equivalent implementation 
 then you should file a bug report.

 The only way you will get different performance is by changing 
 implementation details but then you would compare apples to 
 oranges.

Hmm,I don't think what you're saying about similar 
output|performance with other languages is empirically correct, 
but it's really not the point of the challenge.

The real point of the challenge is too see what idiomatic code, 
written for performance, using the best resources that the 
language provides, will produce compared, to the Nim version. 
It's not to see what a line-by-line translation from Nim to D 
would look like. That may be a start to get something working, 
but shouldn't be the end goal.

I'm using the Nim version here as the "reference implementation" 
so it can be used as the standard for comparison (accuracy of 
results and performance). The goal for D (et al) users is to use 
whatever resources it provides to maybe do better.

Example. Nim currently doesn't provide standard bitarrays. Using 
bitarrays in place of byte arrays should perform faster because 
more data can fit in cache and operate faster.

Also, to parallelize the algorithm maybe using OpenMP, CUDA, etc 
is the way to do it for D. I don't know what constructs D uses 
for parallel multiprocessing. And as noted before, this 
algorithms screams out to be done with GPUs.

But you are correct that the Nim code uses very simple coding 
operations. That is one of its beauties! :) It is simple to 
understand and implement mathematically, short and simple to 
code, and architecturally adaptable to hardware.

So to really do the challenge, the Nim code needs to be compiled 
and run (per instructions in code) to use as the "reference 
implementation", to see what correct outputs look like, and their 
times, and then other implementations can be compared to it.

I would hope, after getting an output correct implementation done 
(to show you really know what you're doing) then alternative 
implementations can be done to wring out better performance.

I think this is a good challenge for anyone wanting to learn D 
too, because it involves something substantially more than a 
"toy" algorithm, but short enough to do with minimal time and 
effort, that involves the need to know (learn about) D in enough 
detail to determine the "best" (alternative) way to do it.

Finally, a really fast D implementation can be a marketing 
bananza to show people in the numerical analysis, data|signal 
processing fields, et al, that D can be used by them to solve 
their problems and be more performant than C++, etc.

Again, people should feel free to email me if the want more 
direct answers to questions, or help.

Oct 12 2018

welkam <wwwelkam gmail.com> writes:

On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 Hmm,I don't think what you're saying about similar 
 output|performance with other languages is empirically correct, 
 but it's really not the point of the challenge.


Thats why godbolt exists.

c++ and Rust
https://godbolt.org/z/FffXY2
C and D
https://godbolt.org/z/YQvkXU

Look at assembly output. Its all the same. Compilers are very 
good at recognizing simple arithmetic computations and can reason 
about it. What they struggle with is memory access.

On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 Finally, a really fast D implementation can be a marketing 
 bananza to show people > in the numerical analysis, data|signal 
 processing fields, et al, that D can be used by them to solve 
 their problems and be more performant than C++, etc

eBay`s tsv-utils is fastest in the world
https://github.com/eBay/tsv-utils/blob/master/docs/comparative-benchmarks-2018.md#top-four-in-each-benchmark

D`s JSON parsing is fastest in the world
https://github.com/kostya/benchmarks#json

Sambamba is BAM data processor and is fastest in the world
https://www.basepairtech.com/blog/sorting-bam-files-samtools-vs-sambamba/

Mir GLAS is faster than OpenBLAS and Eigen
http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-gemm-benchmark.html

There are enough examples but they are not enough to change minds

Oct 12 2018

welkam <wwwelkam gmail.com> writes:

On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 The real point of the challenge is too see what idiomatic 
 code...

There is no idiomatic D code. There is only better 
implementations.

D doesnt tell you how to write your code. It gives you many tools 
and you choose which tools to use. That`s what people like about 
D.

Oct 12 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Friday, 12 October 2018 at 20:05:29 UTC, welkam wrote:
 On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 The real point of the challenge is too see what idiomatic 
 code...

 There is no idiomatic D code. There is only better 
 implementations.

 D doesnt tell you how to write your code. It gives you many 
 tools and you choose which tools to use. That`s what people 
 like about D.

I await your implementation(s)! :-)

Oct 12 2018

Vijay Nayar <madric gmail.com> writes:

On Friday, 12 October 2018 at 21:08:03 UTC, Jabari Zakiya wrote:
 On Friday, 12 October 2018 at 20:05:29 UTC, welkam wrote:
 On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya 
 wrote:
 The real point of the challenge is too see what idiomatic 
 code...

 There is no idiomatic D code. There is only better 
 implementations.

 D doesnt tell you how to write your code. It gives you many 
 tools and you choose which tools to use. That`s what people 
 like about D.

 I await your implementation(s)! :-)

I downloaded the reference NIM implementation and got the latest 
nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
undeclared routine: 'sort'

For a person not familiar with nim, what's the fastest way to fix 
that?

Oct 13 2018

welkam <wwwelkam gmail.com> writes:

On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?

import algorithm

thats all but then it spits out

lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
requires too many iterations

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?

 import algorithm

 thats all but then it spits out

 lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
 requires too many iterations

I ran into the same problem as you did, and then followed the 
instructions from the error.  I modified the compiler source and 
increased the number of maximum iterations from 3_000_000 to 
1_000_000_000, rebuilt and installed it, but still ran into the 
exact same problem.  There may be something up with the algorithm 
itself.

Oct 13 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?

 import algorithm

 thats all but then it spits out

 lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
 requires too many iterations

My mistake. I updated the file and forgot to include the 'import 
algorithm' directive. The file is now fixed to include it. 
Download the corrected version or patch your file accordingly.

As stated in the file intro **YOU MUST DO THIS** to get it to 
compile with current Nim (they were supposed to fix this in this 
version 0.19.0 but didn't).

  To compile for nim versions <= 0.19.0 do following:
  1) in file: ~/nim-0.19.0/compiler/vmdef.nim
  2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion 
or >)
  3) then rebuild sysem: ./koch boot -d:release

If you are using 'choosenim' to install Nim (highly advisable) 
the full path is:

  ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim

I'll post performance results from my laptop to give reference 
times to compare against.

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]

 import algorithm

 thats all but then it spits out

 lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
 requires too many iterations

 My mistake. I updated the file and forgot to include the 
 'import algorithm' directive. The file is now fixed to include 
 it. Download the corrected version or patch your file 
 accordingly.

 As stated in the file intro **YOU MUST DO THIS** to get it to 
 compile with current Nim (they were supposed to fix this in 
 this version 0.19.0 but didn't).

  To compile for nim versions <= 0.19.0 do following:
  1) in file: ~/nim-0.19.0/compiler/vmdef.nim
  2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion 
 or >)
  3) then rebuild sysem: ./koch boot -d:release

 If you are using 'choosenim' to install Nim (highly advisable) 
 the full path is:

  ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim

 I'll post performance results from my laptop to give reference 
 times to compare against.

Ok, now it builds.  I was previously following the build 
instructions from the Nim website and am not super clear what the 
"koch" tool does, but following your instructions, the program 
does build and run.  I'll take a stab at making a D version.

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 15:50:06 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya 
 wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]

 import algorithm

 thats all but then it spits out

 lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
 requires too many iterations

 My mistake. I updated the file and forgot to include the 
 'import algorithm' directive. The file is now fixed to include 
 it. Download the corrected version or patch your file 
 accordingly.

 As stated in the file intro **YOU MUST DO THIS** to get it to 
 compile with current Nim (they were supposed to fix this in 
 this version 0.19.0 but didn't).

  To compile for nim versions <= 0.19.0 do following:
  1) in file: ~/nim-0.19.0/compiler/vmdef.nim
  2) set variable: MaxLoopIterations* = 1_000_000_000 (1 
 Billion or >)
  3) then rebuild sysem: ./koch boot -d:release

 If you are using 'choosenim' to install Nim (highly advisable) 
 the full path is:

  ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim

 I'll post performance results from my laptop to give reference 
 times to compare against.

 Ok, now it builds.  I was previously following the build 
 instructions from the Nim website and am not super clear what 
 the "koch" tool does, but following your instructions, the 
 program does build and run.  I'll take a stab at making a D 
 version.

Interesting results so far.  I have a partially converted program 
here:  
https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

The interesting part is that during compilation (with the command 
"dmd twinprimes_ssoz.d"), the compilation will abort with the 
message "Killed" and no further information. That's a new one for 
me, so I'm looking into the cause.

Oct 13 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Saturday, 13 October 2018 at 17:36:33 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:50:06 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya 
 wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]

 import algorithm

 thats all but then it spits out

 lib/nim/pure/algorithm.nim(144, 11) Error: interpretation 
 requires too many iterations

 My mistake. I updated the file and forgot to include the 
 'import algorithm' directive. The file is now fixed to 
 include it. Download the corrected version or patch your file 
 accordingly.

 As stated in the file intro **YOU MUST DO THIS** to get it to 
 compile with current Nim (they were supposed to fix this in 
 this version 0.19.0 but didn't).

  To compile for nim versions <= 0.19.0 do following:
  1) in file: ~/nim-0.19.0/compiler/vmdef.nim
  2) set variable: MaxLoopIterations* = 1_000_000_000 (1 
 Billion or >)
  3) then rebuild sysem: ./koch boot -d:release

 If you are using 'choosenim' to install Nim (highly 
 advisable) the full path is:

  ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim

 I'll post performance results from my laptop to give 
 reference times to compare against.

 Ok, now it builds.  I was previously following the build 
 instructions from the Nim website and am not super clear what 
 the "koch" tool does, but following your instructions, the 
 program does build and run.  I'll take a stab at making a D 
 version.

 Interesting results so far.  I have a partially converted 
 program here:  
 https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

 The interesting part is that during compilation (with the 
 command "dmd twinprimes_ssoz.d"), the compilation will abort 
 with the message "Killed" and no further information. That's a 
 new one for me, so I'm looking into the cause.

It may be also running into a hard time limit imposed on 
compilation that Nim had/has that prevented my code from 
initially compiling. I'm generating a lot of PG parameter 
constants at compile time, and it's doing a lot of number 
crunching and building larger and larger arrays of constants as 
the PG's get larger.

Try compiling with successive PG's (just P5, then P5 and P7, etc) 
to see where it fails. That will let you know the code is working 
correctly, and that the compiler is choking either/and because of 
a hard time limit and/or memory limit. That's why I put in a 
compiler output statement in 'genPGparameters' to see the 
progression of the PG parameters being built by the compiler to 
initially find when the compiler started choking. You may also 
need to patch whatever facility in the D compiler chain that 
controls this too.

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants as 
 the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.

It's P17, the biggest one that takes the longest to build in the 
Nim version. I actually don't know what memory limits exist for 
the D compiler at compile-time, so I may need to do some homework.

Oct 13 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.

 It's P17, the biggest one that takes the longest to build in 
 the Nim version. I actually don't know what memory limits exist 
 for the D compiler at compile-time, so I may need to do some
 homework.

Yes, that's what I figured, because that's when the Nim compiler 
initially balked. The variable that was patched in the Nim 
configg file set a hard limit on number of operations the 
compiler could do. It's used as a break switch on a runaway 
process (infinite loop, etc). It's probably something like that 
in the D compiler too, just guessing offhand. I hope it isn't a 
memory limit. However, the program will run fine without using 
P17. It's currently selected as the optimum PG for inputs greater 
than 15 trillion (15,000,000,000,000), so the program will run 
fine and produce correct output using P13 instead, but just not 
as fast.

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.

 It's P17, the biggest one that takes the longest to build in 
 the Nim version. I actually don't know what memory limits exist 
 for the D compiler at compile-time, so I may need to do some 
 homework.

It's not just DMD either.

$ ldc2 twinprimes_ssoz.d
...
generating parameters for P17
Killed

$ gdc twinprimes_ssoz.d
...
generating parameters for P17
gdc: internal compiler error: Killed (program cc1d)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.

$ dmd twinprimes_ssoz.d
...
generating parameters for P17
Killed

Oct 13 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Saturday, 13 October 2018 at 18:31:57 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code 
 is working correctly, and that the compiler is choking 
 either/and because of a hard time limit and/or memory limit. 
 That's why I put in a compiler output statement in 
 'genPGparameters' to see the progression of the PG parameters 
 being built by the compiler to initially find when the 
 compiler started choking. You may also need to patch whatever 
 facility in the D compiler chain that controls this too.

 It's P17, the biggest one that takes the longest to build in 
 the Nim version. I actually don't know what memory limits 
 exist for the D compiler at compile-time, so I may need to do 
 some homework.

 It's not just DMD either.

 $ ldc2 twinprimes_ssoz.d
 ...
 generating parameters for P17
 Killed

 $ gdc twinprimes_ssoz.d
 ...
 generating parameters for P17
 gdc: internal compiler error: Killed (program cc1d)
 Please submit a full bug report,
 with preprocessed source if appropriate.
 See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.

 $ dmd twinprimes_ssoz.d
 ...
 generating parameters for P17
 Killed

In the Nim code, starting line 91 is when the PG constants are 
being generate at compile time.

---------------------------------------------------------

const parametersp5  = genPGparameters(5)
const parametersp7  = genPGparameters(7)
const parametersp11 = genPGparameters(11)
const parametersp13 = genPGparameters(13)
const parametersp17 = genPGparameters(17)
---------------------------------------------------------

Can it compile just using P5 (the first line, others commented 
out), and then P7, etc?

I'm not understanding your comments now.

If you can get a working version up and running (with correct 
output) we can solve the P17 compiler issues later (or in a 
parallel forum thread), especially if you have to delve into the 
weeds of the compiler chain.

In my mind (same with Nim process) getting working code using any 
PG is first order priority (because you'll need getting 
multi-threading working too). Once you can do that, by default, 
you can then use any generator you want if you create the correct 
parameters for it. That's one of the advantages of the algorithm, 
it's PG agnostic (as long as your hardware will accommodate it).

So don't prioritize getting P17 to compile right now (in this 
thread). Create the working generic structure that can work with 
any PG first.

Oct 13 2018

Vijay Nayar <madric gmail.com> writes:

On Saturday, 13 October 2018 at 19:04:48 UTC, Jabari Zakiya wrote:
 On Saturday, 13 October 2018 at 18:31:57 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar 
 wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code 
 is working correctly, and that the compiler is choking 
 either/and because of a hard time limit and/or memory limit. 
 That's why I put in a compiler output statement in 
 'genPGparameters' to see the progression of the PG 
 parameters being built by the compiler to initially find 
 when the compiler started choking. You may also need to 
 patch whatever facility in the D compiler chain that 
 controls this too.

 It's P17, the biggest one that takes the longest to build in 
 the Nim version. I actually don't know what memory limits 
 exist for the D compiler at compile-time, so I may need to do 
 some homework.

 It's not just DMD either.

 $ ldc2 twinprimes_ssoz.d
 ...
 generating parameters for P17
 Killed

 $ gdc twinprimes_ssoz.d
 ...
 generating parameters for P17
 gdc: internal compiler error: Killed (program cc1d)
 Please submit a full bug report,
 with preprocessed source if appropriate.
 See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions.

 $ dmd twinprimes_ssoz.d
 ...
 generating parameters for P17
 Killed

 In the Nim code, starting line 91 is when the PG constants are 
 being generate at compile time.

 ---------------------------------------------------------

 const parametersp5  = genPGparameters(5)
 const parametersp7  = genPGparameters(7)
 const parametersp11 = genPGparameters(11)
 const parametersp13 = genPGparameters(13)
 const parametersp17 = genPGparameters(17)
 ---------------------------------------------------------

 Can it compile just using P5 (the first line, others commented 
 out), and then P7, etc?

 I'm not understanding your comments now.

 If you can get a working version up and running (with correct 
 output) we can solve the P17 compiler issues later (or in a 
 parallel forum thread), especially if you have to delve into 
 the weeds of the compiler chain.

 In my mind (same with Nim process) getting working code using 
 any PG is first order priority (because you'll need getting 
 multi-threading working too). Once you can do that, by default, 
 you can then use any generator you want if you create the 
 correct parameters for it. That's one of the advantages of the 
 algorithm, it's PG agnostic (as long as your hardware will 
 accommodate it).

 So don't prioritize getting P17 to compile right now (in this 
 thread). Create the working generic structure that can work 
 with any PG first.

Updated:  
https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

I still get a few runtime errors likely from mistakes in my 
conversion for certain primes.  I'll resolve those after I get 
back from the gym.

But as previous posters have said, the code is not really very 
different between Nim and D.  Most of it is array manipulation 
and arithmetic operations, and not many of the features of either 
D or Nim are very different.  Both turn into fast code, both have 
garbage collection, and both have generally similar operators and 
libraries for this kind of problem.

The biggest differences I observed revolved not around the 
languages themselves, but around code style.  For example, can 
you put a loop and 3 additional statements on a single line in D? 
  Yes.  But it is considered to be not very readable code from a 
style perspective.

Once I get the bugs out, I'm curious to see if any performance 
differences crop up.  There's the theory that says they should be 
the same, and then there's the practice.

Oct 14 2018

Bastiaan Veelo <Bastiaan Veelo.net> writes:

On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
 But as previous posters have said, the code is not really very 
 different between Nim and D.  Most of it is array manipulation 
 and arithmetic operations, and not many of the features of 
 either D or Nim are very different.  Both turn into fast code, 
 both have garbage collection, and both have generally similar 
 operators and libraries for this kind of problem.

 The biggest differences I observed revolved not around the 
 languages themselves, but around code style.

Are you aware of DCompute? [1] I haven’t used it myself but since 
Jabari explicitly mentioned taking advantage of the GPU, it may 
be worth giving it a try. It should have an impact on performance 
and possibly on style as well, even though it will still be D. It 
would be nice if Nicholas could chime in on this thread.

Bastiaan.

[1] 
https://dlang.org/blog/2017/10/30/d-compute-running-d-on-the-gpu/

Oct 14 2018

Vijay Nayar <madric gmail.com> writes:

On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
 Once I get the bugs out, I'm curious to see if any performance 
 differences crop up.  There's the theory that says they should 
 be the same, and then there's the practice.

I don't actually understand the underlying algorithm, but I at 
least understand the flow of the program and the structure.  The 
algorithm utilized depends heavily on using shared memory access, 
which can be done in D, but I definitely wouldn't call it 
idiomatic.  In D, message passing is preferred, but it really 
can't be well demonstrated on your algorithm without a deeper 
understanding of the algorithm itself.

A complete working version can be found at: 
https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

I modified both versions of the program to utilize the 
pgParameters13 for more of an apples-to-apples comparison.

The final results are as follows:
$ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim && 
echo "3000000000" | ./twinprimes_ssoz
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.222 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.223 secs

$ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
./twinprimes
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 1 ms, 864 μs, and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 196 ms, 566 μs, and 5 hnsecs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/- 1
total time = 198 ms, 431 μs, and 2 hnsecs

My understanding is that the difference in performance is largely 
due to slightly better optimization from the LLVM based ldc2 
compiler, where I believe Nim is using gcc.

Oct 15 2018

Jabari Zakiya <jzakiya gmail.com> writes:

 I don't actually understand the underlying algorithm, but I at 
 least understand the flow of the program and the structure.  
 The algorithm utilized depends heavily on using shared memory 
 access, which can be done in D, but I definitely wouldn't call 
 it idiomatic.  In D, message passing is preferred, but it 
 really can't be well demonstrated on your algorithm without a 
 deeper understanding of the algorithm itself.

 A complete working version can be found at: 
 https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

 I modified both versions of the program to utilize the 
 pgParameters13 for more of an apples-to-apples comparison.

 The final results are as follows:
 $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim 
 && echo "3000000000" | ./twinprimes_ssoz
 Enter integer number: threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 0.000 secs
 perform twinprimes ssoz sieve
 sieve time = 0.222 secs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/-1
 total time = 0.223 secs

 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM based 
 ldc2 compiler, where I believe Nim is using gcc.

Hey Great!  I'll take some time and study your code.

Nim currently allows using gcc or clang: --cc:gcc or --cc:clang, 
and
compiles to C: nim c --cc:gcc... or C++: nim c++ --cc:gcc...

You can compile in Nim with/out using garbage collection (GC) and 
choose GC.
This version needs GC to recover mem created in each thread, or 
it will eat
it up. See operation of program using htop/top to see threads|mem 
at work.

If D has a fast bitarray structure, try it to create the data 
arrays
"prms" in proc "sozpg" and "seg" arrays is "twin_sieves". This 
could allow
for larger segment sizes|arrays that fit into cpu cache, reducing 
the number
of segment slices to process. This would all have to be profiled 
and
encapsulated in "selectPG", which adaptively selects to best PG 
to use.

Here is a table of output data and performance times (in seconds).

The results were produced on a System76 laptop with an Intel I7 
6700HQ cpu,
2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I compiled 
the file
"twinprimes_ssoz.nim" with Nim versions 0.18.0, and the just 
released 0.19.0,
using gcc 8.2.1 with Manjaro (Linux) in Virtual Box. I then ran 
the executables
on my host Linux system (which only has gcc-4.9.8) to use all 8 
threads.

The output was produced on a "quiet" system, where I turned off 
the laptop and
rebooted, and ran tests with only a terminal and spreedsheet open 
(to record
results). The times are the "best" times from multiple runs.

I also compare times from the program "primesieve" - 
https://primesieve.org/
which states on its site:

------------------------------------------------------------------------------------
primesieve is a free software program and C/C++ library that 
generates primes using a highly optimized sieve of Eratosthenes 
implementation. It counts the primes below 10^10 in just 0.45 
seconds on an Intel Core i7-6700 CPU (4 x 3.4 GHz). primesieve 
can generate primes and prime k‑tuplets up to 2^64.
------------------------------------------------------------------------------------

(An optimized C/C++ versions of twinprimes_ssoz would be nice to 
compare against too.)


   Number  | Twimprimes | Last Twinprime |twinprime_ssoz.nim, 
gcc-8.2.1| primesieve |
           |            |                |     0.18.0   |    
0.19.0    |   9.7.1    |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^09 |    3424506 |     1999999872 |      0.043   |      
0.041   |     0.049  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^09 |   14618166 |     4999999860 |      0.219   |      
0.226   |     0.248  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^10 |   27412679 |     9999999702 |      0.398   |      
0.401   |     0.533  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^10 |  118903682 |    49999999590 |      2.364   |      
2.331   |     2.978  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^11 |  224376048 |    99999999762 |      4.586   |      
4.476   |     6.189  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^11 |  986222314 |   499999999062 |     26.625   |     
26.578   |    35.120  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^12 | 1870585220 |   999999999960 |     57.895   |     
58.195   |    73.966  |
----------|------------|----------------|--------------|--------------|------------|
5 x 10^12 | 8312493003 |  4999999999878 |    385.562   |    
389.347   |   433.320  |
----------|------------|----------------|--------------|--------------|------------|
1 x 10^13 |15834664872 |  9999999998490 |    855.705   |    
857.771   |   934.253  |
----------|------------|----------------|--------------|--------------|------------|
2 x 10^13 |30198862775 | 19999999999830 |   1806.282   |   
1818.766   |  1998.341  |
----------|------------|----------------|--------------|--------------|------------|

A typical output of the "twinprimes_ssoz.nim" files looks as 
below"

$ echo 123_345_789_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 98304] bytes array          // 98,304 
segment size (Kn)
twinprime candidates = 6099517038; resgroups = 4107419  // max 
twins; max cols (KB)
each 1485 threads has nextp[2 x 30062] array            // 
30,062, primes <= sqrt(N)
setup time = 0.006 secs                                 // time 
for primes|ssoz setup
perform twinprimes ssoz sieve                           // ssoz 
starts now
sieve time = 5.771 secs                                 // time 
to do just the ssoz
last segment = 76955 resgroups; segment slices = 42     // last 
seg cols; seg loops
total twins = 272018910; last twin = 123345788640+/-1   // 
twinprimes, val for last
total time = 5.778 secs                                 // setup 
+ ssoz time

You can determine the selected PG by the number of threads being 
used.

I'm writing up the paper, and will probably release the program 
operation description
to provide a detailed understanding of how|why it works, which 
will allow for modification.

Oct 15 2018

Jabari Zakiya <jzakiya gmail.com> writes:

 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM based 
 ldc2 compiler, where I believe Nim is using gcc.

Here's what I get on my system.

$ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.144 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.144 secs

Could you list your hardware, D ver, compiler specs.

I will run your code on my system with your D version and 
compiler, if I can.

Excellent work!

Oct 15 2018

Vijay Nayar <madric gmail.com> writes:

On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM 
 based ldc2 compiler, where I believe Nim is using gcc.

 Here's what I get on my system.

 $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
 Enter integer number: threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 0.000 secs
 perform twinprimes ssoz sieve
 sieve time = 0.144 secs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/-1
 total time = 0.144 secs

 Could you list your hardware, D ver, compiler specs.

 I will run your code on my system with your D version and 
 compiler, if I can.

 Excellent work!

D has multiple compilers, but for the speed of the finished 
binary, LDC2 is generally recommended.  I used version 1.11.0.  
https://github.com/ldc-developers/ldc/releases/tag/v1.11.0

I was using DUB to manage the project, but to build the 
stand-alone file from the gist link, use this command:  $ ldc2 
-release -O3 twinprimes_ssoz.d
And to run it:  $ echo "3000000000" | ./twinprimes_ssoz

Running the program a bunch of times, I get variable results, 
mostly between 195ms and 250ms.  Running the Nim version, I also 
get variable results, mostly between 230ms and 280ms.

My system is an Alienware 14x R2 from 2012.  Specs can be found 
here: 
https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/

Oct 16 2018

welkam <wwwelkam gmail.com> writes:

So I run profiler and 97% of time is spent in void twinsSieve 
function and hotspots are seg[k] = seg[k] |  1; lines. Since 
seg[k] can only be 1 or 0 I removed that or operation. And the 
results are. Queue the drum-roll... 5% slower.

I thought that all of my studying was getting somewhere. That I 
beginning to understand things but no. Performing OR operation 
and then storing data is faster than just storing data. [sarcasm] 
Of course it makes sense [/sarcasm]

I looked at assembler and nothing changed except

orb    $0x1,(%rbx,%rdi,1)

is changed to

movb   $0x1,(%rbx,%rdi,1)

I`m completely lost.

Oct 16 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Tuesday, 16 October 2018 at 16:57:12 UTC, welkam wrote:
 So I run profiler and 97% of time is spent in void twinsSieve 
 function and hotspots are seg[k] = seg[k] |  1; lines. Since 
 seg[k] can only be 1 or 0 I removed that or operation. And the 
 results are. Queue the drum-roll... 5% slower.

 I thought that all of my studying was getting somewhere. That I 
 beginning to understand things but no. Performing OR operation 
 and then storing data is faster than just storing data. 
 [sarcasm] Of course it makes sense [/sarcasm]

 I looked at assembler and nothing changed except

 orb    $0x1,(%rbx,%rdi,1)

 is changed to

 movb   $0x1,(%rbx,%rdi,1)

 I`m completely lost.

This is the exact same behavior I found with the Nim compiler too.

seg[k] = 1  is slower than seg[k] = seg[k] or 1, which is why 
it's coded like that.

Thanks for finding the exact instruction difference.

How many other unknown gotchas are in the compilers? :-(

Oct 16 2018

welkam <wwwelkam gmail.com> writes:

On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya wrote:
 This is the exact same behavior I found with the Nim compiler 
 too.

Well Nim compiler is more like translator. It translates Nim code 
to c or c++. Since gcc was responsible for optimizations and 
instruction selection it would be more correct to say that gcc 
behavior was same as llvm.


 How many other unknown gotchas are in the compilers? :-(

the number of inlining shall be 3
https://www.youtube.com/watch?v=s4wnuiCwTGU

Oct 16 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Tuesday, 16 October 2018 at 20:38:24 UTC, welkam wrote:
 On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya 
 wrote:
 This is the exact same behavior I found with the Nim compiler 
 too.

 Well Nim compiler is more like translator. It translates Nim 
 code to c or c++. Since gcc was responsible for optimizations 
 and instruction selection it would be more correct to say that 
 gcc behavior was same as llvm.


 How many other unknown gotchas are in the compilers? :-(

 the number of inlining shall be 3
 https://www.youtube.com/watch?v=s4wnuiCwTGU

Yes, it finally is the compiler doing it, and Clang does the same 
thing.
And they could be modded to catch semantics like this and produce 
faster code.

My guess is the difference in code execution times based on the 
ALU operation pipeline.

seg[k] = seg[k] | 1

can be done "inplace" in the pipeline, where the lsb can to be 
changed in mem, whereas

seg[k] = 1

may require a separate constant|value creation and memory write, 
and bypass the ALU.

I used to know most of these tricks back in the Pentium days, but 
I haven't had a need to keep up with Intel Ixx core chips 
machine|compiler level instruction optimizations.

Oct 16 2018

welkam <wwwelkam gmail.com> writes:

On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya wrote:
 And they could be modded to catch semantics like this and 
 produce faster code.

Its hard to prove that you will only write 1 or 0 in the array 
and even if you write such pass it wont fire very often. So 
slower compile times for no gain in common case.


 My guess is the difference in code execution times based on the 
 ALU operation pipeline.


and my guess is that CPU knows that its going to write the same 
value to memory so it ignores that write.

 I used to know most of these tricks back in the Pentium days, 
 but I haven't had a need to keep up with Intel Ixx core chips 
 machine|compiler > level instruction optimizations.

Life is too short for such things. Today compilers are good at 
optimizing low level stuff. Those decades of improvements add up.

Oct 16 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Tuesday, 16 October 2018 at 21:12:39 UTC, welkam wrote:
 On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya 
 wrote:
 And they could be modded to catch semantics like this and 
 produce faster code.

 Its hard to prove that you will only write 1 or 0 in the array 
 and even if you write such pass it wont fire very often. So 
 slower compile times for no gain in common case.


 My guess is the difference in code execution times based on 
 the ALU operation pipeline.


 and my guess is that CPU knows that its going to write the same 
 value to memory so it ignores that write.

Exactly. It does the same instruction twice, and once it's set 
the first time it can ignore setting it the second time.

However with seg[k] = 1 , the mov instruction always moves, 
because it doesn't know|care what the previous value was. That's 
my guess.

 Life is too short for such things. Today compilers are good at 
 optimizing low level stuff. Those decades of improvements add 
 up.

Amen! :-)

Oct 16 2018

Jon Degenhardt <jond noreply.com> writes:

On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
 D has multiple compilers, but for the speed of the finished 
 binary, LDC2 is generally recommended.  I used version 1.11.0.  
 https://github.com/ldc-developers/ldc/releases/tag/v1.11.0

 I was using DUB to manage the project, but to build the 
 stand-alone file from the gist link, use this command:  $ ldc2 
 -release -O3 twinprimes_ssoz.d
 And to run it:  $ echo "3000000000" | ./twinprimes_ssoz

It'd be interesting to see if LTO or PGO generated an 
improvement. It looks like it could in this case, as it might 
optimize some of the inner loops. LTO is easy, enable it with:

     -flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto

(see: https://github.com/ldc-developers/ldc/releases/tag/v1.9.0). 
I've been using 'thin' on OSX, 'full' on Linux.

PGO is a bit more work, but not too bad. A good primer is here: 
https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html

--Jon

Oct 16 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
 On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM 
 based ldc2 compiler, where I believe Nim is using gcc.

 Here's what I get on my system.

 $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
 Enter integer number: threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 0.000 secs
 perform twinprimes ssoz sieve
 sieve time = 0.144 secs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/-1
 total time = 0.144 secs


Hey Vijay I found subtle static typing errors.

When N < 2^32 - 1 (4,294,967,295) the twimprime count and 
lastprime values are correct.
When N > 2^32 - 1 both values start to be incorrect.

Examples: For N = 4,000,000,000 you get correct answers

$ echo 4000000000 | ./twinprimes_ssoz
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 233766231; resgroups = 1731602
each 135 threads has nextp[2 x 6332] array
setup time = 667 μs and 5 hnsecs
perform twinprimes ssoz sieve
sieve time = 181 ms, 277 μs, and 6 hnsecs
last segment = 27666 resgroups; segment slices = 27
total twins = 11944438; last twin = 3999999798+/- 1
total time = 181 ms, 945 μs, and 1 hnsec

Examples: For N = 5,000,000,000 you get wrong answers

$ echo 5000000000 | ./twinprimes_ssoz
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 292207792; resgroups = 2164503
each 135 threads has nextp[2 x 6999] array
setup time = 744 μs and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 225 ms, 323 μs, and 4 hnsecs
last segment = 1815 resgroups; segment slices = 34
total twins = 14618173; last twin = 705034592+/- 1
total time = 226 ms, 68 μs, and 1 hnse

These are correct answers for N = 5,000,000,000

$ echo 5000000000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 292207792; resgroups = 2164503
each 135 threads has nextp[2 x 6999] array
setup time = 0.001 secs
perform twinprimes ssoz sieve
sieve time = 0.237 secs
last segment = 1815 resgroups; segment slices = 34
total twins = 14618166; last twin = 4999999860+/-1
total time = 0.237 secs

The first easy check should show the lastprime value just a 
little smaller than N.
See the timing|results table I posted earlier to see correct 
outputs as N gets bigger.

It took me a looong time, and was a big PITA, to get the correct 
static typing too in Nim, especially coming from a dynamic 
language like Ruby.

Oct 16 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
$ dub build --compiler=ldc2 -b=release && echo "3000000000" |
./twinprimes
Enter integer number:
threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 1 ms, 864 μs, and 7 hnsecs
perform twinprimes ssoz sieve
sieve time = 196 ms, 566 μs, and 5 hnsecs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/- 1
total time = 198 ms, 431 μs, and 2 hnsecs

My understanding is that the difference in performance is
largely due to slightly better optimization from the LLVM
based ldc2 compiler, where I believe Nim is using gcc.

Here's what I get on my system.

$ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821
Enter integer number: threads = 8
each thread segment is [1 x 65536] bytes array
twinprime candidates = 175324676; resgroups = 1298702
each 135 threads has nextp[2 x 5566] array
setup time = 0.000 secs
perform twinprimes ssoz sieve
sieve time = 0.144 secs
last segment = 53518 resgroups; segment slices = 20
total twins = 9210144; last twin = 2999999712+/-1
total time = 0.144 secs

Could you list your hardware, D ver, compiler specs.

I will run your code on my system with your D version and
compiler, if I can.

Excellent work!

D has multiple compilers, but for the speed of the finished
binary, LDC2 is generally recommended. I used version 1.11.0.
https://github.com/ldc-developers/ldc/releases/tag/v1.11.0

I was using DUB to manage the project, but to build the
stand-alone file from the gist link, use this command: $ ldc2
-release -O3 twinprimes_ssoz.d
And to run it: $ echo "3000000000" | ./twinprimes_ssoz

Running the program a bunch of times, I get variable results,
mostly between 195ms and 250ms. Running the Nim version, I
also get variable results, mostly between 230ms and 280ms.

My system is an Alienware 14x R2 from 2012. Specs can be found
here:
https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/

Hey Vijay I got a complete D version running correctly.

https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

Here's the Nim code for comparison.

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

I'm about 80-85% finished writing my paper. It's interesting the
things you learn about a topic if you try to write about it. :-)
When I finish I'll post it here, and everything will become a
whole lot clearer about what|why|how the algorithm|code works.

I found a few edge case bugs in the original code I posted, but
also found some coding and performance improvements too. The D
(and revised Nim) code reflects all these changes, and are the
current state-of-the-art regarding this implementation.

One thing that could|should make it faster is the use of fast
bitarrays to create the seg arrays in the threads (vs byte
arrays). I didn't find anywhere that D has a native bitarray
structure (Nim doesn't either).

Please don't be too offset by my coding style. I tried to mimick
the Nim style (which is non-standard too). I code to minimize
the syntax noise, and like complete concepts on one line.

Performance is "similar" to Nim's, though Nim is slightly faster
as N increases. Both are still faster than primesieve however.
Still hoping someone does a GPU version.

Please compile on the different Nim compilers and see what
differences they produce.

And last, the D version doesn't us P17, as in the Nim version,
because of the issue you found with the D compiler when trying to
compile with P17. If you have time maybe you (someone) can try to
resolve that issue, to match its functionality to the Nim version.

Nov 07 2018

Bastiaan Veelo <Bastiaan Veelo.net> writes:

Hi Jabari,

On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya 
wrote:
 Hey Vijay I got a complete D version running correctly.

 https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

I cloned this and enabled P17 for you, get it at 
https://github.com/veelo/twinprimes_ssoz

I basically optimized the compile time table generation by 
removing templates and avoiding reallocations. This is a big deal 
because currently the compiler will not collect its own garbage. 
You'll see that I reserve exactly the minimum array lengths; of 
course I didn't know these when I started, I got it to work just 
by preallocating for up to P13 and using appender for P17. But 
since I now know the exact required length, I can just as well 
use it. My system has 16GB of RAM, which is enough.

[...]
 One thing that could|should make it faster is the use of fast 
 bitarrays to create the seg arrays in the threads (vs byte 
 arrays). I didn't find anywhere that D has a native bitarray 
 structure (Nim doesn't either).

There is https://dlang.org/phobos/std_bitmanip.html#BitArray

I haven't looked at the run-time code, it is very well possible 
that similar optimizations are possible there too.

Nov 08 2018

Jabari Zakiya <jzakiya gmail.com> writes:

On Thursday, 8 November 2018 at 18:49:35 UTC, Bastiaan Veelo 
wrote:
 Hi Jabari,

 On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya 
 wrote:
 Hey Vijay I got a complete D version running correctly.

 https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61

 I cloned this and enabled P17 for you, get it at 
 https://github.com/veelo/twinprimes_ssoz

 I basically optimized the compile time table generation by 
 removing templates and avoiding reallocations. This is a big 
 deal because currently the compiler will not collect its own 
 garbage. You'll see that I reserve exactly the minimum array 
 lengths; of course I didn't know these when I started, I got it 
 to work just by preallocating for up to P13 and using appender 
 for P17. But since I now know the exact required length, I can 
 just as well use it. My system has 16GB of RAM, which is enough.

 [...]
 One thing that could|should make it faster is the use of fast 
 bitarrays to create the seg arrays in the threads (vs byte 
 arrays). I didn't find anywhere that D has a native bitarray 
 structure (Nim doesn't either).

 There is https://dlang.org/phobos/std_bitmanip.html#BitArray

 I haven't looked at the run-time code, it is very well possible 
 that similar optimizations are possible there too.

Hi Bastiaan,

Good work on the code.

I've been playing with it, and made versions using boolean and 
bitarrays for seg array. Byte arrays are definitely faster, with 
bools slightly slower, but bitarrays way slower, by at least 2x.

I also had made a slight modification to twin_sieve, at its end, 
to speed it up a smidgen. I didn't know if you want PRs to your 
code.

However, there is a problem in the code when using P17. It 
generates incorrect results > 15 trillion (when P17 is selected). 
For 20 trillion (20_000_000_000_000) it gives 37,593,079,058 
instead of 30,198,862,775 for the twins count, and the last twin 
value is off too.This likely means the P17 parameters are off, 
and/or nextp values, as not enough of the k values in twins_sieve 
are marking off prime multiples in the seg array.

However for P5-P13 the results are correct, and the performance 
is comparable to the Nim version, up to N = 5 billion, where you 
start to see D be slower (by a few percent). You can run both on 
your system to see what the difference is on it.

Anyway, if you figure out what's up with P17 that we can get it 
totally working, and I'll look at playing around with it to.

I'm over 90% finished with my paper now (want to finish b4 
Thanksgiving).

Nov 13 2018

Jabari Zakiya <jzakiya gmail.com> writes:

I finally finished my paper on Twin Primes, which just kept 
expanding into other topics as I continued writing. I included an 
extensive (brief) discussion of the math of Prime Generators and 
their use and implications to topics in Number Theory, et al. I 
describe in great detail the Twin Primes SSoZ, with visual aides. 
Hopefully understanding it will be clear. If any questions about 
anything in the paper please let me know.


The Use of Prime Generators to Implement Fast Twin Primes Sieve 
of Zakiya (Soz), Applications to Number Theory, and Implications 
for the Riemann Hypotheses

https://www.academia.edu/37952623/The_Use_of_Prime_Generators_to_Implement_Fast_Twin_Primes_Sieve_of_Zakiya_SoZ_Applications_to_Number_Theory_and_Implications_for_the_Riemann_Hypotheses

https://www.scribd.com/document/395415391/The-Use-of-Prime-Generators-to-Implement-Fast-Twin-Primes-Sieve-Of-Zakiya-SoZ-Applications-to-Number-Theory-and-Implications-for-the-Riemann-Hypoth

Dec 11 2018

welkam <wwwelkam gmail.com> writes:

On Tuesday, 11 December 2018 at 14:07:48 UTC, Jabari Zakiya wrote:
 The Use of Prime Generators to Implement Fast Twin Primes Sieve 
 of Zakiya (Soz), Applications to Number Theory, and 
 Implications for the Riemann Hypotheses

Hey Mom I contributed to white paper. I am famous now :D

Dec 11 2018

D Programming

C/C++ Programming

Other

digitalmars.D - A Friendly Challenge for D