www.digitalmars.com         C & C++   DMDScript  

digitalmars.D - A Friendly Challenge for D

reply Jabari Zakiya <jzakiya gmail.com> writes:
Hi.

I hope this is the right place to request this, if not please 
tell me a better one.

I had looked at D, and played with it some circa 2010~2012, but 
time and life took my priorities away. But I'm still interested 
in learning different languages, but there are so many more now 
it's hard to devote the time to learn them to some high level of 
proficiency.

Subsequently, I started learning Nim, because I find the syntax 
and constructs simpler and more familiar than most (I'm coming 
mostly from a Ruby background).

I have developed in Nim a program to find twinprimes that seems 
to be the fastest of its type, compared to primesieve 
(https://primesieve.org/) which claims to be the fastest, written 
in C++.

Here is the code to my Nim implementation of twinprimes_ssoz.

https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e

The total file is just 318 lines of code, with about 60 separate 
lines of comments.
The code is extensively commented per line to explain what's 
happening.
Reading the references given in the code introduction will 
explain the general process.
See  "The Segmented Sieve of Zakiya (SSoZ)"
https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ

I am in the process of writing up a paper to explain this 
application, and implementation. What I am requesting here is for 
a person(s) who is an "expert" (very good) to create a very fast 
D version, using whatever tricks it has to maximize performance.

I would like to include in my paper a good comparison of various 
implementations in different compiled languages (C/C++, D, Nim, 
etc) to show how it performs with each.

This algorithm is designed to be run in multiprocessing 
environments (more core/threads, better performance). The Nim 
version (0.18.0, 0.19.0 recently released) uses its native 
parallel processing structure. In C++, et al, it may be best 
implemented using OPenMP, or CUDA, etc. These are implementation 
details one versed in a language could determine.

Well that's it. I am willing to share performance data, compared 
to primesive, if you like. The beauty of this 
algorithm|implementation is its simplicity in design, and minimal 
code. It shouldn't be that hard to understand to determine how to 
translate into D. Of course, I'm am fully available to answer 
questions and provide help.

Thanks

Jabari
Oct 10 2018
next sibling parent SrMordred <patric.dexheimer gmail.com> writes:
On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 [...]
Looking forward to this :)
Oct 10 2018
prev sibling next sibling parent reply Kagamin <spam here.lot> writes:
On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
As i understand, main thread preallocates global memory and tracks it, and other threads don't track it?
Oct 10 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Wednesday, 10 October 2018 at 20:43:01 UTC, Kagamin wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
As i understand, main thread preallocates global memory and tracks it, and other threads don't track it?
Here's an abreviated elementary explanation of the algorithm and implementation. You will need to read (download) my paper "The Segmented Sieve of Zakiya (SSoZ)" because I will make reference to things in it, to keep this as short as possible. https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ To really understand the math and theory of the algorithm requires primarily just understanding Table 3 on page 3 of the paper, as it encapsulates everything. You can read the paper to understand the language of the algorithm used in the code. Twinprimes are 2 (odd) primes that differ by only 2 eg. 3-5, 5-7, 11-13, 29-31. Table 3 shows all the residues values (prime candidates|pcs) and residues groups (resgroups|columns) to find all the primes upto 541 using P5 (modpg = 30, rescnt = 8). For a given value N, it will be represented with a PG table of some number of resgroups, with max size I call Kmax (the regroups value N residues in). Using P5, I only need to sieve primes along the residue tracks (restracks) that can produce twinprimes, here 11-13, 17-19, 29-31. Thus I create 3 byte arrays, one for each twinpair, and use the lower 2 bits to represent the upper and lower twinprime restracks. Then for each twinpair (here 3) I run 3 thread which perform the SSoZ on the entire Kmax length of resgroups in parallel. At the end I accumulate the results and print them out. This, in a nutshell, is what the algorithm does. The paper gives you enough to understand the fundamental nature of the algorithm, though I've learned so much more than in 2014. :) The larger the PG the more twinpair restracks (see page 14) there are to use. For larger numbers you want to use the largest PG possible that 'fits' into the hardware cpu. All my development|testing was done on laptops using Intel I5|I7 cpus with 4 or 8 threads. I'm really interested how it performs on other platforms (ARM, AMD, PowerPC, etc). The main proc "twinprimes_ssoz" manages program flow and set as follows: 1) accepts the input number (an integer) in "val" from the cli 2) sets "num" to be first odd number < "val" if "val" even 3) calls "selectPG" with "num" to select optimum Prime Generator (PG) parameters 4) compute various working parameters per selected PG and number value (see refs) 5) compute global values for number of primes, and their values, <= sqrt(num) for selected PG with proc "sozpg" 6) Set initial twinprime count for selected PG 7) Then with proc "segsieve" allocate one thread to perform SSoZ (segmented sieve of zakiya) on each twinprime pair residues (determined by selected PG), and count number of twinprmes computed in each thread. 8) Then determine true total twinprime count and last twinprime <= "num" It also is timing different intervals and prints out diagnostics and final output. The proc "twins_sieve" is the primary function that manages and performs the SSoZ for a given twinprim pair parameters in each thread. The more threads the faster the process goes. I'll provide more info later. I have to run now. I wanted to get this out now while I was at my laptop, and online.
Oct 10 2018
parent reply Neia Neutuladh <neia ikeran.org> writes:
On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
It would be great if you could provide a link to a freely downloadable version of this.
Oct 10 2018
parent reply Jabari Zakiyth <jzakiya gmail.com> writes:
On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh 
wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
It would be great if you could provide a link to a freely downloadable version of this.
You can download the paper for free from that link. Did you have trouble doing it? Here's another link to paper. https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_ Here, again, is the link to the Nim code. Just install Nim (current 0.19.0) and compile and run it per instructions in code. I recommend people do that to see its outputs and verify its results. https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
Oct 10 2018
parent reply =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/10/2018 07:52 PM, Jabari Zakiyth wrote:
 On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
It would be great if you could provide a link to a freely downloadable version of this.
You can download the paper for free from that link. Did you have trouble doing it?
I think the problem is, scribd requires an account, which they apparently happy to link to an existing Google or Facebook account.
 Here's another link to paper.

 https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_
Similarly, that one requires a Google account.
 Here, again, is the link to the Nim code. Just install Nim (current
 0.19.0) and compile and run it per instructions in code. I recommend
 people do that to see its outputs and verify its results.

 https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e
That works! :) Ali
Oct 10 2018
parent reply Jabari Zakiyth <jzakiya gmail.com> writes:
On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli wrote:
 On 10/10/2018 07:52 PM, Jabari Zakiyth wrote:
 On Wednesday, 10 October 2018 at 22:25:17 UTC, Neia Neutuladh
wrote:
 On 10/10/2018 03:05 PM, Jabari Zakiya wrote:
 
https://www.scribd.com/doc/228155369/The-Segmented-Sieve-of-Zakiya-SSoZ
 It would be great if you could provide a link to a freely
downloadable
 version of this.
You can download the paper for free from that link. Did you
have trouble
 doing it?
I think the problem is, scribd requires an account, which they apparently happy to link to an existing Google or Facebook account.
 Here's another link to paper.

 
https://www.academia.edu/7583194/The_Segmented_Sieve_of_Zakiya_SSoZ_ Similarly, that one requires a Google account.
 Here, again, is the link to the Nim code. Just install Nim
(current
 0.19.0) and compile and run it per instructions in code. I
recommend
 people do that to see its outputs and verify its results.

 
https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e That works! :) Ali
What country are you trying to get access from, because I know people in the US have gotten the papers from those link, for free and without an account. It may also have something to do with your browser. They work for me in various modern browsers, including the Tor Browser.
Oct 11 2018
parent reply Carl Sturtivant <sturtivant gmail.com> writes:
On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli wrote:
 [...]
What country are you trying to get access from, because I know people in the US have gotten the papers from those link, for free and without an account. It may also have something to do with your browser. They work for me in various modern browsers, including the Tor Browser.
Not credible --- I cannot get either without an account browsing from the US using Firefox with no unusual arrangements.
Oct 11 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Thursday, 11 October 2018 at 14:49:54 UTC, Carl Sturtivant 
wrote:
 On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
 wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli 
 wrote:
 [...]
What country are you trying to get access from, because I know people in the US have gotten the papers from those link, for free and without an account. It may also have something to do with your browser. They work for me in various modern browsers, including the Tor Browser.
Not credible --- I cannot get either without an account browsing from the US using Firefox with no unusual arrangements.
Wow. Those links used to work with no problem, so I don't know what is it (vpn use, etc,?). OK, this will definitely work. Send me an email and I will email you the paper (and figure out a sure fire way to make it available). Construct this full email (for humans). (my first initial)(lastname) at gmail
Oct 11 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Thursday, 11 October 2018 at 16:13:17 UTC, Jabari Zakiya wrote:
 On Thursday, 11 October 2018 at 14:49:54 UTC, Carl Sturtivant 
 wrote:
 On Thursday, 11 October 2018 at 13:26:19 UTC, Jabari Zakiyth 
 wrote:
 On Thursday, 11 October 2018 at 05:11:20 UTC, Ali Çehreli 
 wrote:
 [...]
What country are you trying to get access from, because I know people in the US have gotten the papers from those link, for free and without an account. It may also have something to do with your browser. They work for me in various modern browsers, including the Tor Browser.
Not credible --- I cannot get either without an account browsing from the US using Firefox with no unusual arrangements.
Wow. Those links used to work with no problem, so I don't know what is it (vpn use, etc,?). OK, this will definitely work. Send me an email and I will email you the paper (and figure out a sure fire way to make it available). Construct this full email (for humans). (my first initial)(lastname) at gmail
Ok, hopefully this will work for everyone. Try this link: https://mega.nz/#!yJxUEQgK!MY9dwjiWheE8tACtEeS0szduIvdBjiyTn4O6mMD_aZw However, I'm still interested to find out what is|are the specific issues with the other links so I can make a possible bug report to those services. 1) Can you actually get to the site? 2) Can you see the document (a reference to it)? 3) Can you read the document online (did you try)? 4) What happened when you tried to download it (popup requiring account, etc)? The paper is from 2014, Scribd has changed a lot since then. The more information people can provide will help me determine how to provide better access to them. But you can always email me to get them as a last resort, or just to ask questions.
Oct 11 2018
parent =?UTF-8?Q?Ali_=c3=87ehreli?= <acehreli yahoo.com> writes:
On 10/11/2018 10:14 AM, Jabari Zakiya wrote:

 Ok, hopefully this will work for everyone. Try this link:

 https://mega.nz/#!yJxUEQgK!MY9dwjiWheE8tACtEeS0szduIvdBjiyTn4O6mMD_aZw
Thank you. That worked just fine. I clicked the Download link and the pdf was saved on my end. :) Ali
Oct 11 2018
prev sibling next sibling parent reply tide <tide tide.tide> writes:
On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages (C/C++, 
 D, Nim, etc) to show how it performs with each.
If you want help with your paper, possibly some kind of decent financial incentive would be appropriate. If the algorithm benefits from more threads than finding or creating an implementation that runs on a GPU would probably be the true performance test. CPUs have like 4-8 cores in the mainstream? A GPU has hundreds, though with some limitations.
Oct 10 2018
parent Jabari Zakiyth <jzakiya gmail.com> writes:
On Thursday, 11 October 2018 at 00:22:10 UTC, tide wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages 
 (C/C++, D, Nim, etc) to show how it performs with each.
If you want help with your paper, possibly some kind of decent financial incentive would be appropriate. If the algorithm benefits from more threads than finding or creating an implementation that runs on a GPU would probably be the true performance test. CPUs have like 4-8 cores in the mainstream? A GPU has hundreds, though with some limitations.
I'm writing the paper anyway (just like the others), so other implementations are icing on the cake to show implementation variations, as a benefit to readers. Maybe if I set up a website and created a Rosetta Code repo for people to post their different language implementations, and offer a T-shirt for fastest implementation. :-) Yes, a GPU based implementation would be the epitome for this algorithm, by far. This is actually why I have gotten the algorithm to this implementation so that the number crunching can all be done in parallel threads. (It would also be screamingly fast done in hardware in a FPGA too.) However, I only have standard consumer grade laptops. Hopefully someone(s) with sufficient hardware, interest, and time, will take this upon themselves to do this and publicize their results.
Oct 10 2018
prev sibling parent reply welkam <wwwelkam gmail.com> writes:
On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
wrote:
 What I am requesting here is for a person(s) who is an "expert" 
 (very good) to create a very fast D version, using whatever 
 tricks it has to maximize performance.

 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages (C/C++, 
 D, Nim, etc) to show how it performs with each.
I looked into your NIM code and from programmers point of view there is nothing interesting going on. Simple data structures and simple operations. If you wrote equivalent code in C, C++, D, NIM, Rust, Zig and compiled with same optimizing compiler (llvm or gcc) you should get the same machine code and almost the same performance (less than 1% difference due to runtime). If you got different machine code for equivalent implementation then you should file a bug report. The only way you will get different performance is by changing implementation details but then you would compare apples to oranges.
Oct 12 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Friday, 12 October 2018 at 15:11:17 UTC, welkam wrote:
 On Wednesday, 10 October 2018 at 16:15:56 UTC, Jabari Zakiya 
 wrote:
 What I am requesting here is for a person(s) who is an 
 "expert" (very good) to create a very fast D version, using 
 whatever tricks it has to maximize performance.

 I would like to include in my paper a good comparison of 
 various implementations in different compiled languages 
 (C/C++, D, Nim, etc) to show how it performs with each.
I looked into your NIM code and from programmers point of view there is nothing interesting going on. Simple data structures and simple operations. If you wrote equivalent code in C, C++, D, NIM, Rust, Zig and compiled with same optimizing compiler (llvm or gcc) you should get the same machine code and almost the same performance (less than 1% difference due to runtime). If you got different machine code for equivalent implementation then you should file a bug report. The only way you will get different performance is by changing implementation details but then you would compare apples to oranges.
Hmm,I don't think what you're saying about similar output|performance with other languages is empirically correct, but it's really not the point of the challenge. The real point of the challenge is too see what idiomatic code, written for performance, using the best resources that the language provides, will produce compared, to the Nim version. It's not to see what a line-by-line translation from Nim to D would look like. That may be a start to get something working, but shouldn't be the end goal. I'm using the Nim version here as the "reference implementation" so it can be used as the standard for comparison (accuracy of results and performance). The goal for D (et al) users is to use whatever resources it provides to maybe do better. Example. Nim currently doesn't provide standard bitarrays. Using bitarrays in place of byte arrays should perform faster because more data can fit in cache and operate faster. Also, to parallelize the algorithm maybe using OpenMP, CUDA, etc is the way to do it for D. I don't know what constructs D uses for parallel multiprocessing. And as noted before, this algorithms screams out to be done with GPUs. But you are correct that the Nim code uses very simple coding operations. That is one of its beauties! :) It is simple to understand and implement mathematically, short and simple to code, and architecturally adaptable to hardware. So to really do the challenge, the Nim code needs to be compiled and run (per instructions in code) to use as the "reference implementation", to see what correct outputs look like, and their times, and then other implementations can be compared to it. I would hope, after getting an output correct implementation done (to show you really know what you're doing) then alternative implementations can be done to wring out better performance. I think this is a good challenge for anyone wanting to learn D too, because it involves something substantially more than a "toy" algorithm, but short enough to do with minimal time and effort, that involves the need to know (learn about) D in enough detail to determine the "best" (alternative) way to do it. Finally, a really fast D implementation can be a marketing bananza to show people in the numerical analysis, data|signal processing fields, et al, that D can be used by them to solve their problems and be more performant than C++, etc. Again, people should feel free to email me if the want more direct answers to questions, or help.
Oct 12 2018
next sibling parent welkam <wwwelkam gmail.com> writes:
On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 Hmm,I don't think what you're saying about similar 
 output|performance with other languages is empirically correct, 
 but it's really not the point of the challenge.
Thats why godbolt exists. c++ and Rust https://godbolt.org/z/FffXY2 C and D https://godbolt.org/z/YQvkXU Look at assembly output. Its all the same. Compilers are very good at recognizing simple arithmetic computations and can reason about it. What they struggle with is memory access. On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 Finally, a really fast D implementation can be a marketing 
 bananza to show people > in the numerical analysis, data|signal 
 processing fields, et al, that D can be used by them to solve 
 their problems and be more performant than C++, etc
eBay`s tsv-utils is fastest in the world https://github.com/eBay/tsv-utils/blob/master/docs/comparative-benchmarks-2018.md#top-four-in-each-benchmark D`s JSON parsing is fastest in the world https://github.com/kostya/benchmarks#json Sambamba is BAM data processor and is fastest in the world https://www.basepairtech.com/blog/sorting-bam-files-samtools-vs-sambamba/ Mir GLAS is faster than OpenBLAS and Eigen http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/glas-gemm-benchmark.html There are enough examples but they are not enough to change minds
Oct 12 2018
prev sibling parent reply welkam <wwwelkam gmail.com> writes:
On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 The real point of the challenge is too see what idiomatic 
 code...
There is no idiomatic D code. There is only better implementations. D doesnt tell you how to write your code. It gives you many tools and you choose which tools to use. That`s what people like about D.
Oct 12 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Friday, 12 October 2018 at 20:05:29 UTC, welkam wrote:
 On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya wrote:
 The real point of the challenge is too see what idiomatic 
 code...
There is no idiomatic D code. There is only better implementations. D doesnt tell you how to write your code. It gives you many tools and you choose which tools to use. That`s what people like about D.
I await your implementation(s)! :-)
Oct 12 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Friday, 12 October 2018 at 21:08:03 UTC, Jabari Zakiya wrote:
 On Friday, 12 October 2018 at 20:05:29 UTC, welkam wrote:
 On Friday, 12 October 2018 at 16:19:59 UTC, Jabari Zakiya 
 wrote:
 The real point of the challenge is too see what idiomatic 
 code...
There is no idiomatic D code. There is only better implementations. D doesnt tell you how to write your code. It gives you many tools and you choose which tools to use. That`s what people like about D.
I await your implementation(s)! :-)
I downloaded the reference NIM implementation and got the latest nim compiler, but I received the following error: $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim twinprimes_ssoz.nim(74, 11) Error: attempting to call undeclared routine: 'sort' For a person not familiar with nim, what's the fastest way to fix that?
Oct 13 2018
parent reply welkam <wwwelkam gmail.com> writes:
On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
Oct 13 2018
next sibling parent Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
I ran into the same problem as you did, and then followed the instructions from the error. I modified the compiler source and increased the number of maximum iterations from 3_000_000 to 1_000_000_000, rebuilt and installed it, but still ran into the exact same problem. There may be something up with the algorithm itself.
Oct 13 2018
prev sibling parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar wrote:
 I downloaded the reference NIM implementation and got the 
 latest nim compiler, but I received the following error:
   $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim
   twinprimes_ssoz.nim(74, 11) Error: attempting to call 
 undeclared routine: 'sort'

 For a person not familiar with nim, what's the fastest way to 
 fix that?
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
My mistake. I updated the file and forgot to include the 'import algorithm' directive. The file is now fixed to include it. Download the corrected version or patch your file accordingly. As stated in the file intro **YOU MUST DO THIS** to get it to compile with current Nim (they were supposed to fix this in this version 0.19.0 but didn't). To compile for nim versions <= 0.19.0 do following: 1) in file: ~/nim-0.19.0/compiler/vmdef.nim 2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion or >) 3) then rebuild sysem: ./koch boot -d:release If you are using 'choosenim' to install Nim (highly advisable) the full path is: ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim I'll post performance results from my laptop to give reference times to compare against.
Oct 13 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
My mistake. I updated the file and forgot to include the 'import algorithm' directive. The file is now fixed to include it. Download the corrected version or patch your file accordingly. As stated in the file intro **YOU MUST DO THIS** to get it to compile with current Nim (they were supposed to fix this in this version 0.19.0 but didn't). To compile for nim versions <= 0.19.0 do following: 1) in file: ~/nim-0.19.0/compiler/vmdef.nim 2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion or >) 3) then rebuild sysem: ./koch boot -d:release If you are using 'choosenim' to install Nim (highly advisable) the full path is: ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim I'll post performance results from my laptop to give reference times to compare against.
Ok, now it builds. I was previously following the build instructions from the Nim website and am not super clear what the "koch" tool does, but following your instructions, the program does build and run. I'll take a stab at making a D version.
Oct 13 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 15:50:06 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya 
 wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
My mistake. I updated the file and forgot to include the 'import algorithm' directive. The file is now fixed to include it. Download the corrected version or patch your file accordingly. As stated in the file intro **YOU MUST DO THIS** to get it to compile with current Nim (they were supposed to fix this in this version 0.19.0 but didn't). To compile for nim versions <= 0.19.0 do following: 1) in file: ~/nim-0.19.0/compiler/vmdef.nim 2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion or >) 3) then rebuild sysem: ./koch boot -d:release If you are using 'choosenim' to install Nim (highly advisable) the full path is: ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim I'll post performance results from my laptop to give reference times to compare against.
Ok, now it builds. I was previously following the build instructions from the Nim website and am not super clear what the "koch" tool does, but following your instructions, the program does build and run. I'll take a stab at making a D version.
Interesting results so far. I have a partially converted program here: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7 The interesting part is that during compilation (with the command "dmd twinprimes_ssoz.d"), the compilation will abort with the message "Killed" and no further information. That's a new one for me, so I'm looking into the cause.
Oct 13 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Saturday, 13 October 2018 at 17:36:33 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:50:06 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 15:19:07 UTC, Jabari Zakiya 
 wrote:
 On Saturday, 13 October 2018 at 14:32:33 UTC, welkam wrote:
 On Saturday, 13 October 2018 at 09:22:16 UTC, Vijay Nayar 
 wrote:
 [...]
import algorithm thats all but then it spits out lib/nim/pure/algorithm.nim(144, 11) Error: interpretation requires too many iterations
My mistake. I updated the file and forgot to include the 'import algorithm' directive. The file is now fixed to include it. Download the corrected version or patch your file accordingly. As stated in the file intro **YOU MUST DO THIS** to get it to compile with current Nim (they were supposed to fix this in this version 0.19.0 but didn't). To compile for nim versions <= 0.19.0 do following: 1) in file: ~/nim-0.19.0/compiler/vmdef.nim 2) set variable: MaxLoopIterations* = 1_000_000_000 (1 Billion or >) 3) then rebuild sysem: ./koch boot -d:release If you are using 'choosenim' to install Nim (highly advisable) the full path is: ~/.choosenim/toolchains/nim-0.19.0/compiler/vmdef.nim I'll post performance results from my laptop to give reference times to compare against.
Ok, now it builds. I was previously following the build instructions from the Nim website and am not super clear what the "koch" tool does, but following your instructions, the program does build and run. I'll take a stab at making a D version.
Interesting results so far. I have a partially converted program here: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7 The interesting part is that during compilation (with the command "dmd twinprimes_ssoz.d"), the compilation will abort with the message "Killed" and no further information. That's a new one for me, so I'm looking into the cause.
It may be also running into a hard time limit imposed on compilation that Nim had/has that prevented my code from initially compiling. I'm generating a lot of PG parameter constants at compile time, and it's doing a lot of number crunching and building larger and larger arrays of constants as the PG's get larger. Try compiling with successive PG's (just P5, then P5 and P7, etc) to see where it fails. That will let you know the code is working correctly, and that the compiler is choking either/and because of a hard time limit and/or memory limit. That's why I put in a compiler output statement in 'genPGparameters' to see the progression of the PG parameters being built by the compiler to initially find when the compiler started choking. You may also need to patch whatever facility in the D compiler chain that controls this too.
Oct 13 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants as 
 the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.
It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
Oct 13 2018
next sibling parent Jabari Zakiya <jzakiya gmail.com> writes:
On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.
It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
Yes, that's what I figured, because that's when the Nim compiler initially balked. The variable that was patched in the Nim configg file set a hard limit on number of operations the compiler could do. It's used as a break switch on a runaway process (infinite loop, etc). It's probably something like that in the D compiler too, just guessing offhand. I hope it isn't a memory limit. However, the program will run fine without using P17. It's currently selected as the optimum PG for inputs greater than 15 trillion (15,000,000,000,000), so the program will run fine and produce correct output using P13 instead, but just not as fast.
Oct 13 2018
prev sibling parent reply Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code is 
 working correctly, and that the compiler is choking either/and 
 because of a hard time limit and/or memory limit. That's why I 
 put in a compiler output statement in 'genPGparameters' to see 
 the progression of the PG parameters being built by the 
 compiler to initially find when the compiler started choking. 
 You may also need to patch whatever facility in the D compiler 
 chain that controls this too.
It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
It's not just DMD either. $ ldc2 twinprimes_ssoz.d ... generating parameters for P17 Killed $ gdc twinprimes_ssoz.d ... generating parameters for P17 gdc: internal compiler error: Killed (program cc1d) Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions. $ dmd twinprimes_ssoz.d ... generating parameters for P17 Killed
Oct 13 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Saturday, 13 October 2018 at 18:31:57 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code 
 is working correctly, and that the compiler is choking 
 either/and because of a hard time limit and/or memory limit. 
 That's why I put in a compiler output statement in 
 'genPGparameters' to see the progression of the PG parameters 
 being built by the compiler to initially find when the 
 compiler started choking. You may also need to patch whatever 
 facility in the D compiler chain that controls this too.
It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
It's not just DMD either. $ ldc2 twinprimes_ssoz.d ... generating parameters for P17 Killed $ gdc twinprimes_ssoz.d ... generating parameters for P17 gdc: internal compiler error: Killed (program cc1d) Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions. $ dmd twinprimes_ssoz.d ... generating parameters for P17 Killed
In the Nim code, starting line 91 is when the PG constants are being generate at compile time. --------------------------------------------------------- const parametersp5 = genPGparameters(5) const parametersp7 = genPGparameters(7) const parametersp11 = genPGparameters(11) const parametersp13 = genPGparameters(13) const parametersp17 = genPGparameters(17) --------------------------------------------------------- Can it compile just using P5 (the first line, others commented out), and then P7, etc? I'm not understanding your comments now. If you can get a working version up and running (with correct output) we can solve the P17 compiler issues later (or in a parallel forum thread), especially if you have to delve into the weeds of the compiler chain. In my mind (same with Nim process) getting working code using any PG is first order priority (because you'll need getting multi-threading working too). Once you can do that, by default, you can then use any generator you want if you create the correct parameters for it. That's one of the advantages of the algorithm, it's PG agnostic (as long as your hardware will accommodate it). So don't prioritize getting P17 to compile right now (in this thread). Create the working generic structure that can work with any PG first.
Oct 13 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Saturday, 13 October 2018 at 19:04:48 UTC, Jabari Zakiya wrote:
 On Saturday, 13 October 2018 at 18:31:57 UTC, Vijay Nayar wrote:
 On Saturday, 13 October 2018 at 18:14:20 UTC, Vijay Nayar 
 wrote:
 On Saturday, 13 October 2018 at 18:05:45 UTC, Jabari Zakiya 
 wrote:
 It may be also running into a hard time limit imposed on 
 compilation that Nim had/has that prevented my code from 
 initially compiling. I'm generating a lot of PG parameter 
 constants at compile time, and it's doing a lot of number 
 crunching and building larger and larger arrays of constants 
 as the PG's get larger.

 Try compiling with successive PG's (just P5, then P5 and P7, 
 etc) to see where it fails. That will let you know the code 
 is working correctly, and that the compiler is choking 
 either/and because of a hard time limit and/or memory limit. 
 That's why I put in a compiler output statement in 
 'genPGparameters' to see the progression of the PG 
 parameters being built by the compiler to initially find 
 when the compiler started choking. You may also need to 
 patch whatever facility in the D compiler chain that 
 controls this too.
It's P17, the biggest one that takes the longest to build in the Nim version. I actually don't know what memory limits exist for the D compiler at compile-time, so I may need to do some homework.
It's not just DMD either. $ ldc2 twinprimes_ssoz.d ... generating parameters for P17 Killed $ gdc twinprimes_ssoz.d ... generating parameters for P17 gdc: internal compiler error: Killed (program cc1d) Please submit a full bug report, with preprocessed source if appropriate. See <file:///usr/share/doc/gcc-5/README.Bugs> for instructions. $ dmd twinprimes_ssoz.d ... generating parameters for P17 Killed
In the Nim code, starting line 91 is when the PG constants are being generate at compile time. --------------------------------------------------------- const parametersp5 = genPGparameters(5) const parametersp7 = genPGparameters(7) const parametersp11 = genPGparameters(11) const parametersp13 = genPGparameters(13) const parametersp17 = genPGparameters(17) --------------------------------------------------------- Can it compile just using P5 (the first line, others commented out), and then P7, etc? I'm not understanding your comments now. If you can get a working version up and running (with correct output) we can solve the P17 compiler issues later (or in a parallel forum thread), especially if you have to delve into the weeds of the compiler chain. In my mind (same with Nim process) getting working code using any PG is first order priority (because you'll need getting multi-threading working too). Once you can do that, by default, you can then use any generator you want if you create the correct parameters for it. That's one of the advantages of the algorithm, it's PG agnostic (as long as your hardware will accommodate it). So don't prioritize getting P17 to compile right now (in this thread). Create the working generic structure that can work with any PG first.
Updated: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7 I still get a few runtime errors likely from mistakes in my conversion for certain primes. I'll resolve those after I get back from the gym. But as previous posters have said, the code is not really very different between Nim and D. Most of it is array manipulation and arithmetic operations, and not many of the features of either D or Nim are very different. Both turn into fast code, both have garbage collection, and both have generally similar operators and libraries for this kind of problem. The biggest differences I observed revolved not around the languages themselves, but around code style. For example, can you put a loop and 3 additional statements on a single line in D? Yes. But it is considered to be not very readable code from a style perspective. Once I get the bugs out, I'm curious to see if any performance differences crop up. There's the theory that says they should be the same, and then there's the practice.
Oct 14 2018
next sibling parent Bastiaan Veelo <Bastiaan Veelo.net> writes:
On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
 But as previous posters have said, the code is not really very 
 different between Nim and D.  Most of it is array manipulation 
 and arithmetic operations, and not many of the features of 
 either D or Nim are very different.  Both turn into fast code, 
 both have garbage collection, and both have generally similar 
 operators and libraries for this kind of problem.

 The biggest differences I observed revolved not around the 
 languages themselves, but around code style.
Are you aware of DCompute? [1] I haven’t used it myself but since Jabari explicitly mentioned taking advantage of the GPU, it may be worth giving it a try. It should have an impact on performance and possibly on style as well, even though it will still be D. It would be nice if Nicholas could chime in on this thread. Bastiaan. [1] https://dlang.org/blog/2017/10/30/d-compute-running-d-on-the-gpu/
Oct 14 2018
prev sibling parent reply Vijay Nayar <madric gmail.com> writes:
On Sunday, 14 October 2018 at 10:51:11 UTC, Vijay Nayar wrote:
 Once I get the bugs out, I'm curious to see if any performance 
 differences crop up.  There's the theory that says they should 
 be the same, and then there's the practice.
I don't actually understand the underlying algorithm, but I at least understand the flow of the program and the structure. The algorithm utilized depends heavily on using shared memory access, which can be done in D, but I definitely wouldn't call it idiomatic. In D, message passing is preferred, but it really can't be well demonstrated on your algorithm without a deeper understanding of the algorithm itself. A complete working version can be found at: https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7 I modified both versions of the program to utilize the pgParameters13 for more of an apples-to-apples comparison. The final results are as follows: $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim && echo "3000000000" | ./twinprimes_ssoz Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 0.000 secs perform twinprimes ssoz sieve sieve time = 0.222 secs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/-1 total time = 0.223 secs $ dub build --compiler=ldc2 -b=release && echo "3000000000" | ./twinprimes Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 1 ms, 864 μs, and 7 hnsecs perform twinprimes ssoz sieve sieve time = 196 ms, 566 μs, and 5 hnsecs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/- 1 total time = 198 ms, 431 μs, and 2 hnsecs My understanding is that the difference in performance is largely due to slightly better optimization from the LLVM based ldc2 compiler, where I believe Nim is using gcc.
Oct 15 2018
next sibling parent Jabari Zakiya <jzakiya gmail.com> writes:
 I don't actually understand the underlying algorithm, but I at 
 least understand the flow of the program and the structure.  
 The algorithm utilized depends heavily on using shared memory 
 access, which can be done in D, but I definitely wouldn't call 
 it idiomatic.  In D, message passing is preferred, but it 
 really can't be well demonstrated on your algorithm without a 
 deeper understanding of the algorithm itself.

 A complete working version can be found at: 
 https://gist.github.com/vnayar/79e2d0a9850833b8859dd9f08997b4d7

 I modified both versions of the program to utilize the 
 pgParameters13 for more of an apples-to-apples comparison.

 The final results are as follows:
 $ nim c --cc:gcc --d:release --threads:on twinprimes_ssoz.nim 
 && echo "3000000000" | ./twinprimes_ssoz
 Enter integer number: threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 0.000 secs
 perform twinprimes ssoz sieve
 sieve time = 0.222 secs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/-1
 total time = 0.223 secs

 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM based 
 ldc2 compiler, where I believe Nim is using gcc.
Hey Great! I'll take some time and study your code. Nim currently allows using gcc or clang: --cc:gcc or --cc:clang, and compiles to C: nim c --cc:gcc... or C++: nim c++ --cc:gcc... You can compile in Nim with/out using garbage collection (GC) and choose GC. This version needs GC to recover mem created in each thread, or it will eat it up. See operation of program using htop/top to see threads|mem at work. If D has a fast bitarray structure, try it to create the data arrays "prms" in proc "sozpg" and "seg" arrays is "twin_sieves". This could allow for larger segment sizes|arrays that fit into cpu cache, reducing the number of segment slices to process. This would all have to be profiled and encapsulated in "selectPG", which adaptively selects to best PG to use. Here is a table of output data and performance times (in seconds). The results were produced on a System76 laptop with an Intel I7 6700HQ cpu, 2.6-3.5 GHz clock, with 8 threads, and 16GB of memory. I compiled the file "twinprimes_ssoz.nim" with Nim versions 0.18.0, and the just released 0.19.0, using gcc 8.2.1 with Manjaro (Linux) in Virtual Box. I then ran the executables on my host Linux system (which only has gcc-4.9.8) to use all 8 threads. The output was produced on a "quiet" system, where I turned off the laptop and rebooted, and ran tests with only a terminal and spreedsheet open (to record results). The times are the "best" times from multiple runs. I also compare times from the program "primesieve" - https://primesieve.org/ which states on its site: ------------------------------------------------------------------------------------ primesieve is a free software program and C/C++ library that generates primes using a highly optimized sieve of Eratosthenes implementation. It counts the primes below 10^10 in just 0.45 seconds on an Intel Core i7-6700 CPU (4 x 3.4 GHz). primesieve can generate primes and prime k‑tuplets up to 2^64. ------------------------------------------------------------------------------------ (An optimized C/C++ versions of twinprimes_ssoz would be nice to compare against too.) Number | Twimprimes | Last Twinprime |twinprime_ssoz.nim, gcc-8.2.1| primesieve | | | | 0.18.0 | 0.19.0 | 9.7.1 | ----------|------------|----------------|--------------|--------------|------------| 1 x 10^09 | 3424506 | 1999999872 | 0.043 | 0.041 | 0.049 | ----------|------------|----------------|--------------|--------------|------------| 5 x 10^09 | 14618166 | 4999999860 | 0.219 | 0.226 | 0.248 | ----------|------------|----------------|--------------|--------------|------------| 1 x 10^10 | 27412679 | 9999999702 | 0.398 | 0.401 | 0.533 | ----------|------------|----------------|--------------|--------------|------------| 5 x 10^10 | 118903682 | 49999999590 | 2.364 | 2.331 | 2.978 | ----------|------------|----------------|--------------|--------------|------------| 1 x 10^11 | 224376048 | 99999999762 | 4.586 | 4.476 | 6.189 | ----------|------------|----------------|--------------|--------------|------------| 5 x 10^11 | 986222314 | 499999999062 | 26.625 | 26.578 | 35.120 | ----------|------------|----------------|--------------|--------------|------------| 1 x 10^12 | 1870585220 | 999999999960 | 57.895 | 58.195 | 73.966 | ----------|------------|----------------|--------------|--------------|------------| 5 x 10^12 | 8312493003 | 4999999999878 | 385.562 | 389.347 | 433.320 | ----------|------------|----------------|--------------|--------------|------------| 1 x 10^13 |15834664872 | 9999999998490 | 855.705 | 857.771 | 934.253 | ----------|------------|----------------|--------------|--------------|------------| 2 x 10^13 |30198862775 | 19999999999830 | 1806.282 | 1818.766 | 1998.341 | ----------|------------|----------------|--------------|--------------|------------| A typical output of the "twinprimes_ssoz.nim" files looks as below" $ echo 123_345_789_000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 98304] bytes array // 98,304 segment size (Kn) twinprime candidates = 6099517038; resgroups = 4107419 // max twins; max cols (KB) each 1485 threads has nextp[2 x 30062] array // 30,062, primes <= sqrt(N) setup time = 0.006 secs // time for primes|ssoz setup perform twinprimes ssoz sieve // ssoz starts now sieve time = 5.771 secs // time to do just the ssoz last segment = 76955 resgroups; segment slices = 42 // last seg cols; seg loops total twins = 272018910; last twin = 123345788640+/-1 // twinprimes, val for last total time = 5.778 secs // setup + ssoz time You can determine the selected PG by the number of threads being used. I'm writing up the paper, and will probably release the program operation description to provide a detailed understanding of how|why it works, which will allow for modification.
Oct 15 2018
prev sibling parent reply Jabari Zakiya <jzakiya gmail.com> writes:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM based 
 ldc2 compiler, where I believe Nim is using gcc.
Here's what I get on my system. $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 0.000 secs perform twinprimes ssoz sieve sieve time = 0.144 secs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/-1 total time = 0.144 secs Could you list your hardware, D ver, compiler specs. I will run your code on my system with your D version and compiler, if I can. Excellent work!
Oct 15 2018
parent reply Vijay Nayar <madric gmail.com> writes:
On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM 
 based ldc2 compiler, where I believe Nim is using gcc.
Here's what I get on my system. $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 0.000 secs perform twinprimes ssoz sieve sieve time = 0.144 secs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/-1 total time = 0.144 secs Could you list your hardware, D ver, compiler specs. I will run your code on my system with your D version and compiler, if I can. Excellent work!
D has multiple compilers, but for the speed of the finished binary, LDC2 is generally recommended. I used version 1.11.0. https://github.com/ldc-developers/ldc/releases/tag/v1.11.0 I was using DUB to manage the project, but to build the stand-alone file from the gist link, use this command: $ ldc2 -release -O3 twinprimes_ssoz.d And to run it: $ echo "3000000000" | ./twinprimes_ssoz Running the program a bunch of times, I get variable results, mostly between 195ms and 250ms. Running the Nim version, I also get variable results, mostly between 230ms and 280ms. My system is an Alienware 14x R2 from 2012. Specs can be found here: https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/
Oct 16 2018
next sibling parent reply welkam <wwwelkam gmail.com> writes:
So I run profiler and 97% of time is spent in void twinsSieve 
function and hotspots are seg[k] = seg[k] |  1; lines. Since 
seg[k] can only be 1 or 0 I removed that or operation. And the 
results are. Queue the drum-roll... 5% slower.

I thought that all of my studying was getting somewhere. That I 
beginning to understand things but no. Performing OR operation 
and then storing data is faster than just storing data. [sarcasm] 
Of course it makes sense [/sarcasm]

I looked at assembler and nothing changed except

orb    $0x1,(%rbx,%rdi,1)

is changed to

movb   $0x1,(%rbx,%rdi,1)

I`m completely lost.
Oct 16 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Tuesday, 16 October 2018 at 16:57:12 UTC, welkam wrote:
 So I run profiler and 97% of time is spent in void twinsSieve 
 function and hotspots are seg[k] = seg[k] |  1; lines. Since 
 seg[k] can only be 1 or 0 I removed that or operation. And the 
 results are. Queue the drum-roll... 5% slower.

 I thought that all of my studying was getting somewhere. That I 
 beginning to understand things but no. Performing OR operation 
 and then storing data is faster than just storing data. 
 [sarcasm] Of course it makes sense [/sarcasm]

 I looked at assembler and nothing changed except

 orb    $0x1,(%rbx,%rdi,1)

 is changed to

 movb   $0x1,(%rbx,%rdi,1)

 I`m completely lost.
This is the exact same behavior I found with the Nim compiler too. seg[k] = 1 is slower than seg[k] = seg[k] or 1, which is why it's coded like that. Thanks for finding the exact instruction difference. How many other unknown gotchas are in the compilers? :-(
Oct 16 2018
parent reply welkam <wwwelkam gmail.com> writes:
On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya wrote:
 This is the exact same behavior I found with the Nim compiler 
 too.
Well Nim compiler is more like translator. It translates Nim code to c or c++. Since gcc was responsible for optimizations and instruction selection it would be more correct to say that gcc behavior was same as llvm.
 How many other unknown gotchas are in the compilers? :-(
the number of inlining shall be 3 https://www.youtube.com/watch?v=s4wnuiCwTGU
Oct 16 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Tuesday, 16 October 2018 at 20:38:24 UTC, welkam wrote:
 On Tuesday, 16 October 2018 at 17:57:23 UTC, Jabari Zakiya 
 wrote:
 This is the exact same behavior I found with the Nim compiler 
 too.
Well Nim compiler is more like translator. It translates Nim code to c or c++. Since gcc was responsible for optimizations and instruction selection it would be more correct to say that gcc behavior was same as llvm.
 How many other unknown gotchas are in the compilers? :-(
the number of inlining shall be 3 https://www.youtube.com/watch?v=s4wnuiCwTGU
Yes, it finally is the compiler doing it, and Clang does the same thing. And they could be modded to catch semantics like this and produce faster code. My guess is the difference in code execution times based on the ALU operation pipeline. seg[k] = seg[k] | 1 can be done "inplace" in the pipeline, where the lsb can to be changed in mem, whereas seg[k] = 1 may require a separate constant|value creation and memory write, and bypass the ALU. I used to know most of these tricks back in the Pentium days, but I haven't had a need to keep up with Intel Ixx core chips machine|compiler level instruction optimizations.
Oct 16 2018
parent reply welkam <wwwelkam gmail.com> writes:
On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya wrote:
 And they could be modded to catch semantics like this and 
 produce faster code.
Its hard to prove that you will only write 1 or 0 in the array and even if you write such pass it wont fire very often. So slower compile times for no gain in common case.
 My guess is the difference in code execution times based on the 
 ALU operation pipeline.
and my guess is that CPU knows that its going to write the same value to memory so it ignores that write.
 I used to know most of these tricks back in the Pentium days, 
 but I haven't had a need to keep up with Intel Ixx core chips 
 machine|compiler > level instruction optimizations.
Life is too short for such things. Today compilers are good at optimizing low level stuff. Those decades of improvements add up.
Oct 16 2018
parent Jabari Zakiya <jzakiya gmail.com> writes:
On Tuesday, 16 October 2018 at 21:12:39 UTC, welkam wrote:
 On Tuesday, 16 October 2018 at 20:58:54 UTC, Jabari Zakiya 
 wrote:
 And they could be modded to catch semantics like this and 
 produce faster code.
Its hard to prove that you will only write 1 or 0 in the array and even if you write such pass it wont fire very often. So slower compile times for no gain in common case.
 My guess is the difference in code execution times based on 
 the ALU operation pipeline.
and my guess is that CPU knows that its going to write the same value to memory so it ignores that write.
Exactly. It does the same instruction twice, and once it's set the first time it can ignore setting it the second time. However with seg[k] = 1 , the mov instruction always moves, because it doesn't know|care what the previous value was. That's my guess.
 Life is too short for such things. Today compilers are good at 
 optimizing low level stuff. Those decades of improvements add 
 up.
Amen! :-)
Oct 16 2018
prev sibling next sibling parent Jon Degenhardt <jond noreply.com> writes:
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
 D has multiple compilers, but for the speed of the finished 
 binary, LDC2 is generally recommended.  I used version 1.11.0.  
 https://github.com/ldc-developers/ldc/releases/tag/v1.11.0

 I was using DUB to manage the project, but to build the 
 stand-alone file from the gist link, use this command:  $ ldc2 
 -release -O3 twinprimes_ssoz.d
 And to run it:  $ echo "3000000000" | ./twinprimes_ssoz
It'd be interesting to see if LTO or PGO generated an improvement. It looks like it could in this case, as it might optimize some of the inner loops. LTO is easy, enable it with: -flto=<thin|full> -defaultlib=phobos2-ldc-lto,druntime-ldc-lto (see: https://github.com/ldc-developers/ldc/releases/tag/v1.9.0). I've been using 'thin' on OSX, 'full' on Linux. PGO is a bit more work, but not too bad. A good primer is here: https://johanengelen.github.io/ldc/2016/07/15/Profile-Guided-Optimization-with-LDC.html --Jon
Oct 16 2018
prev sibling next sibling parent Jabari Zakiya <jzakiya gmail.com> writes:
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
 On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM 
 based ldc2 compiler, where I believe Nim is using gcc.
Here's what I get on my system. $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 0.000 secs perform twinprimes ssoz sieve sieve time = 0.144 secs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/-1 total time = 0.144 secs
Hey Vijay I found subtle static typing errors. When N < 2^32 - 1 (4,294,967,295) the twimprime count and lastprime values are correct. When N > 2^32 - 1 both values start to be incorrect. Examples: For N = 4,000,000,000 you get correct answers $ echo 4000000000 | ./twinprimes_ssoz Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 233766231; resgroups = 1731602 each 135 threads has nextp[2 x 6332] array setup time = 667 μs and 5 hnsecs perform twinprimes ssoz sieve sieve time = 181 ms, 277 μs, and 6 hnsecs last segment = 27666 resgroups; segment slices = 27 total twins = 11944438; last twin = 3999999798+/- 1 total time = 181 ms, 945 μs, and 1 hnsec Examples: For N = 5,000,000,000 you get wrong answers $ echo 5000000000 | ./twinprimes_ssoz Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 292207792; resgroups = 2164503 each 135 threads has nextp[2 x 6999] array setup time = 744 μs and 7 hnsecs perform twinprimes ssoz sieve sieve time = 225 ms, 323 μs, and 4 hnsecs last segment = 1815 resgroups; segment slices = 34 total twins = 14618173; last twin = 705034592+/- 1 total time = 226 ms, 68 μs, and 1 hnse These are correct answers for N = 5,000,000,000 $ echo 5000000000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 292207792; resgroups = 2164503 each 135 threads has nextp[2 x 6999] array setup time = 0.001 secs perform twinprimes ssoz sieve sieve time = 0.237 secs last segment = 1815 resgroups; segment slices = 34 total twins = 14618166; last twin = 4999999860+/-1 total time = 0.237 secs The first easy check should show the lastprime value just a little smaller than N. See the timing|results table I posted earlier to see correct outputs as N gets bigger. It took me a looong time, and was a big PITA, to get the correct static typing too in Nim, especially coming from a dynamic language like Ruby.
Oct 16 2018
prev sibling parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Tuesday, 16 October 2018 at 07:09:05 UTC, Vijay Nayar wrote:
 On Monday, 15 October 2018 at 22:17:57 UTC, Jabari Zakiya wrote:
 $ dub build --compiler=ldc2 -b=release && echo "3000000000" | 
 ./twinprimes
 Enter integer number:
 threads = 8
 each thread segment is [1 x 65536] bytes array
 twinprime candidates = 175324676; resgroups = 1298702
 each 135 threads has nextp[2 x 5566] array
 setup time = 1 ms, 864 μs, and 7 hnsecs
 perform twinprimes ssoz sieve
 sieve time = 196 ms, 566 μs, and 5 hnsecs
 last segment = 53518 resgroups; segment slices = 20
 total twins = 9210144; last twin = 2999999712+/- 1
 total time = 198 ms, 431 μs, and 2 hnsecs

 My understanding is that the difference in performance is 
 largely due to slightly better optimization from the LLVM 
 based ldc2 compiler, where I believe Nim is using gcc.
Here's what I get on my system. $ echo 3_000_000_000 | ./twinprimes_test7yc.0180.gcc821 Enter integer number: threads = 8 each thread segment is [1 x 65536] bytes array twinprime candidates = 175324676; resgroups = 1298702 each 135 threads has nextp[2 x 5566] array setup time = 0.000 secs perform twinprimes ssoz sieve sieve time = 0.144 secs last segment = 53518 resgroups; segment slices = 20 total twins = 9210144; last twin = 2999999712+/-1 total time = 0.144 secs Could you list your hardware, D ver, compiler specs. I will run your code on my system with your D version and compiler, if I can. Excellent work!
D has multiple compilers, but for the speed of the finished binary, LDC2 is generally recommended. I used version 1.11.0. https://github.com/ldc-developers/ldc/releases/tag/v1.11.0 I was using DUB to manage the project, but to build the stand-alone file from the gist link, use this command: $ ldc2 -release -O3 twinprimes_ssoz.d And to run it: $ echo "3000000000" | ./twinprimes_ssoz Running the program a bunch of times, I get variable results, mostly between 195ms and 250ms. Running the Nim version, I also get variable results, mostly between 230ms and 280ms. My system is an Alienware 14x R2 from 2012. Specs can be found here: https://www.cnet.com/products/alienware-m14xr2-14-core-i7-3630qm-8-gb-ram-750-gb-hdd/specs/
Hey Vijay I got a complete D version running correctly. https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61 Here's the Nim code for comparison. https://gist.github.com/jzakiya/6c7e1868bd749a6b1add62e3e3b2341e I'm about 80-85% finished writing my paper. It's interesting the things you learn about a topic if you try to write about it. :-) When I finish I'll post it here, and everything will become a whole lot clearer about what|why|how the algorithm|code works. I found a few edge case bugs in the original code I posted, but also found some coding and performance improvements too. The D (and revised Nim) code reflects all these changes, and are the current state-of-the-art regarding this implementation. One thing that could|should make it faster is the use of fast bitarrays to create the seg arrays in the threads (vs byte arrays). I didn't find anywhere that D has a native bitarray structure (Nim doesn't either). Please don't be too offset by my coding style. I tried to mimick the Nim style (which is non-standard too). I code to minimize the syntax noise, and like complete concepts on one line. Performance is "similar" to Nim's, though Nim is slightly faster as N increases. Both are still faster than primesieve however. Still hoping someone does a GPU version. Please compile on the different Nim compilers and see what differences they produce. And last, the D version doesn't us P17, as in the Nim version, because of the issue you found with the D compiler when trying to compile with P17. If you have time maybe you (someone) can try to resolve that issue, to match its functionality to the Nim version.
Nov 07 2018
parent reply Bastiaan Veelo <Bastiaan Veelo.net> writes:
Hi Jabari,

On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya 
wrote:
 Hey Vijay I got a complete D version running correctly.

 https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61
I cloned this and enabled P17 for you, get it at https://github.com/veelo/twinprimes_ssoz I basically optimized the compile time table generation by removing templates and avoiding reallocations. This is a big deal because currently the compiler will not collect its own garbage. You'll see that I reserve exactly the minimum array lengths; of course I didn't know these when I started, I got it to work just by preallocating for up to P13 and using appender for P17. But since I now know the exact required length, I can just as well use it. My system has 16GB of RAM, which is enough. [...]
 One thing that could|should make it faster is the use of fast 
 bitarrays to create the seg arrays in the threads (vs byte 
 arrays). I didn't find anywhere that D has a native bitarray 
 structure (Nim doesn't either).
There is https://dlang.org/phobos/std_bitmanip.html#BitArray I haven't looked at the run-time code, it is very well possible that similar optimizations are possible there too.
Nov 08 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
On Thursday, 8 November 2018 at 18:49:35 UTC, Bastiaan Veelo 
wrote:
 Hi Jabari,

 On Wednesday, 7 November 2018 at 20:30:33 UTC, Jabari Zakiya 
 wrote:
 Hey Vijay I got a complete D version running correctly.

 https://gist.github.com/jzakiya/ae93bfa03dbc8b25ccc7f97ff8ad0f61
I cloned this and enabled P17 for you, get it at https://github.com/veelo/twinprimes_ssoz I basically optimized the compile time table generation by removing templates and avoiding reallocations. This is a big deal because currently the compiler will not collect its own garbage. You'll see that I reserve exactly the minimum array lengths; of course I didn't know these when I started, I got it to work just by preallocating for up to P13 and using appender for P17. But since I now know the exact required length, I can just as well use it. My system has 16GB of RAM, which is enough. [...]
 One thing that could|should make it faster is the use of fast 
 bitarrays to create the seg arrays in the threads (vs byte 
 arrays). I didn't find anywhere that D has a native bitarray 
 structure (Nim doesn't either).
There is https://dlang.org/phobos/std_bitmanip.html#BitArray I haven't looked at the run-time code, it is very well possible that similar optimizations are possible there too.
Hi Bastiaan, Good work on the code. I've been playing with it, and made versions using boolean and bitarrays for seg array. Byte arrays are definitely faster, with bools slightly slower, but bitarrays way slower, by at least 2x. I also had made a slight modification to twin_sieve, at its end, to speed it up a smidgen. I didn't know if you want PRs to your code. However, there is a problem in the code when using P17. It generates incorrect results > 15 trillion (when P17 is selected). For 20 trillion (20_000_000_000_000) it gives 37,593,079,058 instead of 30,198,862,775 for the twins count, and the last twin value is off too.This likely means the P17 parameters are off, and/or nextp values, as not enough of the k values in twins_sieve are marking off prime multiples in the seg array. However for P5-P13 the results are correct, and the performance is comparable to the Nim version, up to N = 5 billion, where you start to see D be slower (by a few percent). You can run both on your system to see what the difference is on it. Anyway, if you figure out what's up with P17 that we can get it totally working, and I'll look at playing around with it to. I'm over 90% finished with my paper now (want to finish b4 Thanksgiving).
Nov 13 2018
parent reply Jabari Zakiya <jzakiya gmail.com> writes:
I finally finished my paper on Twin Primes, which just kept 
expanding into other topics as I continued writing. I included an 
extensive (brief) discussion of the math of Prime Generators and 
their use and implications to topics in Number Theory, et al. I 
describe in great detail the Twin Primes SSoZ, with visual aides. 
Hopefully understanding it will be clear. If any questions about 
anything in the paper please let me know.


The Use of Prime Generators to Implement Fast Twin Primes Sieve 
of Zakiya (Soz), Applications to Number Theory, and Implications 
for the Riemann Hypotheses

https://www.academia.edu/37952623/The_Use_of_Prime_Generators_to_Implement_Fast_Twin_Primes_Sieve_of_Zakiya_SoZ_Applications_to_Number_Theory_and_Implications_for_the_Riemann_Hypotheses

https://www.scribd.com/document/395415391/The-Use-of-Prime-Generators-to-Implement-Fast-Twin-Primes-Sieve-Of-Zakiya-SoZ-Applications-to-Number-Theory-and-Implications-for-the-Riemann-Hypoth
Dec 11 2018
parent welkam <wwwelkam gmail.com> writes:
On Tuesday, 11 December 2018 at 14:07:48 UTC, Jabari Zakiya wrote:
 The Use of Prime Generators to Implement Fast Twin Primes Sieve 
 of Zakiya (Soz), Applications to Number Theory, and 
 Implications for the Riemann Hypotheses
Hey Mom I contributed to white paper. I am famous now :D
Dec 11 2018