www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.announce - [GSoC] Mir.random.flex - Generic non-uniform random sampling

reply Seb <seb wilzba.ch> writes:
Hey all,

I am proud to publish a report of my GSoC work as two extensive 
blog posts, which explain non-uniform random sampling and the 
mir.random.flex package (part of Mir > 0.16-beta2):

http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html


Before I start my personal retrospect, I wanted to use this 
opportunity to give a huge thanks and acknowledgement to my two 
awesome mentors Ilya Yaroshenko (9il) and Joseph Wakeling 
(WebDrake).

As I wrote my first line of D code this February, I have learned 
quite a lot during the last few months. Github allows to list all 
merged contributions, which might show that I got quite familiar 
with dlang over the time:

https://github.com/search?l=&o=desc&q=author%3Awilzbach+is%3Amerged+user%3Adlang&ref=advsearch&s=comments&type=Issues&utf8=%E2%9C%93

… and with other D repositories:

https://github.com/search?l=D&o=desc&q=author%3Awilzbach+is%3Amerged&ref=searchresults&s=comments&type=Issues&utf8=%E2%9C%93

I am pretty sure you now know me from the NG, Github, IRC, 
Twitter, Bugzilla, DConf16, the DWiki, #d at StackOverflow or 
/r/d_language, so I will skip a further introduction ;-)

Over the next weeks and months I will continue my work on 
mir.random, which is supposed to supersede std.random, so in case 
you aren’t following the Mir project [1, 2], stay tuned!

Best regards,

Seb

[1] https://github.com/libmir/mir
[2] https://twitter.com/libmir
Aug 22 2016
next sibling parent reply jmh530 <john.michael.hall gmail.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
Thanks for the well-done blog posts, especially the first one. Does your implementation make any use of CTFE?
Aug 22 2016
parent Seb <seb wilzba.ch> writes:
On Monday, 22 August 2016 at 17:13:10 UTC, jmh530 wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two 
 extensive blog posts, which explain non-uniform random 
 sampling and the mir.random.flex package (part of Mir > 
 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
Thanks for the well-done blog posts, especially the first one.
I am glad to hear this!
 Does your implementation make any use of CTFE?
If you refer to whether the intervals can be calculated at CT, unfortunately it can't be used due to four main reasons: - FP-math at CT (it's already hard to deal with at RT, see e.g. my recent complaint [1]) - the problem is that the Flex algorithm is very sensitive to numerical errors and thus an erroneous change at the lowest end (e.g 10^-15) can lead to totally different numbers with a seeded random engine - std.container due to pointers (I doubt this can/will be fixed in the near future) - std.math due to inline assembly and other tricks (this can be fixed and I will submit a couple of PRs soon) - speed of the CTFE engine (see e.g. [2] for std.regex) That being said CTFE is of course used to compute mixins, constants and specialize functions. Moreover thanks to all speed-ups described in the second blog, constructing the intervals takes about 0.1ms, so for the majority of the users it shouldn't even be noticeable and for the tiny minority it does, they can still manually inline the intervals. [1] http://forum.dlang.org/post/hjaiavlfkoamenidomsa forum.dlang.org [2] http://forum.dlang.org/post/iqcrnokalollrejcabad forum.dlang.org
Aug 23 2016
prev sibling next sibling parent reply Meta <jared771 gmail.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
It's really nice to see that GSoC has been such a huge success so far. Everyone has done some really great work.
 Over the next weeks and months I will continue my work on 
 mir.random, which is supposed to supersede std.random, so in 
 case you aren’t following the Mir project [1, 2], stay tuned!

 Best regards,

 Seb

 [1] https://github.com/libmir/mir
 [2] https://twitter.com/libmir
I'm curious, have you come up with a solution to what is probably the biggest problem with std.random, i.e., it uses value types and copying? I remember a lot of discussion about this and it seemed at the time that the only really solid solution was to make all random generators classes, though I think DIP1000 *may* help here.
Aug 22 2016
parent reply Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:
On Monday, 22 August 2016 at 18:09:28 UTC, Meta wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two 
 extensive blog posts, which explain non-uniform random 
 sampling and the mir.random.flex package (part of Mir > 
 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
It's really nice to see that GSoC has been such a huge success so far. Everyone has done some really great work.
 Over the next weeks and months I will continue my work on 
 mir.random, which is supposed to supersede std.random, so in 
 case you aren’t following the Mir project [1, 2], stay tuned!

 Best regards,

 Seb

 [1] https://github.com/libmir/mir
 [2] https://twitter.com/libmir
I'm curious, have you come up with a solution to what is probably the biggest problem with std.random, i.e., it uses value types and copying? I remember a lot of discussion about this and it seemed at the time that the only really solid solution was to make all random generators classes, though I think DIP1000 *may* help here.
This is an API problem, and will not be fixed. Making D scripting like language is bad for Science. For example, druntime (Fibers and Mutexes) is useless because it is too high level and poor featured in the same time. The main problem with std.random is that std.random.uniform is broken in context of non-uniform sampling. The same situation is for 99% uniform algorithms. They just ignore the fact that for example, for [0, 1) exponent and mantissa should be generated separately with appropriate probabilities for for exponent
Aug 22 2016
parent Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:
On Tuesday, 23 August 2016 at 05:40:24 UTC, Ilya Yaroshenko wrote:
 This is an API problem, and will not be fixed. Making D 
 scripting like language is bad for Science. For example, 
 druntime (Fibers and Mutexes) is useless because it is too high 
 level and poor featured in the same time.
Yes, this is not an issue that is immediately fixable without introducing other issues (e.g. defining everything as a class brings immediate issues related to heap allocation). In the long run it would obviously be nice to address that issue, but it would have been a major blocker to throw that onto Seb's shoulders (as we all recognized quite quickly when we started discussing it). It was (rightly) not the focus of this project. For this reason, the random distributions introduced in mir are implemented as functors (as is the case with random distributions in C++11 <random>) rather than as ranges.
 The main problem with std.random is that std.random.uniform is 
 broken in context of non-uniform sampling. The same situation 
 is for 99% uniform algorithms. They just ignore the fact that 
 for example, for [0, 1) exponent and mantissa should be 
 generated separately with appropriate probabilities for for 
 exponent
Just as a point of terminology: we should make clear here that this is about sampling from a non-uniform distribution. It shouldn't be confused with "sampling" in the sense of what is done by (say) `RandomSample`.
Aug 23 2016
prev sibling next sibling parent Mike Parker <aldacron gmail.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
Reddit: https://www.reddit.com/r/programming/comments/4z4sp7/an_introduction_to_nonuniform_random_sampling/
Aug 22 2016
prev sibling next sibling parent =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):
Fantastic work!
Aug 23 2016
prev sibling next sibling parent reply =?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
Found at typo: Search for "performance boost performance boost"
Aug 23 2016
parent Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 08:10:50 UTC, Nordlöw wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
Found at typo: Search for "performance boost performance boost"
Thanks! Fixed. Btw in case someone is interested, the blog posts are written in Github-flavored Markdown with a couple of custom Jekyll extensions (e.g. the Math formulas are rendered on the server with KaTeX [1]): https://github.com/libmir/blog/blob/master/_posts/2016-08-19-transformed-density-rejection-sampling.md [1] https://khan.github.io/KaTeX/
Aug 23 2016
prev sibling next sibling parent reply tn <no email.com> writes:
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
What are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
Aug 23 2016
parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
What are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
In statistics mu is often used to describe the average, sigma the standard deviation and sigma^2 the variance. It was absolutely unnecessary to use this notation here (especially because standard deviation, not variance was measured). I fixed that and also added how many samples were generated per run (10M), thanks! Btw in case you didn't find the link within the article, the benchmark is available within Mir: https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.d
Aug 23 2016
next sibling parent reply Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated per 
 run (10M), thanks!
Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3) and it doesn't look that bad: http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html#sampling https://github.com/libmir/mir/pull/307 Note that for comparison between languages the speed of the random engine plays a major role and that superior performance was never a top priority for this generic method (statistical quality is far more important). However I will work more on being faster than <random> for common distributions over the next weeks ;-)
Aug 23 2016
parent reply Johan Engelen <j j.nl> writes:
On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated 
 per run (10M), thanks!
Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3)
I think you should add compiler/runtimelib versions for all measurements, and all relevant flags. I'm happy to see you are using LDC for benchmarking; don't forget to mention the LLVM version for LDC's version. You already wrote `-O3` for the C++ measurements, did you not use -mcpu=native or some other performance changing flags? Cheers, Johan
Aug 23 2016
parent Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 16:54:18 UTC, Johan Engelen wrote:
 On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated 
 per run (10M), thanks!
Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3)
I think you should add compiler/runtimelib versions for all measurements, and all relevant flags. I'm happy to see you are using LDC for benchmarking;
What else? ;-)
 don't  forget to mention the LLVM version for LDC's version.
Oh that's at the header of the benchmark script: https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.d I excluded it from the post to avoid visual noise as I thought that the ones that are interested will check it anyhow.
 You already wrote `-O3` for the C++ measurements,
Now that I added the flags for C++ directly in the script it probably got very confusing. Sorry, I will replace C++ flags with a link to the benchmark soon.
 did you not use -mcpu=native or some other performance changing 
 flags?
See the benchmark file above.
Aug 23 2016
prev sibling parent reply tn <no email.com> writes:
On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html
What are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
In statistics mu is often used to describe the average, sigma the standard deviation and sigma^2 the variance. It was absolutely unnecessary to use this notation here (especially because standard deviation, not variance was measured). I fixed that and also added how many samples were generated per run (10M), thanks!
Thanks for the clarification and the fix. (I am familiar with the usage of mu and sigma, but somehow I first thought that the columns corresponded to two different measurements. After the initial confusion I realized the correct meaning, but still wasn't sure about it due to contradiction between the name of the column (sigma^2, not sigma) and the values (ms, not ms^2 or something). So I thought that it would be better to ask.) Another question: You mention that statistical quality is important, but it is not clear if flex has better or worse quality than Box-Muller and Ziggurat in the case of sampling from normal distribution. Or is the difference negligible? (I realize that the real strength of flex is its versatility.)
Aug 23 2016
parent Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 13:12:28 UTC, tn wrote:
 Another question: You mention that statistical quality is 
 important, but it is not clear if flex has better or worse 
 quality than Box-Muller and Ziggurat in the case of sampling 
 from normal distribution. Or is the difference negligible? (I 
 realize that the real strength of flex is its versatility.)
tl;dr: yes, it's negligible for the normal distribution. Excellent question - I didn't cover this in much detail because: (1) it has already been done extensively in the literature. For example, scroll to Table IV (page 32) at "Gaussian Random Number Generators" by Thomas et. al. (2007) for chi-squared and high-sigma tests [1] (btw in all random libraries I looked at, e.g. <random>, the Box-Muller method is used for the normal distribution) (2) the authors of the Tinflex algorithm wrote their own UNU.RAN [2] library to prove that the transformed density rejection method is close to inversion method (without numerical errors, it would be "perfect"). Under the hood UNU.RAN uses polynomial interpolation of inverse CDF, which is needed to be able to automatically compute statistical properties. This method is on our roadmap [3]. (3) the hat/squeeze and cumulative histograms of nearly all example distributions at [4] look pretty good [5] (if the cumulative histogram is identical to the CDF curve, the errors are negligible) (4) this is just a preview release and requires [3] for automatic testing of user-generated distributions [1] http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf [2] http://statmath.wu.ac.at/unuran/ [3] https://github.com/libmir/mir/issues/46 [4] https://github.com/libmir/mir/tree/master/examples/flex_plot [5] https://drive.google.com/open?id=0BwdiZp7qSaBhZXRJNHhSN1RHR3c
Aug 23 2016
prev sibling parent reply Stefan <dl ng.rocks> writes:
Seb, this is awesome!

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
code samples with too long lines do not render appropriate on my browsers. (current chrome on linux and a pretty outdated firefox on windows :) ) Taking this example: S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf pdf, Hat hat, HatInv hatInvCDF, Squeeze sq) { Then the second line of parameters does not render at all. Will definitly play around with your code!
Aug 23 2016
parent Seb <seb wilzba.ch> writes:
On Tuesday, 23 August 2016 at 22:48:31 UTC, Stefan wrote:
 Seb, this is awesome!
Thanks :)
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
code samples with too long lines do not render appropriate on my browsers. (current chrome on linux and a pretty outdated firefox on windows :) ) Taking this example: S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf pdf, Hat hat, HatInv hatInvCDF, Squeeze sq) { Then the second line of parameters does not render at all.
It was a weird white-space issue. Thanks for letting me know! I fixed it and it now looks normal on the "latest, greatest" Chromium, FF and Chrome on Android: http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#rejection-with-inversion http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#squeeze-functions
 Will definitly play around with your code!
Don't hesitate to ping me / us with any questions -> https://github.com/libmir/mir/issues
Aug 23 2016