digitalmars.D.announce - [GSoC] Mir.random.flex - Generic non-uniform random sampling

Seb (27/27) Aug 22 2016 Hey all,

jmh530 (3/9) Aug 22 2016 Thanks for the well-done blog posts, especially the first one.

Seb (24/38) Aug 23 2016 If you refer to whether the intervals can be calculated at CT,

Meta (9/22) Aug 22 2016 It's really nice to see that GSoC has been such a huge success so

Ilya Yaroshenko (10/38) Aug 22 2016 This is an API problem, and will not be fixed. Making D scripting

Joseph Rushton Wakeling (15/25) Aug 23 2016 Yes, this is not an issue that is immediately fixable without

Mike Parker (3/9) Aug 22 2016 Reddit:
=?UTF-8?B?Tm9yZGzDtnc=?= (2/5) Aug 23 2016 Fantastic work!
=?UTF-8?B?Tm9yZGzDtnc=?= (3/5) Aug 23 2016 Found at typo:

Seb (8/13) Aug 23 2016 Thanks! Fixed.

tn (3/4) Aug 23 2016 What are the columns "mu time" and "sigma^2 time" of the

Seb (9/13) Aug 23 2016 In statistics mu is often used to describe the average, sigma the

Seb (11/14) Aug 23 2016 Btw I quickly added the run-time for C++ (for G++ and clang++

Johan Engelen (9/15) Aug 23 2016 I think you should add compiler/runtimelib versions for all

Seb (10/25) Aug 23 2016 Oh that's at the header of the benchmark script:

tn (14/26) Aug 23 2016 Thanks for the clarification and the fix.

Seb (27/32) Aug 23 2016 tl;dr: yes, it's negligible for the normal distribution.

Stefan (12/13) Aug 23 2016 code samples with too long lines do not render appropriate on my

Seb (9/22) Aug 23 2016 It was a weird white-space issue. Thanks for letting me know!

Seb <seb wilzba.ch> writes:

Hey all,

I am proud to publish a report of my GSoC work as two extensive 
blog posts, which explain non-uniform random sampling and the 
mir.random.flex package (part of Mir > 0.16-beta2):

http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html


Before I start my personal retrospect, I wanted to use this 
opportunity to give a huge thanks and acknowledgement to my two 
awesome mentors Ilya Yaroshenko (9il) and Joseph Wakeling 
(WebDrake).

As I wrote my first line of D code this February, I have learned 
quite a lot during the last few months. Github allows to list all 
merged contributions, which might show that I got quite familiar 
with dlang over the time:

https://github.com/search?l=&o=desc&q=author%3Awilzbach+is%3Amerged+user%3Adlang&ref=advsearch&s=comments&type=Issues&utf8=%E2%9C%93

… and with other D repositories:

https://github.com/search?l=D&o=desc&q=author%3Awilzbach+is%3Amerged&ref=searchresults&s=comments&type=Issues&utf8=%E2%9C%93

I am pretty sure you now know me from the NG, Github, IRC, 
Twitter, Bugzilla, DConf16, the DWiki, #d at StackOverflow or 
/r/d_language, so I will skip a further introduction ;-)

Over the next weeks and months I will continue my work on 
mir.random, which is supposed to supersede std.random, so in case 
you aren’t following the Mir project [1, 2], stay tuned!

Best regards,

Seb

[1] https://github.com/libmir/mir
[2] https://twitter.com/libmir

Aug 22 2016

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

Thanks for the well-done blog posts, especially the first one.

Does your implementation make any use of CTFE?

Aug 22 2016

Seb <seb wilzba.ch> writes:

On Monday, 22 August 2016 at 17:13:10 UTC, jmh530 wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two 
 extensive blog posts, which explain non-uniform random 
 sampling and the mir.random.flex package (part of Mir > 
 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

 Thanks for the well-done blog posts, especially the first one.

I am glad to hear this!

 Does your implementation make any use of CTFE?

If you refer to whether the intervals can be calculated at CT, 
unfortunately it can't be used due to four main reasons:

- FP-math at CT (it's already hard to deal with at RT, see e.g. 
my recent complaint [1]) - the problem is that the Flex algorithm 
is very sensitive to numerical errors and thus an erroneous 
change at the lowest end (e.g 10^-15) can lead to totally 
different numbers with a seeded random engine
- std.container due to pointers (I doubt this can/will be fixed 
in the near future)
- std.math due to inline assembly and other tricks (this can be 
fixed and I will submit a couple of PRs soon)
- speed of the CTFE engine (see e.g. [2] for std.regex)

That being said CTFE is of course used to compute mixins, 
constants and specialize functions. Moreover thanks to all 
speed-ups described in the second blog, constructing the 
intervals takes about 0.1ms, so for the majority of the users it 
shouldn't even be noticeable and for the tiny minority it does, 
they can still manually inline the intervals.

[1] 
http://forum.dlang.org/post/hjaiavlfkoamenidomsa forum.dlang.org
[2] 
http://forum.dlang.org/post/iqcrnokalollrejcabad forum.dlang.org

Aug 23 2016

Meta <jared771 gmail.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

It's really nice to see that GSoC has been such a huge success so 
far. Everyone has done some really great work.


 Over the next weeks and months I will continue my work on 
 mir.random, which is supposed to supersede std.random, so in 
 case you aren’t following the Mir project [1, 2], stay tuned!

 Best regards,

 Seb

 [1] https://github.com/libmir/mir
 [2] https://twitter.com/libmir

I'm curious, have you come up with a solution to what is probably 
the biggest problem with  std.random, i.e., it uses value types 
and copying? I remember a lot of discussion about this and it 
seemed at the time that the only really solid solution was to 
make all random generators classes, though I think DIP1000 *may* 
help here.

Aug 22 2016

Ilya Yaroshenko <ilyayaroshenko gmail.com> writes:

On Monday, 22 August 2016 at 18:09:28 UTC, Meta wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two 
 extensive blog posts, which explain non-uniform random 
 sampling and the mir.random.flex package (part of Mir > 
 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

 It's really nice to see that GSoC has been such a huge success 
 so far. Everyone has done some really great work.


 Over the next weeks and months I will continue my work on 
 mir.random, which is supposed to supersede std.random, so in 
 case you aren’t following the Mir project [1, 2], stay tuned!

 Best regards,

 Seb

 [1] https://github.com/libmir/mir
 [2] https://twitter.com/libmir

 I'm curious, have you come up with a solution to what is 
 probably the biggest problem with  std.random, i.e., it uses 
 value types and copying? I remember a lot of discussion about 
 this and it seemed at the time that the only really solid 
 solution was to make all random generators classes, though I 
 think DIP1000 *may* help here.

This is an API problem, and will not be fixed. Making D scripting 
like language is bad for Science. For example, druntime (Fibers 
and Mutexes) is useless because it is too high level and poor 
featured in the same time.

The main problem with std.random is that std.random.uniform is 
broken in context of non-uniform sampling. The same situation is 
for 99% uniform algorithms. They just ignore the fact that for 
example, for [0, 1) exponent and mantissa should be generated 
separately with appropriate probabilities for for exponent

Aug 22 2016

Joseph Rushton Wakeling <joseph.wakeling webdrake.net> writes:

On Tuesday, 23 August 2016 at 05:40:24 UTC, Ilya Yaroshenko wrote:
 This is an API problem, and will not be fixed. Making D 
 scripting like language is bad for Science. For example, 
 druntime (Fibers and Mutexes) is useless because it is too high 
 level and poor featured in the same time.

Yes, this is not an issue that is immediately fixable without 
introducing other issues (e.g. defining everything as a class 
brings immediate issues related to heap allocation).

In the long run it would obviously be nice to address that issue, 
but it would have been a major blocker to throw that onto Seb's 
shoulders (as we all recognized quite quickly when we started 
discussing it).  It was (rightly) not the focus of this project.

For this reason, the random distributions introduced in mir are 
implemented as functors (as is the case with random distributions 
in C++11 <random>) rather than as ranges.

 The main problem with std.random is that std.random.uniform is 
 broken in context of non-uniform sampling. The same situation 
 is for 99% uniform algorithms. They just ignore the fact that 
 for example, for [0, 1) exponent and mantissa should be 
 generated separately with appropriate probabilities for for 
 exponent

Just as a point of terminology: we should make clear here that 
this is about sampling from a non-uniform distribution.  It 
shouldn't be confused with "sampling" in the sense of what is 
done by (say) `RandomSample`.

Aug 23 2016

Mike Parker <aldacron gmail.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 Hey all,

 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

Reddit: 
https://www.reddit.com/r/programming/comments/4z4sp7/an_introduction_to_nonuniform_random_sampling/

Aug 22 2016

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 I am proud to publish a report of my GSoC work as two extensive 
 blog posts, which explain non-uniform random sampling and the 
 mir.random.flex package (part of Mir > 0.16-beta2):

Fantastic work!

Aug 23 2016

=?UTF-8?B?Tm9yZGzDtnc=?= <per.nordlow gmail.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

Found at typo:

Search for "performance boost performance boost"

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 08:10:50 UTC, Nordlöw wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

 Found at typo:

 Search for "performance boost performance boost"

Thanks! Fixed.

Btw in case someone is interested, the blog posts are written in 
Github-flavored Markdown with a couple of custom Jekyll 
extensions (e.g. the Math formulas are rendered on the server 
with KaTeX [1]):

https://github.com/libmir/blog/blob/master/_posts/2016-08-19-transformed-density-rejection-sampling.md


[1] https://khan.github.io/KaTeX/

Aug 23 2016

tn <no email.com> writes:

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

What are the columns "mu time" and "sigma^2 time" of the 
benchmark table in the Sampling subsection?

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

 What are the columns "mu time" and "sigma^2 time" of the 
 benchmark table in the Sampling subsection?

In statistics mu is often used to describe the average, sigma the 
standard deviation and sigma^2 the variance. It was absolutely 
unnecessary to use this notation here (especially because 
standard deviation, not variance was measured). I fixed that and 
also added how many samples were generated per run (10M), thanks!
Btw in case you didn't find the link within the article, the 
benchmark is available within Mir:

https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.d

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated per 
 run (10M), thanks!

Btw I quickly added the run-time for C++ (for G++ and clang++ 
with -O3) and it doesn't look that bad:

http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html#sampling

https://github.com/libmir/mir/pull/307

Note that for comparison between languages the speed of the 
random engine plays a major role and that superior performance 
was never a top priority for this generic method (statistical 
quality is far more important). However I will work more on being 
faster than <random> for common distributions over the next weeks 
;-)

Aug 23 2016

Johan Engelen <j j.nl> writes:

On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated 
 per run (10M), thanks!

 Btw I quickly added the run-time for C++ (for G++ and clang++ 
 with -O3)

I think you should add compiler/runtimelib versions for all 
measurements, and all relevant flags.
I'm happy to see you are using LDC for benchmarking; don't forget 
to mention the LLVM version for LDC's version.
You already wrote `-O3` for the C++ measurements, did you not use 
-mcpu=native or some other performance changing flags?

Cheers,
   Johan

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 16:54:18 UTC, Johan Engelen wrote:
 On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 [...]
 I fixed that and also added how many samples were generated 
 per run (10M), thanks!

 Btw I quickly added the run-time for C++ (for G++ and clang++ 
 with -O3)

 I think you should add compiler/runtimelib versions for all 
 measurements, and all relevant flags.
 I'm happy to see you are using LDC for benchmarking;

What else? ;-)

 don't  forget to mention the LLVM version for LDC's version.

Oh that's at the header of the benchmark script:

https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.d

I excluded it from the post to avoid visual noise as I thought 
that the ones that are interested will check it anyhow.

 You already wrote `-O3` for the C++ measurements,

Now that I added the flags for C++  directly in the script it 
probably got very confusing. Sorry, I will replace C++ flags with 
a link to the benchmark soon.

 did you not use -mcpu=native or some other performance changing 
 flags?

See the benchmark file above.

Aug 23 2016

tn <no email.com> writes:

On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:
 On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:
 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html

 What are the columns "mu time" and "sigma^2 time" of the 
 benchmark table in the Sampling subsection?

 In statistics mu is often used to describe the average, sigma 
 the standard deviation and sigma^2 the variance. It was 
 absolutely unnecessary to use this notation here (especially 
 because standard deviation, not variance was measured). I fixed 
 that and also added how many samples were generated per run 
 (10M), thanks!

Thanks for the clarification and the fix.

(I am familiar with the usage of mu and sigma, but somehow I 
first thought that the columns corresponded to two different 
measurements. After the initial confusion I realized the correct 
meaning, but still wasn't sure about it due to contradiction 
between the name of the column (sigma^2, not sigma) and the 
values (ms, not ms^2 or something). So I thought that it would be 
better to ask.)

Another question: You mention that statistical quality is 
important, but it is not clear if flex has better or worse 
quality than Box-Muller and Ziggurat in the case of sampling from 
normal distribution. Or is the difference negligible? (I realize 
that the real strength of flex is its versatility.)

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 13:12:28 UTC, tn wrote:
 Another question: You mention that statistical quality is 
 important, but it is not clear if flex has better or worse 
 quality than Box-Muller and Ziggurat in the case of sampling 
 from normal distribution. Or is the difference negligible? (I 
 realize that the real strength of flex is its versatility.)

tl;dr: yes, it's negligible for the normal distribution.

Excellent question - I didn't cover this in much detail because:

(1) it has already been done extensively in the literature. For 
example, scroll to Table IV (page 32) at "Gaussian Random Number 
Generators" by Thomas et. al. (2007) for chi-squared and 
high-sigma tests [1] (btw in all random libraries I looked at, 
e.g. <random>, the Box-Muller method is used for the normal 
distribution)
(2) the authors of the Tinflex algorithm wrote their own UNU.RAN 
[2] library to prove that the transformed density rejection 
method is close to inversion method (without numerical errors, it 
would be "perfect"). Under the hood UNU.RAN uses polynomial 
interpolation of inverse CDF, which is needed to be able to 
automatically compute statistical properties. This method is on 
our roadmap [3].
(3) the hat/squeeze and cumulative histograms of nearly all 
example distributions at [4] look pretty good [5] (if the 
cumulative histogram is identical to the CDF curve, the errors 
are negligible)
(4) this is just a preview release and requires [3] for automatic 
testing of user-generated distributions

[1] http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf
[2] http://statmath.wu.ac.at/unuran/
[3] https://github.com/libmir/mir/issues/46
[4] https://github.com/libmir/mir/tree/master/examples/flex_plot
[5] https://drive.google.com/open?id=0BwdiZp7qSaBhZXRJNHhSN1RHR3c

Aug 23 2016

Stefan <dl ng.rocks> writes:

Seb, this is awesome!

On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html

code samples with too long lines do not render appropriate on my 
browsers. (current chrome on linux and a pretty outdated firefox 
on windows :) )

Taking this example:

   S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf 
pdf, Hat hat,
       HatInv hatInvCDF, Squeeze sq)
   {

Then the second line of parameters does not render at all.

Will definitly play around with your code!

Aug 23 2016

Seb <seb wilzba.ch> writes:

On Tuesday, 23 August 2016 at 22:48:31 UTC, Stefan wrote:
 Seb, this is awesome!

Thanks :)

 On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:
 http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html

 code samples with too long lines do not render appropriate on 
 my browsers. (current chrome on linux and a pretty outdated 
 firefox on windows :) )

 Taking this example:

   S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf 
 pdf, Hat hat,
       HatInv hatInvCDF, Squeeze sq)
   {

 Then the second line of parameters does not render at all.

It was a weird white-space issue. Thanks for letting me know!
I fixed it and it now looks normal on the "latest, greatest" 
Chromium, FF and Chrome on Android:

http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#rejection-with-inversion
http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#squeeze-functions


 Will definitly play around with your code!

Don't hesitate to ping me / us with any questions

-> https://github.com/libmir/mir/issues

Aug 23 2016

D Programming

C/C++ Programming

Other

digitalmars.D.announce - [GSoC] Mir.random.flex - Generic non-uniform random sampling