digitalmars.D.announce - [GSoC] Mir.random.flex - Generic non-uniform random sampling
- Seb (27/27) Aug 22 2016 Hey all,
- jmh530 (3/9) Aug 22 2016 Thanks for the well-done blog posts, especially the first one.
- Seb (24/38) Aug 23 2016 If you refer to whether the intervals can be calculated at CT,
- Meta (9/22) Aug 22 2016 It's really nice to see that GSoC has been such a huge success so
- Ilya Yaroshenko (10/38) Aug 22 2016 This is an API problem, and will not be fixed. Making D scripting
- Joseph Rushton Wakeling (15/25) Aug 23 2016 Yes, this is not an issue that is immediately fixable without
- Mike Parker (3/9) Aug 22 2016 Reddit:
- =?UTF-8?B?Tm9yZGzDtnc=?= (2/5) Aug 23 2016 Fantastic work!
- =?UTF-8?B?Tm9yZGzDtnc=?= (3/5) Aug 23 2016 Found at typo:
- Seb (8/13) Aug 23 2016 Thanks! Fixed.
- tn (3/4) Aug 23 2016 What are the columns "mu time" and "sigma^2 time" of the
- Seb (9/13) Aug 23 2016 In statistics mu is often used to describe the average, sigma the
- Seb (11/14) Aug 23 2016 Btw I quickly added the run-time for C++ (for G++ and clang++
- Johan Engelen (9/15) Aug 23 2016 I think you should add compiler/runtimelib versions for all
- Seb (10/25) Aug 23 2016 Oh that's at the header of the benchmark script:
- tn (14/26) Aug 23 2016 Thanks for the clarification and the fix.
- Seb (27/32) Aug 23 2016 tl;dr: yes, it's negligible for the normal distribution.
- Stefan (12/13) Aug 23 2016 code samples with too long lines do not render appropriate on my
- Seb (9/22) Aug 23 2016 It was a weird white-space issue. Thanks for letting me know!
Hey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html Before I start my personal retrospect, I wanted to use this opportunity to give a huge thanks and acknowledgement to my two awesome mentors Ilya Yaroshenko (9il) and Joseph Wakeling (WebDrake). As I wrote my first line of D code this February, I have learned quite a lot during the last few months. Github allows to list all merged contributions, which might show that I got quite familiar with dlang over the time: https://github.com/search?l=&o=desc&q=author%3Awilzbach+is%3Amerged+user%3Adlang&ref=advsearch&s=comments&type=Issues&utf8=%E2%9C%93 … and with other D repositories: https://github.com/search?l=D&o=desc&q=author%3Awilzbach+is%3Amerged&ref=searchresults&s=comments&type=Issues&utf8=%E2%9C%93 I am pretty sure you now know me from the NG, Github, IRC, Twitter, Bugzilla, DConf16, the DWiki, #d at StackOverflow or /r/d_language, so I will skip a further introduction ;-) Over the next weeks and months I will continue my work on mir.random, which is supposed to supersede std.random, so in case you aren’t following the Mir project [1, 2], stay tuned! Best regards, Seb [1] https://github.com/libmir/mir [2] https://twitter.com/libmir
Aug 22 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:Hey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlThanks for the well-done blog posts, especially the first one. Does your implementation make any use of CTFE?
Aug 22 2016
On Monday, 22 August 2016 at 17:13:10 UTC, jmh530 wrote:On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:I am glad to hear this!Hey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlThanks for the well-done blog posts, especially the first one.Does your implementation make any use of CTFE?If you refer to whether the intervals can be calculated at CT, unfortunately it can't be used due to four main reasons: - FP-math at CT (it's already hard to deal with at RT, see e.g. my recent complaint [1]) - the problem is that the Flex algorithm is very sensitive to numerical errors and thus an erroneous change at the lowest end (e.g 10^-15) can lead to totally different numbers with a seeded random engine - std.container due to pointers (I doubt this can/will be fixed in the near future) - std.math due to inline assembly and other tricks (this can be fixed and I will submit a couple of PRs soon) - speed of the CTFE engine (see e.g. [2] for std.regex) That being said CTFE is of course used to compute mixins, constants and specialize functions. Moreover thanks to all speed-ups described in the second blog, constructing the intervals takes about 0.1ms, so for the majority of the users it shouldn't even be noticeable and for the tiny minority it does, they can still manually inline the intervals. [1] http://forum.dlang.org/post/hjaiavlfkoamenidomsa forum.dlang.org [2] http://forum.dlang.org/post/iqcrnokalollrejcabad forum.dlang.org
Aug 23 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:Hey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlIt's really nice to see that GSoC has been such a huge success so far. Everyone has done some really great work.Over the next weeks and months I will continue my work on mir.random, which is supposed to supersede std.random, so in case you aren’t following the Mir project [1, 2], stay tuned! Best regards, Seb [1] https://github.com/libmir/mir [2] https://twitter.com/libmirI'm curious, have you come up with a solution to what is probably the biggest problem with std.random, i.e., it uses value types and copying? I remember a lot of discussion about this and it seemed at the time that the only really solid solution was to make all random generators classes, though I think DIP1000 *may* help here.
Aug 22 2016
On Monday, 22 August 2016 at 18:09:28 UTC, Meta wrote:On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:This is an API problem, and will not be fixed. Making D scripting like language is bad for Science. For example, druntime (Fibers and Mutexes) is useless because it is too high level and poor featured in the same time. The main problem with std.random is that std.random.uniform is broken in context of non-uniform sampling. The same situation is for 99% uniform algorithms. They just ignore the fact that for example, for [0, 1) exponent and mantissa should be generated separately with appropriate probabilities for for exponentHey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlIt's really nice to see that GSoC has been such a huge success so far. Everyone has done some really great work.Over the next weeks and months I will continue my work on mir.random, which is supposed to supersede std.random, so in case you aren’t following the Mir project [1, 2], stay tuned! Best regards, Seb [1] https://github.com/libmir/mir [2] https://twitter.com/libmirI'm curious, have you come up with a solution to what is probably the biggest problem with std.random, i.e., it uses value types and copying? I remember a lot of discussion about this and it seemed at the time that the only really solid solution was to make all random generators classes, though I think DIP1000 *may* help here.
Aug 22 2016
On Tuesday, 23 August 2016 at 05:40:24 UTC, Ilya Yaroshenko wrote:This is an API problem, and will not be fixed. Making D scripting like language is bad for Science. For example, druntime (Fibers and Mutexes) is useless because it is too high level and poor featured in the same time.Yes, this is not an issue that is immediately fixable without introducing other issues (e.g. defining everything as a class brings immediate issues related to heap allocation). In the long run it would obviously be nice to address that issue, but it would have been a major blocker to throw that onto Seb's shoulders (as we all recognized quite quickly when we started discussing it). It was (rightly) not the focus of this project. For this reason, the random distributions introduced in mir are implemented as functors (as is the case with random distributions in C++11 <random>) rather than as ranges.The main problem with std.random is that std.random.uniform is broken in context of non-uniform sampling. The same situation is for 99% uniform algorithms. They just ignore the fact that for example, for [0, 1) exponent and mantissa should be generated separately with appropriate probabilities for for exponentJust as a point of terminology: we should make clear here that this is about sampling from a non-uniform distribution. It shouldn't be confused with "sampling" in the sense of what is done by (say) `RandomSample`.
Aug 23 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:Hey all, I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2): http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlReddit: https://www.reddit.com/r/programming/comments/4z4sp7/an_introduction_to_nonuniform_random_sampling/
Aug 22 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:I am proud to publish a report of my GSoC work as two extensive blog posts, which explain non-uniform random sampling and the mir.random.flex package (part of Mir > 0.16-beta2):Fantastic work!
Aug 23 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlFound at typo: Search for "performance boost performance boost"
Aug 23 2016
On Tuesday, 23 August 2016 at 08:10:50 UTC, Nordlöw wrote:On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:Thanks! Fixed. Btw in case someone is interested, the blog posts are written in Github-flavored Markdown with a couple of custom Jekyll extensions (e.g. the Math formulas are rendered on the server with KaTeX [1]): https://github.com/libmir/blog/blob/master/_posts/2016-08-19-transformed-density-rejection-sampling.md [1] https://khan.github.io/KaTeX/http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlFound at typo: Search for "performance boost performance boost"
Aug 23 2016
On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlWhat are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
Aug 23 2016
On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:In statistics mu is often used to describe the average, sigma the standard deviation and sigma^2 the variance. It was absolutely unnecessary to use this notation here (especially because standard deviation, not variance was measured). I fixed that and also added how many samples were generated per run (10M), thanks! Btw in case you didn't find the link within the article, the benchmark is available within Mir: https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.dhttp://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlWhat are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
Aug 23 2016
On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:[...] I fixed that and also added how many samples were generated per run (10M), thanks!Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3) and it doesn't look that bad: http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.html#sampling https://github.com/libmir/mir/pull/307 Note that for comparison between languages the speed of the random engine plays a major role and that superior performance was never a top priority for this generic method (statistical quality is far more important). However I will work more on being faster than <random> for common distributions over the next weeks ;-)
Aug 23 2016
On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:I think you should add compiler/runtimelib versions for all measurements, and all relevant flags. I'm happy to see you are using LDC for benchmarking; don't forget to mention the LLVM version for LDC's version. You already wrote `-O3` for the C++ measurements, did you not use -mcpu=native or some other performance changing flags? Cheers, Johan[...] I fixed that and also added how many samples were generated per run (10M), thanks!Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3)
Aug 23 2016
On Tuesday, 23 August 2016 at 16:54:18 UTC, Johan Engelen wrote:On Tuesday, 23 August 2016 at 13:01:29 UTC, Seb wrote:What else? ;-)On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:I think you should add compiler/runtimelib versions for all measurements, and all relevant flags. I'm happy to see you are using LDC for benchmarking;[...] I fixed that and also added how many samples were generated per run (10M), thanks!Btw I quickly added the run-time for C++ (for G++ and clang++ with -O3)don't forget to mention the LLVM version for LDC's version.Oh that's at the header of the benchmark script: https://github.com/libmir/mir/blob/master/benchmarks/flex/normal_dist.d I excluded it from the post to avoid visual noise as I thought that the ones that are interested will check it anyhow.You already wrote `-O3` for the C++ measurements,Now that I added the flags for C++ directly in the script it probably got very confusing. Sorry, I will replace C++ flags with a link to the benchmark soon.did you not use -mcpu=native or some other performance changing flags?See the benchmark file above.
Aug 23 2016
On Tuesday, 23 August 2016 at 12:12:30 UTC, Seb wrote:On Tuesday, 23 August 2016 at 11:58:53 UTC, tn wrote:Thanks for the clarification and the fix. (I am familiar with the usage of mu and sigma, but somehow I first thought that the columns corresponded to two different measurements. After the initial confusion I realized the correct meaning, but still wasn't sure about it due to contradiction between the name of the column (sigma^2, not sigma) and the values (ms, not ms^2 or something). So I thought that it would be better to ask.) Another question: You mention that statistical quality is important, but it is not clear if flex has better or worse quality than Box-Muller and Ziggurat in the case of sampling from normal distribution. Or is the difference negligible? (I realize that the real strength of flex is its versatility.)On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:In statistics mu is often used to describe the average, sigma the standard deviation and sigma^2 the variance. It was absolutely unnecessary to use this notation here (especially because standard deviation, not variance was measured). I fixed that and also added how many samples were generated per run (10M), thanks!http://blog.mir.dlang.io/random/2016/08/22/transformed-density-rejection-sampling.htmlWhat are the columns "mu time" and "sigma^2 time" of the benchmark table in the Sampling subsection?
Aug 23 2016
On Tuesday, 23 August 2016 at 13:12:28 UTC, tn wrote:Another question: You mention that statistical quality is important, but it is not clear if flex has better or worse quality than Box-Muller and Ziggurat in the case of sampling from normal distribution. Or is the difference negligible? (I realize that the real strength of flex is its versatility.)tl;dr: yes, it's negligible for the normal distribution. Excellent question - I didn't cover this in much detail because: (1) it has already been done extensively in the literature. For example, scroll to Table IV (page 32) at "Gaussian Random Number Generators" by Thomas et. al. (2007) for chi-squared and high-sigma tests [1] (btw in all random libraries I looked at, e.g. <random>, the Box-Muller method is used for the normal distribution) (2) the authors of the Tinflex algorithm wrote their own UNU.RAN [2] library to prove that the transformed density rejection method is close to inversion method (without numerical errors, it would be "perfect"). Under the hood UNU.RAN uses polynomial interpolation of inverse CDF, which is needed to be able to automatically compute statistical properties. This method is on our roadmap [3]. (3) the hat/squeeze and cumulative histograms of nearly all example distributions at [4] look pretty good [5] (if the cumulative histogram is identical to the CDF curve, the errors are negligible) (4) this is just a preview release and requires [3] for automatic testing of user-generated distributions [1] http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf [2] http://statmath.wu.ac.at/unuran/ [3] https://github.com/libmir/mir/issues/46 [4] https://github.com/libmir/mir/tree/master/examples/flex_plot [5] https://drive.google.com/open?id=0BwdiZp7qSaBhZXRJNHhSN1RHR3c
Aug 23 2016
Seb, this is awesome! On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.htmlcode samples with too long lines do not render appropriate on my browsers. (current chrome on linux and a pretty outdated firefox on windows :) ) Taking this example: S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf pdf, Hat hat, HatInv hatInvCDF, Squeeze sq) { Then the second line of parameters does not render at all. Will definitly play around with your code!
Aug 23 2016
On Tuesday, 23 August 2016 at 22:48:31 UTC, Stefan wrote:Seb, this is awesome!Thanks :)On Monday, 22 August 2016 at 15:34:47 UTC, Seb wrote:It was a weird white-space issue. Thanks for letting me know! I fixed it and it now looks normal on the "latest, greatest" Chromium, FF and Chrome on Android: http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#rejection-with-inversion http://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.html#squeeze-functionshttp://blog.mir.dlang.io/random/2016/08/19/intro-to-random-sampling.htmlcode samples with too long lines do not render appropriate on my browsers. (current chrome on linux and a pretty outdated firefox on windows :) ) Taking this example: S sample(S, RNG, Pdf, Hat, HatInv, Squeeze)(ref RNG gen, Pdf pdf, Hat hat, HatInv hatInvCDF, Squeeze sq) { Then the second line of parameters does not render at all.Will definitly play around with your code!Don't hesitate to ping me / us with any questions -> https://github.com/libmir/mir/issues
Aug 23 2016