digitalmars.D - [OT] Generating distribution of N dice rolls
- H. S. Teoh (31/31) Nov 09 2022 This is technically OT, but I thought I'd pick the smart brains here for
- =?UTF-8?Q?Ali_=c3=87ehreli?= (8/9) Nov 09 2022 I don't know what the standard deviation should be and whether it is a
- Sergey (6/39) Nov 09 2022 They should have uniform distribution with well known mean and
- Timon Gehr (16/51) Nov 10 2022 The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]),
- jmh530 (19/31) Nov 10 2022 mir-random has the ability to sample from both the binomial [1]
- Timon Gehr (18/54) Nov 10 2022 Well, I said that he needs to sample from a multinomial distribution. I
- Timon Gehr (2/4) Nov 10 2022 Actually it uses "trials" instead of "samples".
- jmh530 (5/13) Nov 10 2022 Ah, I see what you're saying now. When p = 1 / K, a categorical
- H. S. Teoh (19/24) Nov 10 2022 I have to admit, the linked article is way above my head. I just don't
- =?UTF-8?Q?Ali_=c3=87ehreli?= (47/49) Nov 10 2022 I am happy with my solution which includes scientific terms like
- Siarhei Siamashka (13/21) Nov 10 2022 That's fast, but unfortunately doesn't seem to be correct. Your
- =?UTF-8?Q?Ali_=c3=87ehreli?= (14/19) Nov 10 2022 Agreed. I didn't think the following "random" values would magically be
- Siarhei Siamashka (28/31) Nov 10 2022 Yes, this particular result looks okish (when tested by the
- H. S. Teoh (145/174) Nov 10 2022 According to the Wikipedia page on multinomial distribution (linked by
- Siarhei Siamashka (6/8) Nov 11 2022 How important for you is to actually have a statistically correct
- H. S. Teoh (9/19) Nov 11 2022 Relax, this isn't for generating fake data. :-D It's for a simulation,
- Abdulhaq (2/10) Nov 11 2022 I guess it simulates the simulator.
- Timon Gehr (11/32) Nov 12 2022 I think the question was more: does it matter to you whether or not the
- Siarhei Siamashka (36/45) Nov 12 2022 This is where things take a wrong turn. In reality you need more
- H. S. Teoh (185/197) Nov 10 2022 [...]
- H. S. Teoh (10/21) Nov 10 2022 [...]
- Max Haughton (7/31) Nov 10 2022 Built in AI model? (I joke but this is probably the future,
- H. S. Teoh (32/43) Nov 10 2022 [...]
- Abdulhaq (4/6) Nov 21 2022 The original article that you copied this from has some useful
- H. S. Teoh (7/16) Nov 21 2022 It's a copy-n-paste spambot promoting a link and masquerading as
This is technically OT, but I thought I'd pick the smart brains here for my project, which happens to be written in D. ;-) Basically, I want to write a function that takes 2 uint arguments k and N, and simulates rolling N k-sided dice and counting how many 1's, 2's, 3's, ... k's were rolled. Something like this: uint[k] diceDistrib(uint k)(uint N) in(k > 0) in(N > 0) out(r; r[].sum == N) { uint[k] result; foreach (i; 0 .. N) { result[uniform(0, k)]++; } return result; } The above code works and does what I want, but since N may be large, I'd like to refactor the code to loop over k instead of N. I.e., instead of actually rolling N dice and tallying the results, the function would generate the elements of the output array directly, such that the distribution of the array elements follow the same probabilities as the above code. Note that in all cases, the output array must sum to N; it is not enough to merely simulate the roll distribution probabilistically. Any ideas? (Or links if this is a well-studied problem with a known solution.) <ObDReference> I love how D's new contract syntax makes it so conducive to expressing programming problem requirements. ;-) </ObDReference> T -- When you breathe, you inspire. When you don't, you expire. -- The Weekly Reader
Nov 09 2022
On 11/9/22 18:10, H. S. Teoh wrote:Note that in all cases, the output array must sum to N;I don't know what the standard deviation should be and whether it is a function of N but the following makes sense: - for all k, pick a random number with mean == N/k and standard deviation == ??? - if the sum > N, remove an item from a random bucket for sum - N times - if the sum < N, add an item to a random bucket for N - sum times Ali
Nov 09 2022
On Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:This is technically OT, but I thought I'd pick the smart brains here for my project, which happens to be written in D. ;-) Basically, I want to write a function that takes 2 uint arguments k and N, and simulates rolling N k-sided dice and counting how many 1's, 2's, 3's, ... k's were rolled. Something like this: uint[k] diceDistrib(uint k)(uint N) in(k > 0) in(N > 0) out(r; r[].sum == N) { uint[k] result; foreach (i; 0 .. N) { result[uniform(0, k)]++; } return result; } The above code works and does what I want, but since N may be large, I'd like to refactor the code to loop over k instead of N. I.e., instead of actually rolling N dice and tallying the results, the function would generate the elements of the output array directly, such that the distribution of the array elements follow the same probabilities as the above code. Note that in all cases, the output array must sum to N; it is not enough to merely simulate the roll distribution probabilistically. Any ideas? (Or links if this is a well-studied problem with a known solution.) <ObDReference> I love how D's new contract syntax makes it so conducive to expressing programming problem requirements. ;-) </ObDReference> TThey should have uniform distribution with well known mean and std. To have exactly N rolls you can estimate with distribution function numbers for (k-1) sides (let their sum will be M), and the rest (N-M) put into k’s side https://scientificgems.wordpress.com/2015/11/03/mathematics-in-action-probability/
Nov 09 2022
On 10.11.22 03:10, H. S. Teoh wrote:This is technically OT, but I thought I'd pick the smart brains here for my project, which happens to be written in D. ;-) Basically, I want to write a function that takes 2 uint arguments k and N, and simulates rolling N k-sided dice and counting how many 1's, 2's, 3's, ... k's were rolled. Something like this: uint[k] diceDistrib(uint k)(uint N) in(k > 0) in(N > 0) out(r; r[].sum == N) { uint[k] result; foreach (i; 0 .. N) { result[uniform(0, k)]++; } return result; } The above code works and does what I want, but since N may be large, I'd like to refactor the code to loop over k instead of N. I.e., instead of actually rolling N dice and tallying the results, the function would generate the elements of the output array directly, such that the distribution of the array elements follow the same probabilities as the above code. Note that in all cases, the output array must sum to N; it is not enough to merely simulate the roll distribution probabilistically. Any ideas? (Or links if this is a well-studied problem with a known solution.) ...The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, and the events follow an uniform distribution. https://en.wikipedia.org/wiki/Multinomial_distribution It is at least as hard to sample as binomial(n,1/k), because that is the marginal distribution of each component of the result. I guess one approach is this (if you can find a way to sample from binomial distributions that is good enough for your use case): auto tot=n; foreach(i;0..k){ result[i]=binomial(tot,1.0/(k-i)); tot-=result[i]; } return result; This samples the result one component at a time.<ObDReference> I love how D's new contract syntax makes it so conducive to expressing programming problem requirements. ;-) </ObDReference>:)
Nov 10 2022
On Thursday, 10 November 2022 at 10:45:50 UTC, Timon Gehr wrote:[snip] The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, and the events follow an uniform distribution. https://en.wikipedia.org/wiki/Multinomial_distribution It is at least as hard to sample as binomial(n,1/k), because that is the marginal distribution of each component of the result. I guess one approach is this (if you can find a way to sample from binomial distributions that is good enough for your use case): [snip]mir-random has the ability to sample from both the binomial [1] and multinomial [2] distributions (you would want the multinomial here, unless I'm missing something). I don't know how well mir-random handles large values of N or K, but there are also well-known normal and poisson approximations of the binomial distribution (I make some use of them here [3] for binomials) at least. I wasn't exactly sure what you meant by the events follow a uniform distribution, but it's certainly possible to simulate from the multinomial to get the counts. You should also be able to calculate the PMF for it to compare the simulations against. The PMF shouldn't be uniform. [1] http://mir-random.libmir.org/mir_random_variable.html#BinomialVariable [2] http://mir-random.libmir.org/mir_random_ndvariable.html#MultinomialVariable [3] https://github.com/libmir/mir-stat/blob/master/source/mir/stat/distribution/binomial.d
Nov 10 2022
On 10.11.22 12:59, jmh530 wrote:On Thursday, 10 November 2022 at 10:45:50 UTC, Timon Gehr wrote:Well, I said that he needs to sample from a multinomial distribution. I just pointed out that you can sample from multinomial(n,k,...) by sampling from binomial distributions k times, and that to be able to sample from multinomial, you need to be able to sample from binomial anyway. It turns out mir-random is doing precisely what I suggested: https://github.com/libmir/mir-random/blob/master/source/mir/random/ndvariable.d#L344-L365[snip] The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, and the events follow an uniform distribution. https://en.wikipedia.org/wiki/Multinomial_distribution It is at least as hard to sample as binomial(n,1/k), because that is the marginal distribution of each component of the result. I guess one approach is this (if you can find a way to sample from binomial distributions that is good enough for your use case): [snip]mir-random has the ability to sample from both the binomial [1] and multinomial [2] distributions (you would want the multinomial here, unless I'm missing something).I don't know how well mir-random handles large values of N or K,FWIW this is what it does (haven't looked at it closely though): https://github.com/libmir/mir-random/blob/master/source/mir/random/variable.d#L1730-L1770but there are also well-known normal and poisson approximations of the binomial distribution (I make some use of them here [3] for binomials) at least. ...Well, that seems a bit more involved. In case that is more accurate/efficient than mir-random, I guess one can sample from multinomial using thatI wasn't exactly sure what you meant by the events follow a uniform distribution,A multinomial is the distribution of a frequency table obtained after drawing n times from a categorical distribution. The OP is only interested in the special case where that categorical distribution is uniform. n is the number of samples, k is the number of events. It's the terminology used by the Wikipedia article I linked.but it's certainly possible to simulate from the multinomial to get the counts. You should also be able to calculate the PMF for it to compare the simulations against. The PMF shouldn't be uniform. ...Well, it should be multinomial. :)[1] http://mir-random.libmir.org/mir_random_variable.html#BinomialVariable [2] http://mir-random.libmir.org/mir_random_ndvariable.html#MultinomialVariable [3] https://github.com/libmir/mir-stat/blob/master/source/mir/stat/distribution/binomial.d
Nov 10 2022
On 10.11.22 13:21, Timon Gehr wrote:n is the number of samples, k is the number of events. It's the terminology used by the Wikipedia article I linked.Actually it uses "trials" instead of "samples".
Nov 10 2022
On Thursday, 10 November 2022 at 12:21:30 UTC, Timon Gehr wrote:[snip] A multinomial is the distribution of a frequency table obtained after drawing n times from a categorical distribution. The OP is only interested in the special case where that categorical distribution is uniform. n is the number of samples, k is the number of events. It's the terminology used by the Wikipedia article I linked. [snip]Ah, I see what you're saying now. When p = 1 / K, a categorical distribution is the same as a uniform distribution. I tend to think of multinomial as a generalization of binomial, but it's really a generalization of both binomial and categorical.
Nov 10 2022
On Thu, Nov 10, 2022 at 11:45:50AM +0100, Timon Gehr via Digitalmars-d wrote: [...]The distribution of the output is multinomial(n,k,[1/k,1/k,...,1/k]), i.e., n trials, k events, and the events follow an uniform distribution. https://en.wikipedia.org/wiki/Multinomial_distributionI have to admit, the linked article is way above my head. I just don't grok probability. :-(It is at least as hard to sample as binomial(n,1/k), because that is the marginal distribution of each component of the result.[...] Hmm. I'm thinking one way to approximate this, if there's a fast way to compute the standard deviation of the multinomial, is to use the Box-Muller transform to do the initial population of the result array, then distribute the remaining (N - sum(array)) elements by running individual trials and adjusting the array accordingly. The initial population should bring us reasonably close to the mean, so (N - sum(array)) should be small enough that running individual trials won't be excessively expensive. Or if (N - sum(array)) is still large, maybe do another estimate using the standard deviation of the remaining trials to reduce it, and then run the remaining trials manually? T -- Doubt is a self-fulfilling prophecy.
Nov 10 2022
On 11/10/22 08:57, H. S. Teoh wrote:if there's a fast way to compute the standard deviation of the multinomialI am happy with my solution which includes scientific terms like normalDistributianness. :p import std; uint[k] diceDistrib(uint k)(uint N) in(k > 0) in(N > 0) out(r; r[].sum == N) { uint[k] result; const average = N / k; // foreach (i; 0 .. N) { // result[uniform(0, k)]++; // } // Larger this value, closer the results to normal distribution enum normalDistributianness = 1000; // Window around the average of values to pick randomly within const halfWindow = average / 2; foreach (ref r; result) { r = iota(normalDistributianness) .map!(_ => uniform(average - halfWindow, average + halfWindow + 1)) .mean .to!uint; } uint total = result[].sum; while (true) { if (total == N) { break; } auto which = &result[uniform(0, k)]; if (total < N) { ++(*which); ++total; } else if (total > N) { --(*which); --total; } } return result; } void main() { writeln(diceDistrib!6(100_000_000)); } Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :p Ali
Nov 10 2022
On Thursday, 10 November 2022 at 17:46:59 UTC, Ali Çehreli wrote:On 11/10/22 08:57, H. S. Teoh wrote:That's fast, but unfortunately doesn't seem to be correct. Your code outputs something like this: [16548205, 16741274, 16600777, 16792301, 16585808, 16731635] While the original slow reference H. S. Teoh's implementation outputs: [16669388, 16665882, 16668019, 16671186, 16664718, 16660807] It's easy to see even with a naked eye that the statistical properties are obviously not the same (your implementation has a significantly higher variance). And for comparison, a [Python/SciPy implementation](https://github.com/scipy/scipy/issues/17388#issu comment-1310913940) outputs: [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]if there's a fast way to compute the standard deviation of the multinomialI am happy with my solution which includes scientific terms like normalDistributianness. :p [...] Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :p
Nov 10 2022
On 11/10/22 13:45, Siarhei Siamashka wrote:(your implementation has a significantly higher variance)Agreed. I didn't think the following "random" values would magically be correct. :) // Larger this value, closer the results to normal distribution enum normalDistributianness = 1000; // Window around the average of values to pick randomly within const halfWindow = average / 2; I was half joking anyway but the values 4_000 and 'average / 20', respectively produce closer values: [16663617, 16663620, 16662052, 16671587, 16663818, 16675306]. And for comparison, a [Python/SciPyimplementation](https://github.com/scipy/scipy/issues/17388#issu comment-1310913940) outputs:[16664115, 16666144, 16677962, 16660692, 16663181, 16667906]But I am still half joking. :) Ali
Nov 10 2022
On Thursday, 10 November 2022 at 22:18:47 UTC, Ali Çehreli wrote:I was half joking anyway but the values 4_000 and 'average / 20', respectively produce closer values: [16663617, 16663620, 16662052, 16671587, 16663818, 16675306]Yes, this particular result looks okish (when tested by the XNomial R package): P value (LLR) = 0.11807 ± 0.00102 You can modify your code this way: ```D void main() { auto results = diceDistrib!6(100_000_000); auto expected_probabilities = [1.0 / 6].replicate(6); https://rdrr.io/cran/XNomial/"); writefln("library(XNomial)"); writefln("experimental_results <- c(%(%s, %))", results); writefln("expected_probabilities <- c(%(%.18f, %))", expected_probabilities); writefln("xmonte(experimental_results, expected_probabilities)"); } ``` And then paste its output to https://rdrr.io/cran/XNomial/ to calculate p-value. If the reported p-value is smaller than 0.05, then we can be 95% confident that the generator is working incorrectly. A p-value higher than 0.05 means that the test can't see anything wrong, but this doesn't guarantee anything and isn't a proof of correctness. Even a correct generator can sporadically have p-values smaller than 0.05, but this happens only in roughly 1 out of 20 runs (in other words, there's a 5% false positive chance). This situation is explained by https://xkcd.com/882/
Nov 10 2022
On Thu, Nov 10, 2022 at 09:45:56PM +0000, Siarhei Siamashka via Digitalmars-d wrote:On Thursday, 10 November 2022 at 17:46:59 UTC, Ali Çehreli wrote:According to the Wikipedia page on multinomial distribution (linked by Timon), it states that the variance of X_i for n rolls of a k-sided dice (with probability p_i), where i is a specific outcome, is: Var(X_i) = n*p_i*(1 - p_i) Don't really understand where this formula came from (as I said, that page is way above my head), but we can make use of it. In this case, p_i = 1/k since we're assuming unbiased dice rolls. So this reduces to: Var(X_i) = (n/k)*(1 - 1/k) Now, since the standard deviation is just the square root of the variance, we have: sigma = sqrt((n/k)*(1 - 1/k)) which is easy to compute. Furthermore, the mean is just: E(X_i) = n*p_i = n/k We can then plug this into the Box-Muller transform (which I found on Wikipedia :-P): import std.random, std.math; const sigma = sqrt((n/k)*(1.0 - 1.0/k)); auto u1 = uniform01(); auto u2 = uniform01(); auto mag = sigma*sqrt(-2*log(u1)); auto z0 = mag * cos(2*PI*u2) + n/k; auto z1 = mag * sin(2*PI*u2) + n/k; which sets z0 and z1 to be two random variables having mean n/k and variance Var(X_i). So they can be used to populate two elements of the resulting array, and should have the correct distribution. Repeat this procedure until all array elements are populated. (If the output array length is odd, i.e., k is odd, we can just discard the last z1.) Then compute the difference d between sum of the array and n (the expected sum), which should be relatively small, and run d iterations of individual dice rolls and adjust the array (++ if the sum is too small, or -- if the sum is too large) to make it sum exactly to n.On 11/10/22 08:57, H. S. Teoh wrote:if there's a fast way to compute the standard deviation of the multinomial[...] I implemented what I described above, and got very good-looking results. Output: ----------- 1 modules passed unittests [16666837, 16666500, 16664692, 16666306, 16663568, 16672097] sum=100000000 [16661247, 16666674, 16668562, 16673476, 16665867, 16664174] sum=100000000 [16663070, 16665221, 16664850, 16668355, 16668246, 16670258] sum=100000000 [16671423, 16665663, 16667209, 16669164, 16663916, 16662625] sum=100000000 [16666194, 16669936, 16662697, 16666535, 16666671, 16667967] sum=100000000 [16665977, 16667094, 16662753, 16670737, 16666762, 16666677] sum=100000000 [16665518, 16666940, 16671829, 16664487, 16665344, 16665882] sum=100000000 [16658486, 16666671, 16667778, 16665901, 16667471, 16673693] sum=100000000 [16668149, 16669431, 16661462, 16668241, 16668286, 16664431] sum=100000000 [16660879, 16671401, 16659154, 16671245, 16667720, 16669601] sum=100000000 [16665213, 16670794, 16665401, 16667548, 16663484, 16667560] sum=100000000 [16663091, 16671196, 16669946, 16670791, 16661162, 16663814] sum=100000000 [16664305, 16670070, 16662123, 16663467, 16673867, 16666168] sum=100000000 [16664395, 16667406, 16657623, 16673838, 16670227, 16666511] sum=100000000 [16671401, 16662181, 16672923, 16657766, 16674160, 16661569] sum=100000000 [16663059, 16669574, 16662338, 16668319, 16670541, 16666169] sum=100000000 [16672366, 16662296, 16666526, 16664284, 16669201, 16665327] sum=100000000 [16665787, 16669038, 16666956, 16668728, 16660096, 16669395] sum=100000000 [16669162, 16669034, 16662788, 16663672, 16669127, 16666217] sum=100000000 [16667016, 16669649, 16661551, 16670882, 16668304, 16662598] sum=100000000 ----------- (I've no idea how accurate it is, but visual inspection certainly indicates that it's doing the right thing. And it's lightning-fast. :-P) Code (compile with `dmd -unittest -main`): ------------------------------------------ import std; /// Box-Muller transform. double[2] gaussianPoint(double[2] mean, double deviation) { import std.math : cos, log, sin, sqrt, PI, round; import std.random : uniform01; auto u = uniform01(); auto v = uniform01(); auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v); auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v); double[2] result; result[0] = mean[0] + cast(int)round(deviation * x0); result[1] = mean[1] + cast(int)round(deviation * y0); return result; } /// Roll k-sided dice N times and tally each output. int[] diceDistrib(int k, int N) in (k > 0) in (N > 0) out (r; r.sum == N) { import std.math : sqrt, round; import std.range : chunks; import std.random : uniform; const mean = cast(double)N / k; const deviation = sqrt(mean * (1.0 - 1.0 / k)); double[2] means = [ mean, mean ]; // Populate output array with initial values with the correct mean and // standard deviation. auto result = new int[k]; foreach (chunk; result.chunks(2)) { auto z = gaussianPoint(means, deviation); chunk[0] = cast(int) round(z[0]); if (chunk.length > 1) chunk[1] = cast(int) round(z[1]); } // Tweak resulting array until it sums exactly to N. auto total = result.sum; while (total != N) { auto i = uniform(0, k); if (total < N) { result[i]++; total++; } else { result[i]--; total--; } } return result; } unittest { foreach (i; 0 .. 20) { import std.stdio; auto N = 100_000_000; auto dist = diceDistrib(6, N); writefln("%s sum=%d", dist, dist[].sum); assert(dist.sum == N); } } ------------------------------------------ // Now, just for fun, I added a writeln before the `while (total != N)` to print out just how big of a discrepancy to N the array sum is. Turns out, quite a bit: for N = 100_000_000 as above, the discrepancies range from approximately -25000 to 25000, with the typical discrepancy being about 3-4 digits long, which means that the code is spending quite a bit of time in that final loop. Which suggests that a possible improvement is to recursively run the initial approximation step, setting sub_N = (N - array.sum). I'll leave that to a future iteration, though. Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-D T -- Answer: Because it breaks the logical sequence of discussion. / Question: Why is top posting bad?I am happy with my solution which includes scientific terms like normalDistributianness. :p [...] Error checking etc. are missing but hundred million dice throws in a split second... Now that's fast! :pThat's fast, but unfortunately doesn't seem to be correct. Your code outputs something like this: [16548205, 16741274, 16600777, 16792301, 16585808, 16731635] While the original slow reference H. S. Teoh's implementation outputs: [16669388, 16665882, 16668019, 16671186, 16664718, 16660807] It's easy to see even with a naked eye that the statistical properties are obviously not the same (your implementation has a significantly higher variance). And for comparison, a [Python/SciPy implementation](https://github.com/scipy/scipy/issues/17388#issuecomment-1310913940) outputs: [16664115, 16666144, 16677962, 16660692, 16663181, 16667906]
Nov 10 2022
On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-DHow important for you is to actually have a statistically correct solution for this particular problem? If something is off, then this may be eventually discovered by somebody in the future. Here's one famous example: https://www.wondriumdaily.com/gregor-mendel-fake-data/
Nov 11 2022
On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via Digitalmars-d wrote:On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:Relax, this isn't for generating fake data. :-D It's for a simulation, and actually my use case mainly involves small values of N. So technically I don't need to optimize it to this extent; it's just a nice-to-have and a fun exercise to make it resistant to performance degradation by unusually large inputs. T -- Mediocrity has been pushed to extremes.Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-DHow important for you is to actually have a statistically correct solution for this particular problem? If something is off, then this may be eventually discovered by somebody in the future. Here's one famous example: https://www.wondriumdaily.com/gregor-mendel-fake-data/
Nov 11 2022
On Friday, 11 November 2022 at 13:27:45 UTC, H. S. Teoh wrote:On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via Digitalmars-d wrote:Relax, this isn't for generating fake data. :-D It's for a simulation, and actually my use case mainly involves small values of N. So technically I don't need to optimize it to this extent; it's just a nice-to-have and a fun exercise to make it resistant to performance degradation by unusually large inputs. TI guess it simulates the simulator.
Nov 11 2022
On 11.11.22 14:27, H. S. Teoh wrote:On Fri, Nov 11, 2022 at 10:52:47AM +0000, Siarhei Siamashka via Digitalmars-d wrote:I think the question was more: does it matter to you whether or not the simulation models an accurate distribution of results? What kind of code is consuming those dice roll frequency tables? Note that even though the results are random, the distribution of results itself is not random and can in principle be compared precisely. It's pretty clear that you are not getting the right distribution, although I have not investigated in detail how much what you simulate deviates from the true multinomial distribution that you actually attempt to simulate. Still, I guess for some use cases, the deviation would be significant.On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:Relax, this isn't for generating fake data. :-D It's for a simulation, and actually my use case mainly involves small values of N. So technically I don't need to optimize it to this extent; it's just a nice-to-have and a fun exercise to make it resistant to performance degradation by unusually large inputs. TBeing able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-DHow important for you is to actually have a statistically correct solution for this particular problem? If something is off, then this may be eventually discovered by somebody in the future. Here's one famous example: https://www.wondriumdaily.com/gregor-mendel-fake-data/
Nov 12 2022
On Thursday, 10 November 2022 at 23:15:24 UTC, H. S. Teoh wrote:According to the Wikipedia page on multinomial distribution (linked by Timon), it states that the variance of X_i for n rolls of a k-sided dice (with probability p_i), where i is a specific outcome, is: Var(X_i) = n*p_i*(1 - p_i) Don't really understand where this formula came from (as I said, that page is way above my head), but we can make use of it.This is where things take a wrong turn. In reality you need more than just a matching mean and variance to correctly simulate some arbitrary probability distribution: https://en.wikipedia.org/wiki/Moment_(mathematics) Every n-th moment needs to be correct too. Some of these moments have special names (n=1 mean, n=2 variance, n=3 skewness, n=4 kurtosis, ...). If you only take care of the mean and variance for simulating a random distribution, then it's somewhat similar to approximating "sin(x) = x - (x^3 / 3!)" via taking only the first few terms of the Taylor series. I wonder what's the reason for not using the mir-random library like suggested in the early comments? Do you want to avoid having an extra dependency? ```D /+dub.sdl: dependency "mir-random" version="~>2.2.19" +/ import std, mir.random.engine, mir.random.ndvariable; uint[k] diceDistrib(uint k)(uint N) in(k > 0) in(N > 0) out(r; r[].sum == N) { uint[k] result; double[k] p; p[] = 1.0 / k; auto rv = multinomialVar(N, p); rv(rne, result[]); return result; } void main() { writeln(diceDistrib!6(100_000_000)); } ```
Nov 12 2022
On Thu, Nov 10, 2022 at 03:15:24PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]Now, just for fun, I added a writeln before the `while (total != N)` to print out just how big of a discrepancy to N the array sum is. Turns out, quite a bit: for N = 100_000_000 as above, the discrepancies range from approximately -25000 to 25000, with the typical discrepancy being about 3-4 digits long, which means that the code is spending quite a bit of time in that final loop. Which suggests that a possible improvement is to recursively run the initial approximation step, setting sub_N = (N - array.sum). I'll leave that to a future iteration, though. Being able to compute a hundred million dice rolls in a split second is already good enough for what I need. :-D[...] OK, I couldn't resist, the idea is so tempting. So I made a new implementation that iteratively uses the Box-Muller transform to close the gap between array.sum and N, resorting to individual rolls only when the difference is < k. For N = 100_000_000, it typically takes about 3-5 iterations to bring the difference down to < k, so the entire algorithm takes about k+5 iterations to compute the result. Now, in theory, I could just run the Box-Muller estimate until the difference is 0, but I found that once the difference is small, it tends to bounce around 0 several iterations before converging on 0. So I arbitrarily decided to stop when the difference < k, and use individual rolls to do the rest. One wrinkle that came up is that for small values of N, sometimes the result array can end up with negative elements, either due to overcompensation during the iterative Box-Muller stage (discrepancy > 0 and the selected z values happen to be larger than the current array element), or during the final adjustment loop (it picks an array element that's already 0 and tries to decrement it). So I had to insert extra checks to discard generated z values if they would cause the result array to have negative counts. Generally, this happens only for small values of N; for large N the initial estimate is large enough that it's extremely unlikely for a later adjustment to overshoot into negative values. To ensure the output is never negative, I added another out-contract to check this. Code: ------------ import std.algorithm; import std.conv : text; double[2] gaussianPoint(double[2] mean, double deviation) { import std.math : cos, log, sin, sqrt, PI, round; import std.random : uniform01; auto u = uniform01(); auto v = uniform01(); auto x0 = sqrt(-2.0*log(u)) * cos(2.0*PI*v); auto y0 = sqrt(-2.0*log(u)) * sin(2.0*PI*v); double[2] result; result[0] = mean[0] + cast(int)round(deviation * x0); result[1] = mean[1] + cast(int)round(deviation * y0); return result; } /** * Simulates rolling N k-sided dice. * * Returns: A static array representing how many of each of 1..(k+1) were * obtained by the rolls. * * Bugs: The current implementation uses a naïve algorithm that's rather * inefficient for large N. For our purposes, though, which involve only * relatively small N, this is Good Enough(tm) for the time being. We can look * into improving this if it becomes a performance bottleneck. */ int[] diceDistrib(int k, int N) in (k > 0) in (N > 0) out (r; r.sum == N) out (r; r.all!(c => c >= 0), r.text) { import std.math : abs, sgn, sqrt, round; import std.range : chunks; import std.random : uniform; debug import std.stdio; // Populate output array with initial values with the correct mean and // standard deviation. auto result = new int[k]; auto discrepancy = N - result.sum; while (abs(discrepancy) > k) { debug writefln("discrepancy=%d", discrepancy); const mean = cast(double)abs(discrepancy) / k; const deviation = sqrt(mean * (1.0 - 1.0 / k)); double[2] means = [ mean, mean ]; double sign = sgn(discrepancy); foreach (chunk; result.chunks(2)) { do { auto z = gaussianPoint(means, deviation); auto c0 = chunk[0] + cast(int) round(sign*z[0]); if (c0 < 0) continue; // discard nonsensical result if (chunk.length > 1) { auto c1 = chunk[1] + cast(int) round(sign*z[1]); if (c1 < 0) continue; // discard nonsensical result chunk[1] = c1; } chunk[0] = c0; } while (false); } discrepancy = N - result.sum; } debug writefln("final discrepancy=%d", discrepancy); // Tweak resulting array until it sums exactly to N. auto total = result.sum; while (total != N) { auto i = uniform(0, k); if (total < N) { result[i]++; total++; } else if (result[i] > 0) { result[i]--; total--; } } return result; } unittest { import std.stdio; foreach (i; 0 .. 5) { auto N = 100_000_000; auto dist = diceDistrib(6, N); debug writefln("%s sum=%d", dist, dist[].sum); assert(dist.sum == N); } foreach (i; 0 .. 5) { auto N = 10; auto dist = diceDistrib(6, N); debug writefln("%s sum=%d", dist, dist[].sum); assert(dist.sum == N); } } ------------ Typical output: ------------ 1 modules passed unittests discrepancy=1000000000 discrepancy=-94251 discrepancy=287 final discrepancy=5 [166665591, 166651866, 166681043, 166654104, 166673559, 166673837] sum=1000000000 discrepancy=1000000000 discrepancy=20314 discrepancy=-88 discrepancy=-9 final discrepancy=6 [166668490, 166665588, 166648866, 166658164, 166684475, 166674417] sum=1000000000 discrepancy=1000000000 discrepancy=14062 discrepancy=18 final discrepancy=5 [166658497, 166674399, 166652111, 166667954, 166672801, 166674238] sum=1000000000 discrepancy=1000000000 discrepancy=15926 discrepancy=-175 discrepancy=-19 final discrepancy=-4 [166664463, 166659610, 166655243, 166678783, 166673507, 166668394] sum=1000000000 discrepancy=1000000000 discrepancy=39916 discrepancy=201 discrepancy=-16 final discrepancy=2 [166666774, 166663400, 166672276, 166686269, 166652312, 166658969] sum=1000000000 discrepancy=10 final discrepancy=-1 [2, 2, 2, 3, 0, 1] sum=10 discrepancy=10 final discrepancy=-5 [2, 1, 3, 2, 1, 1] sum=10 discrepancy=10 final discrepancy=-1 [1, 0, 1, 1, 4, 3] sum=10 discrepancy=10 final discrepancy=-3 [3, 2, 2, 1, 0, 2] sum=10 discrepancy=10 final discrepancy=-2 [1, 1, 2, 2, 2, 2] sum=10 ------------ Note that the last 5 outputs are for a different test case (N=10), just to make sure that we don't produce nonsensical outputs when N is small. T -- Leather is waterproof. Ever see a cow with an umbrella?
Nov 10 2022
On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]/** * Simulates rolling N k-sided dice. * * Returns: A static array representing how many of each of 1..(k+1) were * obtained by the rolls. * * Bugs: The current implementation uses a naïve algorithm that's rather * inefficient for large N. For our purposes, though, which involve only * relatively small N, this is Good Enough(tm) for the time being. We can look * into improving this if it becomes a performance bottleneck. */[...] Argh, that last paragraph should be deleted. :-( See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code. T -- The best compiler is between your ears. -- Michael Abrash
Nov 10 2022
On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:On Thu, Nov 10, 2022 at 05:27:07PM -0800, H. S. Teoh via Digitalmars-d wrote: [...]Built in AI model? (I joke but this is probably the future, whoever gets it right will make a lot of money). Describing what code *does* is actually already relatively doable using contemporary AI techniques, but that isn't documentation (or at least isn't good documentation, I don't know that `x + y` adds two numbers together./** * Simulates rolling N k-sided dice. * * Returns: A static array representing how many of each of 1..(k+1) were * obtained by the rolls. * * Bugs: The current implementation uses a naïve algorithm that's rather * inefficient for large N. For our purposes, though, which involve only * relatively small N, this is Good Enough(tm) for the time being. We can look * into improving this if it becomes a performance bottleneck. */[...] Argh, that last paragraph should be deleted. :-( See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code. T
Nov 10 2022
On Fri, Nov 11, 2022 at 03:34:31AM +0000, Max Haughton via Digitalmars-d wrote:On Friday, 11 November 2022 at 01:47:54 UTC, H. S. Teoh wrote:[...][...]See, this is proof that documentation goes out of sync with the code. We need a new innovation, along the lines of built-in unittests, to revolutionize keeping docs up-to-date with the code.Built in AI model? (I joke but this is probably the future, whoever gets it right will make a lot of money). Describing what code *does* is actually already relatively doable using contemporary AI techniques, but that isn't documentation (or at least isn't good documentation, I don't know that `x + y` adds two numbers together.I don't think AI will ever be able to replace real documentation. It can describe what the code does, but like you said, that isn't good documentation. I mean, I already know `x + y` adds two numbers (if I didn't, I shouldn't be touching the code in the first place). The AI saying this for me, doesn't really add any value. What makes good documentation is to explain *why* I'm adding two numbers. I don't think any AI will be able to divine my intention in adding two numbers; that has to be written by the programmer. One of the factors that make D unittests so awesome is the fact that you can put them right in the same source file as you're coding in, and in fact, quite often right next to the function it's documenting. This proximity is what increases the likelihood that people are more likely to write unittests, and more inclined to keep them up-to-date (as opposed to, y'know, opening up an external unittest file and commenting out / disabling a troublesome test just to get things to run). Even if somebody `version(none)`'d out a unittest, it's still sitting right there in the source file, sticking out like a sore thumb and begging silently but persistently "fix me, fix me, you know you should!". One idea that comes to mind w.r.t. documentation, is if the docs can be written not only in the comment header, but inside the function body, next to the block of code responsible for some particular task. It would have some tags to help ddoc figure out where in the doc block it goes, so that there is some control over order of presentation. But the comment would be sitting next to the offending code, so when somebody changes the code, the comment would be staring right at them, "update me, update me, c'mon I know you know you should!". T -- Why do conspiracy theories always come from the same people??
Nov 10 2022
On Monday, 21 November 2022 at 10:44:48 UTC, Cabal Powel wrote:On Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:Hope this will help you.The original article that you copied this from has some useful images: https://towardsdatascience.com/modelling-the-probability-distributions-of-dice-b6ecf87b24ea
Nov 21 2022
On Mon, Nov 21, 2022 at 11:08:24AM +0000, Abdulhaq via Digitalmars-d wrote:On Monday, 21 November 2022 at 10:44:48 UTC, Cabal Powel wrote:It's a copy-n-paste spambot promoting a link and masquerading as something legitimate by posting (an incomplete version of) a relevant article. T -- In a world without fences, who needs Windows and Gates? -- Christian SurchiOn Thursday, 10 November 2022 at 02:10:32 UTC, H. S. Teoh wrote:Hope this will help you.The original article that you copied this from has some useful images: https://towardsdatascience.com/modelling-the-probability-distributions-of-dice-b6ecf87b24ea
Nov 21 2022