www.digitalmars.com         C & C++   DMDScript  

digitalmars.D.bugs - [Issue 7067] New: std.random.RandomSample and RandomCover are poorly designed

reply d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067

           Summary: std.random.RandomSample and RandomCover are poorly
                    designed
           Product: D
           Version: D2
          Platform: Other
        OS/Version: Windows
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody puremagic.com
        ReportedBy: thecybershadow gmail.com
                CC: andrei metalanguage.com



04:00:22 PST ---
The following tests will always fail:

    int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
    assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
    assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));

The reason why these tests will fail is that both functions take the RNG by
value. Not only is this unintuitive, this is also a security liability -
someone depending on these functions to generate random identifiers can be
surprised that two successive calls generate the same ID.

I strongly suggest that RNG types are disallowed from being implicitly copied.
Even though pseudo-random number generators shouldn't be used for security
purposes, they're still likely to be used in such contexts - especially
considering lack of better sources of random data in Phobos.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 05 2011
next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


bearophile_hugs eml.cc changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |bearophile_hugs eml.cc



See also issue 6593

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 05 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




21:28:32 PST ---

 The following tests will always fail:
 
     int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
     assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
     assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));
Good point.
 The reason why these tests will fail is that both functions take the RNG by
 value.

 Not only is this unintuitive, this is also a security liability -
 someone depending on these functions to generate random identifiers can be
 surprised that two successive calls generate the same ID.
I think we can safely eliminate this argument from the discussion.
 I strongly suggest that RNG types are disallowed from being implicitly copied.
 Even though pseudo-random number generators shouldn't be used for security
 purposes, they're still likely to be used in such contexts - especially
 considering lack of better sources of random data in Phobos.
The problem with taking a random generator by reference is that it then needs to be escaped. So people would be quite surprised to see that: auto gen = rndGen; return randomSample(a, 5, gen); has undefined behavior. One way or another we need to solve this, e.g. by creating a wrapper with reference semantics over generators. Ideas are welcome. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 05 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067






 The problem with taking a random generator by reference is that it then needs
 to be escaped. So people would be quite surprised to see that:
 
 auto gen = rndGen;
 return randomSample(a, 5, gen);
 
 has undefined behavior.
 
 One way or another we need to solve this, e.g. by creating a wrapper with
 reference semantics over generators. Ideas are welcome.
Turn random generators into final classes? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 05 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




07:53:28 PST ---
 Turn random generators into final classes?
We have backward compatibility to worry about. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 06 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




08:04:49 PST ---
The disadvantages of breaking backwards compatibility need to be considered on
a case-by-case basis. I think that turning RNGs into reference types has the
potential to be a relatively low-impact change, while also having a good chance
of revealing broken code.

The typical usage of std.random does not involve using the RNG types directly,
and rather using the implicit thread-local RNG.

An example of affected code:

Random r;
// ... use r

I think that generally allowing such code is a mistake, because it's not clear
that the RNG is not seeded.

auto r = Random(42);

We can implement this for backwards compatibility using static opCall (which
shall be scheduled for deprecation).

The biggest problem is intentional usage of value semantics (it would
transparently turn into reference semantics). Perhaps there's something we
could do with the help of opAssign?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 06 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067






 The biggest problem is intentional usage of value semantics (it would
 transparently turn into reference semantics).
I suggest to ignore such cases, they are probably uncommon, and add a warning note to the changelog. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 06 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


Alex Rønne Petersen <xtzgzorex gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |xtzgzorex gmail.com



04:21:37 PST ---

 Turn random generators into final classes?
We have backward compatibility to worry about.
Sometimes breaking changes can be a good thing. It seems to me like with the current design, it's more likely that people are Doing It Wrong than the opposite. Keeping backwards compatibility also means allowing people to continue doing the same mistakes. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Dec 07 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


Bernard Helyer <blood.of.life gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |blood.of.life gmail.com



00:02:29 PST ---
To add an anecdote to this, when I first tried passing Random instances around,
I was surprised to find they weren't reference types; it wasn't obvious to me
at all.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Dec 08 2011
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


Jonathan M Davis <jmdavisProg gmx.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdavisProg gmx.com



PDT ---
We can make the random number generator ranges reference types without breaking
code simply by creating inner structs for them which hold their member
variables and sit on the heap. The only code which would break because of this
would be code which relied on them being value types, which (as discussed) is
almost certainly a bug.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 15 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


jens.k.mueller gmx.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jens.k.mueller gmx.de





 The following tests will always fail:
 
     int[] a = [ 0, 1, 2, 3, 4, 5, 6, 7, 8 ];
     assert(!equal(randomCover(a, rndGen()), randomCover(a, rndGen())));
     assert(!equal(randomSample(a, 5, rndGen()), randomSample(a, 5, rndGen())));
Good point.
 The reason why these tests will fail is that both functions take the RNG by
 value.

 Not only is this unintuitive, this is also a security liability -
 someone depending on these functions to generate random identifiers can be
 surprised that two successive calls generate the same ID.
I think we can safely eliminate this argument from the discussion.
 I strongly suggest that RNG types are disallowed from being implicitly copied.
 Even though pseudo-random number generators shouldn't be used for security
 purposes, they're still likely to be used in such contexts - especially
 considering lack of better sources of random data in Phobos.
The problem with taking a random generator by reference is that it then needs to be escaped. So people would be quite surprised to see that: auto gen = rndGen; return randomSample(a, 5, gen); has undefined behavior.
Code like this is always incorrect, i.e. the problem is more general. I wonder whether a compiler can always flag those programs as invalid. Is it possible to solve the problem of escaping references to local variables in general? If so it should probably be done that way. And a RNG should be made non-copyable to force pass by ref. If it isn't possible (or inefficient or too time consuming, etc.), then std.random should be deprecated and std.random2 should replace it in the long run. I believe this is the best solution (but far from perfect) to handle design problems like this. I wish there was a better way to fix a wrong design decision. But working around (kind of against the language) is not future-proof. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 19 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


Dmitry Olshansky <dmitry.olsh gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh gmail.com



07:40:01 PDT ---
 auto gen = rndGen;
 return randomSample(a, 5, gen);
 
 has undefined behavior.
Code like this is always incorrect, i.e. the problem is more general. I wonder whether a compiler can always flag those programs as invalid. Is it possible to solve the problem of escaping references to local variables in general? If so it should probably be done that way. And a RNG should be made non-copyable to force pass by ref. If it isn't possible (or inefficient or too time consuming, etc.), then std.random should be deprecated and std.random2 should replace it in the long run. I believe this is the best solution (but far from perfect) to handle design problems like this. I wish there was a better way to fix a wrong design decision. But working around (kind of against the language) is not future-proof.
I had similar problems with redesign of std.regex. It has few less then ideal tradeoffs because of that but I've come to appreciate backwards compatibility. In any case big value types like RNG that seldom get modified but often forwarded could use ref-counted COW. In D we are thread-local by default and thus COW is fun and cheap. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 19 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067






 then
 std.random should be deprecated and std.random2 should replace it in the long
 run. I believe this is the best solution (but far from perfect) to handle
 design problems like this.
std.random2 seems a good solution. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 27 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


monarchdodra gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra gmail.com





 
 then
 std.random should be deprecated and std.random2 should replace it in the long
 run. I believe this is the best solution (but far from perfect) to handle
 design problems like this.
std.random2 seems a good solution.
As I've mentioned, I am almost ready with a non-breaking reference semantic fix. To avoid breaking code, the ranges are basically lazilly initialized. As I've stated in the forums, I do not want to add ANY new functionality in random, which would have to be ditched for a migration to random2 anyways. My two ideas were: * Explicit opSlice that returns an agressivilly initialized PRNG, for higher performance. They'd also have tighter safety. * A PRNG.Payload alias for users that *really* want stack allocation. Out of curiosity, if you had to write random2, what would you want in random2? * I'd remove all constructors, and require an explicit call to either a seed function, or a free Make!PRNG fuction. I don't think classes is quite the right solution. ** This would address both the explicit initialization issue, as well as the "no-arg" initialization issue * I'd also make them "only" input ranges. It doesn't make much sense to save a prng (though I'd still have .dup), but they wouldn't be forwardRanges. Not much else actually. I was planning on asking community feedback on potential evolutions when my developement was ready to go through (either evolving std.random, or forking to std.random2), but I wouldn't mind a pre-release opinion from users who have more hindsight of the issue ;) PS: Related, I'd want std.container to require the exact same initialization scheme >:( PPS: No spellchecker here, sorry. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Sep 27 2012
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067


Joseph Rushton Wakeling <joseph.wakeling webdrake.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |joseph.wakeling webdrake.ne
                   |                            |t



2013-06-20 16:02:07 PDT ---
I just want to put on record some remarks about short- and long-term solutions
to this problem.

Long-term as I think we all agree we need RNGs to be reference types.  In this
case the general design of ranges like RandomCover and RandomSample will be not
entirely different from what RandomCover has now: the constructor will take a
RNG as one of its arguments, and (as it's a reference type) this can be copied
(by reference!) to an internal variable:

    struct SomeRandomRange(Rng)
    {
        private Rng _gen;
        ...
        this(/*stuff*/, Rng gen)
        {
            _gen = gen;
            ...
        }
        ...
    }

... and then you'd have some helper functions that would handle the case where
the user doesn't specify an RNG by passing rndGen to the constructor.

However, while we're stuck with the situation we have, the design of
RandomCover is statistically completely unsafe, as you'll inevitably get
unwanted correlations due to storing the RNG by value.  As RandomCover's
constructor insists on receiving an RNG, the only way to escape this is to pass
it a new RNG seeded with unpredictableSeed.

What we can do, though, is to salvage things minimally by patching RandomCover
to use the same technique as RandomSample -- that is, if the constructor gets
passed an RNG, copy it; but if not -- call rndGen directly.  I'm going to
prepare some patches to that effect, which will also try and do something about
the .save methods of Random{Cover,Sample} -- both of which are currently
broken.

It's putting a sticking plaster on a gaping wound, but at least it should mean
that the simplest case available to the user -- doing something like this:

    auto c = randomCover(input);

... will be statistically reliable.

The caveat here is that we have to remember to tweak the constructors back to
insisting on getting an RNG once we can guarantee that RNGs will behave as
reference types.  (Is there any kind of template constraint that could be used
to guarantee reference copying semantics?)

Anyway, I'll get on with writing the patches.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Jun 20 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067






 It's putting a sticking plaster on a gaping wound,
Why do you/we care so much for breaking backwards compatibility with something that is so broken? If you let it take the generator by ref I think you will remove from bugs from user code. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 20 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




2013-06-20 16:27:56 PDT ---

 Why do you/we care so much for breaking backwards compatibility with something
 that is so broken? If you let it take the generator by ref I think you will
 remove from bugs from user code.
You can pass the generator to the constructor by ref, but you can't safely store it as a reference type, and so far as I can see you DO need to store it internally -- RandomCover is a range object, not a function where you can just use it and return. _If_ you could safely store a reference to the RNG, then of course what you say would be right. But you can't, not while they are value types. My proposal won't break backwards compatibility (which is not in itself a good thing -- this back deserves to be broken:-) and won't fix the bigger picture, but it will improve the status quo without having to wait for the large-scale redesign that random number generation needs. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Jun 20 2013
prev sibling next sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




2013-08-21 07:38:13 PDT ---

 What we can do, though, is to salvage things minimally by patching RandomCover
 to use the same technique as RandomSample -- that is, if the constructor gets
 passed an RNG, copy it; but if not -- call rndGen directly.  I'm going to
 prepare some patches to that effect, which will also try and do something about
 the .save methods of Random{Cover,Sample} -- both of which are currently
 broken.
Just to note, I've submitted a pull request with updates along these lines: https://github.com/D-Programming-Language/phobos/pull/1499 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------
Aug 21 2013
prev sibling parent d-bugmail puremagic.com writes:
http://d.puremagic.com/issues/show_bug.cgi?id=7067




Commit pushed to master at https://github.com/D-Programming-Language/phobos

https://github.com/D-Programming-Language/phobos/commit/8d9233cf8b9e4d27bd70dd0fcd171d2f6dc2f2c0
Partial fix for Issues 7067 and 10434 - std.random.RandomCover

The existing RandomCover design is fatally flawed because it
requires a RNG as input, which it then copies internally by
value.  So, unless the user is smart enough to pass something
like e.g. SomeRNG(unpredictableSeed), there will be unintended
correlations in random behaviour.

This partial fix follows the design of RandomSample in allowing
RandomCover to use the thread-global default RNG rndGen.  It
also improves the choice of template parameter and variable
names in line with Issue 10434.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Aug 29 2013