Thursday, August 13, 2015

Fisher-Yates Shuffle, Reservoir Sampling, and Random permutations of N

I recently had a conversation with someone who introduced me to the concept of Reservoir Sampling, which is a technique of generation for a combination of k randomly selected elements when there are N elements total, where N is unknown or larger then available memory, and you must select elements without replacement.


Here's a version I just wrote:

import random

"""
iterator is an Iterator/Generator object for a stream

and N >= k
"""
def reservoir(iterator, return_count):

    arr = []
    current = 0
    for item in iterator:
        if current < return_count:
            arr.append(item)
        else:
            rand = random.randint(0, current)
            if rand <= return_count:
                arr[rand] = item
        current += 1
    return arr


Once I read it, I noticed that it is strikingly similar to an inside out Fisher-Yates Shuffle (which is how you you return a random permutation of a list from a list)

That would look like:


import random

def shuffle(cards):
    deck = []
    current = 0
    for card in cards:
        rand = random.randint(0, current)
        if rand != current:
            deck.append(deck[rand])
        deck[rand] = card
    return deck

That said, with reservoir, you won't have a random permutation of the elements.  And shuffle uses extra memory if the length of the card pile is very long, and you only need some k cards.

Thus:
        
 import random

 def resshuffle(iterator, return_count):
    arr = []
    current = 0
    while (current < return_count):
        rand = random.randint(0, current)
        if current < return_count:
            arr.append(arr[rand])
            arr[rand] = iterator.next() 
         current += 1   
   for item in iterator:
            if rand <= return_count:
                arr[rand] = item
        current += 1

    return arr