""" Two problems in sampling from long streams of data. 1) Uniformly sample one element from a long stream of data. The eventual length of the stream is unknown. Elements are read one at a time. On reading an element, you must immediately decide whether to keep it or to discard it. When the stream runs out, after N items, each item must have the same probability (1/N) of having been chosen. You should assume you have access to a function that takes as input an integer i and returns a randomly chosen integer in the range 0..i-1 inclusive. This is like rolling an i-sided die numbered from 0 to i-1. We'll call the stream x and denote the elements by x[0],x[1],...x[n-1]. Call the random sample r. Solution: Read an element from the stream, and put it into r r = x[0] with probability 1/2 r = x[1] otherwise read and discard x[1] with probability 1/3 r = x[2] otherwise read and discard x[2] (happens 2/3 of the time) The chance that x[0] will survive two steps and and be found in r is 1/2*2/3 = 1/3 The probability that x[1] will be placed into r at step 1 and the survive is also 1/2*2/3=1/3 The probability that r is x[2] is 1/3. We carry on like this. In general, we can show that when the probability of each x in x[0],x[1],...,x[n-1] being the sample after n steps is 1/ n, the probability that each x in x[0],x[1],...,x[n] being the sample after n+1 steps will be 1/ (n+1), We retain x[n] with probability 1(n+1). To be the sample after n+1 steps, each previous item x[i] must (a) have been the sample at step n (b) survive the risk of being usurped by x[n]. The probability of this is 1/n*n/(n+1) = 1/(n+1) So the property that forall i < n p(x[i]) = 1/n after n steps is preserved by the process we are doing. It is certainly true for n=1, so it must be true for all larger n. 2) Uniformly sample k elements from a stream. Rules are the same as in problem 1. You are allowed to keep k elements. When a new element is read you must immediately decide whether to discard it or not. When the stream runs out, each of the items read from the stream must have probability k/N of being in the sample. While there are less than k items, read the next item and include it in the sample. r = x[0..k-1] Read x[k]. Keep it with probability k/(k+1), otherwise discard it. If you do keep x[k], randomly select one of the existing items and replace it with x[k]. x[k] will be in the sample with probability k/(k+1) because we just arranged this. The other items will be in the sample unless (a) x[k] was retained and (b) a random selection from k happened to hit them. The probability of this is k/(k+1)*1/ = 1/(k+1), so each item will be in the sample with probability 1 - 1/(k+1) = k/(k+1) Once again, we can generalize this by assuming that the probability of being in the sample after n steps is k/n. If we now retain the x[n] with probability k/(n+1), we will replace each of the k current members of the sample with probability 1/(n+1). This means that they will be retained with probability n/(n+1). So their chance of being in the sample after n+1 turns is k/n*n/(n+1) = k/(n+1) Since we know that the probability after k items was k/k = 1, and we have shown that the process preserves the desired property, we have shown that the probability will always be k/n for any n 3) How do you test a program that does this? """ import random import numpy def sample_from_stream(x,k): n = 0 r = [] for item in x: if n < k: r.append(item) else: # we have seen n items before this one # which is the n+1th if random.randrange(n+1) < k: # we want to retain this item # so we randomly choose a place # within r r[random.randrange(k)] = item n+= 1 r.sort() return r def t (n=20,k=5): """ n is the number of elements in the stream. k is the number to be selected """ v = [] for i in range(1000): p = [] # repeat enough times that each cell # will have an average of 100 items for x in range(n*100/k): p += sample_from_stream(xrange(n),k) v.append( [p.count(i) for i in range(n)]) return numpy.array(v)