Random sample algorithm

Raymond Chen’s post this morning was about one of the most bizarre uses of GUIDs I’ve ever heard about. I’ll quote the relevant bit:

A customer liaison asked, “My customer is looking for information on the GUID generation algorithm. They need to select N items randomly from a pool of M (jury selection), and their proposed algorithm is to assign each item a GUID, then sort the items by GUID and take the first N.”

Shuffling cardsI was, to be blunt, horrified that this was the only algorithm the unnamed customer could come up with. Good grief. Agreed, the solution can be a somewhat involved algorithm depending on the number of items (especially if you don’t know the number of items beforehand), but it certainly didn’t sound like there were millions of them. These are usually known as Random Sampling Algorithms, and I assume that the original customer was attempting to write some code that would randomly select potential jury members from a local authority’s voter list.

Here’s my initial take on a solution when you know the value of M: first, assign unique monotonically-increasing numbers to each of the items, 0 to M–1. This could be something as simple as their array indexes if they can be read into an array, or sequence numbers if they’re only available through some index in a database. At any rate, there’s no need to actually modify the items, just so long as you can give me the required item if I ask for the one at number 42, say.

Enter a shuffling algorithm. Create an array of integers of size M. Populate it with the values 0 to M. Shuffle the array, then select the first N indexes from the array. In the JavaScript code below I shall select a random sample of 10 items from the numbers 0 to 99:

var shuffle = function (a) {
  var randomIndex, temp, i = a.length - 1;
  var randomInt = function (limit) {
    // returns integer in range 0..limit-1
    return Math.floor(limit * Math.random());
  }

  while (i) {
    randomIndex = randomInt(i + 1);
    if (randomIndex !== i) {
      temp = a[i];
      a[i] = a[randomIndex];
      a[randomIndex] = temp;
    }
    i -= 1;
  }
};

var a = [];
for (var i = 0; i < 100; i++) {
  a[i] = i;
}
shuffle(a);
var sample = a.slice(0, 10);

// this prints something like
// [65, 21, 30, 8, 46, 40, 18, 44, 95, 66]
console.log(sample);

Despite this quick bit of code, there’s one optimization we can make: there’s no need to shuffle all the M numbers, we only need a random sample of size N. Here’s a better (and more encapsulated) version that only shuffles N items in the sorted array before bailing out.

var getRandomSample = function (n, m) {
  var a = [];
  var randomInt = function (limit) {
    // returns integer in range 0..limit-1
    return Math.floor(limit * Math.random());
  }
  var shuffle = function (a, n) {
    var randomIndex, temp, i = a.length - 1;
    while ((i > 0) && (n > 0)) {
      randomIndex = randomInt(i + 1);
      if (randomIndex !== i) {
        temp = a[i];
        a[i] = a[randomIndex];
        a[randomIndex] = temp;
      }
      i -= 1;
      n -= 1;
    }
  };

  if (n > m) {
    n = m;
  }

  for (var i = 0; i < m; i++) {
    a[i] = i;
  }

  shuffle(a, n);
  return a.slice(m - n);
}

var sample = getRandomSample(10, 100);
console.log(sample);

Notice that, since we shuffle from the end of the array towards the beginning, we need to select the last N items from the array when we’ve shuffled N items.

The fun starts when you don’t know the number of items in the original set beforehand, but that’s some code for another day. If you want to see how it’s done in layman’s hand-waving terms, see here.

Album cover for Twisted TendernessNow playing:
Electronic - Late at Night
(from Twisted Tenderness)


Loading similar posts...   Loading links to posts on similar topics...

No Responses

Feel free to add a comment...

Leave a response

Note: some MarkDown is allowed, but HTML is not. Expand to show what's available.

  •  Emphasize with italics: surround word with underscores _emphasis_
  •  Emphasize strongly: surround word with double-asterisks **strong**
  •  Link: surround text with square brackets, url with parentheses [text](url)
  •  Inline code: surround text with backticks `IEnumerable`
  •  Unordered list: start each line with an asterisk, space * an item
  •  Ordered list: start each line with a digit, period, space 1. an item
  •  Insert code block: start each line with four spaces
  •  Insert blockquote: start each line with right-angle-bracket, space > Now is the time...
Preview of response