[ad_1]

## A cool and useful algorithm explained and extended

By Carl M. Kadie and Christopher Meek

If I (Carl) interviewed you for a job at Microsoft, I probably asked you this question:

How would you select a random line from a text file of unknown length?

I asked this question in Microsoft Research, in the original anti-spam team, and in an Office Machine Learning/Data Science group. We love the question because it touches on probability, algorithms, and systems issues.

Even having retired from Microsoft, we still think about the random-line question. For example, we recently learned about `bytecount`

. It is a Rust crate that uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. As we’ll describe later, our learning of the crate led — indirectly — to improving our favorite solution to the random-line question.

We’ve organized this article as a series of questions and hints. If you wish, you can try to answer the questions yourself, before seeing our answers.

We gave interviewees this advice for whiteboard interviews:

- Feel free to ask clarifying questions. (In the context of this article, you can read a little further to see if we offer clarifications.)
- Start with a correct algorithm first, even if it might be suboptimal. If we want a better algorithm, we will prompt you for it with follow-up questions.
- Don’t worry about exact syntax. We don’t, for example, care if you remember the exact method for generating a random number. (In this article, we’ll give answers in Python.)
- If you get stuck, we can provide some hints. (Again, in the context of this article, read a little farther to look for hints.)

Let’s start with the questions:

## Q: How would you select a random line from a text file of unknown length?

**A:** We must first clarify the meaning of a “random line”. We mean that every line in the text file has an equal chance of being returned. In other words, we want the algorithm to select among the lines via a uniform distribution.

This requirement means you **cannot** use this algorithm, that we’ll call `AlgoRandomSeek`

:

- Ask the filesystem for the length of the file.
- Randomly select a file position and seek to that position.
- Return a line near the position.

**Sub Q:** What’s wrong with `AlgoRandomSeek`

?

**Sub A**: Although the algorithm can return any line in the file, it returns longer lines with higher probability than shorter lines and, thus, does not select lines via a uniform distribution.

Aside: We like being asked for this clarification. It distinguishes the everyday meaning of “random” from the technical meaning used in statistics, data science, decision theory, and machine learning.

**Hint:** If you were coding in Python and needed to solve the random line problem quickly (but correctly) what would you do?

We asked GitHub Copilot for a solution, and it gave us this code. Call it `AlgoReadAll`

:

In interviews, this is usually where we stopped with “random line from a file”. However, we recently learned about the Rust crate `bytecount`

. The `bytecount`

crate uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. That got us playing with the problem again. This led to a new approach and an improved algorithm. The new algorithm doesn’t use `bytecount`

. It does, however, in the specific case of returning one random line, outperform the more general Optimal: Algorithm L described in Wikipedia.

Aside: We call this the “new algorithm”, but it may have been discovered earlier. In any case, we hope you’ll find the algorithm and its development interesting.

As before, we’ll present the new algorithm via a series of questions and hints. We have not, however, ever put these questions to an interviewee.

## Q: In one pass, can you select a random line from a text file of unknown length such that the random number generator is called much less than the number of lines, n?

**A:** We must clarify “much less”. We mean that the number of calls to the random number generator is less than O(*n*) [see Big O notation — Wikipedia]. In other words, reducing the calls by one or by half is not enough. The random number of calls required should grow proportional to, for example, log(*n*) or sqrt(*n*).

**Hint:** First modify `AlgoOnePass`

to print the index of every item that gets assigned to `result`

. Call these the “keep indices”. Run the code on, say, 1,000,000 items.

Outputs:

`1 36 41 187 226 403 2,608 5,756 20,162 48,750 912,566 922,409`

This says that when we run with random seed 0, the first item (index 1) is kept as a possible result. Then no item is kept until the 36th item, then the 41st. If the iterator contains between 922,409 and 1,000,000 items, then the 922,409th item will be the last item kept and so will be returned.

The keep indices seem to grow roughly exponentially. If that conjecture is true, the number of keep indices is O(log *n*), where *n* is the number of items in the iterator.

**Sub Q: Can we randomly generate a sequence of keep indices directly?**

**Sub A:** Yes! We’ve detailed our solution in a short, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We call the distribution of keep indices the “Attenuated Geometric Distribution”. We show that if `index1`

is one keep index number then we can generate the next one with:

`r = rng.random()`

index1 = max(index1+ceil(r * index1 / (1.0 — r)),1)

where `rng.random()`

generates a uniform floating-point value between 0 .0 (inclusive) and 1.0 (exclusive). Bonus: `max(…,1)`

handles the very, very unlikely case of generating 0.0 randomly. Also, recall our convention that “index1” is an index that starting counts from 1 instead of 0.

We can now use this insight to create `AlgoOnePassSkip`

:

We can get some confidence in the code by using it to pick a number between 0 (inclusive) and 100 (exclusive), 100,000 times.

Output plots should look uniform. Which they do:

This algorithm differs from Optimal: Algorithm L (recommended by Wikipedia) in two important ways.

`AlgoOnePassSkip`

only works for selecting one random item, while Algorithm L can select any specified number of random items.- When only one random item is desired,
`AlgoOnePassSkip`

needs one random number per keep index, while Algorithm L needs two.

Thus, for the special case in which we want only one random item, `AlgoOnePassSkip`

uses half as many random draws as Algorithm L.

We’ve seen four ways to select an item randomly from a sequence of unknown length. In the context of a text file, the first solution required that the file fit into memory. The next solution used less memory but required two passes through the file. We then used a probability calculation to reduce this to one pass. That one-pass solution required one random number per line. The last solution required many fewer random numbers than lines. It also uses half as many random numbers as the “optimal” (more general) algorithm.

None of our code uses system-level methods such as `bytecount`

to speed up the linear pass through a file. Adding system-level optimizations would be an interesting extension.

[ad_2]

Source link