Simon would have said

Better Pooled Coronavirus Testing

We need to be doing more Coronavirus testing, but it is hard to scale up the number of tests we can run in a day. One way the CDC is recommending getting around this restriction is by pooling tests. This amounts to pooling samples from a few people together and running one test. This can drastically increase the number of people we can test with the same number of testing kits. However, in areas where the virus is too common it becomes ineffective because almost every pooled test comes back positive. We can improve on this test by utilizing some simple probabalistic techniques. These will allow us to get more information out of a collection of tests than we get out of each test individually.

The Status Quo

Pooled tests work by combining samples from a small pool of people and running a single test kit against the combined sample. This test kit should come back positive if any of the people in the pool are sick.

  • Pooled tests cover more people with the same number of tests
  • When a pooled test is negative, everyone in that pool is presumed negative
  • When a pooled test is positive, each person in the pool needs to be retested.

When too many people in an area are sick, pooled tests come back positive so frequently that it is no longer worthwhile, given the cost and risk of reaching out to people to get retested.

When more people in the population are sick, pooled tests come back positive far too frequently.

Motivation for Improvement

  • A batch is the whole collection of tests and people we will examine at once.
  • A pool is the group of people whose sample was applied to a specific test.
  • A test array is the group of tests containing some of a given persons sample. A test array is considered positive if every test in the array is positive.

We can do better than just considering a single pooled test. To start, imagine that one person out of a thousand were sick. We would do this by getting a sample from each person, and labelling them with a number written out in binary:

  1. $0000000001$
  2. $0000000010$
  3. $0000000011$
  4. $0000000100$
  5. $0000000101$

And so on. The sample numbered 1000 would be written in binary as $1111101000$, taking up ten digits. We would line up the ten test kits so that they matched up with digits of the sample labels. Then we would go through all of the samples, and add a given sample to the pool for a test if the digit of the label for that position was 1. The sick person will have their entire test array come back positive.

There is some weirdness here. For instance, a healthy person could have all of the tests containing their sample come back positive. The sample labeled $1$ would only be in one test because it would only have one digit which was $1$. That test would come back positive about half the time (whenever the sick person had an odd numbered label). This problem gets softened out if we use enough tests that we can make sure everyones test array is the same size. But as long as each test array is a distinct collection of tests, we can gain information about each individual person even though each test we ran were on pools of people.

This means that what we care about is the number of distinct collections, rather than the number of tests. That is great, because the number of distinct collections which are subsets of some set goes up way faster than the size of the set. Imagine a card game where you draw a hand of 5 cards. Even though there are only $52$ cards there are $2,598,960$ distinct hands that you might draw.

A Concrete Example

To show how much we can get out of collections of tests rather than individual tests, lets assume each test array will contain 3 tests. I don’t know for sure how many tests we can get out of a single sample, but probably not a whole lot.

We assume that if someone is sick they will have a positive test array. What we want to know is how likely it is that someone who is not sick will have a positive test array.

The probability that one test is positive ($\Rho(T)$) is the probability that some other person in that test pool was positive. This is $\Rho(T) = 1 - (1 - b) ^ -1$, where $b$ is the baseline population rate of COVID and $n$ is the number of people in a pool. This is also the probability of a false positive under the original pooling method.

The probability of a false positive under the new method ($\Rho(A)$) is the probability that all of the tests in an array are positive. $\Rho(A) = \Rho(T)^m$, where $m$ is the number of tests in an array; in our case 3.

The false positive rate of the improved testing is significantly lower than with simply pooled tests.

The probability of a false positive in a simple pooled test is $\Rho(T) = 1 - (1-x)^-1$. In this modified procedure the probability of a false positive is $\Rho(A) = \Rho(T)^m$.

This batched strategy looks way better than the simple pooled tests, because we are able to get more information out of each test by considering the information from the whole test array. It also does not have to use any more tests, because the number of distinct collections will always be large compared to the number of tests. The tradeoff is that we need to use several tests at once to be able to talk about collections. This means larger batches, but the same ratio of tests to people.

In The Real World

So far we’ve been assuming a miraculous, ideal test which has no false negatives and no false positives. Because a test array is not considered positive unless all of the tests contained in it are positive, false positives on the underlying tests are not a problem. False negatives could be a concern, especially because this method would be most useful where the prevalence of COVID is too high for traditional test pooling. However, there are a few things that protect us from this.

Our test arrays are fairly small. The false negative rate of an array will be $\beta^m$, where $m$ is the size of the array and $\beta$ is the false negative rate of the underlying test. The smaller our test array is, the less we increase the false negative rate.

In a case where we were really concerned, we could control the false negative rate further by changing what we considered positive out of a test array. We could say that an array was positive if a majority of the tests in the array were positive. This would prevent a single false negative from making the entire array come back negative, so that it would be easier to catch. This would mean increasing the false positive rate of the batching system as a whole, but we can control for that by running a few more tests per batch while still getting major reductions in the false negative rate.

It’s also likely that false negatives are very correlated; that is, if you get one false negative then all of your tests are likely to be false negatives. In the limiting case where either all of your tests are false negatives or none of them are, this strategy does not actually increase the false negative rate at all.

Where Do We Go From here

I am not a biologist, or a doctor. I have never worked in a lab, or run one of these tests, and there may well be constraints that I am overlooking because of that. But from a mathematical perspective, we can do so much better than simply pooling tests together. By considering collections of tests rather than individual tests, we can get all of the benefits of test pooling even when we can’t assume a very low prevalence of COVID.