The Santa Clara Serology study and being Bayesian

By taking a bayesian approach, it is possible to see that the authors of the Santa Clara study got something wrong, besides the problems with bias and sampling in their survey. The data is too poor to open society up prematurely, based on the premise of high disease prevalence.

In the last blog post we saw how we can calculate \(p(D+ \mid T+)\) given information on Specificity and Sensitivity of a serological test from a manufacturer. In that article we needed an otherwise obtained (from swab tests, for example) prevalence \(p(D+)\), and also noted the impact of false positives on our calculation at low prevalence levels. We alluded to calculating a prevalence \(p(D+)\) of the disease in the population if we knew p(D+ T+)$.

Its important to get the chain of events right. This is not a prior prevalence using swab based testing, but rather one that is done after carrying out serological tests.

The Santa Clara Survey

We might expect false positives to be impactful in a calculation of prevalence as well, and indeed, this impact is at the center of a controversy about a serological survey in Santa Clara country in California. The preprint by [Bendavid et. al, COVID-19 Antibody Seroprevalence in Santa Clara County, California describes both the survey and the pre-survey test kit work carried out by the group.

Reproduced from the pre-print, here is their survey method:

We conducted serologic testing for SARS-CoV-2 antibodies in 3,330 adults and children in Santa Clara County using capillary blood draws and a lateral flow immunoassay…We recruited participants by placing targeted > advertisements on Facebook aimed at residents of Santa Clara County. We used Facebook to quickly reach a large number of county residents and because it allows for granular targeting by zip code and socio-demographic characteristics..

Below is a map of Santa Clara county. As you can see, richer areas close to Palo Alto are represented more.

Most of the respondents were young to middle aged white women. Thus the authors had to apply demographic corrections they talk about in the excerpt below and which they detail in their technical appendix. Sampling via facebook also have significant issues, which as far as I know are unaccounted for: people signing up are more likely to be those who have had symptoms ot those who know others who have had symptoms, thus possibly overestimating any prevalence.

But putting aside those problems for now, let us assume the authors succeeded in getting a representative sample. They go on to tell us:

The total number of positive cases by either IgG or IgM in our unadjusted sample was 50, a crude prevalence rate of 1.50% (exact binomial 95% CI 1.11-1.97%). After weighting our sample to match Santa Clara County by zip, race, and sex, the prevalence was 2.81% (95% CI 2.24-3.37 without clustering the standard errors for members of the same household, and 1.45-4.16 with clustering). We further improved our estimation using the available data on test kit sensitivity and specificity, using the three scenarios noted above. The estimated prevalence was 2.49% (95CI 1.80%-3.17%) under the S1 scenario, 4.16% (95CI 2.58%-5.70%) under the S2 scenario, and 2.75% (95CI 2.01%-3.49%) under the S3 scenario. Notably, the uncertainty bounds around each of these population prevalence estimates propagates the uncertainty in each of the three component parameters: sample prevalence, test sensitivity, and test specificity.

The authors of the survey have thus claimed that the Covid-19 prevalence in Santa Clara county is quite high. The unadjusted prevalence is estimated at 1.5% (range 1-2%), and a population-weighted prevalence rate is estimated as 2.8% with a 95% confidence upper limit of 4.2%.

Understanding these numbers

Let us parse these numbers. Santa Clara county has a population of 2 million. At the time when the data was snap-shotted by the authors, there were about 100 deaths in the county, and about 1000 recorded cases. A 1.5% prevalence would mean that 30,000 of its residents had Covid-19, an underestimate by a factor of 30. Using 4% as the upper limit instead, we’d have a factor of 80 underestimation. These large numbers have been used by many political officials on the right in the US to push for a quick re-opening of the economy and relaxation of lockdown restrictions, claiming that large fractions of the population have been infected. Leaving aside the fact that 4% is not really that large and we are likely to see more deaths when the next 4% are infected, we shall see that these numbers are very uncertain, because of the possibility of false positives. But it is the maze of uncertainty created by the impact of these false positives that any sane policy must navigate.

There is another issue that can be addressed by prevalence numbers. This is the issue of Infection Fatality rate (IFR). A naive way of calculating the fatality rate is the Case Fatality Rate (CFR). This simply divides the number of fatalities by the number os known cases. For Santa Clara, this is 10%.

But because testing always lags, and because of political inertia in the US, lags a lot, the actual IFR is quite low. A prevalence of 1.5%, or 30,000 cases corresponds to an IFR of 0.33%. A prevalence of 4% corresponds to one of 0.125%. (IFRs from Italy and other place have pegged Covid-19 fatality at 1%). This low IFR from California is also being used by right-wingers to argue that the disease is not that dangerous so restrictions should be eased.

Thus its important to get a good hold of these numbers and their uncertainties to drive policy. To do this, we need to know the uncertainties of the sensitivities and the specificities of the serological tests that are being used.

These quantities, as we now know are usually captured in a construct called the Confusion Matrix.

Confusion matrix for the test kits.

Once you have real frequencies for the false positives, true positives, true negatives and false negatives, either in a set of test-kits from the manufacturers or in a survey. Then you could write a confusion matrix:

Confusion Matrix

The marginal quantities (quantities in the right and the bottom margin in the confusion matrix) are the sum of the rows and columns. For example, when you sum the true negatives and the false positives, you get the observed negatives (ON), all the people who do not have the disease.

The technical appendix of the paper details all these quantities:

In the first scenario, we estimate these quantities based upon the numbers provided by the manufacturer of the test kits. For sensitivity, the manufacturer reported 78 positive test readings for 85 samples (from Chinese blood samples) known to have specific IgM antibodies to the receptor-binding domain (RBD) spike on the SARS-nCOV2 virus. They reported 75 positive test readings for 75 of the samples with specific IgG antibodies to the same RBD spike. We adopt a conservative estimate of sensitivity equal to \({\hat r}\) = 56/67 ≈ 91.8%. The manufacturer reports specificity based on an experiment using their kit on a sample of 371 known negative blood samples collected from before the epidemic, and 369 were tested negative. This implies a specificity of \({\hat s}\) = 369/371 ≈ 99.5%.

In the second scenario, we estimate these quantities based on tests run locally at Stanford University. We identified serum from 37 patients who had RT-PCR-confirmed cases of COVID19 and either IgG or IgM on a locally-developed ELISA assay; of these, 25 tested positive with the test kit, implying a sensitivity of \({\hat r}\) = 25/37 ≈ 67.6%. We also identify serum from 30 patients drawn from before the COVID-19 epidemic, and all 30 tested negative, implying a specificity of \({\hat s}\) = 83/83 = 100%. …

In the third scenario, we estimate these quantities by combining the manufacturer tests with our local tests in a simple additive fashion. Under these assumptions, the sensitivity estimate is \({\hat r}\) = 103/122 ≈ 84.4% and the specificity estimate is \({\hat s}\) = 399/401 ≈ 99.5%.

Technical Appendix, Bendavid

We’ll use the author’s combined third estimate. Then, the Sensitivity, or the True Positive Rate(TPR) is given as: \[ Sensitivity \equiv P(T+ | D+) = TPR = \frac{TP}{OP} = 0.844. \] In other words, this is the number of TP, divided by the sum of the bottom row, the OP.

The complement of the sensitivity is the False Negative Rate(FNR):

\[FNR \equiv P(T- | D+) = 1 - TPR = \frac{FN}{OP} = 0.156\]

This quantity is concerned with the bottom row of the confusion matrix.

The Specificity, or the True Negative Rate(TNR) is given as: \[ Specificity \equiv P(T- | D-) = TNR = \,\,\frac{TN}{ON} = 0.995 \]

and thus the complement, the False Positive Rate(FPR) is given as:

\[FPR \equiv P(T+ | D-) = 1 - TNR = \frac{FP}{ON} = 0.005\]

These two quantities are concerned with the top row of the confusion matrix, and the denominators come from its marginal, the ON.

Lets redraw the confusion matrix with all the “test-kit” based numbers put in.

You might think that this data-set is a good one to estimate prevalence (\(\frac{OP}{Total}\)) from, but its not, as a lot of the test kits were tested on pre-covid blood samples (to look for false positives). The false negative testing, on the other hand is all on samples deemed positive in other fashions. Thus these are synthetic numbers which are useful for the sole purpose of estimating sensitivity and specificity. But the authors did do a survey of the public, so we will look at that for estimating prevalence.

To understand how we might want to model our prevalence, first note that we put little hats on our sensitivity and specificity above, \({\hat r}\) and \({\hat s}\). Why?

The reason is that these numbers are estimates. They were estimated on some positive and negative samples. Estimates on another samples could give us different numbers. Indeed the authors made various estimates S1, S2, and S3 (we are using the combined estimates, S3).

What we want to remember, then, is that the specificity and sensitivity, and for that matter, prevalence are stochastic quantities, and these values were estimates.

With just that notion in mind, we can ballpark why we should be afraid of trusting any prevalence numbers based on these estimates.

Suppose the specificity of the test was 98.5%. This means that all the 50 out of 3330 who tested positive could be false positives. Now suppose it was 100%, then all 50 would be true positives. At our quoted number of 2 false positives in about 400, we have a specificity of 99.5%, which would mean that roughly a third of those who tested positive are false positives.

The point here is that with such low numbers, things can move around very easily. Suppose another 2 tests in another sample were false positives (specificity of 99%). Now the False positives would account for 2/3 (33) of the 50 positive tests in Santa Clara, and our estimate of the prevalence would be even lower.

On the other hand, estimates from New York City of prevalence are much higher, around 20%. Whatever the problems with those surveys, we are not so sensitive to false positives any more. For any range 97-100 in sensitivity, the number of true positives do not change that much.

This is the intuition why any result saying a low prevalence is somewhat higher ought to make you feel skeptical. But read on to see how you can do a principled analysis of the uncertainties involved, and how the results from these analysis land up differing from those of the Santa Clara survey preprint authors.

The Bayesian Approach

We think that the best, most principled way to capture and model these uncertainties is Bayesian Modeling. Not just using Bayes theorem like we did last time, but modeling sensitivity, specificity, and prevalence using probability distributions.

In both frequentist and bayesian statistics, estimates can differ. The estimates could be coming from different samples, or perhaps the assays themselves have some noise.

As usual we wish to model this by a probability density:

The interpretation of probability density is that when you multiply \(p(x)\) by the histogram width \(dx\) you get one of the histogram bars \(dP(x)\), a sliver of the probability that the feature \(X\) has value \(x\):

\[dP(x) = p(x)dx.\]

(To be precise you want these histogram bars to be as thin as possible.)

Now when you add all of these probability slivers over the range from \(a\) to \(b\) you must get 1. You can also consider the area under the density curve up to some value \(x\): this function, \(P(x)\), is called the cumulative distribution function (CDF) ; sometimes you will see it just called the distribution function. And \(p(x)\) is called the density function.

In a large population limit, the probability in the sliver can be thought of as the number of data points around a particular \(x\) divided by the total number of data points in the population…Thus, when multiplied by the total number of data points in the population, \(dP(x)\), or the change in the CDF at \(x\), gives us the total number of data points at that \(x\). So it allows us to have different amounts of data at different \(x\).

Now, the shape of the probability density is depends on some numbers, called parameters. For example, in the figure above, one parameter could tell you where the maximum of the probability density is, while another could tell you how wide it is. Keeping the first constant while making the second smaller would keep the probability density at the same place while making it thinner and thus taller (so that the total sum of the slivers still adds to 1).

Now, the Bayesian approach to statistics has two parts:

  1. treat the parameters, which we shall call \(\theta\) as a random variables. This is in contrast to the frequentist approach, where we think of data-sets as stochastic, as samples from a population, and the parameters fixed on the population.
  1. Associate with these parameters \(\theta\) a prior distribution \(p(\theta)\), which encodes our belief about which values of \(\theta\) we think are more likely. Usually, the prior distribution represents our belief on the parameter values when we have not observed any data yet. But not always: we shall see here that we can think of our survey as the “data” and information from the test-kits as factoring into our “prior”. How you do this factoring is your choice.

Posterior Distribution

The key player in the Bayesian context, is a distribution called the posterior distribution over parameter values \(\theta\) given our data. In other words, we would like to know \(p(\theta \vert \cal{D})\):

Well, how do we do this? Bayes Rule!!

\[ p(\theta \vert \cal{D}) p(\cal{D}) = p(\cal{D} \vert \theta)\,p(\theta)\]


\[ p(\theta \vert \cal{D}) = \frac{p(\cal{D} \vert \theta)\,p(\theta)}{p(\cal{D})} .\]

How do we get the denominator? Well its just a sum over the joint probabilities of the data with every value of \(\theta\):

\[p(\cal{D}) = \int d\theta p(\cal{D} , \theta) = \int d\theta p(\cal{D} \vert \theta) p(\theta).\]

In statistics, an integral of the product of a function and a probability density is called an expectation (or mean) value. This is because such an integral computes a weighted mean with the weights coming from our probability slivers. Thus the denominator of our formula, also called the evidence, can be written as an expectation value over a PDF, denoted using the \(E_{pdf(x)}[function]\) notation:

\[E_{p(\theta)}[{\cal L}]\]


\[{\cal L} = p(\cal{D} \vert \theta)\]

is called the Likelihood. The evidence is basically the normalization constant. You can remember this as:

\[ posterior = \frac{likelihood \times prior}{evidence} \]

A seal tosses a globe

This problem, adapted from McElreath’s super good Bayesian Statistics book, Rethinking Statistics, involves a seal tossing a globe in the air, and catching it with its node. Then the seal sees if its nose is touching land or water.

We are told that the first 9 tosses were:


We wish to understand the evolution of our belief in how much water is on the globe as it keeps getting tossed.

For this purpose, we’ll need to introduce ourselves to a new probability distribution.

Binomial Distribution

Let us consider a population of coin flips for a fair coin: \(x_1,x_2,...,x_n\). The distribution of coin flips is the Binomial distribution. This distribution answers the question: what is the probability of obtaining \(k\) heads in \(n\) flips of the coin?

Well, consider for example that we have flipped a fair coin 3 times. (This diagram is taken from the Feynman Lectures on Physics, volume 1, chapter 6.

When we draw the diagram above, we see that there are different probabilities associated with the events of 0, 1,2, and 3 heads with 1 and 2 heads being the most likely. The Binomial Distribution is the one that tells us how to count these possibilities. It is given as:

\[P(X = k; n, \theta) = {n\choose k}\theta^k(1-\theta)^{n-k} \]


\[{n\choose k}=\frac{n!}{k!(n-k)!}\]

and \(\theta\) is the “fairness” of the coin, or the propensity to fall heads. A totally fair coin has \(\theta = 0.5\).

How did we obtain this? The \(\theta^k(1-\theta)^{n-k}\) comes simply from multiplying the probabilities for each bernoulli trial; there are \(k\) 1’s or yes’s, and \(n-k\) 0’s or no’s. The \({n\choose k}\) comes from counting the number of ways in which each event happens: this corresponds to counting all the paths that give the same number of heads in the diagram above.

For \(p=0.5\), \(n=3\), and \(k=2\), the common case above, we get \({3\choose 2}(1/2)^2(1/2)^1 = 3/8\). Simply put, there are 8 possibilities, and 3 ways to get 2 heads and one tails.

We show the distribution below for 200 trials, for different values of \(p\) (or how biased the coin is if you like).

from scipy.stats import binom
k = np.arange(0, 200)
for θ in [0.1, 0.3, 0.5, 0.7, 0.9]:
    rv = binom(200, θ) # construct a distribution object
    plt.plot(k, rv.pmf(k), '.', lw=1, label=θ) # plot pdf
    plt.fill_between(k, rv.pmf(k), alpha=0.2) # fill down to x-axis

Choosing a prior

Back to the seal-tosses-globe experiment. We can use the binomial distribution to describe it. But our seal does not have a strong prior notion of how much water is there on the globe. We need to model this prior notion.

We’ll use a Beta Distribution as our Prior Distribution, to encode our beliefs. A Beta distribution is defined on \([0, 1]\) and can be contorted into many different shapes, as can be seen in the figure below.

\[ p(\theta) = {\rm Beta}(\theta,\alpha, \beta) = \frac{\theta^{\alpha-1} (1-x)^{\beta-1} }{B(\alpha, \beta)} \] where \(B(\alpha, \beta)\) is independent of \(\theta\) and it is the normalization factor.

You will also find in the literature he notation:

\[\theta \sim {\rm Beta}(\alpha, \beta).\]

This is called the “tilda” notation and is to be read as: \(\theta\) is distributed according to the \({\rm Beta}(\alpha, \beta)\) distribution, OR, \(\theta\) is drawn from a \({\rm Beta}(\alpha, \beta)\) distribution.

from scipy.stats import beta
x=np.linspace(0., 1., 1000)
plt.plot(x, beta.pdf(x, 1, 1), label="${\rm Beta}(1,1)$");
plt.plot(x, beta.pdf(x, 10, 10), label="${\rm Beta}(10,10)$");
plt.plot(x, beta.pdf(x, 100, 100), label="${\rm Beta}(100,100)$");
plt.plot(x, beta.pdf(x, 2, 50), label="${\rm Beta}(2,50)$");
plt.plot(x, beta.pdf(x, 10, 1), label="${\rm Beta}(10,1)$");

If, before tossing our globe, the seal cannot judge how much water there is, we might want to say that every fraction of water is equally likely. This is a Uniform distribution, and can be modeled as we can see in the figure above by a \({\rm Beta}(1,1)\) distribution. On the other hand we may have a strong belief in there being equal amounts of water and land. This corresponds to \({\rm Beta}(10,10)\). An even stronger belief: \({\rm Beta}(100,100)\).

Indeed the interpretation of the Beta distribution is that the arguments (which are confusingly enough themselves called \(\alpha\) and \(\beta\)) correspond to “prior” tosses. So 1,1 corresponds to 1 prior water and 1 prior land, while 100,100 corresponds to 100 prior waters and 100 prior lands. You can imagine that just 9 globe tosses would not be able to overcome the strength of such a prior. You would be right, as we shall see.

Obtaining the posterior

When we usually work with Bayes Theorem we do not care much about the “evidence” term in the denominator. It does not depend upon the parameters (they are summed over or integrated out). What matters to us is the relative probability of two values of a parameter. Thus we blithely go from:

\[ p(\theta \vert \cal{D}) = \frac{p(\cal{D} \vert \theta)\,p(\theta)}{p(\cal{D})} \]


\[ p(\theta \vert \cal{D}) \propto p(\cal{D} \vert \theta)\,p(\theta) .\]

With this in mind, lets identify the various parts of our posterior \(p(\theta \vert \cal{D})\), the probability of a particular value of fraction \(\theta\), given data in which we did 9 tosses and got a particular sequence with 6 Ws. It is, upto the constant of proportionality, the product of the

  1. The prior \(p(\theta)\), which we will choose as different Beta distributions, to demonstrate sensitivity to priors.
  2. The likelihood \(p(\cal{D} \vert \theta)\) which is the probability of \(k\) tosses with water given \(N\) tosses of the globe, GIVEN a particular value for \(\theta\). This is given by the Binomial distribution for a particular value of \(\theta\).

From Bayes theorem, the posterior for \(\theta\) is

\[ p(\theta|D) \propto p(n,k|\theta) \,p(\theta) = Binom(\theta, n, k) \, {\rm Beta}(\theta,\alpha, \beta) .\]

It turns out that the product of a Binomial distribution and a Beta likelihood is another Beta distribution, with the number of tosses for water and the number of tosses for land (or heads and tails for a coin) just added to the “prior” W and “prior” L tosses:.

\[p(\theta|D) \propto {\rm Beta}(\theta, \alpha+k, \beta+n-k)\]

So a small amount of throws will not hugely change the shape of priors with large numbers of “prior” tosses.

Interrogating the posterior

Let us start with no strong belief about the amount of water on the globe. This is a kind of un-informative prior: it does not add strong beliefs into the mix. When we do that;

prior_params = np.array( [1.,1.] )  # FLAT 
prior_pdf = beta.pdf( x, *prior_params)
posterior_params = prior_params + np.array( [num_heads, num_tails] )
posterior_pdf = beta.pdf( x, *posterior_params)

we get a posterior like this…


You can see that it hews to the data, in that the highest probability is around 6 out of 9 heads, as this is where the binomial likelihood would have the most probability mass. But it also respects the fact that we do not really have that much data (only 9 tosses) bu having a rather “wide posterior”, visually.

Now we can calculate all sorts of stuff.

The probability that the amount of water is less than 50% is just the area under the red curve to the left of \(\theta = 0.5\), which is 17.3%

The probability by which we get 80% of the area under the curve from the left is \(\theta = 0.763\)

Perhaps a useful summary of the posterior is the limits of \(\theta\) for which the amount of probability mass is between certain percentages, like the middle 80% (10% to 90%). This is called a 80% credible interval.

For this posterior it is \(\theta \in [ 0.44604094, 0.81516349]\).

You can also make various point estimates from the pdf: mean=0.638, median=0.647.

Posteriors: sensitivity to the prior

We can change the prior to see the sensitivity to it.

Suppose we started with a \(\theta \sim {\rm Beta}(10,10)\) prior. Then we get:

in which our posterior is moved quite a bit towards our prior (remember we have 20 prior samples as compared to our 9 new ones..)

And if we were as sure about our belief that there was 50% water on the globe as we were about the “fact” that Obama was “born in Kenya” (\({\rm Beta}(100,100)\) prior), our data would not make much of a difference at all:

Bayesian Updating: posteriors at each time step

Lets stick with our \({\rm Beta}(1,1)\) prior, since it lets the data speak for itself. One should use strong priors if they have been obtained from some previously obtained data (as we shall soon see for the Santa-Clara survey), or if they are producing some regularizing behavior (this idea is used often in hierarchical models).

Let us now take a look at Bayesian inference as an updating process. That is, instead of considering the 9 tosses as one dataset, lets consider each new toss as a new “dataset” and update our posterior.

From this angle, when no data is seen, the posterior is just the prior. We assume equal probability for any \(\theta\) according to \({\rm Beta}(1,1)\). Then lets say we see one data-point, the first W. How does our posterior update?

This is easy, we add 1 count to the existing “prior” 1 count for W and 0 count to the prior 1 count for L. This means that we move to \({\rm Beta}(2,1)\) as our posterior. If we plot this we know that the probability of W now went up, and the new Beta function gets a lift closer to 1 than to 0.

prior_params=[1., 1.]
data = [1,0,1,1,1,0,1,0,1] # WLWWWLWLW
choices=['L', 'W']
for i,v in enumerate(data):
    prior_pdf = beta.pdf( x, *prior_params) # prior
    if v==1:
        tosses = [1,0]
        tosses = [0,1]
    posterior_params = prior_params + np.array( tosses ) # posterior's beta parameters
    posterior_pdf = beta.pdf( x, *posterior_params)  # posterior 
    prior_params = posterior_params # make posterior the prior for next step
    axes[i].plot( x,prior_pdf, lw =1, color ="#348ABD" )
    axes[i].plot( x, posterior_pdf, lw= 3, color ="#A60628" )
    axes[i].fill_between( x, 0, prior_pdf, color ="#348ABD", alpha = 0.15) 
    axes[i].fill_between( x, 0, posterior_pdf, color ="#A60628", alpha = 0.15) 

We continue the process. Now we add a L. This takes us to \({\rm Beta}(2,2)\) which is peaked at the center, making us think that water and land are equally probable. And so on and so forth, until we have seen all 9 data points with 6 W to 3 L taking us to a final posterior of \({\rm Beta}(6+1, 3+1) = {\rm Beta}(7,4)\) which is moved over a bit to the right of \(\theta = 0.6\) as we can see…

A bayesian analysis of the Santa Clara survey

Lets now apply a bayesian analysis to the Santa Clara data. We start by reproducing the confusion matrix from the test kits we had seen earlier.

We will model the sensitivity \(r\), specificity \(s\) and prevalence \(f\) using bayesian analysis. The sensitivity \(r\), before the result from the test-kits could be seen, could be anything. So let us model it with a Uniform Prior, once again expressed as a Beta distribution. We’ll write this using the “tilda” notation:

\[r \sim {\rm Beta}(1,1),\]

which is to be read as, \(r\) is drawn from a \({\rm Beta}(1,1)\) distribution, or that the prior pdf of \(r\) is a Uniform distribution.

We’ll do this for the other two stochastic quantities as well, sensitivity and prevalence:

\[s \sim {\rm Beta}(1,1),\] \[f \sim {\rm Beta}(1,1).\]

Now, let us see some data from the test-kits. Lets write this process down using Bayes rule to get posteriors on the sensitivity and specificity, after the test-kits, but before any survey data:

\[p(r, s \mid D_{kits}) \propto \, p(D_{kits} \mid r, s) \, p(r,s)\]

But the sensitivity and specificity data come from different test-kit data so we can consider them to be independent: in other words there is no interactions amongst their likelihoods or their priors:

\[p(r,s \mid D_{kits}) \propto p(D_{kits, pos} \mid r) \, p(D_{kits, neg} \mid s) \, p(r) \, p(s)\]

In other words, the posterior factorizes as well:

\[p(r \mid D_{kits}) \propto p(D_{kits, pos} \mid r) \, p(r)\]


\[p(s \mid D_{kits}) \propto p(D_{kits, neg} \mid s) p(s).\]

Now the two likelihoods are simply binomials. The sensitivity is about the probability of getting true positives amongst observed positives, so we have \(n=122\) cases with \(k=103\) true positives. This means that the posterior for the sensitivity is:

\[p(r \mid D_{kits, pos}) = {\rm Beta}(r, 103+1, 122-103+1) = {\rm Beta}(r, 104, 20)\]

The prior and posterior can be seen in the figure below.

For the negatives we have 401 observed negatives with 399 True negatives, so that

\[p(s \mid D_{kits, neg}) = {\rm Beta}(s, 399+1, 401-399+1) = {\rm Beta}(s, 400, 3)\]

This posterior is even more extreme and takes us very close to 1 in specificity, since there were only 2 false positives. But even just 2 false positives make a difference.

We can now turn around and use the posteriors as priors for the Santa Clara Survey.

\[p(r, s, f \mid D_{survey}) \propto p(D_{survey} \mid r, s, f) \, p(r, s, f)\]

Since the posterior after seeing the test-kits, which is now acting as a prior factorizes, we can write:

\[p(r, s, f \mid D_{survey}) \propto p(D_{survey} \mid r, s, f) \, p(r \mid D_{kits, pos}) \, p(s \mid D_{kits, neg}) \, p(f)\]

We know the two posterior-priors on the right hand side. As mentioned earlier, we’ll use a uniform prior for the prevalence so as to not bias us one way or the other.

Now we know that there were 3330 tests with 50 positives. Clearly some of these are true positives and some of these are false positives. So, in the language of our confusion matrix, these are the predicted (by the serology test) positives.

We need to calculate the likelihood for these predicted positives. Well its going to be binomial and look something like this:

\[ Binom(p, 3330, 50) \]

where \(p\) is the probability of being predicted positive. Now, well, a predicted positive is a true positive or a false positive, so:

\[p = \frac{TP + FP}{N} = \frac{TP}{OP}\frac{OP}{N} + \frac{FP}{ON}\frac{ON}{N} = rf + (1 - s)(1-f)\]

So then

\[p(r, s, f \mid D_{survey}) \propto Binom(rf + (1 - s)(1-f), 3330, 50) \, {\rm Beta}(r, 104, 20) \, {\rm Beta}(s, 400, 3)\]

The product of t betas with a binomial does not result in another Beta, so this posterior does not have a nice closed form as before.

Now we’d like to find the posterior on prevalence so that we can compare our findings to what we got from the original preprint. to do this we must marginalize or add the “joint” posterior in \(r\), \(s\), and \(f\) over \(r\) and \(s\). In other words:

\[p(f \mid D_{survey}) = \int_0^1 \int_0^1 dr ds p(r, s, f \mid D_{survey}) \] \[p(f \mid D_{survey}) \propto \int_0^1 \int_0^1 dr ds Binom(rf + (1 - s)(1-f), 3330, 50) \, {\rm Beta}(r, 104, 20) \, {\rm Beta}(s, 400, 3)\]

This integral can be done numerically, and then the normalization can be found numerically as well. When we do that, we get:

from scipy.integrate import nquad
post = lambda r, s, f: like_survey(r,s,f)*post_r.pdf(r)*post_s.pdf(s)
f_grid = np.linspace(0,0.04, 40)
for f in f_grid:
    integ = nquad(lambda r, s: post(r,s,f), [[0,1], [0,1]])
ax.plot(f_grid, prior_f.pdf(f_grid), label="prevalence prior before test-kits and survey")
ax.plot(f_grid, np.array([e[0] for e in post_f_grid])/3.3244e-4, label="post-survey prevalence posterior")

The 95% credible intervals for the prevalence are:

\[f \in [0.0 , 0.018] .\]

The paper quotes 1.11% to 1.97%, while we get 0.05% to 1.81%.

So a large portion of the mass of our posterior is to the left of that quoted from the paper. While our right interval is to the left of theirs, its not as much to the left as is our left side. Correspondingly, our distribution peaks ar 1.25 and the mean is more like 1.1% rather than 1.5%.

We’d thus expect, from our analysis, if we included re-sampling, all the numbers the paper quotes to be revised downwards, making the claim of high prevalence unlikely. Our posterior probability distribution puts non-trivial mass on really low prevalence.

Discussion and Controversy

There has been a lot of discussion on this pre-print amongst researchers, most of whom have expressed the same skepticism we have here.

The Guardian hasa nice article on this survey, and a great followup interview with Carl Bergstrom. Peter Kolchinsky has a very informative and long tweet thread:

Noted epidemiologist Trevor Bedford weighed in too:

Richard Neher and [@NimwegenLab]( weighed in too with a bayesian analysis:

[@luispedrocoelho]( wrote a blog post about this as well..he went an extra (proper) step further than we did and modeled prevalence fully as a sample quantity drawn using a binomial from a population probability. His model is a nice example of using MCMC to do bayesian statistics, something I shall be writing a post on soon.

Natalie Dean as well (and she talks about some of the other problems in the survey):

Something else is lost though in the noise about where the mass of the prevalence is. Even the preprint’s confidence intervals on specificity are between 98% and 100%. And as we know, at a 2% false positive rate, all their positives could be false positives, making any inferences about prevalence suspect. And this I suspect is the main thing to take away…we need more data before we can make policy about re-opening.

Covid-19 Archives