Jekyll2020-05-05T10:58:19+05:30/feed.xmlCovid19@Univ.AICovid 19 at Univ.AIThe Santa Clara Serology study and being Bayesian2020-04-25T00:00:00+05:302020-04-25T00:00:00+05:30/2020/04/25/santa-clara-and-being-bayesian<p>In the last blog post we saw how we can calculate <span class="math inline">\(p(D+ \mid T+)\)</span> given information on Specificity and Sensitivity of a serological test from a manufacturer. In that article we needed an otherwise obtained (from swab tests, for example) prevalence <span class="math inline">\(p(D+)\)</span>, and also noted the impact of false positives on our calculation at low prevalence levels. We alluded to calculating a prevalence <span class="math inline">\(p(D+)\)</span> of the disease in the population if we knew p(D+ T+)$.</p>
<p>Its important to get the chain of events right. This is not a prior prevalence using swab based testing, but rather one that is done after carrying out serological tests.</p>
<section id="the-santa-clara-survey" class="level2">
<h2>The Santa Clara Survey</h2>
<p>We might expect false positives to be impactful in a calculation of prevalence as well, and indeed, this impact is at the center of a controversy about a serological survey in Santa Clara country in California. The preprint by [Bendavid et. al, <a href="https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v1.full.pdf">COVID-19 Antibody Seroprevalence in Santa Clara County, California</a> describes both the survey and the pre-survey test kit work carried out by the group.</p>
<p>Reproduced from the pre-print, here is their survey method:</p>
<div class="epigraph">
<blockquote>
<p>We conducted serologic testing for SARS-CoV-2 antibodies in 3,330 adults and children in Santa Clara County using capillary blood draws and a lateral flow immunoassay…We recruited participants by placing targeted > advertisements on Facebook aimed at residents of Santa Clara County. We used Facebook to quickly reach a large number of county residents and because it allows for granular targeting by zip code and socio-demographic characteristics..</p>
</blockquote>
</div>
<p>Below is a map of Santa Clara county. As you can see, richer areas close to Palo Alto are represented more.</p>
<p><img src="/assets/santaclara.png" /></p>
<p>Most of the respondents were young to middle aged white women. Thus the authors had to apply demographic corrections they talk about in the excerpt below and which they detail in their technical appendix. Sampling via facebook also have significant issues, which as far as I know are unaccounted for: people signing up are more likely to be those who have had symptoms ot those who know others who have had symptoms, thus possibly overestimating any prevalence.</p>
<p>But putting aside those problems for now, let us assume the authors succeeded in getting a representative sample. They go on to tell us:</p>
<div class="epigraph">
<blockquote>
<p>The total number of positive cases by either IgG or IgM in our unadjusted sample was 50, a crude prevalence rate of 1.50% (exact binomial 95% CI 1.11-1.97%). After weighting our sample to match Santa Clara County by zip, race, and sex, the prevalence was 2.81% (95% CI 2.24-3.37 without clustering the standard errors for members of the same household, and 1.45-4.16 with clustering). We further improved our estimation using the available data on test kit sensitivity and specificity, using the three scenarios noted above. The estimated prevalence was 2.49% (95CI 1.80%-3.17%) under the S1 scenario, 4.16% (95CI 2.58%-5.70%) under the S2 scenario, and 2.75% (95CI 2.01%-3.49%) under the S3 scenario. Notably, the uncertainty bounds around each of these population prevalence estimates propagates the uncertainty in each of the three component parameters: sample prevalence, test sensitivity, and test specificity.</p>
</blockquote>
</div>
<p>The authors of the survey have thus claimed that the Covid-19 prevalence in Santa Clara county is quite high. The unadjusted prevalence is estimated at 1.5% (range 1-2%), and a population-weighted prevalence rate is estimated as 2.8% with a 95% confidence upper limit of 4.2%.</p>
<section id="understanding-these-numbers" class="level3">
<h3>Understanding these numbers</h3>
<p>Let us parse these numbers. Santa Clara county has a population of 2 million. At the time when the data was snap-shotted by the authors, there were about 100 deaths in the county, and about 1000 recorded cases. A 1.5% prevalence would mean that 30,000 of its residents had Covid-19, an underestimate by a factor of 30. Using 4% as the upper limit instead, we’d have a factor of 80 underestimation. These large numbers have been used by many political officials on the right in the US to push for a quick re-opening of the economy and relaxation of lockdown restrictions, claiming that large fractions of the population have been infected. Leaving aside the fact that 4% is not really that large and we are likely to see more deaths when the next 4% are infected, we shall see that these numbers are very uncertain, because of the possibility of false positives. But it is the maze of uncertainty created by the impact of these false positives that any sane policy must navigate.</p>
<p>There is another issue that can be addressed by prevalence numbers. This is the issue of Infection Fatality rate (IFR). A naive way of calculating the fatality rate is the Case Fatality Rate (CFR). This simply divides the number of fatalities by the number os known cases. For Santa Clara, this is 10%.</p>
<p>But because testing always lags, and because of political inertia in the US, lags a lot, the actual IFR is quite low. A prevalence of 1.5%, or 30,000 cases corresponds to an IFR of 0.33%. A prevalence of 4% corresponds to one of 0.125%. (IFRs from Italy and other place have pegged Covid-19 fatality at 1%). This low IFR from California is also being used by right-wingers to argue that the disease is not that dangerous so restrictions should be eased.</p>
<p>Thus its important to get a good hold of these numbers and their uncertainties to drive policy. To do this, we need to know the uncertainties of the sensitivities and the specificities of the serological tests that are being used.</p>
<p>These quantities, as we now know are usually captured in a construct called the Confusion Matrix.</p>
</section>
<section id="confusion-matrix-for-the-test-kits." class="level3">
<h3>Confusion matrix for the test kits.</h3>
<p>Once you have real frequencies for the false positives, true positives, true negatives and false negatives, either in a set of test-kits from the manufacturers or in a survey. Then you could write a confusion matrix:</p>
<figure>
<img src="/assets/confusionmatrix.png" title="fig:" alt="Confusion Matrix" />
</figure>
<p>The marginal quantities (quantities in the right and the bottom margin in the confusion matrix) are the sum of the rows and columns. For example, when you sum the true negatives and the false positives, you get the observed negatives (ON), all the people who do not have the disease.</p>
<p>The <a href="https://www.medrxiv.org/content/medrxiv/suppl/2020/04/17/2020.04.14.20062463.DC1/2020.04.14.20062463-1.pdf">technical appendix of the paper</a> details all these quantities:</p>
<div class="epigraph">
<blockquote>
<p>In the first scenario, we estimate these quantities based upon the numbers provided by the manufacturer of the test kits. For sensitivity, the manufacturer reported 78 positive test readings for 85 samples (from Chinese blood samples) known to have specific IgM antibodies to the receptor-binding domain (RBD) spike on the SARS-nCOV2 virus. They reported 75 positive test readings for 75 of the samples with specific IgG antibodies to the same RBD spike. We adopt a conservative estimate of sensitivity equal to <span class="math inline">\({\hat r}\)</span> = 56/67 ≈ 91.8%. The manufacturer reports specificity based on an experiment using their kit on a sample of 371 known negative blood samples collected from before the epidemic, and 369 were tested negative. This implies a specificity of <span class="math inline">\({\hat s}\)</span> = 369/371 ≈ 99.5%.</p>
<p>In the second scenario, we estimate these quantities based on tests run locally at Stanford University. We identified serum from 37 patients who had RT-PCR-confirmed cases of COVID19 and either IgG or IgM on a locally-developed ELISA assay; of these, 25 tested positive with the test kit, implying a sensitivity of <span class="math inline">\({\hat r}\)</span> = 25/37 ≈ 67.6%. We also identify serum from 30 patients drawn from before the COVID-19 epidemic, and all 30 tested negative, implying a specificity of <span class="math inline">\({\hat s}\)</span> = 83/83 = 100%. …</p>
<p>In the third scenario, we estimate these quantities by combining the manufacturer tests with our local tests in a simple additive fashion. Under these assumptions, the sensitivity estimate is <span class="math inline">\({\hat r}\)</span> = 103/122 ≈ 84.4% and the specificity estimate is <span class="math inline">\({\hat s}\)</span> = 399/401 ≈ 99.5%.</p>
</blockquote>
<blockquote>
<footer>
Technical Appendix, Bendavid et.al.
</footer>
</blockquote>
</div>
<p>We’ll use the author’s combined third estimate. Then, the Sensitivity, or the <strong>True Positive Rate</strong>(TPR) is given as: <span class="math display">\[
Sensitivity \equiv P(T+ | D+) = TPR = \frac{TP}{OP} = 0.844.
\]</span> In other words, this is the number of TP, divided by the sum of the bottom row, the OP.</p>
<p>The complement of the sensitivity is the <strong>False Negative Rate</strong>(FNR):</p>
<p><span class="math display">\[FNR \equiv P(T- | D+) = 1 - TPR = \frac{FN}{OP} = 0.156\]</span></p>
<p>This quantity is concerned with the bottom row of the confusion matrix.</p>
<p>The Specificity, or the <strong>True Negative Rate</strong>(TNR) is given as: <span class="math display">\[
Specificity \equiv P(T- | D-) = TNR = \,\,\frac{TN}{ON} = 0.995
\]</span></p>
<p>and thus the complement, the <strong>False Positive Rate</strong>(FPR) is given as:</p>
<p><span class="math display">\[FPR \equiv P(T+ | D-) = 1 - TNR = \frac{FP}{ON} = 0.005\]</span></p>
<p>These two quantities are concerned with the <em>top row</em> of the confusion matrix, and the denominators come from its marginal, the ON.</p>
<p>Lets redraw the confusion matrix with all the “test-kit” based numbers put in.</p>
<p><img src="/assets/confusionmatrixcovidserologybayesian.png" /></p>
<p>You might think that this data-set is a good one to estimate prevalence (<span class="math inline">\(\frac{OP}{Total}\)</span>) from, but its not, as a lot of the test kits were tested on pre-covid blood samples (to look for false positives). The false negative testing, on the other hand is all on samples deemed positive in other fashions. Thus these are synthetic numbers which are useful for the sole purpose of estimating sensitivity and specificity. But the authors did do a survey of the public, so we will look at that for estimating prevalence.</p>
<p>To understand how we might want to model our prevalence, first note that we put little hats on our sensitivity and specificity above, <span class="math inline">\({\hat r}\)</span> and <span class="math inline">\({\hat s}\)</span>. Why?</p>
<p>The reason is that these numbers are estimates. They were estimated on some positive and negative samples. Estimates on another samples could give us different numbers. Indeed the authors made various estimates S1, S2, and S3 (we are using the combined estimates, S3).</p>
<p>What we want to remember, then, is that the specificity and sensitivity, and for that matter, prevalence are stochastic quantities, and these values were estimates.</p>
<p>With just that notion in mind, we can ballpark why we should be afraid of trusting any prevalence numbers based on these estimates.</p>
<p>Suppose the specificity of the test was 98.5%. This means that all the 50 out of 3330 who tested positive could be false positives. Now suppose it was 100%, then all 50 would be true positives. At our quoted number of 2 false positives in about 400, we have a specificity of 99.5%, which would mean that roughly a third of those who tested positive are false positives.</p>
<p>The point here is that with such low numbers, things can move around very easily. Suppose another 2 tests in another sample were false positives (specificity of 99%). Now the False positives would account for 2/3 (33) of the 50 positive tests in Santa Clara, and our estimate of the prevalence would be even lower.</p>
<p>On the other hand, estimates from New York City of prevalence are much higher, around 20%. Whatever the problems with those surveys, we are not so sensitive to false positives any more. For any range 97-100 in sensitivity, the number of true positives do not change that much.</p>
<p>This is the intuition why any result saying a low prevalence is somewhat higher ought to make you feel skeptical. But read on to see how you can do a principled analysis of the uncertainties involved, and how the results from these analysis land up differing from those of the Santa Clara survey preprint authors.</p>
</section>
</section>
<section id="the-bayesian-approach" class="level2">
<h2>The Bayesian Approach</h2>
<p>We think that the best, most principled way to capture and model these uncertainties is Bayesian Modeling. Not just using Bayes theorem like we did last time, but modeling sensitivity, specificity, and prevalence using probability distributions.</p>
<p>In both frequentist and bayesian statistics, estimates can differ. The estimates could be coming from different samples, or perhaps the assays themselves have some noise.</p>
<p>As usual we wish to model this by a probability density:</p>
<p><img src="/assets/slivercopycovid.png" /></p>
<p>The interpretation of probability density is that when you multiply <span class="math inline">\(p(x)\)</span> by the histogram width <span class="math inline">\(dx\)</span> you get one of the histogram bars <span class="math inline">\(dP(x)\)</span>, a <em>sliver</em> of the probability that the feature <span class="math inline">\(X\)</span> has value <span class="math inline">\(x\)</span>:</p>
<p><span class="math display">\[dP(x) = p(x)dx.\]</span></p>
<p>(To be precise you want these histogram bars to be as thin as possible.)</p>
<p>Now when you add all of these probability slivers over the range from <span class="math inline">\(a\)</span> to <span class="math inline">\(b\)</span> you must get 1. You can also consider the area under the density curve up to some value <span class="math inline">\(x\)</span>: this function, <span class="math inline">\(P(x)\)</span>, is called the cumulative distribution function (CDF) ; sometimes you will see it just called the distribution function. And <span class="math inline">\(p(x)\)</span> is called the density function.</p>
<p>In a large population limit, the probability in the sliver can be thought of as the number of data points around a particular <span class="math inline">\(x\)</span> divided by the total number of data points in the population…Thus, when multiplied by the total number of data points in the population, <span class="math inline">\(dP(x)\)</span>, or the change in the CDF at <span class="math inline">\(x\)</span>, gives us the total number of data points at that <span class="math inline">\(x\)</span>. So it allows us to have different amounts of data at different <span class="math inline">\(x\)</span>.</p>
<p>Now, the shape of the probability density is depends on some numbers, called parameters. For example, in the figure above, one parameter could tell you where the maximum of the probability density is, while another could tell you how wide it is. Keeping the first constant while making the second smaller would keep the probability density at the same place while making it thinner and thus taller (so that the total sum of the slivers still adds to 1).</p>
<p>Now, the Bayesian approach to statistics has two parts:</p>
<ol type="A">
<li>treat the parameters, which we shall call <span class="math inline">\(\theta\)</span> as a random variables. This is in contrast to the <em>frequentist</em> approach, where we think of data-sets as stochastic, as samples from a population, and the parameters fixed on the population.</li>
</ol>
<ol start="2" type="a">
<li>Associate with these parameters <span class="math inline">\(\theta\)</span> a prior distribution <span class="math inline">\(p(\theta)\)</span>, which encodes our belief about which values of <span class="math inline">\(\theta\)</span> we think are more likely. Usually, the prior distribution represents our belief on the parameter values when we have not observed any data yet. But not always: we shall see here that we can think of our survey as the “data” and information from the test-kits as factoring into our “prior”. How you do this factoring is your choice.</li>
</ol>
<section id="posterior-distribution" class="level3">
<h3>Posterior Distribution</h3>
<p>The key player in the Bayesian context, is a distribution called the <em>posterior distribution</em> over parameter values <span class="math inline">\(\theta\)</span> given our data. In other words, we would like to know <span class="math inline">\(p(\theta \vert \cal{D})\)</span>:</p>
<p>Well, how do we do this? Bayes Rule!!</p>
<p><span class="math display">\[ p(\theta \vert \cal{D}) p(\cal{D}) = p(\cal{D} \vert \theta)\,p(\theta)\]</span></p>
<p>so:</p>
<p><span class="math display">\[ p(\theta \vert \cal{D}) = \frac{p(\cal{D} \vert \theta)\,p(\theta)}{p(\cal{D})} .\]</span></p>
<p>How do we get the denominator? Well its just a sum over the joint probabilities of the data with every value of <span class="math inline">\(\theta\)</span>:</p>
<p><span class="math display">\[p(\cal{D}) = \int d\theta p(\cal{D} , \theta) = \int d\theta p(\cal{D} \vert \theta) p(\theta).\]</span></p>
<p>In statistics, an integral of the product of a function and a probability density is called an expectation (or mean) value. This is because such an integral computes a weighted mean with the weights coming from our probability slivers. Thus the denominator of our formula, also called the <em>evidence</em>, can be written as an expectation value over a PDF, denoted using the <span class="math inline">\(E_{pdf(x)}[function]\)</span> notation:</p>
<p><span class="math display">\[E_{p(\theta)}[{\cal L}]\]</span></p>
<p>where</p>
<p><span class="math display">\[{\cal L} = p(\cal{D} \vert \theta)\]</span></p>
<p>is called the Likelihood. The evidence is basically the normalization constant. You can remember this as:</p>
<p><span class="math display">\[ posterior = \frac{likelihood \times prior}{evidence} \]</span></p>
</section>
</section>
<section id="a-seal-tosses-a-globe" class="level2">
<h2>A seal tosses a globe</h2>
<p>This problem, adapted from McElreath’s super good Bayesian Statistics book, Rethinking Statistics, involves a seal tossing a globe in the air, and catching it with its node. Then the seal sees if its nose is touching land or water.</p>
<p>We are told that the first 9 tosses were:</p>
<p><code>WLWWWLWLW</code>.</p>
<p>We wish to understand the evolution of our belief in how much water is on the globe as it keeps getting tossed.</p>
<p>For this purpose, we’ll need to introduce ourselves to a new probability distribution.</p>
<section id="binomial-distribution" class="level3">
<h3>Binomial Distribution</h3>
<p>Let us consider a population of coin flips for a fair coin: <span class="math inline">\(x_1,x_2,...,x_n\)</span>. The distribution of coin flips is the <em>Binomial distribution</em>. This distribution answers the question: what is the probability of obtaining <span class="math inline">\(k\)</span> heads in <span class="math inline">\(n\)</span> flips of the coin?</p>
<p>Well, consider for example that we have flipped a fair coin 3 times. (This diagram is taken from the Feynman Lectures on Physics, volume 1, <a href="http://www.feynmanlectures.caltech.edu/I_06.html">chapter 6</a>.</p>
<p><img src="/assets/3flips.png" /></p>
<p>When we draw the diagram above, we see that there are different probabilities associated with the events of 0, 1,2, and 3 heads with 1 and 2 heads being the most likely. The <em>Binomial Distribution</em> is the one that tells us how to count these possibilities. It is given as:</p>
<p><span class="math display">\[P(X = k; n, \theta) = {n\choose k}\theta^k(1-\theta)^{n-k} \]</span></p>
<p>where</p>
<p><span class="math display">\[{n\choose k}=\frac{n!}{k!(n-k)!}\]</span></p>
<p>and <span class="math inline">\(\theta\)</span> is the “fairness” of the coin, or the propensity to fall heads. A totally fair coin has <span class="math inline">\(\theta = 0.5\)</span>.</p>
<p>How did we obtain this? The <span class="math inline">\(\theta^k(1-\theta)^{n-k}\)</span> comes simply from multiplying the probabilities for each bernoulli trial; there are <span class="math inline">\(k\)</span> 1’s or yes’s, and <span class="math inline">\(n-k\)</span> 0’s or no’s. The <span class="math inline">\({n\choose k}\)</span> comes from counting the number of ways in which each event happens: this corresponds to counting all the paths that give the same number of heads in the diagram above.</p>
<p>For <span class="math inline">\(p=0.5\)</span>, <span class="math inline">\(n=3\)</span>, and <span class="math inline">\(k=2\)</span>, the common case above, we get <span class="math inline">\({3\choose 2}(1/2)^2(1/2)^1 = 3/8\)</span>. Simply put, there are 8 possibilities, and 3 ways to get 2 heads and one tails.</p>
<p>We show the distribution below for 200 trials, for different values of <span class="math inline">\(p\)</span> (or how biased the coin is if you like).</p>
<figure class="fullwidth">
<img src="/assets/binomials.png" />
</figure>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> binom</span>
<span id="cb1-2"><a href="#cb1-2"></a>k <span class="op">=</span> np.arange(<span class="dv">0</span>, <span class="dv">200</span>)</span>
<span id="cb1-3"><a href="#cb1-3"></a><span class="cf">for</span> θ <span class="kw">in</span> [<span class="fl">0.1</span>, <span class="fl">0.3</span>, <span class="fl">0.5</span>, <span class="fl">0.7</span>, <span class="fl">0.9</span>]:</span>
<span id="cb1-4"><a href="#cb1-4"></a> rv <span class="op">=</span> binom(<span class="dv">200</span>, θ) <span class="co"># construct a distribution object</span></span>
<span id="cb1-5"><a href="#cb1-5"></a> plt.plot(k, rv.pmf(k), <span class="st">'.'</span>, lw<span class="op">=</span><span class="dv">1</span>, label<span class="op">=</span>θ) <span class="co"># plot pdf</span></span>
<span id="cb1-6"><a href="#cb1-6"></a> plt.fill_between(k, rv.pmf(k), alpha<span class="op">=</span><span class="fl">0.2</span>) <span class="co"># fill down to x-axis</span></span></code></pre></div>
</section>
<section id="choosing-a-prior" class="level3">
<h3>Choosing a prior</h3>
<p>Back to the seal-tosses-globe experiment. We can use the binomial distribution to describe it. But our seal does not have a strong prior notion of how much water is there on the globe. We need to model this prior notion.</p>
<p>We’ll use a <a href="https://en.wikipedia.org/wiki/Beta_distribution">Beta Distribution</a> as our <em>Prior Distribution</em>, to encode our beliefs. A Beta distribution is defined on <span class="math inline">\([0, 1]\)</span> and can be contorted into many different shapes, as can be seen in the figure below.</p>
<p><span class="math display">\[ p(\theta) = {\rm Beta}(\theta,\alpha, \beta) = \frac{\theta^{\alpha-1} (1-x)^{\beta-1} }{B(\alpha, \beta)} \]</span> where <span class="math inline">\(B(\alpha, \beta)\)</span> is independent of <span class="math inline">\(\theta\)</span> and it is the normalization factor.</p>
<p>You will also find in the literature he notation:</p>
<p><span class="math display">\[\theta \sim {\rm Beta}(\alpha, \beta).\]</span></p>
<p>This is called the “tilda” notation and is to be read as: <span class="math inline">\(\theta\)</span> is distributed according to the <span class="math inline">\({\rm Beta}(\alpha, \beta)\)</span> distribution, OR, <span class="math inline">\(\theta\)</span> is drawn from a <span class="math inline">\({\rm Beta}(\alpha, \beta)\)</span> distribution.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="im">from</span> scipy.stats <span class="im">import</span> beta</span>
<span id="cb2-2"><a href="#cb2-2"></a>x<span class="op">=</span>np.linspace(<span class="fl">0.</span>, <span class="fl">1.</span>, <span class="dv">1000</span>)</span>
<span id="cb2-3"><a href="#cb2-3"></a>plt.plot(x, beta.pdf(x, <span class="dv">1</span>, <span class="dv">1</span>), label<span class="op">=</span><span class="st">"${</span><span class="ch">\r</span><span class="st">m Beta}(1,1)$"</span>)<span class="op">;</span></span>
<span id="cb2-4"><a href="#cb2-4"></a>plt.plot(x, beta.pdf(x, <span class="dv">10</span>, <span class="dv">10</span>), label<span class="op">=</span><span class="st">"${</span><span class="ch">\r</span><span class="st">m Beta}(10,10)$"</span>)<span class="op">;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a>plt.plot(x, beta.pdf(x, <span class="dv">100</span>, <span class="dv">100</span>), label<span class="op">=</span><span class="st">"${</span><span class="ch">\r</span><span class="st">m Beta}(100,100)$"</span>)<span class="op">;</span></span>
<span id="cb2-6"><a href="#cb2-6"></a>plt.plot(x, beta.pdf(x, <span class="dv">2</span>, <span class="dv">50</span>), label<span class="op">=</span><span class="st">"${</span><span class="ch">\r</span><span class="st">m Beta}(2,50)$"</span>)<span class="op">;</span></span>
<span id="cb2-7"><a href="#cb2-7"></a>plt.plot(x, beta.pdf(x, <span class="dv">10</span>, <span class="dv">1</span>), label<span class="op">=</span><span class="st">"${</span><span class="ch">\r</span><span class="st">m Beta}(10,1)$"</span>)<span class="op">;</span></span></code></pre></div>
<p><img src="/assets/betas.png" /></p>
<p>If, before tossing our globe, the seal cannot judge how much water there is, we might want to say that every fraction of water is equally likely. This is a <em>Uniform</em> distribution, and can be modeled as we can see in the figure above by a <span class="math inline">\({\rm Beta}(1,1)\)</span> distribution. On the other hand we may have a strong belief in there being equal amounts of water and land. This corresponds to <span class="math inline">\({\rm Beta}(10,10)\)</span>. An even stronger belief: <span class="math inline">\({\rm Beta}(100,100)\)</span>.</p>
<p>Indeed the interpretation of the Beta distribution is that the arguments (which are confusingly enough themselves called <span class="math inline">\(\alpha\)</span> and <span class="math inline">\(\beta\)</span>) correspond to “prior” tosses. So 1,1 corresponds to 1 prior water and 1 prior land, while 100,100 corresponds to 100 prior waters and 100 prior lands. You can imagine that just 9 globe tosses would not be able to overcome the strength of such a prior. You would be right, as we shall see.</p>
</section>
<section id="obtaining-the-posterior" class="level3">
<h3>Obtaining the posterior</h3>
<p>When we usually work with Bayes Theorem we do not care much about the “evidence” term in the denominator. It does not depend upon the parameters (they are summed over or integrated out). What matters to us is the relative probability of two values of a parameter. Thus we blithely go from:</p>
<p><span class="math display">\[ p(\theta \vert \cal{D}) = \frac{p(\cal{D} \vert \theta)\,p(\theta)}{p(\cal{D})} \]</span></p>
<p>to</p>
<p><span class="math display">\[ p(\theta \vert \cal{D}) \propto p(\cal{D} \vert \theta)\,p(\theta) .\]</span></p>
<p>With this in mind, lets identify the various parts of our posterior <span class="math inline">\(p(\theta \vert \cal{D})\)</span>, the probability of a particular value of fraction <span class="math inline">\(\theta\)</span>, given data in which we did 9 tosses and got a particular sequence with 6 Ws. It is, upto the constant of proportionality, the product of the</p>
<ol type="1">
<li>The prior <span class="math inline">\(p(\theta)\)</span>, which we will choose as different Beta distributions, to demonstrate sensitivity to priors.</li>
<li>The likelihood <span class="math inline">\(p(\cal{D} \vert \theta)\)</span> which is the probability of <span class="math inline">\(k\)</span> tosses with water given <span class="math inline">\(N\)</span> tosses of the globe, GIVEN a particular value for <span class="math inline">\(\theta\)</span>. This is given by the Binomial distribution for a particular value of <span class="math inline">\(\theta\)</span>.</li>
</ol>
<p>From Bayes theorem, the posterior for <span class="math inline">\(\theta\)</span> is</p>
<p><span class="math display">\[ p(\theta|D) \propto p(n,k|\theta) \,p(\theta) = Binom(\theta, n, k) \, {\rm Beta}(\theta,\alpha, \beta) .\]</span></p>
<p>It turns out that the product of a Binomial distribution and a Beta likelihood is another Beta distribution, with the number of tosses for water and the number of tosses for land (or heads and tails for a coin) just added to the “prior” W and “prior” L tosses:.</p>
<p><span class="math display">\[p(\theta|D) \propto {\rm Beta}(\theta, \alpha+k, \beta+n-k)\]</span></p>
<p>So a small amount of throws will not hugely change the shape of priors with large numbers of “prior” tosses.</p>
</section>
<section id="interrogating-the-posterior" class="level3">
<h3>Interrogating the posterior</h3>
<p>Let us start with no strong belief about the amount of water on the globe. This is a kind of un-informative prior: it does not add strong beliefs into the mix. When we do that;</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1"></a>prior_params <span class="op">=</span> np.array( [<span class="fl">1.</span>,<span class="fl">1.</span>] ) <span class="co"># FLAT </span></span>
<span id="cb3-2"><a href="#cb3-2"></a>prior_pdf <span class="op">=</span> beta.pdf( x, <span class="op">*</span>prior_params)</span>
<span id="cb3-3"><a href="#cb3-3"></a>posterior_params <span class="op">=</span> prior_params <span class="op">+</span> np.array( [num_heads, num_tails] )</span>
<span id="cb3-4"><a href="#cb3-4"></a>posterior_pdf <span class="op">=</span> beta.pdf( x, <span class="op">*</span>posterior_params)</span></code></pre></div>
<p>we get a posterior like this…</p>
<figure>
<img src="/assets/posterior11.png" alt="" /><figcaption>png</figcaption>
</figure>
<p>You can see that it hews to the data, in that the highest probability is around 6 out of 9 heads, as this is where the binomial likelihood would have the most probability mass. But it also respects the fact that we do not really have that much data (only 9 tosses) bu having a rather “wide posterior”, visually.</p>
<p>Now we can calculate all sorts of stuff.</p>
<p>The probability that the amount of water is less than 50% is just the area under the red curve to the left of <span class="math inline">\(\theta = 0.5\)</span>, which is 17.3%</p>
<p>The probability by which we get 80% of the area under the curve from the left is <span class="math inline">\(\theta = 0.763\)</span></p>
<p>Perhaps a useful summary of the posterior is the limits of <span class="math inline">\(\theta\)</span> for which the amount of probability mass is between certain percentages, like the middle 80% (10% to 90%). This is called a 80% <strong>credible interval</strong>.</p>
<p>For this posterior it is <span class="math inline">\(\theta \in [ 0.44604094, 0.81516349]\)</span>.</p>
<p>You can also make various point estimates from the pdf: mean=0.638, median=0.647.</p>
</section>
<section id="posteriors-sensitivity-to-the-prior" class="level3">
<h3>Posteriors: sensitivity to the prior</h3>
<p>We can change the prior to see the sensitivity to it.</p>
<p>Suppose we started with a <span class="math inline">\(\theta \sim {\rm Beta}(10,10)\)</span> prior. Then we get:</p>
<p><img src="/assets/posterior1010.png" /></p>
<p>in which our posterior is moved quite a bit towards our prior (remember we have 20 prior samples as compared to our 9 new ones..)</p>
<p>And if we were as sure about our belief that there was 50% water on the globe as we were about the “fact” that Obama was “born in Kenya” (<span class="math inline">\({\rm Beta}(100,100)\)</span> prior), our data would not make much of a difference at all:</p>
<p><img src="/assets/posterior100100.png" /></p>
</section>
<section id="bayesian-updating-posteriors-at-each-time-step" class="level3">
<h3>Bayesian Updating: posteriors at each time step</h3>
<p>Lets stick with our <span class="math inline">\({\rm Beta}(1,1)\)</span> prior, since it lets the data speak for itself. One should use strong priors if they have been obtained from some previously obtained data (as we shall soon see for the Santa-Clara survey), or if they are producing some regularizing behavior (this idea is used often in hierarchical models).</p>
<p>Let us now take a look at Bayesian inference as an updating process. That is, instead of considering the 9 tosses as one dataset, lets consider each new toss as a new “dataset” and update our posterior.</p>
<p>From this angle, when no data is seen, the posterior is just the prior. We assume equal probability for any <span class="math inline">\(\theta\)</span> according to <span class="math inline">\({\rm Beta}(1,1)\)</span>. Then lets say we see one data-point, the first <code>W</code>. How does our posterior update?</p>
<p>This is easy, we add 1 count to the existing “prior” 1 count for <code>W</code> and 0 count to the prior 1 count for <code>L</code>. This means that we move to <span class="math inline">\({\rm Beta}(2,1)\)</span> as our posterior. If we plot this we know that the probability of <code>W</code> now went up, and the new Beta function gets a lift closer to 1 than to 0.</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1"></a>prior_params<span class="op">=</span>[<span class="fl">1.</span>, <span class="fl">1.</span>]</span>
<span id="cb4-2"><a href="#cb4-2"></a>data <span class="op">=</span> [<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">1</span>,<span class="dv">0</span>,<span class="dv">1</span>] <span class="co"># WLWWWLWLW</span></span>
<span id="cb4-3"><a href="#cb4-3"></a>choices<span class="op">=</span>[<span class="st">'L'</span>, <span class="st">'W'</span>]</span>
<span id="cb4-4"><a href="#cb4-4"></a><span class="cf">for</span> i,v <span class="kw">in</span> <span class="bu">enumerate</span>(data):</span>
<span id="cb4-5"><a href="#cb4-5"></a> prior_pdf <span class="op">=</span> beta.pdf( x, <span class="op">*</span>prior_params) <span class="co"># prior</span></span>
<span id="cb4-6"><a href="#cb4-6"></a> <span class="cf">if</span> v<span class="op">==</span><span class="dv">1</span>:</span>
<span id="cb4-7"><a href="#cb4-7"></a> tosses <span class="op">=</span> [<span class="dv">1</span>,<span class="dv">0</span>]</span>
<span id="cb4-8"><a href="#cb4-8"></a> <span class="cf">else</span>:</span>
<span id="cb4-9"><a href="#cb4-9"></a> tosses <span class="op">=</span> [<span class="dv">0</span>,<span class="dv">1</span>]</span>
<span id="cb4-10"><a href="#cb4-10"></a> posterior_params <span class="op">=</span> prior_params <span class="op">+</span> np.array( tosses ) <span class="co"># posterior's beta parameters</span></span>
<span id="cb4-11"><a href="#cb4-11"></a> posterior_pdf <span class="op">=</span> beta.pdf( x, <span class="op">*</span>posterior_params) <span class="co"># posterior </span></span>
<span id="cb4-12"><a href="#cb4-12"></a> prior_params <span class="op">=</span> posterior_params <span class="co"># make posterior the prior for next step</span></span>
<span id="cb4-13"><a href="#cb4-13"></a> axes[i].plot( x,prior_pdf, lw <span class="op">=</span><span class="dv">1</span>, color <span class="op">=</span><span class="st">"#348ABD"</span> )</span>
<span id="cb4-14"><a href="#cb4-14"></a> axes[i].plot( x, posterior_pdf, lw<span class="op">=</span> <span class="dv">3</span>, color <span class="op">=</span><span class="st">"#A60628"</span> )</span>
<span id="cb4-15"><a href="#cb4-15"></a> axes[i].fill_between( x, <span class="dv">0</span>, prior_pdf, color <span class="op">=</span><span class="st">"#348ABD"</span>, alpha <span class="op">=</span> <span class="fl">0.15</span>) </span>
<span id="cb4-16"><a href="#cb4-16"></a> axes[i].fill_between( x, <span class="dv">0</span>, posterior_pdf, color <span class="op">=</span><span class="st">"#A60628"</span>, alpha <span class="op">=</span> <span class="fl">0.15</span>) </span></code></pre></div>
<figure class="fullwidth">
<img src="/assets/bayesianupdating.png" title="fig:" alt="png" />
</figure>
<p>We continue the process. Now we add a <code>L</code>. This takes us to <span class="math inline">\({\rm Beta}(2,2)\)</span> which is peaked at the center, making us think that water and land are equally probable. And so on and so forth, until we have seen all 9 data points with 6 <code>W</code> to 3 <code>L</code> taking us to a final posterior of <span class="math inline">\({\rm Beta}(6+1, 3+1) = {\rm Beta}(7,4)\)</span> which is moved over a bit to the right of <span class="math inline">\(\theta = 0.6\)</span> as we can see…</p>
</section>
</section>
<section id="a-bayesian-analysis-of-the-santa-clara-survey" class="level2">
<h2>A bayesian analysis of the Santa Clara survey</h2>
<p>Lets now apply a bayesian analysis to the Santa Clara data. We start by reproducing the confusion matrix from the test kits we had seen earlier.</p>
<p><img src="/assets/confusionmatrixcovidserologybayesian.png" /></p>
<p>We will model the sensitivity <span class="math inline">\(r\)</span>, specificity <span class="math inline">\(s\)</span> and prevalence <span class="math inline">\(f\)</span> using bayesian analysis. The sensitivity <span class="math inline">\(r\)</span>, before the result from the test-kits could be seen, could be anything. So let us model it with a Uniform Prior, once again expressed as a Beta distribution. We’ll write this using the “tilda” notation:</p>
<p><span class="math display">\[r \sim {\rm Beta}(1,1),\]</span></p>
<p>which is to be read as, <span class="math inline">\(r\)</span> is drawn from a <span class="math inline">\({\rm Beta}(1,1)\)</span> distribution, or that the prior pdf of <span class="math inline">\(r\)</span> is a Uniform distribution.</p>
<p>We’ll do this for the other two stochastic quantities as well, sensitivity and prevalence:</p>
<p><span class="math display">\[s \sim {\rm Beta}(1,1),\]</span> <span class="math display">\[f \sim {\rm Beta}(1,1).\]</span></p>
<p>Now, let us see some data from the test-kits. Lets write this process down using Bayes rule to get posteriors on the sensitivity and specificity, after the test-kits, but before any survey data:</p>
<p><span class="math display">\[p(r, s \mid D_{kits}) \propto \, p(D_{kits} \mid r, s) \, p(r,s)\]</span></p>
<p>But the sensitivity and specificity data come from different test-kit data so we can consider them to be independent: in other words there is no interactions amongst their likelihoods or their priors:</p>
<p><span class="math display">\[p(r,s \mid D_{kits}) \propto p(D_{kits, pos} \mid r) \, p(D_{kits, neg} \mid s) \, p(r) \, p(s)\]</span></p>
<p>In other words, the posterior factorizes as well:</p>
<p><span class="math display">\[p(r \mid D_{kits}) \propto p(D_{kits, pos} \mid r) \, p(r)\]</span></p>
<p>and</p>
<p><span class="math display">\[p(s \mid D_{kits}) \propto p(D_{kits, neg} \mid s) p(s).\]</span></p>
<p>Now the two likelihoods are simply binomials. The sensitivity is about the probability of getting true positives amongst observed positives, so we have <span class="math inline">\(n=122\)</span> cases with <span class="math inline">\(k=103\)</span> true positives. This means that the posterior for the sensitivity is:</p>
<p><span class="math display">\[p(r \mid D_{kits, pos}) = {\rm Beta}(r, 103+1, 122-103+1) = {\rm Beta}(r, 104, 20)\]</span></p>
<p>The prior and posterior can be seen in the figure below. <img src="/assets/sensitivityupdating.png" /></p>
<p>For the negatives we have 401 observed negatives with 399 True negatives, so that</p>
<p><span class="math display">\[p(s \mid D_{kits, neg}) = {\rm Beta}(s, 399+1, 401-399+1) = {\rm Beta}(s, 400, 3)\]</span></p>
<p><img src="/assets/specificityupdating.png" /></p>
<p>This posterior is even more extreme and takes us very close to 1 in specificity, since there were only 2 false positives. But even just 2 false positives make a difference.</p>
<p>We can now turn around and use the posteriors as priors for the Santa Clara Survey.</p>
<p><span class="math display">\[p(r, s, f \mid D_{survey}) \propto p(D_{survey} \mid r, s, f) \, p(r, s, f)\]</span></p>
<p>Since the posterior after seeing the test-kits, which is now acting as a prior factorizes, we can write:</p>
<p><span class="math display">\[p(r, s, f \mid D_{survey}) \propto p(D_{survey} \mid r, s, f) \, p(r \mid D_{kits, pos}) \, p(s \mid D_{kits, neg}) \, p(f)\]</span></p>
<p>We know the two posterior-priors on the right hand side. As mentioned earlier, we’ll use a uniform prior for the prevalence so as to not bias us one way or the other.</p>
<p>Now we know that there were 3330 tests with 50 positives. Clearly some of these are true positives and some of these are false positives. So, in the language of our confusion matrix, these are the predicted (by the serology test) positives.</p>
<p>We need to calculate the likelihood for these predicted positives. Well its going to be binomial and look something like this:</p>
<p><span class="math display">\[ Binom(p, 3330, 50) \]</span></p>
<p>where <span class="math inline">\(p\)</span> is the probability of being predicted positive. Now, well, a predicted positive is a true positive or a false positive, so:</p>
<p><span class="math display">\[p = \frac{TP + FP}{N} = \frac{TP}{OP}\frac{OP}{N} + \frac{FP}{ON}\frac{ON}{N} = rf + (1 - s)(1-f)\]</span></p>
<p>So then</p>
<p><span class="math display">\[p(r, s, f \mid D_{survey}) \propto Binom(rf + (1 - s)(1-f), 3330, 50) \, {\rm Beta}(r, 104, 20) \, {\rm Beta}(s, 400, 3)\]</span></p>
<p>The product of t betas with a binomial does not result in another Beta, so this posterior does not have a nice closed form as before.</p>
<p>Now we’d like to find the posterior on prevalence so that we can compare our findings to what we got from the original preprint. to do this we must marginalize or add the “joint” posterior in <span class="math inline">\(r\)</span>, <span class="math inline">\(s\)</span>, and <span class="math inline">\(f\)</span> over <span class="math inline">\(r\)</span> and <span class="math inline">\(s\)</span>. In other words:</p>
<p><span class="math display">\[p(f \mid D_{survey}) = \int_0^1 \int_0^1 dr ds p(r, s, f \mid D_{survey}) \]</span> <span class="math display">\[p(f \mid D_{survey}) \propto \int_0^1 \int_0^1 dr ds Binom(rf + (1 - s)(1-f), 3330, 50) \, {\rm Beta}(r, 104, 20) \, {\rm Beta}(s, 400, 3)\]</span></p>
<p>This integral can be done numerically, and then the normalization can be found numerically as well. When we do that, we get:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1"></a><span class="im">from</span> scipy.integrate <span class="im">import</span> nquad</span>
<span id="cb5-2"><a href="#cb5-2"></a>post <span class="op">=</span> <span class="kw">lambda</span> r, s, f: like_survey(r,s,f)<span class="op">*</span>post_r.pdf(r)<span class="op">*</span>post_s.pdf(s)</span>
<span id="cb5-3"><a href="#cb5-3"></a>f_grid <span class="op">=</span> np.linspace(<span class="dv">0</span>,<span class="fl">0.04</span>, <span class="dv">40</span>)</span>
<span id="cb5-4"><a href="#cb5-4"></a>post_f_grid<span class="op">=</span>[]</span>
<span id="cb5-5"><a href="#cb5-5"></a><span class="cf">for</span> f <span class="kw">in</span> f_grid:</span>
<span id="cb5-6"><a href="#cb5-6"></a> <span class="bu">print</span>(f)</span>
<span id="cb5-7"><a href="#cb5-7"></a> integ <span class="op">=</span> nquad(<span class="kw">lambda</span> r, s: post(r,s,f), [[<span class="dv">0</span>,<span class="dv">1</span>], [<span class="dv">0</span>,<span class="dv">1</span>]])</span>
<span id="cb5-8"><a href="#cb5-8"></a> post_f_grid.append(integ)</span>
<span id="cb5-9"><a href="#cb5-9"></a>...</span>
<span id="cb5-10"><a href="#cb5-10"></a>ax.plot(f_grid, prior_f.pdf(f_grid), label<span class="op">=</span><span class="st">"prevalence prior before test-kits and survey"</span>)</span>
<span id="cb5-11"><a href="#cb5-11"></a>ax.plot(f_grid, np.array([e[<span class="dv">0</span>] <span class="cf">for</span> e <span class="kw">in</span> post_f_grid])<span class="op">/</span><span class="fl">3.3244e-4</span>, label<span class="op">=</span><span class="st">"post-survey prevalence posterior"</span>)</span></code></pre></div>
<p><img src="/assets/prevalenceposterior.png" /></p>
<p>The 95% credible intervals for the prevalence are:</p>
<p><span class="math display">\[f \in [0.0 , 0.018] .\]</span></p>
<p>The paper quotes 1.11% to 1.97%, while we get 0.05% to 1.81%.</p>
<p>So <em>a large portion of the mass of our posterior is to the left of that quoted from the paper</em>. While our right interval is to the left of theirs, its not as much to the left as is our left side. Correspondingly, our distribution peaks ar 1.25 and the mean is more like 1.1% rather than 1.5%.</p>
<p>We’d thus expect, <em>from our analysis</em>, if we included re-sampling, <em>all the numbers the paper quotes to be revised downwards</em>, making the claim of high prevalence unlikely. Our posterior probability distribution puts non-trivial mass on really low prevalence.</p>
</section>
<section id="discussion-and-controversy" class="level2">
<h2>Discussion and Controversy</h2>
<p>There has been a lot of discussion on this pre-print amongst researchers, most of whom have expressed the same skepticism we have here.</p>
<p>The Guardian hasa nice <a href="https://www.theguardian.com/world/2020/apr/23/coronavirus-antibody-studies-california-stanford">article</a> on this survey, and a great followup <a href="https://www.theguardian.com/world/2020/apr/28/there-is-no-absolute-truth-an-infectious-disease-expert-on-covid-19-misinformation-and-bullshit">interview with Carl Bergstrom</a>. Peter Kolchinsky has a very informative and long tweet thread:</p>
<blockquote class="twitter-tweet" data-theme="dark">
<p lang="en" dir="ltr">
Misleading Stanford Covid diagnostic study is all over news. Flaws w/ this study (authors acknowledge) could trick you into thinking that getting shot in the head has a low chance of killing you. I invest in diagnostics. I see this often. A dead body… <a href="https://t.co/rBJvMBWWmD">https://t.co/rBJvMBWWmD</a>
</p>
— Peter Kolchinsky (<span class="citation" data-cites="PeterKolchinsky">@PeterKolchinsky</span>) <a href="https://twitter.com/PeterKolchinsky/status/1251585935994740736?ref_src=twsrc%5Etfw">April 18, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Noted epidemiologist Trevor Bedford weighed in too:</p>
<blockquote class="twitter-tweet" data-theme="dark">
<p lang="en" dir="ltr">
Very interesting new preprint by Eran Bendavid and colleagues reports seroprevalence estimates from Santa Clara county. Great to have seroprevalence work start to emerge, but I'd be skeptical of the 2-4% seroprevalence result. 1/8 <a href="https://t.co/qzDd8ky1p0">https://t.co/qzDd8ky1p0</a>
</p>
— Trevor Bedford (<span class="citation" data-cites="trvrb">@trvrb</span>) <a href="https://twitter.com/trvrb/status/1251332447691628545?ref_src=twsrc%5Etfw">April 18, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Richard Neher and <span class="citation" data-cites="NimwegenLab">[@NimwegenLab]</span>(https://twitter.com/NimwegenLab) weighed in too with a bayesian analysis:</p>
<blockquote class="twitter-tweet" data-theme="dark">
<p lang="en" dir="ltr">
The preprint on <a href="https://twitter.com/hashtag/SARSCoV2?src=hash&ref_src=twsrc%5Etfw">#SARSCoV2</a> seroprevalence in Santa Clara County continues to make headlines. They estimate 2-4% of the population had <a href="https://twitter.com/hashtag/COVID19?src=hash&ref_src=twsrc%5Etfw">#COVID19</a> by April 4 implying an infection fatality rate (IFR) of 0.1 to 0.2%.<br><br>But there are many reasons to be very skeptical. Thread…
</p>
— Richard Neher (<span class="citation" data-cites="richardneher">@richardneher</span>) <a href="https://twitter.com/richardneher/status/1251484971279233025?ref_src=twsrc%5Etfw">April 18, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p><span class="citation" data-cites="luispedrocoelho">[@luispedrocoelho]</span>(https://twitter.com/luispedrocoelho) wrote a <a href="https://metarabbit.wordpress.com/2020/04/18/did-anyone-in-santa-clara-county-get-covid-19/">blog post</a> about this as well..he went an extra (proper) step further than we did and modeled prevalence fully as a sample quantity drawn using a binomial from a population probability. His model is a nice example of using MCMC to do bayesian statistics, something I shall be writing a post on soon.</p>
<p>Natalie Dean as well (and she talks about some of the other problems in the survey):</p>
<blockquote class="twitter-tweet" data-theme="dark">
<p lang="en" dir="ltr">
A rapid, unsolicited peer review on emerging serosurvey data from Santa Clara County, and why I remain skeptical of claims that we are identifying only 1 out of every 50 to 85 confirmed cases.<br><br>1/10<a href="https://t.co/fk45sn1NHl">https://t.co/fk45sn1NHl</a>
</p>
— Natalie E. Dean, PhD (<span class="citation" data-cites="nataliexdean">@nataliexdean</span>) <a href="https://twitter.com/nataliexdean/status/1251309217215942656?ref_src=twsrc%5Etfw">April 18, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Something else is lost though in the noise about where the mass of the prevalence is. Even the preprint’s confidence intervals on specificity are between 98% and 100%. And as we know, at a 2% false positive rate, all their positives could be false positives, making any inferences about prevalence suspect. And this I suspect is the main thing to take away…we need more data before we can make policy about re-opening.</p>
</section>Rahul DaveIn the last blog post we saw how we can calculate \(p(D+ \mid T+)\) given information on Specificity and Sensitivity of a serological test from a manufacturer. In that article we needed an otherwise obtained (from swab tests, for example) prevalence \(p(D+)\), and also noted the impact of false positives on our calculation at low prevalence levels. We alluded to calculating a prevalence \(p(D+)\) of the disease in the population if we knew p(D+ T+)$.Bayes Rule and Serological testing2020-04-04T00:00:00+05:302020-04-04T00:00:00+05:30/2020/04/04/bayes-rule-and-serological-testing<p class="byline">
(Image by rawpixel.com)
</p>
<p><a href="https://twitter.com/zbinney_NFLinj">Zachary Binney</a>, an epidemiologist and incoming Assistant professor at Oxford College at Emory university, shared a very interesting twitter thread on how to interpret the new Serological test approved by the FDA.</p>
<p>A serology test is a blood based test to check antibody production against a specific condition. The idea behind an antibody test is to detect antibodies to Covid-19 in your blood. Since we have no vaccine yet, these antibodies would presumably mean that you have had Covid-19 at some point in the past.</p>
<blockquote class="twitter-tweet" data-theme="dark">
<p lang="en" dir="ltr">
The FDA has approved the first antibody test for COVID-19, from Cellex. It theoretically tells you if you've had it & are, as far as we know, immune for some time.<br><br>Sensitivity is 93.8%, specificity 95.6%. Sounds great, right?<br><br>Well, sort of. (1/6)
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789672833417217?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>These tests are however not perfect. You might have had the disease, but your test might show up as negative. This is called a <strong>false negative</strong>. On the other hand, you might not have had the disease, but you get a positive test. This is called a <strong>false positive</strong>. These errors are what the sensitivity and specificity in the above tweet are about: they represent probabilities of these two different kinds of errors. They are numbers <em>given to us by the test manufacturer</em>. To understand them, we’ll need to recall our basic notions of probability.</p>
<p>Zachary goes on to say: “Sounds great, right? Well, sort of.”. The numbers are in the 90%s. Surely that is good. But not so fast. To understand these numbers, we’ll need to understand something about the <strong>prevalence</strong> of Covid-19 in the population, and along with that, <strong>conditional probability</strong> and the famous <strong>Bayes Theorem</strong>.</p>
<section id="probability" class="level2">
<h2>Probability</h2>
<p>Probability can be thought of as the long run frequency of events. For example, the probability of testing positive, <span class="math inline">\(P(T+)\)</span> can be thought of as <em>the number of positive tests in a very large set of tests</em> conducted. <span><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle"/><span class="marginnote"> This and the other probability dartboard images are taken from Andrew Glassner’s amazing <a href="https://www.amazon.com/Deep-Learning-Vol-Basics-Practice-ebook/dp/B079XSQNRX/">Deep Learning, Vol. 1: From Basics to Practice</a>.<br />
<br />
</span></span></p>
<figure>
<img src="/assets/Figure-03-002.png" title="fig:" alt="Throwing Darts" />
</figure>
<p>We can visualize this idea of probability geometrically. Imagine that you are drunk, and are throwing darts at a square shaped wall. <span><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle"/><span class="marginnote"> <img src="/assets/Figure-03-004.png" alt="probability as frequency" /> The reverse idea of using frequencies to calculate areas of geometric objects (here the circle) can be used to calculate <span class="math inline">\(\pi\)</span> (think of how!), and in general, all kinds of integrals. This is called the <em>Monte Carlo</em> method.<br />
<br />
</span></span> And you do this for a very very long time and thus throw a very large number of dots. Lets furthermore assume that you are a very special thrower, and it is equally likely that you throw darts anywhere in this square.</p>
<p>Now you can ask the question, whats the probability that your darts land in the circle. You have probably intuited the answer as shown on the right. Its:</p>
<p><span class="math display">\[P(darts \, in \, circle) = \frac{\# \, darts \, in \, circle}{\# \, darts \, in \, square}\]</span></p>
<p><br/></p>
</section>
<section id="joint-and-conditional-probability" class="level2">
<h2>Joint and Conditional probability</h2>
<p>We can use this idea of probability as <em>relative frequency</em> of darts to understand why these tests are not as useful as they might seem at first blush. Imagine two areas on the square where the darts arrive, <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>. We can now ask two questions:</p>
<ol type="1">
<li><strong>Joint Probability</strong>:<span><label for="sn-2" class="margin-toggle">⊕</label><input type="checkbox" id="sn-2" class="margin-toggle"/><span class="marginnote"> <img src="/assets/Figure-03-010.png" alt="Joint Probability" /><br />
<br />
</span></span> What are the odds that the darts arrive in the area that is the intersection of the regions <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>. This is called the joint probability of being in <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>, denoted <span class="math inline">\(P(A, B)\)</span>. We can compute this by counting the number of darts that arrive in the intersection area, relative to the number of darts that arrive in the square.</li>
</ol>
<ol start="2" type="1">
<li><strong>Conditional Probability</strong>:<span><label for="sn-3" class="margin-toggle">⊕</label><input type="checkbox" id="sn-3" class="margin-toggle"/><span class="marginnote"> <img src="/assets/Figure-03-005.png" alt="Conditional Probability" /><br />
<br />
</span></span> Suppose I told you that a dart has arrived in region <span class="math inline">\(B\)</span>. I can now ask the question: <em>given</em> this information, what are the odds that it arrived in region <span class="math inline">\(A\)</span>. The word <em>given</em> is critical here, it presupposes that an event has happened, and we are asking a question <em>conditioned</em> on its occurrence. As you might expect, you can answer this question using relative frequencies as well: out of all the darts that arrived in region <span class="math inline">\(B\)</span>, lets count those that arrived in region <span class="math inline">\(A\)</span> and ratio the two numbers.</li>
</ol>
<p>The critical question we wish to answer is a conditional probability<span><label for="sn-4" class="margin-toggle">⊕</label><input type="checkbox" id="sn-4" class="margin-toggle"/><span class="marginnote"> Almost all critical questions in life and death are conditional probabilities. As Joe Blitzstein (the famous <a href="https://twitter.com/stat110">@stat110</a>) will tell you within 5 minutes of meeting him: <strong>Conditioning is the soul of Statistics</strong>.<br />
<br />
</span></span>: <em>given that we have a positive test, what are the odds that we have had the Covid-19 disease</em>. In other words we want the quantity</p>
<p><span class="math display">\[P(D+ | T+).\]</span></p>
<p>In other words, <strong>do you have antibodies (have you had the disease?), given that you have a positive test</strong> and thus can you get back to work and get the economy back on track?</p>
<p>At this point you might be thinking, wait, I just got a <em>positive</em> test. How can I <em>not have the disease</em>? Read on….</p>
</section>
<section id="sensitivity-and-specificity" class="level2">
<h2>Sensitivity and Specificity</h2>
<p>Ok, so what exactly are the <em>sensitivity</em> and <em>specificity</em>. Zachary quotes a sensitivity of 93.5%. What does this number mean?</p>
<p>The <strong>sensitivity</strong> is the reverse conditional probability to what we want, the probability that you test positive given that you have the disease:</p>
<p><span class="math display">\[
Sensitivity \equiv P(T+ | D+) = 0.938
\]</span></p>
<p>This means that if you have had the disease, there is only a 93.8% chance you will test positive. When you have had the disease, and you test positive, you are called a <strong>true positive</strong>.</p>
<p>On the other hand you might be a <strong>false negative</strong>. <span><label for="sn-5" class="margin-toggle">⊕</label><input type="checkbox" id="sn-5" class="margin-toggle"/><span class="marginnote"> This is the equivalent probability of being in the region of A that is not in B.<br />
<br />
</span></span> In other words, there a <span class="math inline">\(100 - 93.8 = 6.2\)</span>% chance that you will test negative, even though you have had the disease.</p>
<p>Since, given disease, you either test negative or positive:</p>
<p><span class="math display">\[ P(T- | D+) = 1 - P(T+ | D+) = 1 - 0.938 = 0.062\]</span></p>
<p><strong>Specificity</strong> is the other side of the coin. It answers the question: what are the odds that you will test negative if you dont have the disease.</p>
<p><span class="math display">\[
Specificity \equiv P(T- | D-) = 0.956
\]</span></p>
<p>In other words, it asks the question, are you a <strong>true negative</strong>: do you test negative when you do not have the disease?</p>
<p>This then begs the question: what if you are a <strong>false positive</strong>? What if you test positive, given that you do not have the disease?</p>
<p>Since, given no disease, you either test negative or positive:</p>
<p><span class="math display">\[P(T+ | D-) = 1 - P(T- | D-) = 1 - 0.956 = 0.044\]</span></p>
<p>There thus seems to be 2 ways in which this test has errors: you can be a false positive, or a false negative. Indeed, all tests have such errors. There are biological reasons for this. You might be a false negative if antibodies in your blood do not show up on the test because of confounding factors in your biology. You might be a false positive, if the test reacts to antibodies to other viruses (other coronaviruses even) in your body.</p>
</section>
<section id="the-confusion-matrix" class="level2">
<h2>The Confusion Matrix</h2>
<p>These errors are usually captured in a construct called the Confusion Matrix.</p>
<p>Suppose you had real frequencies for the false positives, true positives, true negatives and false negatives. I want to emphasize we do not have these, just numbers given to us by the test manufacturer. But suppose, just for a bit, that we did.</p>
<figure>
<img src="/assets/confusionmatrix.png" title="fig:" alt="Confusion Matrix" />
</figure>
<p>The marginal quantities (quantities in the right and the bottom margin in the confusion matrix) are the sum of the rows and columns. For example, when you sum the true negatives and the false positives, you get the observed negatives (ON), all the people who do not have the disease.</p>
<p>The Confusion matrix deals with actual numbers and not probabilities, and so we must write the various quantities we have calculated in the previous section as a function of the quantities in this matrix.</p>
<p>The Sensitivity, or the <strong>True Positive Rate</strong>(TPR) is given as: <span class="math display">\[
Sensitivity \equiv P(T+ | D+) = TPR = \frac{TP}{OP} = 0.938.
\]</span> In other words, this is the number of TP, divided by the sum of the bottom row, the OP.</p>
<p>The complement of the sensitivity is the <strong>False Negative Rate</strong>(FNR):</p>
<p><span class="math display">\[FNR \equiv P(T- | D+) = 1 - TPR = \frac{FN}{OP} = 0.062\]</span></p>
<p>Again this quantity is concerned with the <em>bottom row</em> of the confusion matrix.</p>
<p>The Specificity, or the <strong>True Negative Rate</strong>(TNR) is given as: <span class="math display">\[
Specificity \equiv P(T- | D-) = TNR = \,\,\frac{TN}{ON} = 0.956
\]</span></p>
<p>and thus the complement, the <strong>False Positive Rate</strong>(FPR) is given as:</p>
<p><span class="math display">\[FPR \equiv P(T+ | D-) = 1 - TNR = \frac{FP}{ON} = 0.044\]</span></p>
<p>These two quantities are concerned with the <em>top row</em> of the confusion matrix, and the denominators come from its marginal, the ON.</p>
<p>At this point, you might be saying, the FPR is very low, so what are we worried about? Well, we have not answered the question that we set out to answer: what is</p>
<p><span class="math display">\[P(D+ | T+) \, ?\]</span></p>
<p>Do you have antibodies (have you had the disease?), given that you have a positive test and thus can you get back to work and get the economy back on track?</p>
<p>How do we calculate this?</p>
<p>From the perspective of the confusion matrix, we are looking at a different marginal, the marginal of the right column, the Predicted Positive, or PP.</p>
<p><span class="math display">\[
P(D+ | T+) = \frac{TP}{PP} = \frac{TP}{TP + FP}
\]</span></p>
<p>Thus we have:</p>
<p><span class="math display">\[
\begin{eqnarray*}
P(D+ | T+) &=& \frac{TP}{TP + FP} \\
&=& \frac{TPR \times OP}{TPR \times OP + FPR \times ON} \\
&=& \frac{TPR \times \frac{OP}{Pop}}{(TPR \times \frac{OP}{Pop}) + (FPR \times \frac{ON}{Pop})} \\
&=& \frac{P(T+ | D+) \times P(D+)}{(P(T+ | D+) \times P(D+)) + (P(T+ | D-) \times P(D-))} \\
&=& \frac{0.938 \times P(D+)}{(0.938 \times P(D+)) + (0.044 \times P(D-))}
\end{eqnarray*}
\]</span></p>
<p>Here I divided both the numerator and the denominator of the fraction by the population (<span class="math inline">\(Pop\)</span>) to get the <strong>prevalence</strong> of Covid19, the fraction of the population that has Covid19, or the probability <strong>prior</strong> to serological testing, presumably estimated from regular testing or mortality, that a random person in the population has Covid19.</p>
<p>This tells us that our probability of disease given testing <span class="math inline">\(P(D+ | T+)\)</span> can be estimated from our confusion matrix if we had the raw numbers. But in the absence of those, given just the probabilities from the test manufacturer, we can still estimate <span class="math inline">\(P(D+ | T+)\)</span> as long as we have the prevalence of disease <span class="math inline">\(P(D+)\)</span> since <span class="math inline">\(P(D-) = 1 - P(D+)\)</span>.</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
If only a small % have actually had COVID-19 (our best guess now) a “positive” antibody test isn't that likely to mean you're immune.<br><br>If only 4.5% of U.S. has had COVID-19, + test only means ~50% chance you really had it. With lots of uninfected, lots of false +s. (2/6) <a href="https://t.co/IIqnNSM9Lg">pic.twitter.com/IIqnNSM9Lg</a>
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789676264460288?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Zachary assumes prevalence <span class="math inline">\(p = P(D+) = 0.045\)</span> from the above tweet. Then the prevalence of the “lack of disease” is</p>
<p><span class="math display">\[P(D-) = 1 - P(D+) = 1 - p = 1 - 0.045 = 0.955,\]</span></p>
<p>and we can use these numbers to calculate the probability if disease given test:</p>
<p><span class="math display">\[ P(D+ | T+) = \frac{0.938*0.045}{(0.938*0.045 + 0.044*0.955)} \sim 0.501.\]</span></p>
<p>This explains the 50% number in the tweet above.</p>
<p>The calculation we did above is an example of a very general calculation in probability theory: a conditional probability calculation using Bayes theorem. You may skip this section if you are not interested, but i must warn you, it is fun.</p>
</section>
<section id="bayes-theorem" class="level2">
<h2>Bayes Theorem</h2>
<p>We’ll rely on an almost 260 year old theorem, which is so simple to prove, that we shall show its proof visually below. <span><label for="sn-6" class="margin-toggle">⊕</label><input type="checkbox" id="sn-6" class="margin-toggle"/><span class="marginnote"> Bayes theorem may be the highest utility to simplicity ratio theorem out there. It is used in almost every field of human endeavour. Its originator was the <a href="https://en.wikipedia.org/wiki/Thomas_Bayes">Reverend Bayes</a>, and the theorem was presented in <em>“An Essay towards solving a Problem in the Doctrine of Chances”</em>. <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Thomas_Bayes.gif/225px-Thomas_Bayes.gif" alt="Bayes" /> This drawing of him is taken from Wikipedia and is not guaranteed to be him..no drawing survives!<br />
<br />
</span></span></p>
<p>The basic idea of Bayes theorem comes from the definition of joint probability. As illustrated below, the joint probability <span class="math inline">\(P(A, B)\)</span> of the dart falling in regions <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> can be decomposed into the product of the conditional probability <span class="math inline">\(P(A | B)\)</span> and the full or <em>marginal</em> probability of the dart falling in region <span class="math inline">\(B\)</span>, <span class="math inline">\(P(B)\)</span>:</p>
<figure>
<img src="/assets/agivenb.png" title="fig:" alt="Decomposition using probability of A given B" />
</figure>
<p>This should be obvious from the figure above. Here’s the fun part: there is nothing special about regions <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span>, so we can reverse the argument. <span><label for="sn-7" class="margin-toggle">⊕</label><input type="checkbox" id="sn-7" class="margin-toggle"/><span class="marginnote"> <img src="/assets/Figure-03-010.png" alt="Joint Probability" /><br />
<br />
</span></span> I reproduce the image of regions <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> and the joint probability here on the right for convenience.</p>
<figure>
<img src="/assets/bgivena.png" title="fig:" alt="Decomposition using probability of B given A" />
</figure>
<p>Thus the joint probability can be written two ways, and voila, we get Bayes Theorem:</p>
<figure>
<img src="/assets/bayeseqn.png" title="fig:" alt="Bayes Theorem" />
</figure>
<p>Remember how we had <span class="math inline">\(P(T+ | D+)\)</span> but wanted <span class="math inline">\(P(D+ | T+)\)</span>. Now we can apply the theorem to our advantage:</p>
<p><span class="math display">\[P(D+, T+) = P(D+ | T+) P(T+) = P(T+ | D+) P(D+)\]</span></p>
<p>Solving for <span class="math inline">\(P(D+ | T+)\)</span>, we get:</p>
<p><span class="math display">\[
P(D+ | T+) = \frac{P(T+ | D+) P(D+)}{P(T+)}
\]</span></p>
<p>To expand out the denominator, remember that When you test positive, it can either be because you have the disease, or because you dont have the disease:</p>
<p><span class="math display">\[
\begin{eqnarray*}
P(D+ | T+) &=& \frac{P(T+ | D+) P(D+)}{P(T+, D+) + P(T+, D-)} \\
&=& \frac{P(T+ | D+) P(D+)}{P(T+ | D+) P(D+) + P(T+ | D-)P(D-)} \\
&=& \frac{0.938 \times P(D+)}{0.938 \times P(D+) + 0.044 \times P(D-)}
\end{eqnarray*}
\]</span></p>
<p>The marginal quantity <span class="math inline">\(P(D+)\)</span>, the fraction of people in the population with disease established by other methods is often called the <strong>prior</strong> probability of disease, because it reflects our belief in the disease rate without having done the testing we are interested in (without having seen “data”).</p>
</section>
<section id="gird-yourself-for-disappointment" class="level2">
<h2>Gird yourself for disappointment</h2>
<p><br/></p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
If only a small % have actually had COVID-19 (our best guess now) a “positive” antibody test isn't that likely to mean you're immune.<br><br>If only 4.5% of U.S. has had COVID-19, + test only means ~50% chance you really had it. With lots of uninfected, lots of false +s. (2/6) <a href="https://t.co/IIqnNSM9Lg">pic.twitter.com/IIqnNSM9Lg</a>
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789676264460288?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>So using bayes theorem or our confusion matrix based calculation, we have all the numbers we need to finish our calculation:</p>
<p><span class="math display">\[ P(D+ | T+) = \frac{0.938*0.045}{(0.938*0.045 + 0.044*0.955)} \sim 0.501\]</span></p>
<p>Remember, this explains the 50% number in the tweet above.</p>
<p>Take in a moment to absorb this. Because the disease prevalence in the population is currently low (and we want to keep it that way), Bayes Theorem tells us that the probability of having had the disease given a positive serology test is just 50%. Not the slam dunk we were hoping to have.</p>
<p>How do these odds change with the prevalence of the disease? This is told to us by the plot in the tweet above and the tweet included below.</p>
<p>How do we produce a curve like the above? We’ll make a grid of prevalences p from 0 to 1, and produce <span class="math inline">\(P(D+ | T+)\)</span> for each prevalence, and then plot it:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="op">%</span>matplotlib inline</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
<span id="cb1-3"><a href="#cb1-3"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
<span id="cb1-4"><a href="#cb1-4"></a><span class="co"># Implement Bayes theorem as a function of prevalence</span></span>
<span id="cb1-5"><a href="#cb1-5"></a>dplus_given_tplus <span class="op">=</span> <span class="kw">lambda</span> p: <span class="fl">0.938</span><span class="op">*</span>p <span class="op">/</span> (<span class="fl">0.938</span><span class="op">*</span>p <span class="op">+</span> <span class="fl">0.044</span><span class="op">*</span>(<span class="fl">1.</span><span class="op">-</span>p))</span>
<span id="cb1-6"><a href="#cb1-6"></a>pgrid <span class="op">=</span> np.linspace(<span class="fl">0.</span>, <span class="fl">1.</span>, <span class="dv">100</span>)</span>
<span id="cb1-7"><a href="#cb1-7"></a>plt.plot(pgrid, plus_given_tplus(pgrid))</span></code></pre></div>
<p>The plot is shown in the margin here <span><label for="sn-8" class="margin-toggle">⊕</label><input type="checkbox" id="sn-8" class="margin-toggle"/><span class="marginnote"> <img src="/assets/download.png" alt="probability plot" /> You can read off that at a prevalence of 0.1 (x-axis), the probability of disease having happened given a +ive test is 0.7. And that for <span class="math inline">\(p=0.3\)</span>, its 0.9.<br />
<br />
</span></span> and can be used to read the numbers off the next tweet.</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
If 10% were truly infected, a positive Cellex antibody test has a 70% chance of being right. If 30% were infected, a positive test is right 90% of the time. This happens bc when more people were sick, true positives overwhelm false positives. But that's not our situation. (3/6)
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789685760380930?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Why does the probability of having had the disease, given a positive test go up with increased disease prevalence?<span><label for="sn-9" class="margin-toggle">⊕</label><input type="checkbox" id="sn-9" class="margin-toggle"/><span class="marginnote"><img src="/assets/confusionmatrix.png" alt="Confusion Matrix" /> From the perspective of what we are trying to understand, it might have almost been better to draw the confusion matrix with a thinner lower strip to indicate that the total number of observed positives is a smaller fraction of the population than the observed negatives, or those who do not a-priori have the disease.<br />
<br />
</span></span> This is alluded to in the tweet above: when the prevalence is higher and there are more sick people, true positives overwhelm the false positives. The way to think about this is by using the confusion matrix. False positives come from the top row of the confusion matrix. This row totals up to the observed negatives. The bottom row, which holds the true positives, totals up to the observed positives, or people who have had Covid19. Thus, as the prevalence increases, the true positives overwhelm the false positives.</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
So this is test may not be that useful for saying “Zach, you are immune; Jen, you aren't.” It might be wrong as often as it's right. Mistakenly telling someone they're immune & clear to return to society…you can see the problem if we do that on a large scale. (4/6)
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789687383547905?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Remember that our social distancing, lockdown, and quarantine measures are designed to keep the disease prevalence low. If we succeed in doing this, we should not expect just a simple serology test to be enough to return people to the workforce.</p>
</section>
<section id="how-should-we-use-this-test-then" class="level2">
<h2>How should we use this test, then?</h2>
<p>Well, one way to use this test is to narrow its scope, and estimate prevalence in geographic subpopulations, Seattle, for example, using the confusion matrices calculated from random samples taken in these geographies (this kind of inverts what we have been talking about here).</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
There are simple equations to correct for this on a population level. So this test is still <em>very</em> useful for helping us figure out what % of people have been sick in different areas. And it's the best we've got; deploy it! But realize what it will & won't reliably tell us. (5/6)
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789689279401997?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Another way to use this test is to repeat it twice over a longer timeline per person, thus eliminating false positives. But if there are systematic biological effects that created the false positives in the first place, then we are in trouble.</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
Running the test 2x & only telling someone they're immune if they get 2 +s <em>might</em> help reduce false +s, depending on the error source. If it's anything systemic - say it's detecting antibodies from a similar virus that don't grant immunity to COVID-19 - no good. <a href="https://twitter.com/KevinMalogna?ref_src=twsrc%5Etfw"><span class="citation" data-cites="KevinMalogna">@KevinMalogna</span></a>
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1245789690277625861?ref_src=twsrc%5Etfw">April 2, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Or do a different test:</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
Bonus tweet 2: We might be able to reduce false +s/increase the chance a + test is right by using this test as a screener and another slightly different antibody test to confirm, if they wouldn't both show false + for the same systemic reason like antibodies to a similar virus.
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1246037468501233664?ref_src=twsrc%5Etfw">April 3, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>The test is, of-course more useful in Covid high risk groups because the prevalence is higher:</p>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark">
<p lang="en" dir="ltr">
Bonus tweet 1: A positive test would be more likely to mean you're truly immune if you're in a high-risk group - healthcare worker, had COVID symptoms, family member had COVID - bc prevalence of infection in these subgroups is higher. So test may be more useful for these folks!
</p>
— Zachary Binney, PhD (<span class="citation" data-cites="zbinney_NFLinj">@zbinney_NFLinj</span>) <a href="https://twitter.com/zbinney_NFLinj/status/1246037465267425282?ref_src=twsrc%5Etfw">April 3, 2020</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</section>Rahul Dave(Image by rawpixel.com) Zachary Binney, an epidemiologist and incoming Assistant professor at Oxford College at Emory university, shared a very interesting twitter thread on how to interpret the new Serological test approved by the FDA.