How to Estimate Vaccine Efficacy Using a Logistic Regression Model
Learn how vaccine efficacy and its confidence bounds can be estimated using a regression model
Vaccine Efficacy (VE) is a measure of how much the vaccine was able to reduce the incidence of the disease in the vaccinated group of people as compared to the non-vaccinated group.
When reported by itself, Vaccine Efficacy is of limited clinical use as it’s only a point estimate of the true population level efficacy. The value of VE that is reported in a vaccine trial relates to only that specific trial. If it were possible to conduct 10,000 identically structured vaccine trials, you would get 10,000 different values of VE. All these values would be (presumably) distributed around the true population level Vaccine Efficacy which is always an unknown quantity.
Since it is very expensive to conduct even a few trials for the same vaccine, leave alone thousands, Vaccine Efficacy is usually reported as a point estimate along with upper and lower 95% (or 99%) confidence bounds. The confidence bounds give you a way to decide how much you want to trust the point estimate of efficacy coming out of the trial.
Speaking of vaccine trials, they usually consist of a group of people who are given the vaccine and a second group of people, the placebo group, who get a saline solution shot or equivalent. The trial is randomized in that participants are randomly assigned to the vaccine or placebo groups. The trial is also usually double-blinded in that neither the vaccine giver nor the vaccine taker knows whether the syringe contains a placebo or the real thing.
After a participant is given a jab, they are monitored until they either get the disease or they drop out or die or are disqualified for some reason, or the trial period comes to an end.
Formula for Vaccine Efficacy
Vaccine Efficacy can be calculated using the following simple formula:
Where p2 and p1 are the incidences of the disease in the vaccinated group and the placebo group respectively.
The disease incidence can be thought of as the probability of getting the disease over a certain period of time. The time period is often measured in person-years. So for purposes of calculation, p2 and p1 can be considered as probabilities.
How to Use a Logistic Regression Model to Estimate Vaccine Efficacy
We’ll start by simulating a vaccine trial data set. We’ll assume that 2 jabs each of either the vaccine or the placebo are given to each participant.
Furthermore, we’ll assume that half of our imaginary participants are given the two vaccine doses separated by a duration of 3 months. The other half are given the two doses separated a duration of 6 months.
So the duration between doses becomes an explanatory variable in our regression model. We will encode it as a binary (0/1) variable with 0 implying 3 months separation between two doses and 1 implying 6 months between two doses.
The second explanatory variable is whether the participant received the vaccine or the placebo.
Following is a summary of the simulated trial data:
Following are the first dozen or so rows of our simulated data set. Each row represents a unique participant in our simulated trial:
The data set is available for download from over here.
The regression variables are as follows:
The dependent variable y
Name of variable: INFECTED
Values: 1==> Participant caught the infection during the trial period. 0==> Participant did not catch the infection during the trial period.
The regression matrix X
There are two explanatory variables in X:
- DURATION_BETWEEN_DOSES: Values: 0 ==> 3 months. 1==> 6 months.
- VACCINATED: Values: 0==> Participant belongs to the PLACEBO group. 1==>Participant belongs to the VACCINATED group.
We will now fit a Logistic Regression model on this simulated data set of (y, X) values.
Let’s start by importing the required packages:
import pandas as pd
from patsy import dmatrices
import statsmodels.api as sm
We will use Pandas to load the data set into a Dataframe:
df = pd.read_csv('vaccine_trial_simulation_study.csv', header=0)
And print the top 10 rows:
Let’s form the regression equation:
expr = 'INFECTED ~ INTERVAL_BETWEEN_DOSES + VACCINATED'
We’ll use Patsy to carve out the X and y matrices:
y_train, X_train = dmatrices(expr, df, return_type='dataframe')
Now let’s build and train a Logistic Regression model on (y, X).
logit_model = sm.Logit(endog=y_train, exog=X_train)
logit_results = logit_model.fit()
Print the fitted model’s summary:
Goodness of fit of the Logit Model
The p-value of the Log-Likelihood Ratio is 1.25e-126 (which is essentially zero) indicating that the model does indeed fit better than a Null model consisting of just flat horizontal fitted mean line passing through all the y values.
The psuedo R-squared is only 12.38% indicating a poor fit. We might want to experiment with a Poisson, a Generalized Poisson or a Negative Binomial regression model to see if one of those models might fit better. The process of fitting those models is essentially the same as for the Logit model.
If duration data, i.e. interval between taking the jab and either being infected or exiting the trial, are available, we can also experiment with a Survival Model such as the Cox Proportional Hazards model.
For the purpose of illustrating the procedure for calculating VE, we’ll continue using the Logit model.
Calculating Vaccine Efficacy
To calculate vaccine efficacy, we’ll focus attention on the coefficients of regression variables:
The first thing to notice is that the p-value of the coefficient of the INTERVAL_BETWEEN_DOSES variable is 0.688, which means that we cannot say at even a 40% confidence level that the coefficient of INTERVAL_BETWEEN_DOSES is really any different from 0.
Secondly, we note that the coefficient of the VACCINATED variable is statistically significant at a >99% confidence level as evidenced by it’s p-value which is basically zero.
Recollect that the regression equation of a Logit model is expressed as follows:
On the R.H.S., we have a linear combination of regression variables. β is the matrix of regression coefficients. N is the number of samples (i.e. number of participants in our simulated vaccine study).
On the L.H.S. of the above equation, p is the probability of the event occurring. In our example, it is the probability of a trial participant getting infected during a certain observation period. So (1 — p) is the probability of the event not occurring.
p/(1 — p) are known as the odds of the event happening.
ln(p/(1 — p)) are the log-odds.
We will use the above equation to write out the regression equation of our fitted regression model as follows:
Recall that VACCINATED is an indicator variable with a value of 1=vaccinated and 0=not vaccinated.
For any given value of INTERVAL_BETWEEN_DOSES, the change in log-odds of getting infected for a unit change in the value of the VACCINATED variable from 0 (not vaccinated) to 1 (vaccinated) is as follows:
Let’s simplify the above expression.We’ll convert the natural log on the L.H.S. to an exponentiation on the R.H.S. i.e. exp(-2.9491) as follows:
Our regression model has predicted that the odds of getting infected reduce by (1–0.05239)*100=94.476% after being fully vaccinated with two sequentially administered doses of the vaccine.
Recall that the formula for Vaccine Efficacy is VE = (1-IRR).
Also recall that IRR (the Incident Rate Ratio) can be expressed as follows:
With a little bit of variable manipulation, we can express IRR in terms of the Odds Ratio (OR) as follow:
Estimating the overall disease incidence rate
In the above formula, p_infected_placebogroup is the incidence of the disease in the overall population. We’ll take the example of COVID-19 to estimate p_infected_placebogroup. From January 22, 2020 to 17 April, 2021, the Johns Hopkins University’s COVID-19 Dashboard has recorded 140,010,233 COVID19 infections worldwide. Assuming an average global population of 7,730,000,000 during this time period, it translates into a worldwide 12-month disease incidence rate of 140,010,233/7,730,000,000*(12/16)=0.01358 or 1.358%
Estimating Incidence Rate Ratio (IRR)
We are now in a position to estimate the Incidence Rate Ratio by using the above relationship between OR and IRR. By plugging in p_infected_placebogroup=0.01358, and OR = 0.05239, we get:
This IRR value yields a point estimate of Vaccine Efficacy of:
Confidence bounds for Vaccine Efficacy
To calculate the 95% confidence interval for the point estimate of Vaccine Efficacy, we use the confidence intervals reported by our fitted Logit Model for the VACCINATED parameter’s coefficient. Here is the CI data for reference:
We will use the following 3-step procedure:
- We’ll exponentiate each confidence bound to arrive at the 95% confidence bounds of the Odds Ratio.
- We’ll then use the relation between IRR and OR to calculate the 95% bounds for IRR.
- Finally, we’ll apply the relation between Vaccine Efficacy and IRR to calculate the 95% confidence bounds for the Vaccine Efficacy’s point estimate.
Using the above procedure, we get the following confidence intervals for the Vaccine Efficacy:
In summary, our Logistic Regression Model has yielded a point estimate of Vaccine Efficacy of 94.476% with 95% confidence bounds of [92.36452% to 96.31105%]
Here is the complete source code used in this article:
References, Citations and Copyrights
Paper and Book Links
Zhang J, Yu KF. What’s the Relative Risk? A Method of Correcting the Odds Ratio in Cohort Studies of Common Outcomes. JAMA. 1998;280(19):1690–1691. doi:10.1001/jama.280.19.1690
Venmani A. Comparison of regression models on estimation of vaccine efficacy in anti-leprosy vaccination trial-a large prospective vaccination trial. AIP Conference Proceedings 2112, 020148 (2019); https://doi.org/10.1063/1.5112333
COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. URL: https://github.com/CSSEGISandData/COVID-19.
Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533–534. doi: 10.1016/S1473–3099(20)30120–1
Thanks for reading! If you liked this article, please follow me to receive tips, how-tos and programming advice on regression and time series analysis.