Missing data – it is almost inevitable in any dataset, and a perennial concern to anyone using that data to analyze trends or make decisions. The frequent “head in the sand” approach, where one just deletes any observations with missing data, is convenient, but most likely not very realistic. When data is missing, it is often missing for a reason.
Part of the problem with missing data is that, at least within the frequentist framework, there is no satisfactory way to address the missing-ness. One common solution is to impute each missing data with the mean or median as calculated from the observations that are missing. This could work, but should be used with extreme caution. Take, for example, a study on the efficacy of a new drug for headaches. Someone for whom the drug is less effective might be less likely to respond to follow up surveys about the drug, and therefore would be missing data. If one were to assign to this person the average treatment effect of the drug calculated from the people for whom the drug was effective, one would over-estimate the efficacy of the drug. Therefore, imputing with the average (or even the median) is not always a sound solution, but oftentimes it is the only solution available so it may be better than nothing.
Considering missing data within the Bayesian framework provides an analyst with much more appealing options. In simplified terms, the Bayesian framework considers data to be a particular realization from a distribution containing all possible data, and this underlying distribution can inform the analyst as to possible values for the missing data. If the particular data seems to have sparser realizations in the part of the distribution for less drug efficacy, then Bayesian methods impute values to fill in that sparser portion of the distribution, giving a more realistic picture of the actual treatment effect of the drug. This frankly magical method is only possible with this underlying distribution, a concept unique to Bayesianism.
Let me illustrate with an example, using R and the Bayesian analysis software Stan and the 12-month follow up survey for the Oregon Health Insurance Experiment (OHIE) data (WordPress does not recognize Rmd or Stan files, so the specific code is available upon request – comment below). The OHIE data was collected in 2008 under an expansion of Medicaid, where 30,000 low-income adults were randomly selected via lottery from a 90,000-person waiting list for the opportunity to apply for 10,000 additional spots in the Medicaid program. In particular, the 12-month follow up survey was sent to all 30,000 persons from the lottery to study whether having insurance improved health. Specifically for this exercise, I looked at whether having insurance decreased the number of days in the last month someone had mental health problems. Since it was a pseudo-experiment, in theory I should not have to include other covariates. The results indicate that, when missing data is simply deleted, having insurance is associated with a strong positive effect in the number of days with mental health problems, i.e. having insurance actually makes someone’s mental health worse off – a rather implausible scenario. When the missing data is imputed from the underlying distribution, having insurance is associated with an effect that spans 0, i.e. having insurance has no effect on someone’s mental health – a result which, considering that some of the population may have had no mental health problems at all, minor mental health problems that could be treated with some medication, and severe mental health problems that would take years of medication and/or therapy to properly treat, seems much more plausible.
Note: these results are part of a larger problem set I did for a Bayesian Statistics course at Columbia University.