Machine learning imputation of race in statistical estimates of disability and mortality risks using Social Security administrative data

Abstract

Data on race and ethnicity at the individual level is much needed to measure racial differences in social outcomes. Legally observable individual characteristics such as first name, last name, sex, geolocation of residence, country of origin could be used to estimate statistical models and train Natural Language Processing (NLP) Machine Learning models to impute missing race information. This paper briefly surveys the existing statistical and machine learning models for race imputations, focusing on the recent state-of-the-art NLP Machine Learning models of two popular network architectures — LSTM architecture and very recently introduced Transformer architecture. The paper builds and trains LSTM and transformer models merging information from SSA’s three administrative datasets — RECS, Numident and Geolocation and compares performances of the models over various sets of feature variables, character, and word-level tokenization algorithms and a data balancing method of oversampling minority races in the training data. The paper then examines the sensitivity of the statistical estimates of racial patterns in health outcomes — namely in risks of disability and mortality — that emerges from the actual race and imputed race in the Social Security Administration’s 2008 one percent Continuous Work History Sample (CWHS) dataset of size around 3 million. The paper uses a statistical multi-state time-to-event model to estimate risks.

Publication
Working Paper, Social Security Administration