Generative AI for Synthetic Tabular Data: Privacy, Utility, and Sensitivity of Downstream Policy Inference using HRS Data

Abstract

Strict privacy regulations restrict the public release of micro-level data critical for modeling in healthcare, education, economics, and finance. Synthetic tabular data provides a promising alternative by replicating statistical properties of the original datasets while safeguarding individual identities. This paper investigates the sensitivity of econometric policy models to the substitution of real data with synthetic data.We present an overview of differential privacy frameworks and DP-SGD algorithms that ensure formal privacy guarantees. In addition, we examine generative approaches—including GANs, VAEs, and diffusion models—with particular emphasis on Denoising Diffusion Probabilistic Models (DDPMs). We highlight the continuous-time stochastic differential equation formulation that unifies discrete diffusion processes, where the Fokker–Planck equation offers a principled simplification of backward denoising dynamics.We propose the Mahalanobis $D^2$ statistic as a novel metric for measuring policy sensitivity to data substitution. Using Health and Retirement Study (HRS) data, we train and assess 13 tabular generative models (three GAN-based, two VAE-based, five diffusion-based, and three additional architectures). Models are ranked across utility, privacy, and Mahalanobis $D^2$ metrics, providing a comprehensive benchmark for synthetic data generation in econometric policy analysis.

Publication
Working Paper