| Model | pMSE | NNAA | nnMDCR | grMDCR | eps. hit rate | Compos. score |
|---|---|---|---|---|---|---|
| Ref. | 0.0001 | 0.5000 | 0.0000 | 0.0000 | 1.0000 | 0.3570 |
| ARF | 0.0003 | 0.4823 | 1.4660 | 0.7360 | 0.0000 | 0.5758 |
| CDTD | 0.0005 | 0.4152 | 1.0200 | 0.7344 | 0.0000 | 0.5135 |
| TabDDPM | 0.0018 | 0.4776 | 1.3397 | 0.7346 | 0.0000 | 0.5558 |
| SMOTE | 0.0051 | 0.3518 | 1.0033 | 0.8005 | 0.0000 | 0.5053 |
| Tabsyn | 0.0062 | 0.5326 | 1.4434 | 0.7346 | 0.0000 | 0.5624 |
| TVAE | 0.0120 | 0.6432 | 1.2156 | 0.8006 | 0.0000 | 0.5220 |
| CoDi | 0.0146 | 0.5303 | 1.4419 | 0.7490 | 0.0000 | 0.5508 |
| TabularARGN | 0.0168 | 0.5331 | 1.4496 | 0.0267 | 0.0113 | 0.5315 |
| TabDiff | 0.0592 | 0.7554 | 1.7223 | 0.0739 | 0.0015 | 0.5116 |
| CTABGAN | 0.0624 | 0.7816 | 1.7420 | 0.1376 | 0.0001 | 0.5142 |
| TTVAE | 0.0627 | 0.7524 | 1.4511 | 0.0743 | 0.0017 | 0.4715 |
| CTGAN | 0.0772 | 0.7696 | 1.5580 | 0.8011 | 0.0000 | 0.4747 |
| CopulaGAN | 0.2492 | 0.9996 | 3.6659 | 0.1176 | 0.0000 | 0.5111 |
Note: pMSE = Propensity Mean Squared Error, NNAA = Nearest Neighbor Adversarial Accuracy (NNAA ≈ 0.5 datasets are similar), nnMDCR = nearest neighbor median distance, and gowerMDCR = Gower median distance, eps. hit rate = epsilon hitting rate (closer to 0.0 is lower privacy leaks. | ||||||
Acknowledgement and disclaimer
This research was conducted without external grant funding and was completed independently during the author’s personal time.
1 Introduction
Micro-level data is essential for developing robust statistical and machine learning models in sectors like healthcare, economics, education, and finance. Although agencies collect high-value data through services and surveys, disseminating this information for evidence-based analysis is hindered by privacy mandates, such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe, and sector-specific frameworks like the Gramm-Leach-Bliley Act for financial data. Consequently, synthetic tabular data is gaining traction as a viable alternative that mitigates privacy risks while preserving data utility, ensuring that the statistical distributions of the synthetic data closely mirror those of the ground truth.
Beyond privacy, generative models for tabular data are increasingly important due to their multifaceted applications, which include imputing missing values, balancing minority class data for statistical analysis, and augmenting training data. The issue of class imbalance is common in domains like fraud detection and rare disease diagnosis. Models trained on such skewed distributions exhibit a strong bias toward the majority class, resulting in poor predictive accuracy for the critical minority class. Generative models can be conditioned to oversample the minority class, providing a rich, diverse set of new examples and enabling the creation of perfectly balanced datasets. This directly addresses the class imbalance problem in a far more sophisticated manner than traditional oversampling techniques like SMOTE (Synthetic Minority Over-sampling Technique) introduced in Chawla et al. (2002); for more on this, see He and Garcia (2009).
Generative tabular data models offer powerful advantages for missing value imputation by learning the underlying data distribution rather than simply estimating expectations. For instance, generative models like MIWAE (Mattei and Frellsen 2018) handle challenging scenarios including missing-at-random mechanisms and achieve competitive accuracy while maintaining computational efficiency, making them particularly valuable for real-world applications with heterogeneous, incomplete datasets.
Training deep neural network models typically requires large volumes of data to achieve reliable results. This poses a challenge in domains such as biochemical drug discovery, where data availability is limited. To address this, Altae-Tran et al. (2017) introduced a one-shot learning algorithm that substantially reduces the amount of data needed to make meaningful predictions. Beyond drug discovery, synthetic data offers valuable support for semi-supervised and self-supervised learning paradigms, particularly in scenarios where labeled data is scarce or costly to obtain.
In industrial applications, synthetic tabular data accelerates model prototyping, enables robust benchmarking, and enhances reproducibility by providing standardized datasets that closely approximate real-world conditions without exposing proprietary or sensitive information. In healthcare contexts, however, the use of synthetic data requires careful consideration to avoid risks and ensure ethical application (Giuffrè and Shung 2023; Koul, Duran, and Hernandez-Boussard 2025; Mohammed et al. 2025; Vallevik et al. 2024). Compounding this challenge, real-world data often serves as a mirror to historical and societal biases, which, if left unaddressed, are learned and amplified by machine learning models, leading to discriminatory and inequitable automated decisions (Barocas and Selbst 2016).
Traditional anonymization methods, such as k-anonymization, generalization, and suppression, often lead to a significant trade-off between privacy and data utility. This loss of utility can compromise downstream analytical tasks, particularly when training complex machine learning models or performing policy inference.
Generative Artificial Intelligence (AI) models have emerged as the state-of-the-art in generating synthetic data, including tabular data, offering significant improvements over classical techniques. Extensions of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), Variational Autoencoders (VAEs) (Kingma and Welling 2013), and more recently introduced Large Language Models (LLMs) (W. X. Zhao et al. 2023; Matarazzo and Torlone 2025) and Diffusion Models (DMs) (Sohl-Dickstein et al. 2015; Ho, Jain, and Abbeel 2020) for generation of high-quality synthetic tabular data are gaining popularity.
The generation of synthetic tabular data is not without its complexities. Unlike image or text data, tabular datasets often exhibit heterogeneous feature types (categorical, ordinal, continuous), intricate inter-feature dependencies, and domain-specific constraints. These characteristics pose unique challenges for generative modeling, necessitating specialized architectures and evaluation metrics tailored to tabular data synthesis. Some studies show that among all types of models, Diffusion Models have demonstrated superior capability in capturing complex, non-linear dependencies and multimodal distributions inherent in real-world tabular data (Capasso 2025; Fonseca and Bacao 2023; Zhang et al. 2024; Kotelnikov et al. 2023; Li et al. 2025; R. Shi et al. 2025; Truda 2023).
The fundamental promise of synthetic data is to act as a high-fidelity (utility), privacy-preserving proxy for the real data by creating a dataset that contains no personally identifiable information (PII) and has no one-to-one mapping to the original records. Diverse metrics are used in the literature to assess utility and privacy when comparing a real dataset with various synthetic datasets. Going further toward guaranteeing privacy preservation, based on the rigorous differential privacy framework of Dwork (2008) and Dwork and Roth (2014), training algorithms such as the differentially private stochastic gradient descent (DP-SGD) algorithm proposed by Abadi et al. (2016) are built into the training of generative models to achieve a given level of guaranteed differential privacy.
While some papers examine the downstream machine learning efficiency of synthetic datasets (e.g., using ML-Efficiency metrics, Sajjadi et al. (2018)), few studies compare the sensitivity of policy inferences derived from the statistical parameter estimates of econometric models. This paper studies the viability of using synthetic HRS data for policy inference, assessing whether the derived conclusions—which are vital for informing public health and economic decisions—remain statistically equivalent to those drawn from the original protected data. The paper proposes the use of the Mahalanobis distance statistic D^2 (which is related to Hotelling’s T^2 statistic) to assess the viability of synthetic HRS data for policy inference by testing whether derived public health and economic conclusions remain statistically equivalent to those from the original protected data.
The rest of the paper is organized as follows. In Section 2, I describe the mechanics of the generative AI models for tabular synthetic data that are based on GANs, VAEs, and Diffusion Models. I describe the concept of differential privacy (DP) and the differentially private stochastic gradient descent (DP-SGD) algorithm, which can be used in generative AI models to achieve differential privacy in the generated synthetic data. In Section 3, I describe various metrics used for the assessments of synthetic data utility, privacy level, and downstream policy sensitivity. In Section 4, I briefly describe the HRS dataset and the variables used in this paper. In Section 5, I compute the utility and privacy metrics for all the models to benchmark synthetic datasets generated by the 13 models considered in this paper and discuss the recommendation of models based on the utility and privacy metrics. In Section 6, I compute the policy sensitivity metric, the Mahalanobis D^2 proposed in this paper for all 13 models and discuss the recommendations for the models using this metric. Section 7 concludes the paper.
2 Generative models for synthetic tabular data
This paper aims to investigate the trade-off between privacy preservation and data utility in synthetic data generation models, and to assess the sensitivity of policy analysis using synthetic datasets produced by various machine learning models. We begin by explaining the concept of differential privacy.
2.1 Differential Privacy (DP)
Cynthia Dwork and colleagues (Dwork 2008) introduced a rigorous mathematical framework for the concept of privacy called Differential Privacy (DP), designed to protect individual data contributions when performing statistical analysis or machine learning. The core idea is that the output of an algorithm should not significantly change whether or not any single individual’s data is included, thereby ensuring plausible deniability for participants.
In short, adding or removing a single user should not statistically change the output. This stability guarantees that the model does not memorize or reveal sensitive information about any specific individual
Consider an algorithm that acts on some dataset and produces some output such as a synthetic dataset, or mean, median, mode of a variable in the real dataset. An algorithm can be a database query producing outputs of the above types. In our context, an algorithm is a machine learning model that acts on real data and produces a synthetic dataset similar in nature of the real data. Let \mathcal{D} be the set of all datasets and \mathcal{R} be the set of all possible outcomes of the algorithms.
A randomized algorithm is said to be differentially private if its outputs are nearly indistinguishable when run on two datasets in \mathcal{D} that differ by only one individual’s record. This is typically achieved by carefully adding noise to computations, balancing privacy guarantees with utility.
A randomized algorithm \mathcal{M}: \mathcal{D} \rightarrow \mathcal{R} satisfies (\epsilon, \delta)-differential privacy if for all datasets D and D' in \mathcal{D} differing in one record, and for all subsets S \subseteq \mathcal{R}:
\Pr[\mathcal{M}(D) \in S] \leq e^\epsilon \cdot \Pr[\mathcal{M}(D') \in S] + \delta If \delta = 0, it is called pure differential privacy (DP) which provides strict guarantee and if \delta > 0, it called approximate differential privacy (DP) which allows a small probability of privacy breach.
The parameter \epsilon known as privacy budget controls privacy loss; smaller values mean stronger privacy. General practice is to treat \epsilon < 1 as strong privacy; 1 \leq \epsilon \leq 10 as moderate privacy; and \epsilon > 10 as weak privacy. The parameter \delta controls the probability of privacy breach; typically, it is fixed at \delta < \frac{1}{n^2} where n is dataset size.
The main mechanism for achieving (\epsilon, \delta)-differential privacy in Machine Learning (ML) models of synthetic data generation is to replace SGD (stochastic gradient descent) with DP-SGD (Differentially Private Stochastic Gradient Descent) in parameter estimation algorithms. The algorithm is described next.
2.2 DP-SGD (Differentially Private Stochastic Gradient Descent)
Abadi et al. (2016) developed the DP-SGD algorithm for training neural networks to achieve a level of guaranteed differential privacy, which is described below.
DP-SGD Algorithm
Clip gradients: For each sample gradient g_i: \bar{g}_i = g_i \cdot \min\left(1, \frac{C}{\|g_i\|_2}\right) where C is the clipping threshold.
Add noise: Compute noisy average: \tilde{g} = \frac{1}{B}\left(\sum_{i=1}^B \bar{g}_i + \mathcal{N}(0, \sigma^2 C^2 I)\right)
Update parameters: \theta_{t+1} = \theta_t - \eta \tilde{g}
Privacy Accounting: Using moments accountant or Rényi DP, after T iterations:
\epsilon(T, \delta) = \mathcal{O}\left(\frac{q\sqrt{T\log(1/\delta)}}{\sigma}\right)
where q = B/n is the sampling rate.
There are three main types of generative models in the literature that I use in the present study. Their mechanics are briefly described below.
2.3 GAN (Generative adversial network)
GANs, introduced by Goodfellow and colleagues (Goodfellow et al. 2014), consist of two neural networks engaged in a minimax game. The generator G maps random noise z \sim p_z(z) to synthetic samples G(z), while the discriminator D attempts to distinguish real samples from generated ones.
The optimization objective for training is:
\min_\theta \max_\phi V(D_\phi, G_\theta) = \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D_\phi(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D_\phi(G_\theta(\mathbf{z})))]
where p_{\text{data}} is the true data distribution and p_z is the prior distribution on latent codes (typically \mathcal{N}(\mathbf{0}, \mathbf{I}))
The alternating optimization procedure:
Standard GAN Training
Input: Real data \mathcal{D}, learning rates \eta_D, \eta_G, batch size m
- \mathbf{for} number of training iterations:
- \quad for k discriminator steps:
- \quad\quad Sample minibatch \{\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(m)}\} from p_{\text{data}}
- \quad\quad Sample minibatch \{\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(m)}\} from p_z
- \quad\quad Update discriminator: \phi \leftarrow \phi + \eta_D \nabla_\phi \frac{1}{m}\sum_{i=1}^m [\log D_\phi(\mathbf{x}^{(i)}) + \log(1-D_\phi(G_\theta(\mathbf{z}^{(i)})))]
- \quad Sample minibatch \{\mathbf{z}^{(1)}, \ldots, \mathbf{z}^{(m)}\} from p_z
- \quad Update generator: \theta \leftarrow \theta - \eta_G \nabla_\theta \frac{1}{m}\sum_{i=1}^m \log(1-D_\phi(G_\theta(\mathbf{z}^{(i)})))
For tabular data, this framework requires modifications to handle mixed data types (continuous, categorical, ordinal) and complex dependencies between features. Xu et al. (2019) introduced such an extension known as CTGAN which I will use this study. There are other extensions such as PAT-GAN (Jordon, Yoon, and Schaar 2018) and DP-CTGAN (Fang, Dhami, and Kersting 2022) that incorporate differential privacy explicitly. These could not be easily adapted to trained on our dataset and thus not included in the study.
2.4 VAE (Variational Auto Encoders)
A VAE model introduced by Kingma and Welling (2013) learns a probabilistic mapping between data space \mathcal{X} and latent space \mathcal{Z} through variational inference. Unlike GANs, VAEs have an explicit probabilistic framework and optimize a principled objective function (the evidence lower bound).
The VAE defines a generative process:
z \sim p_\theta(z) = \mathcal{N}(0, I), \qquad\qquad x|z \sim p_\theta(x|z) \tag{1}
The goal is to maximize the marginal log-likelihood:
\log p_\theta(x) = \log \int p_\theta(x|z)p_\theta(z) dz
This integral is intractable for complex decoders p_\theta(x|z). Introduce an approximate posterior (encoder) q_\phi(z|x) and apply Jensen’s inequality:
\begin{aligned} \log p_\theta(x) &= \log \int p_\theta(x,z) dz \\ &= \log \int q_\phi(z|x) \frac{p_\theta(x,z)}{q_\phi(z|x)} dz \\ &\geq \int q_\phi(z|x) \log \frac{p_\theta(x,z)}{q_\phi(z|x)} dz \\ &= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x,z) - \log q_\phi(z|x)] \\ &= \mathcal{L}(\theta, \phi; x) \end{aligned}
\mathcal{L}(\theta, \phi; x) is known as the Evidence Lower Bound (ELBO).
The gap between ELBO and log-likelihood is:
\log p_\theta(x) - \mathcal{L}(\theta, \phi; x) = D_{KL}(q_\phi(z|x) \| p_\theta(z|x))
Since D_{KL} \geq 0, maximizing the ELBO provides a lower bound on the log-likelihood and minimizes the KL divergence to the true posterior.
Reparameterization Trick
To enable backpropagation through stochastic nodes, reparameterize:
z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
This separates the stochasticity (\epsilon) from the parameters (\phi), allowing gradients:
\nabla_\phi \mathbb{E}_{q_\phi(z|x)}[f(z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}[\nabla_\phi f(\mu_\phi(x) + \sigma_\phi(x) \odot \epsilon)] The training objective maximizes the Evidence Lower Bound (ELBO):
\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))
The first term encourages reconstruction accuracy, while the second regularizes the latent space to match a prior p(z) = \mathcal{N}(0, I).
For tabular data with mixed categorical, numeric and ordinal data, the synthetic data vault group introduced TVAE, an extension of the above VAE. Another extension of the vanilla VAE for tabular data is TTVAE by A. X. Wang and Nguyen (2025), an attention-based transformer model.
VAEs offer inherent privacy benefits: (1) The KL divergence term creates a continuous, smooth latent representation, reducing memorization. (2) The probabilistic encoder adds noise during training, providing a form of implicit privacy protection. (3) The ELBO objective is more amenable to differential privacy mechanisms than GAN objectives.
Weggenmann et al. (2022) introduce the DP-VAE model that incorporates explicitly differential privacy. But their codes could not be readily adapted to our dataset and thus not included in this study.
I will include TVAE and TTVAE in this study.
2.5 Diffusion Models for Tabular Data
Diffusion models, particularly Denoising Diffusion Probabilistic Models (DDPMs), have emerged as powerful generative models. They define a forward process that gradually adds noise to data and learn a reverse process that removes noise to generate samples. In what follows, I will provide a terse presentation of this method. The details could be found in the original papers (Sohl-Dickstein et al. 2015; Ho, Jain, and Abbeel 2020). For more friendly expositions, see (Luo 2022; Chan 2024).
The foundational concept introduced by Sohl-Dickstein et al. was to replace a single-step conversion in VAE with a chain of sequential conversions. Specifically, they defined two distinct processes x_0,x_1, \dots x_T, each x_i is in data space. One process is called forward process (i.e., going forward in time starting at t=0) with joint distribution q_\phi(x_{0:T}), mirroring the encoder component. The other is a backward process (i.e., going backward in time starting at t=T) with joint distribution p_\theta(x_{0:T}) mirroring the decoder component of a Variational Autoencoder (VAE).
To ensure both tractability and flexibility, a Markov chain structure is imposed on these processes. This means that each state in the sequence depends only on the immediately preceding state:
q_\phi(x_{1:T} | x_0) =q(x_0) \prod_{t=1}^T q_\phi(x_t | x_{t-1}) \tag{2}
and
p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} | x_t) \tag{3}
where q(x_0) is the unknown data distribution that we are trying to approximate, p(x_T) is a known distribution, generally assumed to be standard normal distribution, from which one can easily draw samples; the conditional distributions q_\phi(x_t | x_{t-1}) and p_\theta(x_{t-1} | x_t) represent the single-step transitions of the forward and reverse processes, respectively. The parameters \phi and \theta characterize distributions of each process. The forward process is a fixed (specified by the user) Markov chain that adds Gaussian noise over T timesteps:
q_{\phi}(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I) \tag{4}
where \beta_1, \ldots, \beta_T is a variance schedule with 0 < \beta_t < 1. While other distributions could be assumed for the transition probabilities, there are advantages if these are taken to be normal.
Consequently, one can sample x_t directly given x_0 using the following,
q_{\phi}(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I) \tag{5}
where \alpha_t = 1 - \beta_t and \bar{\alpha}_t = \prod_{s=1}^t \alpha_s. To see how Equation 3 is derived, notice that using the reparameterization x_t = \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}\epsilon_{t-1} repeatedly, one gets,
\begin{aligned} x_t &= \sqrt{\alpha_t}x_{t-1} + \sqrt{1-\alpha_t}\epsilon_{t-1} \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1-\alpha_{t-1}}\epsilon_{t-2}) + \sqrt{1-\alpha_t}\epsilon_{t-1} \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t(1-\alpha_{t-1})}\epsilon_{t-2} + \sqrt{1-\alpha_t}\epsilon_{t-1} \end{aligned}
Since the sum of independent Gaussians is Gaussian with variance sum, this simplifies to the closed form above.
The reverse process learns to denoise:
p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t)) \tag{6}
The joint distribution is:
p_\theta(x_{0:T}) = p(x_T)\prod_{t=1}^T p_\theta(x_{t-1}|x_t)
where p(x_T) = \mathcal{N}(x_T; 0, I).
Training Objective: The training maximizes the ELBO:
\mathbb{E}_{q}[\log p_\theta(x_0)] \geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \mathcal{L}
This decomposes into:
\mathcal{L} = \mathbb{E}_q\left[\underbrace{D_{KL}(q(x_T|x_0)\|p(x_T))}_{L_T} + \sum_{t>1}\underbrace{D_{KL}(q(x_{t-1}|x_t,x_0)\|p_\theta(x_{t-1}|x_t))}_{L_{t-1}} - \underbrace{\log p_\theta(x_0|x_1)}_{L_0}\right]
Posterior q(x_{t-1}|x_t, x_0): By Bayes’ theorem:
q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)
where:
\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t
\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t
Simplified Training Loss: Using the connection between x_0 and x_t:
x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\epsilon)
where \epsilon \sim \mathcal{N}(0, I) is the noise added in the forward process.
The model can predict \epsilon_\theta(x_t, t), leading to the simplified loss:
L_{simple} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]
where t \sim \text{Uniform}(1, T) and \epsilon \sim \mathcal{N}(0, I).
DDPM Sampling Algorithm
- Sample x_T \sim \mathcal{N}(0, I)
- For t = T, \ldots, 1: x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sqrt{\beta_t}z where z \sim \mathcal{N}(0, I) if t > 1, else z = 0.
Score-Based Perspective: From Equation 5, it can be seen that diffusion models connect to score matching through:
\nabla_{x_t} \log q(x_t) = -\frac{1}{\sqrt{1-\bar{\alpha}_t}}\epsilon
The model learns the score function:
s_\theta(x_t, t) = -\frac{1}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t) \approx \nabla_{x_t} \log q(x_t)
This enables sampling via Langevin dynamics and connects discrete-time diffusion to continuous-time stochastic differential equations (SDEs). This leads to Flow matching (Lipman et al. 2023; Albergo and Vanden-Eijnden 2022), an alternative to the above method, by learning continuous normalizing flows. Both frameworks above can be formulated in terms of probability flow ODE (Song et al. 2021). I describe it in a more unified framework in the next subsection.
2.6 Flow matching and diffusion models in continuous time — a unified modern approach
The fundamental insight underlying these approaches is elegant: creating noise from data is trivial, but the reverse process—generating data from noise—constitutes the essence of generative modeling. This transformation is accomplished by constructing probability paths that smoothly interpolate between a simple prior distribution (typically a Gaussian noise) and the unknown data distribution. The goal is using a neural network to learn the vector fields or score functions that guide this transformation.
This subsection closely follows Holderrieth and Erives (2025). Both Flow matching models and diffusion models rely on differential equations—ordinary differential equations (ODEs) for flow models and stochastic differential equations (SDEs) for diffusion models—to gradually transform simple noise distributions into complex data distributions.
Consider a data distribution p_{\text{data}} from which we wish to sample. The generative process begins by defining a probability path \{p_t\}_{t \in [0,1]}, where p_0 represents a simple noise distribution (e.g., standard Gaussian) and p_1 = p_{\text{data}} is the target data distribution. This path describes how probability mass evolves from noise to data over the time interval [0,1].
For flow models, the evolution of samples along this path is governed by an ODE:
\frac{dx_t}{dt} = u_t(x_t)
where u_t: \mathbb{R}^d \to \mathbb{R}^d is the vector field at time t. The vector field determines how individual samples flow through space to transform the initial distribution p_0 into p_1.
For diffusion models, the process includes stochastic components and is described by an SDE:
dx_t = u_t(x_t)dt + g_t dW_t
where W_t denotes Brownian motion, f_t is the drift term, and g_t controls diffusion intensity. The stochasticity enables more flexible probability transformations.
The Fokker-Planck equation is the cornerstone connecting SDEs to probability density evolution. It describes how the probability density p_t(x) evolves when particles follow an SDE. For the general SDE above, the Fokker-Planck equation states:
\frac{\partial p_t(x)}{\partial t} = -\nabla \cdot (u_t(x) p_t(x)) + \frac{1}{2}g_t^2 \Delta p_t(x)
where \nabla denotes divergence and \Delta is the Laplacian operator. The first term captures transport due to the drift, while the second term models diffusion.
For ODEs (when g_t = 0), this reduces to the continuity equation:
\frac{\partial p_t(x)}{\partial t} = -\nabla \cdot (u_t(x) p_t(x))
The Fokker-Planck equation is fundamental because it provides the theoretical guarantee: if we correctly parameterize our vector field or score function, samples generated by solving the differential equation will have the desired marginal distributions at each time t.
2.6.1 Flow Matching (using differential equations)
Flow matching trains continuous normalizing flows by regressing onto conditional vector fields rather than computing expensive maximum likelihood objectives. This approach constructs simple conditional probability paths p_t(x|x_1) that interpolate between noise p_0 and individual data samples x_1 \sim p_{\text{data}}.
A common choice is the Gaussian conditional path:
p_t(x|x_1) = \mathcal{N}(x; \mu_t(x_1), \sigma_t^2 I)
where \mu_t interpolates from noise to data: \mu_0 = 0, \mu_1 = x_1. The conditional vector field for this path is:
u_t(x|x_1) = \frac{d\mu_t(x_1)}{dt} + \frac{1}{\sigma_t}\frac{d\sigma_t}{dt}(x - \mu_t(x_1))
For the simple linear interpolation \mu_t(x_1) = tx_1 and \sigma_t = 1-t, this becomes:
u_t(x|x_1) = \frac{x_1 - x}{1-t}
The marginal vector field that governs the evolution of the entire distribution is:
u_t(x) = \mathbb{E}_{x_1 \sim p_{\text{data}}}[u_t(x|x_1) | x_t = x]
Training Algorithm:
The key insight is that minimizing the conditional flow matching loss \mathcal{L}_{\text{CFM}}(\theta) defined below is equivalent to minimizing the marginal loss:
\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}[0,1], x_1 \sim p_{\text{data}}, x_0 \sim p_0}\left[\|u_\theta(t, x_t) - u_t(x_t|x_1)\|^2\right]
where x_t is sampled from the conditional path p_t(\cdot|x_1). This loss is tractable because we can explicitly compute u_t(x_t|x_1) for simple conditional paths.
The training algorithm is remarkably simple:
Flow Matching Training Algorithm
- Sample a data point x_1 \sim p_{\text{data}}
- Sample noise x_0 \sim \mathcal{N}(0, I)
- Sample time t \sim \mathcal{U}[0,1]
- Compute x_t from the conditional path
- Compute loss \|u_\theta(t, x_t) - u_t(x_t|x_1)\|^2
- Update parameters via gradient descent
Sampling from Flow Models:
Sampling requires solving the learned ODE:
\frac{dx_t}{dt} = u_\theta(t, x_t), \quad x_0 \sim \mathcal{N}(0, I)
Using the Euler method with step size h:
x_{t+h} = x_t + h \cdot u_\theta(t, x_t)
More sophisticated ODE solvers (Runge-Kutta methods) provide better accuracy-efficiency tradeoffs.
2.6.2 Diffusion models (using stochastic differential equations)
2.6.2.1 Score-Based Formulation
Diffusion models learn to reverse a forward noising process. The forward SDE gradually adds noise:
dx_t = f_t x_t dt + g_t dW_t
The reverse-time SDE that transforms noise back to data is:
dx_t = \left[f_t x_t - g_t^2 \nabla \log p_t(x_t)\right]dt + g_t d\bar{W}_t
where \nabla \log p_t(x) is the score function and \bar{W}_t is a reverse-time Brownian motion.
2.6.2.2 Denoising Score Matching
Similar to flow matching, diffusion models employ conditional score functions. For the variance-preserving (VP) SDE with f_t = -\frac{1}{2}\beta_t and g_t = \sqrt{\beta_t}, the conditional distribution is Gaussian:
p_t(x|x_1) = \mathcal{N}(x; \alpha_t x_1, \sigma_t^2 I)
where \alpha_t = e^{-\frac{1}{2}\int_0^t \beta_s ds} and \sigma_t^2 = 1 - \alpha_t^2. The conditional score is:
\nabla \log p_t(x|x_1) = -\frac{x - \alpha_t x_1}{\sigma_t^2}
Training Algorithm:
The denoising score matching objective is:
\mathcal{L}_{\text{DSM}}(\theta) = \mathbb{E}_{t, x_1, x_0}\left[\lambda_t\|s_\theta(t, x_t) - \nabla \log p_t(x_t|x_1)\|^2\right]
where \lambda_t is a time-dependent weighting. In practice, predicting the noise \epsilon rather than the score is common:
\mathcal{L}_{\epsilon}(\theta) = \mathbb{E}_{t, x_1, \epsilon}\left[\|\epsilon_\theta(t, x_t) - \epsilon\|^2\right]
where x_t = \alpha_t x_1 + \sigma_t \epsilon with \epsilon \sim \mathcal{N}(0, I).
Training procedure:
Score Matching Training Algorithm
- Sample x_1 \sim p_{\text{data}}, \epsilon \sim \mathcal{N}(0, I), t \sim \mathcal{U}[0,1]
- Compute x_t = \alpha_t x_1 + \sigma_t \epsilon
- Predict noise: \hat{\epsilon} = \epsilon_\theta(t, x_t)
- Compute loss and update: \|\hat{\epsilon} - \epsilon\|^2
Sampling from Diffusion Models:
The Euler-Maruyama method discretizes the reverse SDE:
x_{t-h} = x_t + h\left[f_t x_t - g_t^2 s_\theta(t, x_t)\right] + g_t\sqrt{h}\xi_t
where \xi_t \sim \mathcal{N}(0, I). Deterministic sampling using the probability flow ODE is also possible:
x_{t-h} = x_t + h\left[f_t x_t - \frac{1}{2}g_t^2 s_\theta(t, x_t)\right]
2.7 Tabular Data Adaptations
Feature Preprocessing
Tabular data requires careful preprocessing and modifications in the architectures. See, for instance, Kotelnikov et al. (2023) for the TabDDPM model:
- Numerical features: Quantile transformation to approximate Gaussian distributions
- Categorical features: One-hot encoding or learned embeddings
- Mixed representations: Concatenated feature vectors with appropriate normalization
The architectures of each model, its training and sampling procedures can be found in the related papers. The paper Li et al. (2025) gives a comprehensive survey of various models.
Comparative Analysis
DDPMs offer: - Well-established theory and training stability - Explicit noise scheduling control - Strong performance on diverse data types
Flow matching provides: - Faster sampling via straight trajectories - Simulation-free training - Better numerical stability
Recent work suggests flow matching achieves comparable quality with 2-5× faster inference (Lipman et al. 2023).
2.8 Models of synthetic datasets studied this paper
I use the following models for the exercise of this paper for which the python codes are available to fit on our dataset. I use 13 models of various types and train them on HRS dataset and generate synthetic datasets from those models for assessing their performance.
GAN Based models:
- CTGAN (Xu et al. 2019)
- CTABGAN (Z. Zhao et al. 2021)
- CopulaGAN (Patki, Wedge, and Veeramachaneni 2021)
VAE based models:
- TVAE (Xu et al. 2019)
- TTVAE (A. X. Wang and Nguyen 2025)
Diffusion models:
- TabDDPM (Kotelnikov et al. 2023),
- CoDi (Lee, Kim, and Park 2023),
- Tabsyn (Zhang et al. 2024),
- TabDiff (J. Shi et al. 2024),
- CDTD (Mueller, Gruber, and Fok 2023)
Other type of models:
- SMOTE (Chawla et al. 2002)
- ARF (Watson et al. 2022)
- TabularARGN (Tiwald et al. 2025)
3 Metrics for utility, privacy and policy sensitivity
Synthetic data generation practitioners have been using traditional anonymization techniques like k-anonymity that ensures each record is indistinguishable from at least (k-1) others with respect to quasi-identifiers like age, ZIP code and similarly l-diversity and few others. However, these methods have been repeatedly shown to be insufficient, as they are vulnerable to re-identification and linkage attacks, especially in high-dimensional datasets (Sweeney 2002; Narayanan and Shmatikov 2008). Furthermore, these methods often degrade the underlying statistical properties of the data to the point where it loses its utility for complex ML tasks. These metrics are not used in this study. The metrics used in this study are described below.
3.1 Utility
3.1.1 KL Divergence metric
The Kullback-Leibler (KL) divergence, KL(P||Q), measures the pseudo-distance from a “true” distribution P to an “approximating” distribution Q, defined as
- For discrete distributions: KL(P||Q) = \sum_x P(x) \log\left(\frac{P(x)}{Q(x)}\right)
- For continuous distributions: KL(P||Q) = \int p(x) \log\left(\frac{p(x)}{q(x)}\right) dx
For KL(P||Q) to be finite, P must be absolutely continuous with respect to Q. This means that anywhere P has a non-zero probability, Q must also have a non-zero probability.
Estimating KL divergence between mixed continuous and discrete distributions is theoretically complex. Standard formulas often fail because mixed distributions lack absolute continuity; specifically, comparing a discrete probability mass in P against a continuous density in Q yields infinite divergence.
Key estimation methods are:
Discretization: Bins continuous variables to create fully discrete PMFs. This is computationally simple but sensitive to bin sizing and the “curse of dimensionality.”
Monte Carlo: Approximates the divergence expectation via sample averaging, feasible only when the specific underlying density functions are fully evaluable. More specifically, approximate this expectation by drawing many samples x_1, \dots, x_N from the distribution P and then computing the sample mean: \hat{D}(P||Q) \approx \frac{1}{N} \sum_{i=1}^N \log\left(\frac{p(x_i)}{q(x_i)}\right) This is not possible in the present context, as the distribution function for the real data P is unknown.
k-NN Estimation: A non-parametric approach using sample distances. It bypasses density estimation entirely, effectively handling high-dimensional mixed data. This method, famously proposed by Wang, Kulkarni, and Verdú (Q. Wang, Kulkarni, and Verdú 2009) also see (Perez-Cruz 2008), relies on the distances between samples. For each sample x_i from distribution P, it finds the distance to its k-th nearest neighbor in the same set of samples from P (let’s call this distance \rho(i)). It also finds the distance to its k-th nearest neighbor in the other set of samples from Q (let’s call this \nu(i)). The estimator uses the ratio of these distances. A simplified version of the estimator looks like: \hat{D}(P||Q) \approx \frac{d}{N} \sum_{i=1}^N \log\left(\frac{\nu(i)}{\rho(i)}\right) + \log\left(\frac{M}{N-1}\right), where d is the dimension of the data, N is the number of samples from P, and M is the number of samples from Q.
It converges to the true KL divergence as the number of samples increases and works in high-dimensional spaces where binning fails. It gracefully handles the mix of continuous and discrete data by using a proper distance metric (e.g., Gower distance) that can accommodate both data types.
It is, however, computationally more expensive than binning and requires careful selection of the distance metric and the parameter k.
3.1.2 Propensity Mean Squared Error (pMSE)
The utility metric pMSE (Propensity Mean Squared Error) is a metric used to evaluate the fidelity (i.e., the statistical similarity) of a synthetic dataset compared to a real dataset. The core idea is to test how easily a machine learning model can “tell the difference” between a real row and a synthetic row. To compute it, one trains a classifier (like Logistic Regression or a Random Forest) to predict the “propensity” (i.e., probability) that a given row is synthetic. If the synthetic data is perfectly realistic and statistically identical to the real data, the classifier should be completely “confused.” In this “confused” state, for any given row (real or synthetic), the classifier’s best guess would be 0.5 (a 50/50 chance). The pMSE metric measures how far the classifier’s predictions are from this ideal 0.5 value.
The formula is: pMSE = \frac{1}{N} \sum_{i=1}^{N} (p_i - 0.5)^2 Where, N is the total number of rows (real + synthetic) and p_i is the predicted probability (propensity score) that the i-th row is synthetic.
A pMSE score near 0.0 is IDEAL. This means the classifier’s predicted probabilities are all clustered around 0.5 (p_i \approx 0.5). The model has no idea which data is real and which is fake. This indicates high fidelity synthetic data. A pMSE score near 0.25 is POOR. This is the worst possible score. It means the classifier can perfectly separate the data. It predicts p_i \approx 1 for all synthetic rows (since (1 - 0.5)^2 = 0.25) and p_i \approx 0 for all real rows (since (0 - 0.5)^2 = 0.25). This indicates low fidelity data that is “obviously fake.”
3.1.3 Nearest Neighbor Adversarial Accuracy (NNAA)
The utility metric Nearest Neighbor Adversarial Accuracy (NNAA) evaluates how distinguishable synthetic data is from real data by checking whether each record’s nearest neighbor (in feature space) comes from the same dataset or the opposite one. If synthetic and real data are well‑mixed, the classifier accuracy will be close to 50%. If they are easily separable, accuracy will be much higher, indicating poor synthetic realism.
This is how it is computed. Combine the real dataset R and synthetic dataset S. Label each record: 0 = real, 1 = synthetic. Fit a classifier such as Logistic or Random Forest classifier. For each record, find its nearest neighbor (excluding itself). Predict the label of the record as the label of its nearest neighbor. Compute the classification accuracy across all records, and compute NNAA with the formula.
\text{NNAA} = \frac{\# \text{correctly predicted labels}}{\text{total records}} General guidelines for NNAA are that NNAA ≈ 0.5 means synthetic and real are indistinguishable (good utility); NNAA > 0.7 indicates the datasets are too different — synthetic data lacks realism; NNAA < 0.3 Suggests mode collapse or overfitting — synthetic data may be too close to real data.
3.1.4 Kolmogorov-Smirnov statistic and Hellinger distance
For a single variable, the Kolmogorov-Smirnov statistic is defined as, D_{KS} = \sup_x |F_{real}(x) - F_{synth}(x)| where F is the empirical CDF. Lower values indicate better similarity. The scores for individual columns are aggregated to arrive at the overall metric.
The Hellinger distance quantifies the difference between two probability distributions Pand Q, defined for probability mass functions as
H(P, Q) = \sqrt{\frac{1}{2} \sum_{i=1}^{n} \left(\sqrt{p_i} - \sqrt{q_i}\right)^2} where p_i and q_i are the probabilities of the i-th outcome in distributions P and Q. Its range is 0 \leq H(P, Q) \leq 1. A value of the distance 0 means identical distributions and a value 1 means completely disjoint distributions. A small value (close to 0) indicates synthetic data is statistically faithful.
3.1.5 Correlation Difference
The Correlation Difference metric is computed as follows. \Delta_{corr} = \|\text{Corr}(X_{real}) - \text{Corr}(X_{synth})\|_F using Frobenius norm.
3.2 Privacy metrics
3.2.1 Hit rate
The hit rate metric in synthetic data evaluation measures the proportion of synthetic records that exactly replicate (or nearly replicate) real records. A high hitting rate means the generator memorized and copied training data, which undermines privacy and generalization.
Given a real dataset R and a synthetic dataset S, the hitting rate is: \text{hit rate} = \frac{\#\{s \in S : s \in R\}}{|S|}
A low hit rate close to 0 means synthetic data is not memorizing individuals. A high hitting rate close to 1 means synthetic data is leaking real records.
3.2.2 Epsilon hit rate
The epsilon hit rate metric measures the proportion of synthetic records that are “too close” to real records, where “too close” is defined by a user‑chosen distance threshold \epsilon. It is a privacy risk metric: a higher value means more synthetic records are nearly identical to real ones, increasing re‑identification risk.
It is defined as
\text{epsilon hit rate} = \frac{1}{|S|} \sum_{x \in S} \mathbf{1}\!\left(\min_{y \in R} d(x,y) \leq \varepsilon \right), where R is the set of real records, S is the set of synthetic records and d(x,y) is the distance. High median_DCR means synthetic records are farther from real one, i.e., lower privacy risk; low median_DCRmeans synthetic records are very close to real ones, i.e., higher privacy risk.
3.2.3 Median Distance to Closest Record (median_DCR)
The median_DCR metric estimates how close synthetic records are to real records, helping assess the risk of re‑identification. When the datasets contain categorical variables, using Euclidean distance does not make sense, one uses Gower’s distance.
For each synthetic record, compute the distance to every real record. Identify the closest real record, i.e., minimum distance of the synthetic record to all the real records and then take the median of the shortest distances of the synthetic records. Formally,
\text{median\_DCR} = \text{median}\left( \min_{y \in R} d(x,y) \;\; \forall x \in S \right), where R is the set of real records, S is the set of synthetic records and d(x,y) is the distance. High median_DCR means synthetic records are farther from real one, i.e., lower privacy risk; low median_DCR means synthetic records are very close to real ones, i.e., higher privacy risk.
This metric is often reported alongside hitting rate and epsilon identifiability risk to give a fuller privacy picture.
3.3 Policy sensitivity metric – Mahalanobis Distance
Evidence based policy analysis is sometimes based on some multivariate mean of variables of a tabular dataset, or based on statistical parameter estimates of an econometric model with a vector of parameters say \theta. If the estimates of one or more policy relevant important parameters statistically significantly differ, or even worse, change signs when estimated using original data and synthetic data, the synthetic data will produce quite different policy inference as compared to the real data. I use the Mahalanobis distance D to measure the distance between \hat\theta_{real} and \hat\theta_{synth}. Following the arguments in Johnson and Wichern (2013), Chapter 5 and Rencher and Christensen (2012), Chapter 6, and noting how the Mahalanobis D^2 is related to Hotelling’s T^2 statistic for two independent samples, and its conversion to the F statistic for p-value calculation, one can test if the synthetic dataset will produce significant policy distortions or not.
Under the null hypothesis, H_0: \theta_{real} = \theta_{synth}, the squared Mahalanobis distance D^2 is asymptotically distributed as \chi^2_p (p is the dimension of the parameter vector \theta). D^2 = (\hat\theta_{real} - \hat\theta_{synth})' \hat\Sigma_{pooled}^{-1} (\hat\theta_{real} - \hat\theta_{synth}), where \hat\Sigma_{pooled} = \frac{(n_1 - 1)\hat\Sigma_{real} + (n_2 - 1)\hat\Sigma_{synth}}{n_1 + n_2 - 2}, where n_1 and n_2 are the sizes of real and synthetic datasets respectively.
4 The Dataset and the construction of variables
I use the Health and Retirement Study (HRS) dataset for empirical analysis. A lot has been written about HRS datasets–about its structure, purpose, and various modules collecting data on genetics, biomarkers, cognitive functioning, and more, see for instance (Juster and Suzman 1995; Sonnega et al. 2014; Fisher and Ryan 2017).
For definition of variables, see Raut (2024a).
The demographic variables White and Female have the standard definition. The variable College+ is a binary variable taking value 1 if the respondent has education level of completed college and above (does not include some college), i.e., has a college degree and more and taking value 0 otherwise.
CES-D: I used the score on the Center for Epidemiologic Studies Depression (CES-D) measure in various waves that is created by RAND release of the HRS data. RAND creates the score as the sum of five negative indicators minus two positive indicators. “The negative indicators measure whether the Respondent experienced the following sentiments all or most of the time: depression, everything is an effort, sleep is restless, felt alone, felt sad, and could not get going. The positive indicators measure whether the Respondent felt happy and enjoyed life, all or most of the time.” I standardize this score by subtracting 4 and dividing 8 to the RAND measure. The wave 1 had different set of questions so it was not reported in RAND HRS. I imputed it to be the first non-missing future CES-D score. In the paper, I refer the variable as CES-D. Steffick (2000) discusses its validity as a measure of stress and depression.
Cognitive scores: This variable is a measure of cognitive functioning. RAND combined the original HRS scores on cognitive function measure which includes “immediate and delayed word recall, the serial 7s test, counting backwards, naming tasks (e.g., date-naming), and vocabulary questions”. Three of the original HRS cognition summary indices—two indices of scores on 20 and 40 words recall and third is score on the mental status index which is sum of scores ``from counting, naming, and vocabulary tasks”—are added together to create this variable. Again, due to non-compatibility with the rest of the waves, the score in the first wave was not reported in the RAND HRS. I have imputed it by taking the first future non-missing value of this variable.
HIGH BMI: The variable body-mass-index (HIGH BMI) is the standard measure used in the medical field and HRS collected data on this for all individuals. If it is missing in 1992, I impute it with the first future non-missing value for the variable. Following the criterion in the literature, I create the variable HIGH BMI taking value 1 if HIGH BMI > 25 and value 0 otherwise.
Now I describe the construction of the behavioral variables.
Smoking: This variable is constructed to be a binary variable taking value 1 if the respondent has reported yes to ever smoked question during any of the waves as reported in the RAND HRS data and then repeated the value for all the years.
Exercising: The RAND HRS has data on whether the respondent did vigorous exercise three or more days per week. I created in each time period to be 1 if the respondent did vigorous exercise three or more days per week in any of the waves and then that value is assigned to all the years.
Childhood SES: This variable is a binary variable measuring childhood SES. I constructed it using the IRT procedure as follows. From the HRS data I created four binary variables using the original categorical data on family moved for financial reason, family usually got financial help during childhood, father unemployed during childhood, father’s usual occupation during childhood (0 = disadvantaged and 1 = advantaged), and three tertiary variables two on each parent’s educational levels (0 = High School dropout, 1 = some college, 2 = completed college and higher) and third on family financial situation (0 = poor, 1 = average, 2 = well-off). I used these seven variables as items in the IRT procedure to first compute a continuous score estimate and then I define Childhood SES = 1 if the score is above mean plus one standard deviation of the scores and 0 otherwise.
Childhood Health is a binary measure of childhood health constructed from the self-reported qualitative childhood health variable in HRS. I define Childhood Health = 1 if the respondent reported very good or excellent, and zero otherwise.
5 Benchmarking models with privacy and utility metrics
5.1 My own implemented metrics
The models sorted (best first) by the privacy metrics in Table 1 are:
nnMDCR (nearest neighbor median distance) — CopulaGAN, CTABGAN, TabDiff, CTGAN, ARF, TTVAE, TabularARGN, Tabsyn, CoDi, TabDDPM, TVAE, CDTD, SMOTE
grMDCR (median Gower distance to closest record) — CTGAN, TVAE, SMOTE, CoDi, ARF, Tabsyn, TabDDPM, CDTD, CTABGAN, CopulaGAN, TTVAE, TabDiff, TabularARGN.
5.2 Syntheval metrics
In this section, I present a selected few metrics from Lautrup et al. (2024) using their codes available on GitHub. I have renamed some of the metrics to the names used in my own implementations.
| synth. data | pMSE | NNAA | K-S statistic | Hellinger distance | Utility rank |
|---|---|---|---|---|---|
| Ref | 0.0001 | 0.0000 | 0.0000 | 0.0000 | 7.0000 |
| ARF | 0.0007 | 0.5682 | 0.0648 | 0.0415 | 5.0758 |
| Tabsyn | 0.0058 | 0.5268 | 0.0426 | 0.0257 | 4.7714 |
| TabularARGN | 0.0010 | 0.5108 | 0.0172 | 0.0112 | 5.0625 |
| TVAE | 0.0140 | 0.8660 | 0.1137 | 0.0653 | 3.7410 |
| CoDi | 0.0179 | 0.6632 | 0.1053 | 0.0500 | 3.8917 |
| SMOTE | 0.0047 | 0.8024 | 0.1020 | 0.0725 | 4.2499 |
| TabDDPM | 0.0015 | 0.4451 | 0.0211 | 0.0152 | 5.3041 |
| CDTD | 0.0004 | 0.4441 | 0.0235 | 0.0093 | 5.6741 |
| TabDiff | 0.0539 | 1.0000 | 0.0931 | 0.0380 | 2.7515 |
| TTVAE | 0.0581 | 1.0000 | 0.0985 | 0.0378 | 2.4446 |
| CTABGAN | 0.0481 | 1.0000 | 0.1646 | 0.0711 | 1.8897 |
| CTGAN | 0.0758 | 0.8672 | 0.2054 | 0.1128 | 0.9530 |
| CopulaGAN | 0.2421 | 1.0000 | 0.2200 | 0.0748 | 0.3369 |
| Notes: K-S stands for Kolmogorov-Smirnov. | |||||
The models sorted (best first) by utility metrics in Table 2 are:
- pMSE — CDTD, ARF, TabularARGN, TabDDPM, SMOTE, Tabsyn, TVAE, CoDi, CTABGAN, TabDiff, TTVAE, CTGAN, CopulaGAN,
Kolmogorov-Smirnov statistic — TabularARGN, TabDDPM, CDTD, Tabsyn, ARF, TabDiff, TTVAE, SMOTE, CoDi, TVAE, CTABGAN, CTGAN, CopulaGAN,
Hellinger distance — CDTD, TabularARGN, TabDDPM, Tabsyn, TTVAE, TabDiff, ARF, CoDi, TVAE, CTABGAN, SMOTE, CopulaGAN, CTGAN.
| synth. data | median DCR | eps_privacy loss | eps. hit rate | Privacy rank | Composite score |
|---|---|---|---|---|---|
| Ref | 1.0000 | -0.3963 | 0.0000 | 1.6393 | 8.6393 |
| CTGAN | 3.6974 | -0.0017 | 0.0833 | 4.8434 | 5.7964 |
| TVAE | 3.5608 | 0.0274 | 0.1579 | 4.6448 | 8.3857 |
| CoDi | 3.4032 | 0.0480 | 0.2559 | 4.3646 | 8.2564 |
| ARF | 4.5667 | 0.0616 | 0.3083 | 4.2345 | 9.3103 |
| TTVAE | 18.1320 | 0.0000 | 0.0000 | 4.0000 | 6.4446 |
| CTABGAN | 17.8736 | 0.0000 | 0.0000 | 4.0000 | 5.8897 |
| TabDiff | 16.1454 | 0.0000 | 0.0000 | 4.0000 | 6.7515 |
| CopulaGAN | 15.6249 | 0.0000 | 0.0000 | 4.0000 | 4.3369 |
| SMOTE | 3.1109 | 0.2064 | 0.3650 | 3.7060 | 7.9558 |
| Tabsyn | 1.0791 | 0.0976 | 0.4608 | 3.6426 | 8.4141 |
| TabularARGN | 1.1498 | 0.1446 | 0.6023 | 3.3308 | 8.3932 |
| TabDDPM | 0.6689 | 0.2054 | 0.5883 | 2.6483 | 7.9525 |
| CDTD | 0.3529 | 0.2909 | 0.6405 | 1.6787 | 7.3529 |
Taking into account the trade-off between utility and privacy, Lautrup et al. (2024) suggests a weighting of their metrics to come up with their composite rank metric. Table 3 reports the value of this metric for all 13 models. According to this metric, the best 5 (in descending order) are ARF, Tabsyn, TabularARGN, TVAE, CoDi.
5.3 Visual comparisons
I plot performances of three synthetic datasets compared to real dataset.
I also plot the performance of the commonly used TabDDPM in the class of diffusion model.
6 Policy sensitivity with econometric policy models
I consider two types of policy relevant econometric models. one set studies the importance of childhood factors in the determination of childhood health, education and mid-age health (i.e., healthy or not in mid-ages). The other set studies the effect of childhood factors, health related behaviors on incidence of various chronic diseases at mid ages.
I examine the sensitivity of parameter estimates to the substitution of the real data with synthetic data, examining the differences in the significant parameters and also with the proposed measure of policy sensitivity metric, Mahalanobis distance. I highlight the models which have no statistically significant policy sensitivity. While for each set of models, I report the Mahalanobis distance metrics for all the econometric models in each set for all 13 synthetic datasets, I only present the econometric parameter estimates for two synthetic datasets – one generated by ARF which is found to produce good utility by most utility metrics; the generated by CTGAN, a widely used method in the literature and also found above to have good value for privacy metrics.
6.1 Econometric models of childhood development
Childhood health status (Childhood Health) is an important factor for later life health outcomes and educational attainments. Childhood SES influences the stressors of the cells environment and thus will affect Childhood Health. Apart from Childhood SES, other factors such as nutrition and pediatric health care are important factors.
Many childhood factors also determine College+ such as innate IQ, family background, preschool inputs, prenatal and postnatal stressors for brain development, the childhood health status, and mother’s time input. See, Heckman (2008) and Raut (2018) for recent literature on the biology of brain development and the role of socioeconomic factors, and Heckman and Raut (2016) for a Logit model of college completion in which a IQ measure, family background measured with parents’ education, preschool inputs and non-cognitive skills play important roles. See Raut (2024b) for a similar model that uses the HRS dataset. The latter econometric model is used in this paper.
| Synth data | Childhood Health | College+ | Midage Health |
|---|---|---|---|
| ARF | 2.63 | 9.80 | 1.93 |
| Tabsyn | 105.83* | 114.88* | 16.53* |
| TabularARGN | 25.67* | 23.97* | 16.33* |
| TabDDPM | 13.72* | 25.03* | 2.64 |
| CoDi | 12.52* | 1.83 | 9.88 |
| TVAE | 438.90* | 47.07* | 254.69* |
| CDTD | 6.09 | 5.35 | 9.29 |
| TabDiff | 1.49 | 6.50 | 2.62 |
| CTABGAN | 2.02 | 8.29 | 22.86* |
| SMOTE | 72.28* | 80.76* | 8.67 |
| TTVAE | 7.50 | 16.62* | 5.06 |
| CTGAN | 184.14* | 322.33* | 461.12* |
| CopulaGAN | 91.11* | 212.01* | 177.45* |
| Note: A Mahalanobis distance statistic with a * means its p-value < 0.01, providing strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | |||
Estimates of Mahalanobis distance metric in Table 4 show that the synthetic data generators ARF, CDTD, and TabDiff methods are best, producing statistically identical parametric estimates for all econometric policy models of this section and the next best generators are CoDi, CTABGAN and TTVAE, producing statistically identical for almost all models.
The parameter estimates for ARF and CTGAN are presented below to visually see the differences of statistically significant parameter estimates for the econometric models of this sub section.
| Variables | cHLTH:real | cHLTH:synth | College:real | College:synth | midage health: real | midage health: synth |
|---|---|---|---|---|---|---|
| Intercept | 1.089 *** | 1.065 *** | -2.048 *** | -1.908 *** | -1.130 *** | -1.000 *** |
| (0.065) | (0.064) | (0.097) | (0.094) | (0.078) | (0.076) | |
| White | 0.299 *** | 0.349 *** | 0.428 *** | 0.406 *** | 0.264 *** | 0.170 ** |
| (0.063) | (0.063) | (0.074) | (0.073) | (0.060) | (0.059) | |
| Female | -0.141 * | -0.125 * | -0.409 *** | -0.483 *** | -0.201 *** | -0.206 *** |
| (0.056) | (0.056) | (0.057) | (0.056) | (0.049) | (0.049) | |
| Childhood SES | 0.536 *** | 0.342 *** | 1.596 *** | 1.314 *** | 0.225 ** | 0.205 ** |
| (0.091) | (0.087) | (0.069) | (0.069) | (0.070) | (0.069) | |
| Childhood Health | 0.544 *** | 0.502 *** | 0.291 *** | 0.233 *** | ||
| (0.077) | (0.075) | (0.062) | (0.061) | |||
| College+ | 0.218 *** | 0.243 *** | ||||
| (0.059) | (0.058) | |||||
| N | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 |
| R squared | 0.009 | 0.007 | 0.088 | 0.068 | 0.012 | 0.010 |
| Mahalanobis distance | 2.629 | 2.629 | 9.802 | 9.802 | 1.933 | 1.933 |
| p-value | 0.453 | 0.453 | 0.044 | 0.044 | 0.858 | 0.858 |
| Loglik | -4002.297 | -4008.616 | -3867.259 | -3941.574 | -4864.316 | -4888.015 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. Standard errors are in parentheses. Low p-value (e.g., < 0.01) reported below the Mahalanobis distance provides strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | ||||||
| Variables | cHLTH:real | cHLTH:synth | College:real | College:synth | midage health: real | midage health: synth |
|---|---|---|---|---|---|---|
| Intercept | 1.089 *** | 1.933 *** | -2.048 *** | -3.944 *** | -1.130 *** | -2.841 *** |
| (0.065) | (0.095) | (0.097) | (0.204) | (0.078) | (0.137) | |
| White | 0.299 *** | -0.160 * | 0.428 *** | 0.936 *** | 0.264 *** | 0.738 *** |
| (0.063) | (0.078) | (0.074) | (0.118) | (0.060) | (0.087) | |
| Female | -0.141 * | -0.059 | -0.409 *** | -0.568 *** | -0.201 *** | 0.091 |
| (0.056) | (0.082) | (0.057) | (0.088) | (0.049) | (0.076) | |
| Childhood SES | 0.536 *** | 0.609 *** | 1.596 *** | 3.230 *** | 0.225 ** | 0.828 *** |
| (0.091) | (0.154) | (0.069) | (0.106) | (0.070) | (0.111) | |
| Childhood Health | 0.544 *** | 1.287 *** | 0.291 *** | 0.481 *** | ||
| (0.077) | (0.166) | (0.062) | (0.102) | |||
| College+ | 0.218 *** | 0.482 *** | ||||
| (0.059) | (0.094) | |||||
| N | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 |
| R squared | 0.009 | 0.004 | 0.088 | 0.244 | 0.012 | 0.043 |
| Mahalanobis distance | 184.142 | 184.142 | 322.328 | 322.328 | 461.119 | 461.119 |
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Loglik | -4002.297 | -3159.738 | -3867.259 | -2240.215 | -4864.316 | -3382.905 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. Standard errors are in parentheses. Low p-value (e.g., < 0.01) reported below the Mahalanobis distance provides strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | ||||||
6.2 Econometric models of midlife health
I consider another econometric model. This model is to see how various childhood factors and health behaviors leading up to middle ages are associated with the incidence of various chronic diseases in a multinomial logit framework. The regressors and the disease states are given as in Table 8. The detailed policy issues are discussed in more in (Raut 2024a).
Like in the previous subsection, I first show the statistical estimates of Mahalanobis distance metric for all the datasets in Table 7 and then present detailed parameter estimates for two synthetic generative models, ARF and CTGAN.
The estimates of Mahalanobis distance metrics in Table 7 show that the synthetic data generators ARF, TabDDPM, and CDTD are best, producing statistically identical parametric estimates for the econometric policy model of midlife diseases of this section and the next best generators are TabularARGN and SMOTE, producing statistically identical for almost all diseases.
| Synth data | 2-Cardiovas | 3-Cancer | 4-other | 5-Comorbid |
|---|---|---|---|---|
| ARF | 20.63 | 6.10 | 10.64 | 30.50 |
| Tabsyn | 46.55* | 40.84* | 19.69 | 81.73* |
| TabularARGN | 27.03 | 30.72 | 15.60 | 61.89* |
| TabDDPM | 9.76 | 26.83 | 18.97 | 17.87 |
| CoDi | 23.71 | 36.40* | 24.09 | 43.96* |
| TVAE | 440.11* | 2446.12* | 319.36* | 315.21* |
| CDTD | 15.51 | 21.73 | 24.21 | 26.57 |
| TabDiff | 57.12* | 17.62 | 42.49* | 150.85* |
| CTABGAN | 76.08* | 25.36 | 55.22* | 181.97* |
| SMOTE | 15.32 | 18.82 | 27.69 | 60.59* |
| TTVAE | 75.46* | 27.29 | 54.45* | 181.28* |
| CTGAN | 836.18* | 319.12* | 229.87* | 234.96* |
| CopulaGAN | 300.88* | 71.81* | 166.01* | 428.49* |
| Note: A Mahalanobis distance statistic with a * means its p-value < 0.01, providing strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | ||||
| variables | 2-Cardiovas | 3-Cancer | 4-other | 5-Comorbid | ||||
|---|---|---|---|---|---|---|---|---|
| data type -> | original | synthetic | original | synthetic | original | synthetic | original | synthetic |
| Intercept | -0.449 * | -0.446 * | -4.536 *** | -3.431 *** | -1.090 *** | -0.721 ** | -0.280 | -0.059 |
| (0.228) | (0.217) | (0.922) | (0.671) | (0.251) | (0.234) | (0.228) | (0.216) | |
| White | -0.096 | -0.025 | 1.422 | 0.122 | 0.447 ** | 0.345 ** | 0.354 ** | 0.216 |
| (0.128) | (0.113) | (0.726) | (0.372) | (0.152) | (0.128) | (0.137) | (0.116) | |
| Black | 0.512 *** | 0.403 ** | 1.224 | 0.218 | 0.239 | 0.049 | 0.590 *** | 0.319 * |
| (0.141) | (0.124) | (0.764) | (0.411) | (0.173) | (0.146) | (0.150) | (0.127) | |
| Female | -0.123 | 0.003 | 0.988 *** | 0.790 *** | 0.590 *** | 0.438 *** | 0.499 *** | 0.389 *** |
| (0.066) | (0.065) | (0.222) | (0.211) | (0.072) | (0.070) | (0.067) | (0.066) | |
| Childhood SES | -0.104 | -0.053 | -0.599 | -0.351 | -0.188 | -0.276 ** | -0.257 ** | -0.184 |
| (0.093) | (0.090) | (0.325) | (0.300) | (0.098) | (0.099) | (0.099) | (0.096) | |
| Childhood Health | 0.060 | 0.047 | 0.235 | -0.064 | -0.319 *** | -0.206 * | -0.414 *** | -0.344 *** |
| (0.084) | (0.081) | (0.266) | (0.238) | (0.083) | (0.083) | (0.077) | (0.076) | |
| College+ | -0.115 | -0.066 | 0.043 | 0.056 | 0.070 | -0.044 | -0.152 | -0.277 *** |
| (0.080) | (0.077) | (0.241) | (0.233) | (0.084) | (0.082) | (0.085) | (0.083) | |
| Smoking | 0.103 | 0.076 | 0.152 | 0.287 | 0.343 *** | 0.192 ** | 0.317 *** | 0.139 * |
| (0.065) | (0.064) | (0.192) | (0.194) | (0.069) | (0.068) | (0.066) | (0.065) | |
| Home Environment | -0.012 | -0.016 | -0.118 | -0.211 | -0.215 * | -0.303 *** | -0.219 ** | -0.274 *** |
| (0.076) | (0.075) | (0.232) | (0.238) | (0.084) | (0.085) | (0.080) | (0.080) | |
| High BMI | 0.717 *** | 0.669 *** | -0.235 | -0.206 | 0.147 * | 0.032 | 0.941 *** | 0.886 *** |
| (0.072) | (0.070) | (0.193) | (0.193) | (0.070) | (0.069) | (0.073) | (0.073) | |
| CES-D | 0.263 | 0.364 * | 0.170 | 0.326 | 0.981 *** | 0.935 *** | 1.597 *** | 1.435 *** |
| (0.157) | (0.144) | (0.467) | (0.426) | (0.153) | (0.146) | (0.139) | (0.135) | |
| Cognitive scores | -0.016 * | -0.016 * | -0.016 | 0.004 | 0.007 | 0.004 | -0.023 ** | -0.022 ** |
| (0.008) | (0.008) | (0.024) | (0.023) | (0.009) | (0.008) | (0.008) | (0.007) | |
| cohort 1948-53 | 0.074 | 0.101 | -0.483 | -0.484 | -0.113 | -0.043 | -0.054 | -0.048 |
| (0.082) | (0.080) | (0.297) | (0.281) | (0.090) | (0.088) | (0.084) | (0.083) | |
| cohort 1954-59 | 0.194 * | 0.043 | -0.149 | -0.525 | -0.003 | -0.031 | -0.061 | -0.027 |
| (0.090) | (0.090) | (0.302) | (0.318) | (0.100) | (0.098) | (0.094) | (0.091) | |
| N | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 |
| Mahalanobis distance | 7.528 | 7.528 | 5.393 | 5.393 | 8.903 | 8.903 | 9.954 | 9.954 |
| p-value | 0.873 | 0.873 | 0.966 | 0.966 | 0.780 | 0.780 | 0.698 | 0.698 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. Standard errors are in parentheses. Low p-value (e.g., < 0.01) reported below the Mahalanobis distance provides strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | ||||||||
| variables | 2-Cardiovas | 3-Cancer | 4-other | 5-Comorbid | ||||
|---|---|---|---|---|---|---|---|---|
| data type -> | original | synthetic | original | synthetic | original | synthetic | original | synthetic |
| Intercept | -0.449 * | 3.235 *** | -4.536 *** | -0.604 | -1.090 *** | 1.138 ** | -0.280 | 2.021 *** |
| (0.228) | (0.303) | (0.922) | (0.470) | (0.251) | (0.358) | (0.228) | (0.359) | |
| White | -0.096 | -0.665 *** | 1.422 | 0.294 | 0.447 ** | -0.231 | 0.354 ** | -0.306 * |
| (0.128) | (0.113) | (0.726) | (0.191) | (0.152) | (0.136) | (0.137) | (0.137) | |
| Black | 0.512 *** | 0.230 * | 1.224 | -0.712 *** | 0.239 | 0.412 ** | 0.590 *** | -0.316 * |
| (0.141) | (0.116) | (0.764) | (0.203) | (0.173) | (0.136) | (0.150) | (0.144) | |
| Female | -0.123 | -0.207 * | 0.988 *** | 0.772 *** | 0.590 *** | 0.181 | 0.499 *** | 0.184 |
| (0.066) | (0.085) | (0.222) | (0.160) | (0.072) | (0.106) | (0.067) | (0.108) | |
| Childhood SES | -0.104 | -0.652 *** | -0.599 | -0.841 *** | -0.188 | -1.037 *** | -0.257 ** | -1.146 *** |
| (0.093) | (0.128) | (0.325) | (0.218) | (0.098) | (0.172) | (0.099) | (0.215) | |
| Childhood Health | 0.060 | -0.481 *** | 0.235 | 0.221 | -0.319 *** | -0.205 | -0.414 *** | -0.766 *** |
| (0.084) | (0.110) | (0.266) | (0.179) | (0.083) | (0.130) | (0.077) | (0.121) | |
| College+ | -0.115 | -0.402 *** | 0.043 | -0.573 *** | 0.070 | -0.438 *** | -0.152 | -1.096 *** |
| (0.080) | (0.109) | (0.241) | (0.163) | (0.084) | (0.133) | (0.085) | (0.171) | |
| Smoking | 0.103 | 0.142 * | 0.152 | -0.397 *** | 0.343 *** | 0.023 | 0.317 *** | 0.044 |
| (0.065) | (0.072) | (0.192) | (0.107) | (0.069) | (0.085) | (0.066) | (0.088) | |
| Home Environment | -0.012 | 0.072 | -0.118 | 0.546 *** | -0.215 * | -0.309 * | -0.219 ** | -0.524 *** |
| (0.076) | (0.109) | (0.232) | (0.142) | (0.084) | (0.138) | (0.080) | (0.154) | |
| High BMI | 0.717 *** | 0.181 * | -0.235 | -0.190 | 0.147 * | -0.395 *** | 0.941 *** | 0.709 *** |
| (0.072) | (0.071) | (0.193) | (0.104) | (0.070) | (0.084) | (0.073) | (0.089) | |
| CES-D | 0.263 | 1.324 *** | 0.170 | 0.854 * | 0.981 *** | 1.382 *** | 1.597 *** | 1.497 *** |
| (0.157) | (0.305) | (0.467) | (0.429) | (0.153) | (0.341) | (0.139) | (0.342) | |
| Cognitive scores | -0.016 * | -0.043 *** | -0.016 | -0.029 * | 0.007 | -0.009 | -0.023 ** | -0.046 *** |
| (0.008) | (0.010) | (0.024) | (0.015) | (0.009) | (0.012) | (0.008) | (0.012) | |
| cohort 1948-53 | 0.074 | 0.792 *** | -0.483 | 0.792 *** | -0.113 | 0.554 *** | -0.054 | 0.706 *** |
| (0.082) | (0.104) | (0.297) | (0.147) | (0.090) | (0.122) | (0.084) | (0.121) | |
| cohort 1954-59 | 0.194 * | 0.373 *** | -0.149 | 0.382 ** | -0.003 | 0.384 *** | -0.061 | 0.272 * |
| (0.090) | (0.096) | (0.302) | (0.144) | (0.100) | (0.111) | (0.094) | (0.119) | |
| N | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 | 7775 |
| Mahalanobis distance | 781.452 | 781.452 | 323.772 | 323.772 | 200.363 | 200.363 | 149.634 | 149.634 |
| p-value | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. Standard errors are in parentheses. Low p-value (e.g., < 0.01) reported below the Mahalanobis distance provides strong evidence against the null hypothesis that the parameter estimates from the real data and synthetic data are equal. | ||||||||
7 Conclusion
Micro-level data is essential for high-quality modeling in sectors like healthcare and finance, yet strict privacy mandates often prevent organizations from sharing this information publicly. As a solution, synthetic tabular data—comprising numerical, categorical, and ordinal variables—has emerged as a powerful alternative that mirrors original statistical distributions while protecting individual identities, enabling evidence-based analysis without compromising privacy or utility. A central challenge in this field is navigating the inherent trade-off between data utility and privacy preservation; researchers strive to identify Pareto superior models that provide higher levels of both utility and privacy protection. While the literature offers a variety of metrics and frameworks, modern Generative AI architectures—specifically Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Large Language Models (LLMs), and Diffusion models—are increasingly recognized for their ability to outperform conventional synthetic data generation techniques.
An important question that remains largely unexplored is: how sensitive are evidence-based econometric model policies to the substitution of real data with synthetic data? What metrics should be used to compare synthetic datasets generated by various models? These are the main issues addressed in this paper.
This paper first explains the differential privacy framework of Dwork and colleagues (Dwork 2008; Dwork and Roth 2014) and the DP-SGD (Differential Privacy Stochastic Gradient Descent) algorithm of Abadi et al. (2016) that various generative models can incorporate in their neural network training algorithms to achieve a guaranteed level of differential privacy. The paper then explains the main mechanics of GANs, VAEs, and Diffusion models for synthetic tabular data. After briefly describing the mechanics of GANs and VAEs, the paper details the mechanics of the discrete time diffusion model, Denoising Diffusion Probabilistic Models (DDPMs) (Sohl-Dickstein et al. 2015; Ho, Jain, and Abbeel 2020).
The paper then describes the elegant continuous time stochastic differential equation (SDE) framework that unifies discrete time diffusion models. In this framework, vector fields and diffusion coefficients are used for forward noising processes of the data points, and the Fokker-Planck equation (also known as the Kolmogorov forward equation) is employed to simplify the backward denoising process to approximate the data generating probability distribution and the algorithms for training and sampling.
The paper proposes the use of the Mahalanobis distance D^2 statistic as a measure of policy sensitivity to the substitution of real data with synthetic data for the statistical parameter estimates of econometric policy models.
The paper considers 13 tabular generative models—3 GAN-based, 2 VAE-based, 5 Diffusion-based, and 3 other types—to train and generate samples from each model, compute various utility and privacy metrics, and apply the proposed Mahalanobis distance metric for policy sensitivity to rank models.1
The paper finds that the best five models, ranked in decreasing order, are as follows:
- By the commonly used utility metric pMSE (propensity mean squared error), the best five models are ARF, CDTD, TabDDPM, SMOTE, and Tabsyn.
- By the privacy metric grMDCR (median Gower distance to closest record), the best five models are CTGAN, TVAE, SMOTE, CoDi, and ARF.
- By the weighted privacy metric in Lautrup et al. (2024) (that combines many individual metrics), the best five models are CTGAN, TVAE, CoDi, ARF, and TTVAE.
According to the proposed Mahalanobis D^2 metric, the models with statistically no policy sensitivity are ARF, CDTD, and TabDIFF for three econometric models of early childhood factors, and ARF, CDTD, and TabDDPM for the econometric model of midlife chronic disease incidence. The models that produce no statistically significant policy sensitivity in all econometric models considered in the paper are ARF and CDTD.
The synthetic data generator ARF stands out as the best compromise from the viewpoint of utility, privacy, and policy sensitivity metrics.
References
Footnotes
GAN Based models: CTGAN (Xu et al. 2019), CTABGAN (Z. Zhao et al. 2021), CopulaGAN (Patki, Wedge, and Veeramachaneni 2021); VAE based models: TVAE (Xu et al. 2019), TTVAE (A. X. Wang and Nguyen 2025); Diffusion models: TabDDPM (Kotelnikov et al. 2023), CoDi (Lee, Kim, and Park 2023), Tabsyn (Zhang et al. 2024), TabDiff (J. Shi et al. 2024), CDTD (Mueller, Gruber, and Fok 2023); Other type of models: SMOTE (Chawla et al. 2002), ARF (Watson et al. 2022), TabularARGN (Tiwald et al. 2025).↩︎