Evaluating Privacy Risks in AI-Generated Synthetic Tabular Data
This white paper evaluates the privacy risks of AI-generated synthetic tabular data by analyzing the success of membership inference attacks (MIAs) under varying configurations and attacker profiles. Building on the MIDST challenge, which revealed vulnerabilities in state-of-the-art diffusion models, we identify key factors influencing privacy leakage and provide actionable guidance for practitioners.
🔍 The Challenge
⚠️ The Problem
Synthetic data generation promises to preserve utility while eliminating privacy risks. However, recent research reveals that generative models can leak information about specific records used during training through membership inference attacks (MIAs).
🎯 The MIDST Challenge
The Membership Inference over Diffusion-models-based Synthetic Tabular data (MIDST) challenge at IEEE SaTML 2025 systematically tested privacy vulnerabilities. Over 71 participants submitted 700+ attack strategies, revealing significant privacy leakage across all scenarios.
📊 MIDST Challenge Results
The winning white-box attack achieved a 46% true positive rate at a 10% false positive rate, more than four times better than a random baseline under the same metric. Even black-box attacks with only synthetic outputs reached 25% success rates, demonstrating that state-of-the-art diffusion models leak training data information.
🔬 Key Experimental Findings
1. Training Data Size Matters Most
Larger training datasets substantially improve both privacy and utility. Data collection is a more effective lever than oversynthesizing. Increasing synthetic data relative to training data degrades privacy, particularly for smaller training sets.
2. Model Size vs. Privacy Tradeoffs
Smaller or moderately sized models are often more privacy-efficient. Increasing model capacity can amplify privacy leakage without meaningful gains in synthetic data quality.
3. Hyperparameter Choices Present Clear Tradeoffs
The sensitivity of attack success to different hyperparameters is varied and nuanced. Some have large impacts while others are surprisingly unimportant. For example, increased diffusion steps and training iterations improve quality up to a point but increase vulnerability to MIAs. These hyperparameters require careful tuning based on the specific use case and acceptable risk thresholds.
4. MIAs Work Even with Imperfect Knowledge
Traditionally, it is assumed that attackers have access to statistically identical data and knowledge of training parameters. However, experimental results demonstrate that state-of-the-art MIAs remain highly effective even when attackers have imperfect knowledge of the training data distribution, hyperparameters, and model architectures. This demonstrates that MIAs provide a direct, regulator-aligned measure of privacy risk.
5. Distance to Closest Record (DCR) Fails as a Privacy Metric
DCR and similar proxy metrics fail to reliably estimate MIA success. While DCR is easy to compute and widely used, experiments show that it does not correlate with MIA outcomes across key levers such as model size, batch size, and diffusion steps, among others. In these scenarios, MIA success changes significantly while DCR remains largely flat. As such, DCR cannot be relied upon as a standalone privacy metric, particularly in high-stakes or regulated settings.
💡 Industry Takeaways
Prioritize Data Collection
Investing in larger, high-quality training datasets delivers better privacy and utility outcomes than attempting to compensate with excessive synthetic data generation. Generating synthetic data well in excess of your training set size, particularly with small training datasets, significantly increases privacy leakage without proportional quality gains.
Right-Size Your Models
Don't default to the largest available model. Moderately sized models often provide the best privacy-utility balance for synthetic data generation.
Use MIAs for Assessment
Adopt MIAs as your primary privacy risk assessment tool. They provide direct, operationally meaningful measures aligned with regulatory concerns.
Establish Defensible Thresholds
Define acceptable privacy risk thresholds based on your specific use case, regulatory environment, and data sensitivity, even in the absence of industry-wide frameworks.
Beware of Proxy Metrics
DCR and similar proxy metrics fail to reliably measure privacy risk and possess notable blind spots compared to MIA-based assessments. Don't rely on proxies when direct MIA-based assessment is available.
Test Multiple Attack Scenarios
Evaluate privacy under both white-box (full model access) and black-box (synthetic outputs only) scenarios to understand your system's vulnerability profile from different perspectives.
👥 Participating Teams
Accenture
Juan-Carlos Castañeda, Declan McClure, Karthik Venkataraman
EY
Rasoul Shahsavarifar, Jean-Luc Rukundo, N'Golo Kone, Paulina Nouwou, Yasmin Mokaberi
Hitachi Rail
Théo Pinardin, Safiya Kamal
Unilever
Colm Cleary, Marta Mischi
Vector Institute
Masoumeh Shafieinejad & David Emerson (Technical Leads), Xi He (Faculty Advisor), Michael Joseph (Project Manager), Behnoosh Zamanlooy, Elaheh Bassak, Fatemeh Tavakoli, Sara Kodeiri, Marcelo Lotif (Research & Engineering)
📝 Citation
Use the BibTeX below to cite this work:
@techreport{vectorinstitute2026privacypotions,
title={Privacy Potions for Production: Controlling Leakage in AI-Synthesized Data},
author={Accenture and EY and {Hitachi Rail} and Unilever and {Vector Institute}},
year={2026},
institution={Vector Institute for Artificial Intelligence},
type={White Paper}
}