Privacy Potions for Production

Controlling Leakage in AI-Synthesized Data

Evaluating Privacy Risks in AI-Generated Synthetic Tabular Data

Authors:
Accenture, EY, Hitachi Rail, Unilever, Vector Institute

This white paper evaluates the privacy risks of AI-generated synthetic tabular data by analyzing the success of membership inference attacks (MIAs) under varying configurations and attacker profiles. Building on the MIDST challenge, which revealed vulnerabilities in state-of-the-art diffusion models, we identify key factors influencing privacy leakage and provide actionable guidance for practitioners.

🔍 The Challenge

⚠️ The Problem

Synthetic data generation promises to preserve utility while eliminating privacy risks. However, recent research reveals that generative models can leak information about specific records used during training through membership inference attacks (MIAs).

🎯 The MIDST Challenge

The Membership Inference over Diffusion-models-based Synthetic Tabular data (MIDST) challenge at IEEE SaTML 2025 systematically tested privacy vulnerabilities. Over 71 participants submitted 700+ attack strategies, revealing significant privacy leakage across all scenarios.

📊 MIDST Challenge Results

The winning white-box attack achieved a 46% true positive rate at a 10% false positive rate, more than four times better than a random baseline under the same metric. Even black-box attacks with only synthetic outputs reached 25% success rates, demonstrating that state-of-the-art diffusion models leak training data information.

🔬 Key Experimental Findings

1. Training Data Size Matters Most

Larger training datasets substantially improve both privacy and utility. Data collection is a more effective lever than oversynthesizing. Increasing synthetic data relative to training data degrades privacy, particularly for smaller training sets.

2. Model Size vs. Privacy Tradeoffs

Smaller or moderately sized models are often more privacy-efficient. Increasing model capacity can amplify privacy leakage without meaningful gains in synthetic data quality.

3. Hyperparameter Choices Present Clear Tradeoffs

The sensitivity of attack success to different hyperparameters is varied and nuanced. Some have large impacts while others are surprisingly unimportant. For example, increased diffusion steps and training iterations improve quality up to a point but increase vulnerability to MIAs. These hyperparameters require careful tuning based on the specific use case and acceptable risk thresholds.

4. MIAs Work Even with Imperfect Knowledge

Traditionally, it is assumed that attackers have access to statistically identical data and knowledge of training parameters. However, experimental results demonstrate that state-of-the-art MIAs remain highly effective even when attackers have imperfect knowledge of the training data distribution, hyperparameters, and model architectures. This demonstrates that MIAs provide a direct, regulator-aligned measure of privacy risk.

5. Distance to Closest Record (DCR) Fails as a Privacy Metric

DCR and similar proxy metrics fail to reliably estimate MIA success. While DCR is easy to compute and widely used, experiments show that it does not correlate with MIA outcomes across key levers such as model size, batch size, and diffusion steps, among others. In these scenarios, MIA success changes significantly while DCR remains largely flat. As such, DCR cannot be relied upon as a standalone privacy metric, particularly in high-stakes or regulated settings.

💡 Industry Takeaways

1

Prioritize Data Collection

Investing in larger, high-quality training datasets delivers better privacy and utility outcomes than attempting to compensate with excessive synthetic data generation. Generating synthetic data well in excess of your training set size, particularly with small training datasets, significantly increases privacy leakage without proportional quality gains.

2

Right-Size Your Models

Don't default to the largest available model. Moderately sized models often provide the best privacy-utility balance for synthetic data generation.

3

Use MIAs for Assessment

Adopt MIAs as your primary privacy risk assessment tool. They provide direct, operationally meaningful measures aligned with regulatory concerns.

4

Establish Defensible Thresholds

Define acceptable privacy risk thresholds based on your specific use case, regulatory environment, and data sensitivity, even in the absence of industry-wide frameworks.

5

Beware of Proxy Metrics

DCR and similar proxy metrics fail to reliably measure privacy risk and possess notable blind spots compared to MIA-based assessments. Don't rely on proxies when direct MIA-based assessment is available.

6

Test Multiple Attack Scenarios

Evaluate privacy under both white-box (full model access) and black-box (synthetic outputs only) scenarios to understand your system's vulnerability profile from different perspectives.

👥 Participating Teams

Accenture

Juan-Carlos Castañeda, Declan McClure, Karthik Venkataraman

EY

Rasoul Shahsavarifar, Jean-Luc Rukundo, N'Golo Kone, Paulina Nouwou, Yasmin Mokaberi

Hitachi Rail

Théo Pinardin, Safiya Kamal

Unilever

Colm Cleary, Marta Mischi

Vector Institute

Masoumeh Shafieinejad & David Emerson (Technical Leads), Xi He (Faculty Advisor), Michael Joseph (Project Manager), Behnoosh Zamanlooy, Elaheh Bassak, Fatemeh Tavakoli, Sara Kodeiri, Marcelo Lotif (Research & Engineering)

📝 Citation

Use the BibTeX below to cite this work:

@techreport{vectorinstitute2026privacypotions,
  title={Privacy Potions for Production: Controlling Leakage in AI-Synthesized Data},
  author={Accenture and EY and {Hitachi Rail} and Unilever and {Vector Institute}},
  year={2026},
  institution={Vector Institute for Artificial Intelligence},
  type={White Paper}
}