Welcome to the Membership Inference over Diffusion-models-based Synthetic Tabular data (MIDST) challenge of SaTML 2025!
In this challenge, you will evaluate the resilience of the synthetic tabular data generated by diffusion models against black-box and white-box membership inference attacks.
Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. In this challenge, we seek a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, we will explore multiple target models for MIAs, including diffusion models for single tables of mixed data type types and multi-relational tables with interconnected constraints. We expect the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy.
The winner of each task will be eligible for an award of $XXX CAD (in the event of tied entries, these awards may be adjusted). This competition is co-located with the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2025, and the winners will be invited to present their strategies at the conference.
The generative models are developed on the training data set to generate synthetic data. They are expected to learn the statistics without memorizing the individual data. To evaluate this promise, membership inference attacks assess whether the model distinguishes between the training data set and a holdout data set that is derived from the same distribution as the training set. This challenge is composed of four different tasks, each associated with a separate category. The categories are defined based on the access to the generative models and the type of the tabular data as follows:
MIDST examines the privacy of three recent diffusion-model base tabular synthesis approaches:
We include each of these models with a dedicated directory in the reference_implementations/
directory. In each directory, there is a README file that provides an overview of the topic, prerequisites, and notebook descriptions.
We evaluate the success of membership inference attacks (MIAs) by their ability to accurately determine whether a data point originated from the training set to train a diffusion model or the holdout set. To perform our four tasks, we provide a set of models trained on different splits of a public dataset. For each of these models, we provide m challenge points; exactly half of which are members (i.e., used to train the model) and half are non-members (holdout dataset, i.e., they come from the same distribution as the training set, but were not used to train the model). The participants should determine which challenge points are members and which are non-members. Submissions will be evaluated according to their True Positive Rate at 10% False Positive Rate, which is a common practice in MIA assessment.
We includ some easier-to-achieve baseline scores in this repository as a warm-up, and encourage you to start by passing the baseline threshold. The main challenge is more advanced, and regards a higher score as an indicator of a stronger attack. The winners of the tasks are the attacks with the highest scores in their corresponding category.
Winners will be selected independently for each task (i.e. if you choose not to participate in certain tasks, this will not affect your rank for the tasks in which you do participate).
For each task, the winner will be the one achieving the highest average score (TPR @ 0.1 FPR
) across the three scenarios.
You need to register on CodaLab for the tasks in which you would like to participate, first. Upon registration, you will be given URLs from which to download the challenge data.
This project is licensed under the terms of the [LICENSE] file located in the root directory of this repository.
This project welcomes contributions and suggestions. To do so, please read our [CONTRIBUTING.md] guide.
For more information or help with navigating this repository, please contact masoumeh@vectorinstitute.ai or xi.he@vectorinstitute.ai.