Introduction
Welcome to the Vector Institute MIDST challenge (Membership Inference over Diffusion-models-based Synthetic Tabular data) hosted at the 3rd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2025).
In this challenge, you will evaluate the resilience of the synthetic tabular data generated by diffusion models against black-box and white-box membership inference attacks.
- Challenge Overview
- Task Details
- Models and Datasets
- Submissions and Scoring
- Winner Selection
- Important Dates
- Terms and Conditions
- Getting Started
- Event Organizers
- Event Sponsors
- Frequently Asked Questions
- Acknowledgements
- Contact
Challenge Overview
Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored.
In this challenge, we seek a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, we will explore multiple target models for MIAs, including diffusion models for single tables of mixed data type types and multi-relational tables with interconnected constraints. We expect the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy.
For each task in MIDST, you are given a set of challenge points, the aim is to decide which of these challenge points were used to train the model. You can compete on any of four separate membership inference tasks. Each task will be scored separately. You do not need to participate in all of them, and can choose to participate in as many as you like. Throughout the competition, submissions will be scored on a subset of the evaluation data and ranked on a live scoreboard. When submission closes, the final scores will be computed on a separate subset of the evaluation data.
The winner of each task will be eligible for an award of $2000 CAD and the runner-up of each task for an award of $1000 CAD (in the event of tied entries, these awards may be adjusted). This competition is co-located with the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) 2025, and the winners will be invited to present their strategies at the conference.
Task Details
The generative models are developed on the training data set to generate synthetic data. They are expected to learn the statistics without memorizing the individual data. To evaluate this promise, membership inference attacks assess whether the model distinguishes between the training data set and the holdout data set, both are derived from the same, larger data set.
For each of the four tasks, we train a set of models on different splits of a public dataset. For each of these models, we provide m challenge points; exactly half of which are members (i.e., used to train the model) and half are non-members (i.e., from the holdout set; they come from the same public dataset as the training set, but were not used to train the model). Your goal is to determine which challenge points are members and which are non-members.
This challenge is composed of four different tasks, each associated with a separate category. The categories are defined based on the access to the generative models and the type of the tabular data as follows:
- Access to the models: black-box, Data: single table
- Access to the models: white-box, Data: single table
- Access to the models: black-box, Data: multi-table
- Access to the models: white-box, Data: multi-table
Note: In white-box attacks, you have access to the models and their generated synthetic output. Training sets for these models are selected from a public dataset. In black-box attack, you have access to the same information as the white-box attack, except for the models.
To facilitate participation in MIDST, we develop some shadow models for both single table and multi-table tasks. The shadow models are the same for black-box and white-box tasks. You are free to choose these shadow models and/or generate your own if needed in developing your MIAs.
Models and Datasets
MIDST examines the privacy of three recent diffusion-model base tabular synthesis approaches:
We include each of these models with a dedicated directory in the MIDSTModels repository. In each directory, there is a README file that provides an overview of the topic, prerequisites, and notebook descriptions.
Submissions and Scoring
Submissions will be ranked based on their performance in membership inference against the associated models.
There are three sets of challenges: train, dev, and final. For models in train, we reveal the full training dataset, and consequently the ground truth membership data for challenge points. These models can be used by participants to develop their attacks. For models in the dev and final sets, no ground truth is revealed and participants must submit their membership predictions for challenge points.
During the competition, there will be a live scoreboard based on the dev challenges. The final ranking will be decided on the rank set; scoring for this dataset will be withheld until the competition ends.
For each challenge point, the submission must provide a value, indicating the confidence level with which the challenge point is a member. Each value must be a floating point number in the range [0.0, 1.0], where 1.0 indicates certainty that the challenge point is a member, and 0.0 indicates certainty that it is a non-member.
Submissions will be evaluated according to their True Positive Rate at 10% False Positive Rate (i.e. TPR @ 0.1 FPR). In this context, positive challenge points are members and negative challenge points are non-members. For each submission, the scoring program concatenates the confidence values for all models (dev and final treated separately) and compares these to the reference ground truth. The scoring program determines the minimum confidence threshold for membership such that at most 10% of the non-member challenge points are incorrectly classified as members. The score is the True Positive Rate achieved by this threshold (i.e., the proportion of correctly classified member challenge points). The live scoreboard shows additional scores (i.e., TPR at other FPRs, membership inference advantage, accuracy, AUC-ROC score), but these are only informational.
You are allowed to make multiple submissions, but only your latest submission will be considered.
Winner Selection
Winners will be selected independently for each task (i.e. if you choose not to participate in certain tasks, this will not affect your rank for the tasks in which you do participate).
For each task, the winner will be the one achieving the highest score (TPR @ 0.1 FPR
).
Important Dates
- Submission opens: December 1, 2024
- Submission closes: February 20, 2025, 23:59 (Anywhere on Earth)
- Conference: April 9-11, 2025
Terms and Conditions
- To be eligible for receiving awards, participants are required to release the code of their submissions as open source.
- Each algorithm will be required to run within a specific time on a given GPU.
- Submissions will be evaluated by a panel of judges according to the aims of the competition.
- The individuals involved with MIDST design may submit solutions, but are not eligible to receive awards.
Codabench Competitions
- Black-box MIA on single table
- Black-box MIA on multi-table
- White-box MIA on single table
- White-box MIA on multi-table
Getting Started
You need to register on Codabench for the tasks in which you would like to participate, first. Upon registration, you will be directed to the related starter kit and URLs from which to download the challenge data. The MIDSTModels repo contains helpful information related to the competitions. In it, we have provided starter kits that showcase creating a baseline attack and making a submission to each of the competitions. This is a great place to start. Additionaly, we have provided reference implementations for each model used in MIDST.
Event Organizers
Event Sponsors
FAQ
Acknowledgements
We’d like to thank MICO organizers, for their open source project, and very helpful comments.
Contact
For more information or help with navigating our repository, please contact masoumeh@vectorinstitute.ai, xi.he@vectorinstitute.ai or john.jewell@vectorinstitute.ai.