Differentially private synthetic tabular data generation with a generative adversarial network and privacy amplification by subsampling
Nieminen, Valtteri (2022-08-01)
Differentially private synthetic tabular data generation with a generative adversarial network and privacy amplification by subsampling
Nieminen, Valtteri
(01.08.2022)
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
avoin
Julkaisun pysyvä osoite on:
https://urn.fi/URN:NBN:fi-fe2022082956643
https://urn.fi/URN:NBN:fi-fe2022082956643
Tiivistelmä
Advances in computation have created high demand for large datasets, which in turn has sparked interest in using personal data collected by different institutions for secondary purposes such as research. However, in many domains like healthcare, privacy concerns often stand in the way of sharing data for novel use.
One promising approach to making data anonymous in order to make privacy-preserving data sharing possible is to create what is called synthetic data. Synthetic data is based on real data and attempts to mimic the properties of that real data while preserving utility for different tasks and protecting the privacy of those depicted in the original dataset.
In this work, the differential privacy (DP) framework is adopted to train a generative adversarial network (GAN) in a privacy-preserving manner to create differentially private synthetic tabular data. The quality of the synthetic data is evaluated based on its usefulness for training new models on it, the extent to which realistic sample quality is retained and the strength of privacy guarantees achieved.
The technical implementation modifies the state-of-the-art DP GAN model, the GS-WGAN by Chen, Orekondy, and Fritz from the domain of images to that of tabular data. This model choice poses novel questions on whether similar privacy benefits as reported with image data can be achieved with tabular data by applying the privacy by subsampling technique to the GAN training process. The technical choices in this work also focus on theoretical synergies between the model architecture and privacy-preserving training as well as the method's usability in a real-life scenario.
The results show that the synthetic data generated preserves utility in training downstream classification models while attaining strong privacy guarantees. However, simultaneously retaining realistic sample quality proved to be difficult. The research presented in this thesis contributes to the field of differentially private synthetic data generation with GAN models by demonstrating, that the application of PABS to GAN training is an effective way to achieve stronger privacy guarantees with tabular data. The results raise important questions over whether the use of downstream classification accuracy as a metric can lead to synthetic data biased towards this specific task and whether DP synthetic data should be separately crafted for different tasks to avoid loss of utility.
One promising approach to making data anonymous in order to make privacy-preserving data sharing possible is to create what is called synthetic data. Synthetic data is based on real data and attempts to mimic the properties of that real data while preserving utility for different tasks and protecting the privacy of those depicted in the original dataset.
In this work, the differential privacy (DP) framework is adopted to train a generative adversarial network (GAN) in a privacy-preserving manner to create differentially private synthetic tabular data. The quality of the synthetic data is evaluated based on its usefulness for training new models on it, the extent to which realistic sample quality is retained and the strength of privacy guarantees achieved.
The technical implementation modifies the state-of-the-art DP GAN model, the GS-WGAN by Chen, Orekondy, and Fritz from the domain of images to that of tabular data. This model choice poses novel questions on whether similar privacy benefits as reported with image data can be achieved with tabular data by applying the privacy by subsampling technique to the GAN training process. The technical choices in this work also focus on theoretical synergies between the model architecture and privacy-preserving training as well as the method's usability in a real-life scenario.
The results show that the synthetic data generated preserves utility in training downstream classification models while attaining strong privacy guarantees. However, simultaneously retaining realistic sample quality proved to be difficult. The research presented in this thesis contributes to the field of differentially private synthetic data generation with GAN models by demonstrating, that the application of PABS to GAN training is an effective way to achieve stronger privacy guarantees with tabular data. The results raise important questions over whether the use of downstream classification accuracy as a metric can lead to synthetic data biased towards this specific task and whether DP synthetic data should be separately crafted for different tasks to avoid loss of utility.