The False Privacy of Synthetic Data

This research reveals that using synthetic data for fine-tuning LLMs fails to address privacy risks, as generated data can retain and leak personal information.

30-60% of synthetic data samples contained Personal Identifiable Information (PII) from the original dataset
Fine-tuned models showed vulnerability to Membership Inference Attacks regardless of using synthetic data
The more capable the generating LLM, the higher risk of privacy leakage in synthetic data

These findings challenge the security assumption that synthetic data provides adequate privacy protection when training language models, requiring organizations to implement additional safeguards even when using generated datasets.

Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data