Protecting Data Ownership in RAG Systems

This research introduces a novel approach to detect unauthorized use of datasets in Retrieval-Augmented LLMs through watermarked canaries - specially crafted data entries that can prove dataset ownership.

Achieves 95-99% detection accuracy of unauthorized dataset use
Creates synthetic, watermarked data that blends naturally with authentic content
Provides dataset owners with tools to identify IP infringement
Demonstrates effectiveness across various LLM architectures and retrieval methods

As RAG systems become standard in AI deployments, this technique offers a crucial security layer for organizations concerned about protecting proprietary datasets and intellectual property rights in an increasingly complex AI landscape.

Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs