Detecting Copyright Infringement in AI Models

DIS-CO introduces a groundbreaking technique to identify whether copyrighted material was included in training data for vision-language models (VLMs) without requiring direct access to training datasets.

Leverages the hypothesis that VLMs can recognize images from their training corpus
Extracts content identity by repeatedly querying the model with specific frames
Demonstrates effective identification of copyrighted content across various media types
Raises important implications for intellectual property rights in AI development

This research addresses critical security concerns in the AI industry, providing rights holders with tools to verify potential copyright infringement and helping model developers demonstrate compliance with intellectual property laws.

DIS-CO: Discovering Copyrighted Content in VLMs Training Data