
Securing Code LLMs Against Harmful Content
A novel automated framework for robust content moderation in code generation
This research introduces a specialized framework for testing and improving content moderation in Code Large Language Models to prevent harmful naming patterns in generated code.
- Identifies critical security vulnerabilities in current code generation moderation systems
- Presents CHT (Code Harmfulness Testing) framework that automatically detects harmful code naming patterns
- Demonstrates effectiveness through evaluation of popular Code LLMs
- Provides actionable insights for strengthening security guardrails in generative AI for developers
This research is vital for organizations implementing Code LLMs, as it addresses specific security risks posed by malicious naming patterns that could bypass conventional content filters, helping to build safer AI coding assistants.
Automated Harmfulness Testing for Code Large Language Models