Security of LLM Activation Functions and Architecture

Research on how architectural components like activation functions affect safety and security properties of LLMs

Security of LLM Activation Functions and Architecture

Research on Large Language Models in Security of LLM Activation Functions and Architecture

Hidden Dangers in LLM Optimization

Hidden Dangers in LLM Optimization

How activation approximations compromise safety in aligned models

Multi-Dimensional Safety in LLM Alignment

Multi-Dimensional Safety in LLM Alignment

Revealing the hidden complexity of safety mechanisms in language models

The Geometry of LLM Refusals

The Geometry of LLM Refusals

Uncovering Multiple Refusal Concepts in Language Models

Unlocking Precise Control of AI Behavior

Unlocking Precise Control of AI Behavior

Sparse Activation Steering: A New Approach to LLM Alignment

Transferable Safety Interventions for LLMs

Transferable Safety Interventions for LLMs

Creating portable security measures across language models

Key Takeaways

Summary of Research on Security of LLM Activation Functions and Architecture