Identifying "Deception Vectors" In Models

Link post

Using representation engineering, we systematically induce, detect, and control such deception in CoT-enabled LLMs, extracting ”deception vectors” via Linear Artificial Tomography (LAT) for 89% detection accuracy. Through activation steering, we achieve a 40% success rate in eliciting context-appropriate deception without explicit prompts, unveiling the specific honesty related issue of reasoning models and providing tools for trustworthy AI alignment.

This seems like a positive breakthrough for mech interp research generally, the team used RepE to identify features, and were able to “reliably suppress or induce strategic deception”.

Identifying “Deception Vectors” In Models