AI Safety Thursday: Monitoring LLMs for deceptive behaviour using probes

​LLMs show deceptive behaviour when they have incentive to do so, whether it’s alignment faking or lying about its capabilities. A work earlier this year at Apollo proposed using linear probes that detect such behaviour using model’s internal activations.

​In this talk Shivam Arora, will share details on how these probes work and share his research experience on a follow up work to improve them conducted as part of a fellowship at LASR labs.

Registration Instructions
This is a paid event ($5 general admission, free for students & job seekers) with limited tickets—you must RSVP on Luma to secure your spot.

​​Event Schedule
6:00 to 6:30 - Food & Introductions
6:30 to 7:30 - Main Presentation & Questions
7:30 to 9:00 - Open Discussion

​​If you can’t attend in person, join our live stream starting at 6:30 pm via this link.

​​This is part of our weekly AI Safety Thursdays series. Join us in examining questions like:

  • ​​How do we ensure AI systems are aligned with human interests?

  • ​​How do we measure and mitigate potential risks from advanced AI systems?

  • ​​What does safer AI development look like?

  • ​​What governance structures are needed for safer AI?

No comments.