I recently gave this talk at the Safety-Guaranteed LLMs workshop:
The talk is about ARC’s work on low probability estimation (LPE), covering:
Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
Empirical work on LPE in language models (described here)
Recent work-in-progress on theoretical results
I recently gave this talk at the Safety-Guaranteed LLMs workshop:
The talk is about ARC’s work on low probability estimation (LPE), covering:
Theoretical motivation for LPE and (towards the end) activation modeling approaches (both described here)
Empirical work on LPE in language models (described here)
Recent work-in-progress on theoretical results