fbarez

Karma: 132

Martian Interpretability Challenge: The Core Problems In Interpretability

fbarez11 Mar 2026 17:41 UTC

9 points

0 comments9 min readLW link

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

fbarez12 Jan 2026 19:55 UTC

9 points

0 comments1 min readLW link

Best-of-N Jailbreaking

John Hughes, saraprice, Aengus Lynch, Rylan Schaeffer, fbarez, Henry Sleight, Ethan Perez and mrinank_sharma

14 Dec 2024 4:58 UTC

79 points

5 comments2 min readLW link

(arxiv.org)

Visualizing neural network planning

Nevan Wichers, Victor Tao, fbarez and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and fbarez

3 May 2024 1:18 UTC

48 points

6 comments1 min readLW link

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, fbarez and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran, fbarez, Sabrina Zaki, gabrielrecc and rz2383

23 Feb 2023 10:48 UTC

8 points

0 comments6 min readLW link