Johannes Treutlein

Karma: 1,624

All opinions are my own. Homepage: johannestreutlein.com

Evaluating honesty and lie detection techniques on a diverse suite of dishonest models

RowanWang, Sam Marks, Johannes Treutlein, evhub and Fabien Roger

25 Nov 2025 19:33 UTC

41 points

0 comments4 min readLW link

(alignment.anthropic.com)

Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein and evhub

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

24 Apr 2025 21:15 UTC

77 points

12 comments2 min readLW link

(alignment.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

153 points

15 comments13 min readLW link

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

18 Dec 2024 17:19 UTC

492 points

87 comments10 min readLW link 3 reviews

Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data

Johannes Treutlein and Owain_Evans

21 Jun 2024 15:54 UTC

166 points

14 comments8 min readLW link 1 review

(arxiv.org)

Report on modeling evidential cooperation in large worlds

Johannes Treutlein12 Jul 2023 16:37 UTC

45 points

3 comments1 min readLW link

(arxiv.org)

Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies

Rubi J. Hudson and Johannes Treutlein

26 May 2023 17:44 UTC

88 points

13 comments24 min readLW link

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

10 Feb 2023 19:21 UTC

36 points

3 comments11 min readLW link

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

9 Feb 2023 20:59 UTC

28 points

0 comments10 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

8 Feb 2023 18:19 UTC

32 points

2 comments11 min readLW link

Conditioning Predictive Models: Making inner alignment as easy as possible

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

7 Feb 2023 20:04 UTC

33 points

2 comments19 min readLW link

Conditioning Predictive Models: The case for competitiveness

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

6 Feb 2023 20:08 UTC

20 points

3 comments11 min readLW link

Conditioning Predictive Models: Outer alignment via careful conditioning

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

2 Feb 2023 20:28 UTC

72 points

15 comments57 min readLW link

Conditioning Predictive Models: Large language models as predictors

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

2 Feb 2023 20:28 UTC

89 points

4 comments13 min readLW link

Stop-gradients lead to fixed point predictions

Johannes Treutlein, Caspar Oesterheld, Rubi J. Hudson and Emery Cooper

28 Jan 2023 22:47 UTC

37 points

2 comments24 min readLW link

Underspecification of Oracle AI

Rubi J. Hudson, Adam Jermyn and Johannes Treutlein

15 Jan 2023 20:10 UTC

30 points

12 comments19 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

16 Dec 2022 18:22 UTC

80 points

8 comments21 min readLW link

Response to Katja Grace’s AI x-risk counterarguments

Erik Jenner and Johannes Treutlein

19 Oct 2022 1:17 UTC

77 points

18 comments15 min readLW link

Training goals for large language models

Johannes Treutlein18 Jul 2022 7:09 UTC

28 points

5 comments19 min readLW link