Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Johannes Treutlein
Karma:
1,609
All opinions are my own. Homepage:
johannestreutlein.com
All
Posts
Comments
New
Top
Old
Page
1
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
RowanWang
,
Sam Marks
,
Johannes Treutlein
,
evhub
and
Fabien Roger
25 Nov 2025 19:33 UTC
40
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Building and evaluating alignment auditing agents
Sam Marks
,
trentbrick
,
RowanWang
,
Sam Bowman
,
Euan Ong
,
Johannes Treutlein
and
evhub
24 Jul 2025 19:22 UTC
47
points
1
comment
5
min read
LW
link
Modifying LLM Beliefs with Synthetic Document Finetuning
RowanWang
,
Johannes Treutlein
,
Avery
,
Ethan Perez
,
Fabien Roger
and
Sam Marks
24 Apr 2025 21:15 UTC
70
points
12
comments
2
min read
LW
link
(alignment.anthropic.com)
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei Nishimura-Gasparian
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
13 Mar 2025 19:18 UTC
149
points
15
comments
13
min read
LW
link
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
18 Dec 2024 17:19 UTC
491
points
86
comments
10
min read
LW
link
3
reviews
Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data
Johannes Treutlein
and
Owain_Evans
21 Jun 2024 15:54 UTC
164
points
14
comments
8
min read
LW
link
1
review
(arxiv.org)
Report on modeling evidential cooperation in large worlds
Johannes Treutlein
12 Jul 2023 16:37 UTC
45
points
3
comments
1
min read
LW
link
(arxiv.org)
Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies
Rubi J. Hudson
and
Johannes Treutlein
26 May 2023 17:44 UTC
88
points
13
comments
24
min read
LW
link
Conditioning Predictive Models: Open problems, Conclusion, and Appendix
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
10 Feb 2023 19:21 UTC
36
points
3
comments
11
min read
LW
link
Conditioning Predictive Models: Deployment strategy
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
9 Feb 2023 20:59 UTC
28
points
0
comments
10
min read
LW
link
Conditioning Predictive Models: Interactions with other approaches
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
8 Feb 2023 18:19 UTC
32
points
2
comments
11
min read
LW
link
Conditioning Predictive Models: Making inner alignment as easy as possible
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
7 Feb 2023 20:04 UTC
33
points
2
comments
19
min read
LW
link
Conditioning Predictive Models: The case for competitiveness
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
6 Feb 2023 20:08 UTC
20
points
3
comments
11
min read
LW
link
Conditioning Predictive Models: Outer alignment via careful conditioning
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
2 Feb 2023 20:28 UTC
72
points
15
comments
57
min read
LW
link
Conditioning Predictive Models: Large language models as predictors
evhub
,
Adam Jermyn
,
Johannes Treutlein
,
Rubi J. Hudson
and
kcwoolverton
2 Feb 2023 20:28 UTC
89
points
4
comments
13
min read
LW
link
Stop-gradients lead to fixed point predictions
Johannes Treutlein
,
Caspar Oesterheld
,
Rubi J. Hudson
and
Emery Cooper
28 Jan 2023 22:47 UTC
37
points
2
comments
24
min read
LW
link
Underspecification of Oracle AI
Rubi J. Hudson
,
Adam Jermyn
and
Johannes Treutlein
15 Jan 2023 20:10 UTC
30
points
12
comments
19
min read
LW
link
Proper scoring rules don’t guarantee predicting fixed points
Johannes Treutlein
,
Rubi J. Hudson
and
Caspar Oesterheld
16 Dec 2022 18:22 UTC
80
points
8
comments
21
min read
LW
link
Response to Katja Grace’s AI x-risk counterarguments
Erik Jenner
and
Johannes Treutlein
19 Oct 2022 1:17 UTC
77
points
18
comments
15
min read
LW
link
Training goals for large language models
Johannes Treutlein
18 Jul 2022 7:09 UTC
28
points
5
comments
19
min read
LW
link
Back to top
Next