RSS

Johannes Treutlein

Karma: 1,609

All opinions are my own. Homepage: johannestreutlein.com

Eval­u­at­ing hon­esty and lie de­tec­tion tech­niques on a di­verse suite of dishon­est models

25 Nov 2025 19:33 UTC
40 points
0 comments4 min readLW link
(alignment.anthropic.com)

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

24 Jul 2025 19:22 UTC
47 points
1 comment5 min readLW link

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

24 Apr 2025 21:15 UTC
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
149 points
15 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
491 points
86 comments10 min readLW link3 reviews

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

21 Jun 2024 15:54 UTC
164 points
14 comments8 min readLW link1 review
(arxiv.org)

Re­port on mod­el­ing ev­i­den­tial co­op­er­a­tion in large worlds

Johannes Treutlein12 Jul 2023 16:37 UTC
45 points
3 comments1 min readLW link
(arxiv.org)

Con­di­tional Pre­dic­tion with Zero-Sum Train­ing Solves Self-Fulfilling Prophecies

26 May 2023 17:44 UTC
88 points
13 comments24 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Open prob­lems, Con­clu­sion, and Appendix

10 Feb 2023 19:21 UTC
36 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

9 Feb 2023 20:59 UTC
28 points
0 comments10 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

8 Feb 2023 18:19 UTC
32 points
2 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Mak­ing in­ner al­ign­ment as easy as possible

7 Feb 2023 20:04 UTC
33 points
2 comments19 min readLW link

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

6 Feb 2023 20:08 UTC
20 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

2 Feb 2023 20:28 UTC
72 points
15 comments57 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

2 Feb 2023 20:28 UTC
89 points
4 comments13 min readLW link

Stop-gra­di­ents lead to fixed point predictions

28 Jan 2023 22:47 UTC
37 points
2 comments24 min readLW link

Un­der­speci­fi­ca­tion of Or­a­cle AI

15 Jan 2023 20:10 UTC
30 points
12 comments19 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

16 Dec 2022 18:22 UTC
80 points
8 comments21 min readLW link

Re­sponse to Katja Grace’s AI x-risk counterarguments

19 Oct 2022 1:17 UTC
77 points
18 comments15 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link