RSS

Johannes Treutlein

Karma: 1,534

All opinions are my own. Homepage: johannestreutlein.com

Mod­ify­ing LLM Beliefs with Syn­thetic Doc­u­ment Finetuning

Apr 24, 2025, 9:15 PM
70 points
12 comments2 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
141 points
15 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
485 points
75 comments10 min readLW link

Con­nect­ing the Dots: LLMs can In­fer & Ver­bal­ize La­tent Struc­ture from Train­ing Data

Jun 21, 2024, 3:54 PM
163 points
13 comments8 min readLW link
(arxiv.org)

Re­port on mod­el­ing ev­i­den­tial co­op­er­a­tion in large worlds

Johannes TreutleinJul 12, 2023, 4:37 PM
45 points
3 comments1 min readLW link
(arxiv.org)

Con­di­tional Pre­dic­tion with Zero-Sum Train­ing Solves Self-Fulfilling Prophecies

May 26, 2023, 5:44 PM
88 points
13 comments24 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Open prob­lems, Con­clu­sion, and Appendix

Feb 10, 2023, 7:21 PM
36 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: De­ploy­ment strategy

Feb 9, 2023, 8:59 PM
28 points
0 comments10 min readLW link

Con­di­tion­ing Pre­dic­tive Models: In­ter­ac­tions with other approaches

Feb 8, 2023, 6:19 PM
32 points
2 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Mak­ing in­ner al­ign­ment as easy as possible

Feb 7, 2023, 8:04 PM
27 points
2 comments19 min readLW link

Con­di­tion­ing Pre­dic­tive Models: The case for competitiveness

Feb 6, 2023, 8:08 PM
20 points
3 comments11 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Outer al­ign­ment via care­ful conditioning

Feb 2, 2023, 8:28 PM
72 points
15 comments57 min readLW link

Con­di­tion­ing Pre­dic­tive Models: Large lan­guage mod­els as predictors

Feb 2, 2023, 8:28 PM
88 points
4 comments13 min readLW link

Stop-gra­di­ents lead to fixed point predictions

Jan 28, 2023, 10:47 PM
37 points
2 comments24 min readLW link

Un­der­speci­fi­ca­tion of Or­a­cle AI

Jan 15, 2023, 8:10 PM
30 points
12 comments19 min readLW link

Proper scor­ing rules don’t guaran­tee pre­dict­ing fixed points

Dec 16, 2022, 6:22 PM
79 points
8 comments21 min readLW link

Re­sponse to Katja Grace’s AI x-risk counterarguments

19 Oct 2022 1:17 UTC
77 points
18 comments15 min readLW link

Train­ing goals for large lan­guage models

Johannes Treutlein18 Jul 2022 7:09 UTC
28 points
5 comments19 min readLW link

Re­quest for in­put on mul­ti­verse-wide su­per­ra­tional­ity (MSR)

Johannes Treutlein14 Aug 2018 17:29 UTC
18 points
3 comments1 min readLW link
(effective-altruism.com)

A be­hav­iorist ap­proach to build­ing phe­nomenolog­i­cal bridges

Johannes Treutlein20 Nov 2017 19:36 UTC
4 points
0 comments1 min readLW link
(casparoesterheld.com)