RSS

Florian_Dietz

Karma: 561

Split Per­son­al­ity Train­ing can de­tect Align­ment Faking

Florian_Dietz4 Mar 2026 11:49 UTC
34 points
0 comments6 min readLW link

Already Optimized

Florian_Dietz18 Feb 2026 10:01 UTC
52 points
14 comments14 min readLW link

De­cep­tion Chan­nel­ing: Train­ing Models to Always Ver­bal­ize Align­ment Faking

Florian_Dietz17 Feb 2026 22:28 UTC
7 points
2 comments9 min readLW link

The Miss­ing Se­quence: Why Cor­rect Anal­y­sis Makes Ter­rible Ac­tion Guides

Florian_Dietz15 Feb 2026 17:24 UTC
15 points
6 comments6 min readLW link

De­liber­ate Epistemic Uncer­tainty: An Au­to­mated Ex­per­i­ment on AI Self-Reporting

Florian_Dietz14 Feb 2026 15:13 UTC
13 points
0 comments8 min readLW link

An Abla­tion Study on the Role of [Un­trans­lat­able] in Co­op­er­a­tive Equil­ibrium For­ma­tion: Emer­gent Ra­tion­al­iza­tion Un­der Miss­ing Primitives

Florian_Dietz31 Jan 2026 18:03 UTC
22 points
5 comments11 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Alter­nate Per­son­al­ities (Re­search Re­port)

Florian_Dietz12 Jan 2026 12:29 UTC
86 points
41 comments26 min readLW link

De­liber­a­tive Credit As­sign­ment (DCA): Mak­ing Faith­ful Rea­son­ing Profitable

Florian_Dietz29 Jul 2025 16:23 UTC
9 points
0 comments17 min readLW link

De­liber­a­tive Credit As­sign­ment: Mak­ing Faith­ful Rea­son­ing Profitable

Florian_Dietz14 Jul 2025 9:26 UTC
10 points
3 comments17 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC
19 points
3 comments4 min readLW link

Split Per­son­al­ity Train­ing: Re­veal­ing La­tent Knowl­edge Through Per­son­al­ity-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC
49 points
7 comments9 min readLW link

Do we want al­ign­ment fak­ing?

Florian_Dietz28 Feb 2025 21:50 UTC
7 points
4 comments1 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

Flo­rian_Dietz’s Shortform

Florian_Dietz1 Jan 2025 14:27 UTC
3 points
52 comments1 min readLW link

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Florian_Dietz17 Feb 2024 8:45 UTC
4 points
0 comments13 min readLW link

Un­der­stand­ing differ­ences be­tween hu­mans and in­tel­li­gence-in-gen­eral to build safe AGI

Florian_Dietz16 Aug 2022 8:27 UTC
7 points
8 comments1 min readLW link

logic puz­zles and loop­hole abuse

Florian_Dietz30 Sep 2017 15:45 UTC
3 points
4 comments3 min readLW link

a differ­ent per­specive on physics

Florian_Dietz26 Jun 2017 22:47 UTC
0 points
15 comments3 min readLW link

Teach­ing an AI not to cheat?

Florian_Dietz20 Dec 2016 14:37 UTC
5 points
12 comments1 min readLW link

con­trol­ling AI be­hav­ior through un­usual ax­io­matic probabilities

Florian_Dietz8 Jan 2015 17:00 UTC
5 points
11 comments1 min readLW link