RSS

Florian_Dietz

Karma: 561

Split Per­son­al­ity Train­ing can de­tect Align­ment Faking

Florian_Dietz4 Mar 2026 11:49 UTC
34 points
0 comments6 min readLW link

Already Optimized

Florian_Dietz18 Feb 2026 10:01 UTC
52 points
14 comments14 min readLW link

De­cep­tion Chan­nel­ing: Train­ing Models to Always Ver­bal­ize Align­ment Faking

Florian_Dietz17 Feb 2026 22:28 UTC
7 points
2 comments9 min readLW link

The Miss­ing Se­quence: Why Cor­rect Anal­y­sis Makes Ter­rible Ac­tion Guides

Florian_Dietz15 Feb 2026 17:24 UTC
15 points
6 comments6 min readLW link

De­liber­ate Epistemic Uncer­tainty: An Au­to­mated Ex­per­i­ment on AI Self-Reporting

Florian_Dietz14 Feb 2026 15:13 UTC
13 points
0 comments8 min readLW link