Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Florian_Dietz
Karma:
561
All
Posts
Comments
New
Top
Old
Page
1
Split Personality Training can detect Alignment Faking
Florian_Dietz
4 Mar 2026 11:49 UTC
34
points
0
comments
6
min read
LW
link
Already Optimized
Florian_Dietz
18 Feb 2026 10:01 UTC
52
points
14
comments
14
min read
LW
link
Deception Channeling: Training Models to Always Verbalize Alignment Faking
Florian_Dietz
17 Feb 2026 22:28 UTC
7
points
2
comments
9
min read
LW
link
The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides
Florian_Dietz
15 Feb 2026 17:24 UTC
15
points
6
comments
6
min read
LW
link
Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting
Florian_Dietz
14 Feb 2026 15:13 UTC
13
points
0
comments
8
min read
LW
link
An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives
Florian_Dietz
31 Jan 2026 18:03 UTC
22
points
5
comments
11
min read
LW
link
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Florian_Dietz
12 Jan 2026 12:29 UTC
86
points
41
comments
26
min read
LW
link
Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable
Florian_Dietz
29 Jul 2025 16:23 UTC
9
points
0
comments
17
min read
LW
link
Deliberative Credit Assignment: Making Faithful Reasoning Profitable
Florian_Dietz
14 Jul 2025 9:26 UTC
10
points
3
comments
17
min read
LW
link
Edge Cases in AI Alignment
Florian_Dietz
24 Mar 2025 9:27 UTC
19
points
3
comments
4
min read
LW
link
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Florian_Dietz
10 Mar 2025 16:07 UTC
49
points
7
comments
9
min read
LW
link
Do we want alignment faking?
Florian_Dietz
28 Feb 2025 21:50 UTC
7
points
4
comments
1
min read
LW
link
Revealing alignment faking with a single prompt
Florian_Dietz
29 Jan 2025 21:01 UTC
9
points
5
comments
4
min read
LW
link
Florian_Dietz’s Shortform
Florian_Dietz
1 Jan 2025 14:27 UTC
3
points
52
comments
1
min read
LW
link
Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems
Florian_Dietz
17 Feb 2024 8:45 UTC
4
points
0
comments
13
min read
LW
link
Understanding differences between humans and intelligence-in-general to build safe AGI
Florian_Dietz
16 Aug 2022 8:27 UTC
7
points
8
comments
1
min read
LW
link
logic puzzles and loophole abuse
Florian_Dietz
30 Sep 2017 15:45 UTC
3
points
4
comments
3
min read
LW
link
a different perspecive on physics
Florian_Dietz
26 Jun 2017 22:47 UTC
0
points
15
comments
3
min read
LW
link
Teaching an AI not to cheat?
Florian_Dietz
20 Dec 2016 14:37 UTC
5
points
12
comments
1
min read
LW
link
controlling AI behavior through unusual axiomatic probabilities
Florian_Dietz
8 Jan 2015 17:00 UTC
5
points
11
comments
1
min read
LW
link
Back to top
Next