Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Florian_Dietz
Karma:
583
All
Posts
Comments
New
Top
Old
Page
1
Context Modification as a Negative Alignment Tax
Florian_Dietz
10 May 2026 11:32 UTC
7
points
0
comments
4
min read
LW
link
Context Modification as a Negative Alignment Tax
Florian_Dietz
8 May 2026 10:56 UTC
5
points
0
comments
5
min read
LW
link
Context Modification as a Negative Alignment Tax
Florian_Dietz
7 May 2026 17:34 UTC
5
points
0
comments
5
min read
LW
link
Positive Feedback Only
Florian_Dietz
5 May 2026 21:28 UTC
18
points
0
comments
8
min read
LW
link
Split Personality Training can detect Alignment Faking
Florian_Dietz
4 Mar 2026 11:49 UTC
34
points
0
comments
6
min read
LW
link
Already Optimized
Florian_Dietz
18 Feb 2026 10:01 UTC
52
points
14
comments
14
min read
LW
link
Deception Channeling: Training Models to Always Verbalize Alignment Faking
Florian_Dietz
17 Feb 2026 22:28 UTC
7
points
2
comments
9
min read
LW
link
The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides
Florian_Dietz
15 Feb 2026 17:24 UTC
15
points
6
comments
6
min read
LW
link
Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting
Florian_Dietz
14 Feb 2026 15:13 UTC
13
points
0
comments
8
min read
LW
link
An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives
Florian_Dietz
31 Jan 2026 18:03 UTC
22
points
5
comments
11
min read
LW
link
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
Florian_Dietz
12 Jan 2026 12:29 UTC
87
points
41
comments
26
min read
LW
link
Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable
Florian_Dietz
29 Jul 2025 16:23 UTC
9
points
0
comments
17
min read
LW
link
Deliberative Credit Assignment: Making Faithful Reasoning Profitable
Florian_Dietz
14 Jul 2025 9:26 UTC
10
points
3
comments
17
min read
LW
link
Edge Cases in AI Alignment
Florian_Dietz
24 Mar 2025 9:27 UTC
19
points
3
comments
4
min read
LW
link
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Florian_Dietz
10 Mar 2025 16:07 UTC
49
points
7
comments
9
min read
LW
link
Do we want alignment faking?
Florian_Dietz
28 Feb 2025 21:50 UTC
7
points
4
comments
1
min read
LW
link
Revealing alignment faking with a single prompt
Florian_Dietz
29 Jan 2025 21:01 UTC
9
points
5
comments
4
min read
LW
link
Florian_Dietz’s Shortform
Florian_Dietz
1 Jan 2025 14:27 UTC
3
points
60
comments
1
min read
LW
link
Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems
Florian_Dietz
17 Feb 2024 8:45 UTC
4
points
0
comments
13
min read
LW
link
Understanding differences between humans and intelligence-in-general to build safe AGI
Florian_Dietz
16 Aug 2022 8:27 UTC
7
points
8
comments
1
min read
LW
link
Back to top
Next