Florian_Dietz

Karma: 561

Split Personality Training can detect Alignment Faking

Florian_Dietz4 Mar 2026 11:49 UTC

34 points

0 comments6 min readLW link

Already Optimized

Florian_Dietz18 Feb 2026 10:01 UTC

52 points

14 comments14 min readLW link

Deception Channeling: Training Models to Always Verbalize Alignment Faking

Florian_Dietz17 Feb 2026 22:28 UTC

7 points

2 comments9 min readLW link

The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides

Florian_Dietz15 Feb 2026 17:24 UTC

15 points

6 comments6 min readLW link

Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting

Florian_Dietz14 Feb 2026 15:13 UTC

13 points

0 comments8 min readLW link

An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives

Florian_Dietz31 Jan 2026 18:03 UTC

22 points

5 comments11 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz12 Jan 2026 12:29 UTC

86 points

41 comments26 min readLW link

Deliberative Credit Assignment (DCA): Making Faithful Reasoning Profitable

Florian_Dietz29 Jul 2025 16:23 UTC

9 points

0 comments17 min readLW link

Deliberative Credit Assignment: Making Faithful Reasoning Profitable

Florian_Dietz14 Jul 2025 9:26 UTC

10 points

3 comments17 min readLW link

Edge Cases in AI Alignment

Florian_Dietz24 Mar 2025 9:27 UTC

19 points

3 comments4 min readLW link

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Florian_Dietz10 Mar 2025 16:07 UTC

49 points

7 comments9 min readLW link

Do we want alignment faking?

Florian_Dietz28 Feb 2025 21:50 UTC

7 points

4 comments1 min readLW link

Revealing alignment faking with a single prompt

Florian_Dietz29 Jan 2025 21:01 UTC

9 points

5 comments4 min readLW link

Florian_Dietz’s Shortform

Florian_Dietz1 Jan 2025 14:27 UTC

3 points

52 comments1 min readLW link

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_Dietz17 Feb 2024 8:45 UTC

4 points

0 comments13 min readLW link

Understanding differences between humans and intelligence-in-general to build safe AGI

Florian_Dietz16 Aug 2022 8:27 UTC

7 points

8 comments1 min readLW link

logic puzzles and loophole abuse

Florian_Dietz30 Sep 2017 15:45 UTC

3 points

4 comments3 min readLW link

a different perspecive on physics

Florian_Dietz26 Jun 2017 22:47 UTC

0 points

15 comments3 min readLW link

Teaching an AI not to cheat?

Florian_Dietz20 Dec 2016 14:37 UTC

5 points

12 comments1 min readLW link

controlling AI behavior through unusual axiomatic probabilities

Florian_Dietz8 Jan 2015 17:00 UTC

5 points

11 comments1 min readLW link