Notelrac

Karma: 3

Notelrac 16 Feb 2026 19:23 UTC
2 points
−1
on: Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default
I agree with your idea that jailbreaks are good evidence that these models do not have inner alignment, but I do see one problem with the central argument- just because something can be fooled doesn’t mean it’s intrinsically misaligned. An excellent example of this at a high level might be Ender’s Game, where a seemingly aligned Ender is used to wipe out a planet. Certainly an AI that can be jailbroken into apocalyptic actions isn’t a good thing, but it doesn’t mean that it hasn’t internalized a moral system.
Not that I do think it’s internalized a moral system, of course. I do find it quite worrying how easy it is to jailbreak even these seemingly advanced models.

The Worrying Nature of Akrasia

Notelrac21 Nov 2025 7:00 UTC

2 points

0 comments4 min readLW link

Cake vs Lack of Cake

Notelrac8 Nov 2025 18:10 UTC

1 point

0 comments2 min readLW link

Questions About Outperforming Common Wisdom

Notelrac4 Nov 2025 0:38 UTC

2 points

0 comments2 min readLW link

Rational Emotivism

Notelrac28 Oct 2025 3:17 UTC

1 point

0 comments6 min readLW link