Jeremy Gillen comments on Foom & Doom 2: Technical alignment is hard

Jeremy Gillen 26 Jun 2025 18:01 UTC
2 points
0
I was trying to argue that the most natural deontology-style preferences we’d aim for are relatively stable if we actually instill them.
Trivial and irrelevant though if true-obedience is part of it, since that’s magic that gets you anything you can describe.
if the way integrity is implemented is at all kinda similar to how humans implement it.
How do humans implement integrity?
Part of my perspective is that the deontological preferences we want are relatively naturally robust to optimization pressure if faithfully implemented, so from my perspective the situation comes down to “you get scheming”, “your behavioural tests look bad, so you try again”, “your behavioural tests look fine, and you didn’t have scheming, so you probably basically got the properties you wanted if you were somewhat careful”.
You’re just stating that you don’t expect any reflective instability, as an agent learns and thinks over time? I’ve heard you say this kind of thing before, but haven’t heard an explanation. I’d love to hear your reasoning? In particular since it seems very different from how humans work, and intuitively surprising for any thinking machine that starts out a bit of a hacky mess like us. (I could write out an object-level argument for why reflective instability is expected, but it’d take some effort and I’d want to know that you were going to engage with it).