Jan Betley comments on eggsyntax’s Shortform

Jan Betley 9 Oct 2025 20:39 UTC
2 points
0
Consider backdoors, as in the Sleeper Agents paper. So, a conditional policy triggered by some specific user prompt. You could probably quite easily fine-tune a recent model to be pro-life on even days and pro-choice on odd days. These would be just fully general, consistent behaviors, i.e. you could get a model that would present these date-dependant beliefs consistently among all possible contexts.

Now, imagine someone controls all of the environment you live in. Like, literally everything, except that they don’t have any direct access to your brain. Could they implement similar backdoor in you? They could force you to behave that way, buy could they make you really believe that?

My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?

So couldn’t we have LLMs be like humans in this regard? I don’t see a good reason for why this wouldn’t be possible.

I’m not sure if this is a great analogy : )
- eggsyntax 14 Oct 2025 1:02 UTC
  2 points
  0
  Parent
  Could they implement similar backdoor in you?...My guess is not
  Although people have certainly tried...
  My guess is not, and one reason (there are also others but that’s a different topic) is that humans like me and you have a very deep belief “current date doesn’t make a difference for whether abortion is good and bad” that is extremely hard to overwrite without hurting our cognition in other contexts. Like, what is even good and bad if in some cases they flip at midnight?
  I’m being a bit tangential here, but a couple of thoughts:
  - Do we actually have that belief? There are an unbounded number of things that by default we don’t let affect our values, and we can’t be actively representing all of them in our bounded brain (eg we can pick some house in the world at random — whether its lights are on doesn’t affect my values regarding abortion, but I sure didn’t have an actual policy on that in my brain).
  - I’m sure you’re familiar with this, but your example reminds me of the ‘new riddle of induction’: why aren’t grue and bleen just as reasonable as blue and green are?
  I agree that it’s hard to imagine what cognitive changes would have to happen for me to have a value with that property. I don’t think I have very good intuitions about how much it would affect my overall cognition, though. What you’re saying feels plausible to me, but I don’t have much confidence either way.