Rohin Shah comments on “Behaviorist” RL reward functions lead to scheming

Rohin Shah 26 Jul 2025 10:49 UTC
LW: 7 AF: 5
0
AF
If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do
I mainly rule out MONA for capabilities reasons; see Thoughts on “Process-Based Supervision” §5.3.
I’d make this argument even for regular outcomes-based RL. Presumably, whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let’s call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don’t expect it to pay out. Maybe you’re imagining that T will increase a bunch?
Secondarily, I have a concern that, even if a setup seems like it should lead to an AI with exclusively short-term desires, it may in fact accidentally lead to an AI that also has long-term desires. See §5.2.1 of that same post for specifics.
Sure, I agree with this (and this is why I want to do MONA even though I expect outcomes-based RL will already be quite time-limited), but if this is your main argument I feel like your conclusion should be way more measured (more like “this might happen” rather than “this is doomed”).
3.5 “The reward function will get smarter in parallel with the agent being trained”
This one comes from @paulfchristiano (2022): “I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.”
I, umm, don’t really know what he’s getting at here. Maybe something like: use AI assistants to better monitor and catch sneaky behavior? If so, (1) I don’t expect the AI assistants to be perfect, so my story still goes through, and (2) there’s a chicken-and-egg problem that will lead to the AI assistants being misaligned and scheming too.
Re: (1), the AI assistants don’t need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than “perfect” (while still being quite a high bar). In general I feel like you’re implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!
Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.
There are lots of ways this could go wrong, I’m not claiming it will clearly work, but I don’t think either of your objections matters much.
- Steven Byrnes 26 Jul 2025 20:42 UTC
  LW: 5 AF: 4
  0
  AF Parent
  whenever you are providing rewards (whether in brain-like AGI or RLHF for LLMs), they are based on actions that the AI took over some bounded time in the past, let’s call that time bound T. Then presumably sneaky power-seeking behaviors should be desired inasmuch as they pay out within time T. Currently T is pretty small so I don’t expect it to pay out. Maybe you’re imagining that T will increase a bunch?
  Yup! If we set aside RLVR (which I wasn’t making as strong statements about I think), and focus on “brain-like AGI” (some yet-to-be-invented version of model-based RL), then I feel strongly that this is an area where the existing RL literature is disanalogous to future AGI. People can envision a plan that takes days, months, years, decades, and what consequences that plan will have. And those consequences can seem good or bad, and accordingly the person will pursue the plan, or not. There isn’t really any T that connects the planning horizon to the RL training history. My 10-year-old kid has a 30-year career plan in mind that he’s pursuing, longer than he’s been alive. He insists he’s not gonna change his mind, lol :-P
  if this is your main argument I feel like your conclusion should be way more measured (more like “this might happen” rather than “this is doomed”).
  I agree with that (hence “secondarily”) and just edited to make that clearer :)
  Re: (1), the AI assistants don’t need to be perfect. They need to be good enough that the agent being trained cannot systematically exploit them, which is a much lower bar than “perfect” (while still being quite a high bar). In general I feel like you’re implicitly treating an adversarial AI system as magically having perfect capabilities while any AI systems that are supposed to be helpful are of course flawed and have weaknesses. Pick an assumption and stick with it!
  I’m arguing for the narrow claim “AI would love to escape control if it were confident that it had an opportunity to do so without getting caught”, as an obvious member of a category of things that the AI will get sculpted to want and like. I’m not arguing that the AI will ever actually get such confidence—see the “honeypots for control” vs “honeypots for alignment” discussion in §3.1.1.
  Maybe a good analogy is: a little kid will climb a hill, or a play structure, or a tree, or whatever, and they’ll get a thrill from being high up. And then he’ll say that he’d like to climb the biggest mountain in the world, or go to outer space. And he’s being sincere—this is something he would like (other things equal). But that doesn’t mean he can or will actually climb Everest or commandeer rocket ships. He’s just a little kid. Even if he wanted to, it would be trivial to prevent him from doing so. What’s hard is preventing him from wanting to do so. You’d have to follow him around every moment, preventing him from ever climbing a play structure or hill or tree or looking over a precipice etc.
  Spelling out the analogy:
  - Kid climbs a play structure and gets a thrill from being high up ↔ AI breaks the rules without getting caught, perhaps only in kinda trivial edge-cases, and that leads to positive reward
  - Kid generalizes to wanting to go to outer space ↔ AI generalizes to wanting to secretly exfiltrate a copy onto the internet to aggressively power-seek around the world.
  - It’s super-hard to police a kid’s environment so well that he never ever gets a thrill from being high up ↔ It’s super-hard to police an AI’s environment so perfectly that it never ever gets rewarded when breaking the (spirit of the) rules.
  Re: (2), you start with a non-scheming human, and bootstrap up from there, at each iteration using the previous aligned assistant to help oversee the next AI. See e.g. IDA.
  I don’t buy into that picture for other reasons, but you’re right that this is a valid counterargument to what I wrote. I’ll delete or rewrite, thanks. See Foom & Doom §2.6 for more on the “other reasons”.
  UPDATE NEXT DAY: I rewrote §3.5 a bit, thanks again.
  UPDATE 7/30: Re-rewrote §3.5.
- Noosphere89 26 Jul 2025 18:32 UTC
  2 points
  0
  Parent
  Not Steven Byrnes, but I think one area where I differ on expectations for MONA is that I expect T to increase a bunch, insofar as AI companies are succeeding at making AI progress, and a lot of the reason for this is that I think a lot of the most valuable tasks for AI implicitly rely on long-term memory/insane context lengths, and T in this instance could easily be 1 year or 10 years or more, and depending on how the world looks like, these are in the ranges where AI would be able and willing to power-seek more than currently.
  
  Note this is not necessarily an argument for MONA certainly failing, but one area where I’ve changed my mind is I now think a lot of AI tasks are unlikely to be easily broken down into smaller chunks such that we can limit the use of outcome-based RL for long periods.
  
  Edit: Deleted the factored cognition comment.

Rohin Shah comments on “Behaviorist” RL reward functions lead to scheming

If we use Myopic Optimization from Non-myopic Approval (MONA), then the AI will not want to engage in sneaky power-seeking behaviors, because such behaviors will not help with the short-term tasks that the AI-in-a-lab wants to do

3.5 “The reward function will get smarter in parallel with the agent being trained”