Rohin Shah comments on Research directions Open Phil wants to fund in technical AI safety

Rohin Shah 8 Feb 2025 9:56 UTC
LW: 101 AF: 54
23
AF
I’m excited to see this RFP out! Many of the topics in here seem like great targets for safety work.
I’m sad that there’s so little emphasis in this RFP about alignment, i.e. research on how to build an AI system that is doing what its developer intended it to do. The main area that seems directly relevant to alignment is “alternatives to adversarial training”. (There’s also “new moonshots for aligning superintelligence” but I don’t expect much to come out of that, and “white-box estimation of rare misbehavior” could help if you are willing to put optimization pressure against it, but that isn’t described as a core desideratum, and I don’t expect we get it. Work on externalized reasoning can also make alignment easier, but I’m not counting that as “directly relevant”.) Everything else seems to mostly focus on evaluation or fundamental science that we hope pays off in the future. (To be clear, I think those are good and we should clearly put a lot of effort into them, just not all of our effort.)
Areas that I’m more excited about relative to the median area in this post (including some of your starred areas):
- Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
- Mild optimization. I’m particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
To be fair, a lot of this work is much harder to execute on, requiring near-frontier models, significant amounts of compute, and infrastructure to run RL on LLMs, and so is much better suited to frontier labs than to academia / nonprofits, and you will get fewer papers per $ invested in it. Nonetheless, academia / nonprofits are so much bigger than frontier labs that I think academics should still be taking shots at these problems.
(And tbc there are plenty of other areas directly relevant to alignment that I’m less excited about, e.g. improved variants of RLHF / Constitutional AI, leveraging other modalities of feedback for LLMs (see Table 1 here), and “gradient descent psychology” (empirically studying how fine-tuning techniques affect LLM behavior).)
Control evaluations are an attempt to conservatively evaluate the safety of protocols like AI-critiquing-AI (e.g., McAleese et al.), AI debate (Arnesen et al., Khan et al.), other scalable oversight methods
Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that’s an alignment stress test.)
(A lot of the scalable oversight work from the last year is inference-only, but at least for me (and for the field historically) the goal is to scale this to training the AI system—all of the theory depends on equilibrium behavior which you only get via training.)
What links here?
- plex's comment on plex’s Shortform by plex (7 Oct 2025 17:51 UTC; 116 points)
- Buck 8 Feb 2025 16:46 UTC
  LW: 20 AF: 12
  9
  AF Parent
  I think we should just all give up on the word “scalable oversight”; it is used in many conflicting ways, sadly. I mostly talk about “recursive techniques for reward generation”.
  - Alexander Gietelink Oldenziel 8 Feb 2025 19:04 UTC
    4 points
    −1
    Parent
    The idea I associate with scalable oversight is weaker models overseeing stronger models (probably) combined with safety-by-debate. Is that the same or different from ” recursive techniques for reward generation” ?
    Currently, this general class of ideas seems to me the most promising avenue for achieving alignment for vastly superhuman AI (′ superintelligence’ )..
- Fabien Roger 8 Feb 2025 19:10 UTC
  LW: 8 AF: 6
  −1
  AF Parent
  Amplified oversight aka scalable oversight, where you are actually training the systems as in e.g. original debate.
  Mild optimization. I’m particularly interested in MONA (fka process supervision, but that term now means something else) and how to make it competitive, but there may also be other good formulations of mild optimization to test out in LLMs.
  I think additional non-moonshot work in these domains will have a very hard time helping.
  [low confidence] My high level concern is that non-moonshot work in these clusters may be the sort of things labs will use anyway (with or without safety push) if this helped with capabilities because the techniques are easy to find, and won’t use if it didn’t help with capabilities because the case for risk reduction is weak. This concern is mostly informed by my (relatively shallow) read of recent work in these clusters.
  [edit: I was at least somewhat wrong, see comment threads below]
  Here are things that would change my mind:
  - If I thought people were making progress towards techniques with nicer safety properties and no alignment tax that seem hard enough to make workable in practice that capabilities researchers won’t bother using by default, but would bother using if there was existing work on how to make them work.
    (For the different question of preventing AIs from using “harmful knowledge”, I think work on robust unlearning and gradient routing may have this property—the current SoTA is far enough from a solution that I expect labs to not bother doing anything good enough here, but I think there is a path to legible success, and conditional on success I expect labs to pick it up because it would be obviously better, more robust, and plausibly cheaper than refusal training + monitoring. And I think robust unlearning and gradient routing have better safety properties than refusal training + monitoring.)
  - If I thought people were making progress towards understanding when not using process-based supervision and debate is risky. This looks like demos and model organisms aimed at measuring when, in real life, not using these simple precautions would result in very bad outcomes while using the simple precautions would help.
    (In the different domain of CoT-faithfulness I think there is a lot of value in demonstrating the risk of opaque CoT well-enough that labs don’t build techniques that make CoT more opaque if it just slightly increased performance because I expect that it will be easier to justify. I think GDM’s updated safety framework is a good step in this direction as it hints at additional requirements GDM may have to fulfill if it wanted to deploy models with opaque CoT past a certain level of capabilities.)
  - If I thought that research directions included in the cluster you are pointing at were making progress towards speeding up capabilities in safety-critical domains (e.g. conceptual thinking on alignment, being trusted neutral advisors on geopolitical matters, …) relative to baseline methods (i.e. the sort of RL you would do by default if you wanted to make the model better at the safety-critical task if you had no awareness of anything people did in the safety literature).
  I am not very aware of what is going on in this part of the AI safety field. It might be the case that I would change my mind if I was aware of certain existing pieces of work or certain arguments. In particular I might be too skeptical about progress on methods for things like debate and process-based supervision—I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
  It’s also possible that I am missing an important theory of change for this sort of research.
  (I’d still like it if more people worked on alignment: I am excited about projects that look more like the moonshots described in the RFP and less like the kind of research I think you are pointing at.)
  - Rohin Shah 8 Feb 2025 22:35 UTC
    LW: 4 AF: 5
    1
    AF Parent
    I would have guessed that the day that labs actually want to use it for production runs, the methods work on toy domains and math will be useless, but I guess you disagree?
    I think MONA could be used in production basically immediately; I think it was about as hard for us to do regular RL as it was to do MONA, though admittedly we didn’t have to grapple as hard with the challenge of defining the approval feedback as I’d expect in a realistic deployment. But it does impose an alignment tax, so there’s no point in using MONA currently, when good enough alignment is easy to achieve with RLHF and its variants, or RL on ground truth signals. I guess in some sense the question is “how big is the alignment tax”, and I agree we don’t know the answer to that yet and may not have enough understanding by the time it is relevant, but I don’t really see why one would think “nah it’ll only work in toy domains”.
    I agree debate doesn’t work yet, though I think >50% chance we demonstrate decent results in some LLM domain (possibly a “toy” one) by the end of this year. Currently it seems to me like a key bottleneck (possibly the only one) is model capability, similarly to how model capability was a bottleneck to achieving the value of RL on ground truth until ~2024).
    It also seems like it would still be useful if the methods were used some time after the labs want to use it for production runs.
    It’s wild to me that you’re into moonshots when your objection to existing proposals is roughly “there isn’t enough time for research to make them useful”. Are you expecting the moonshots to be useful immediately?
    - Fabien Roger 9 Feb 2025 14:14 UTC
      LW: 5 AF: 4
      0
      AF Parent
      Which of the bullet points in my original message do you think is wrong? Do you think MONA and debate papers are:
      on the path to techniques that measurably improve feedback quality on real domains with potentially a low alignment tax, and that are hard enough to find that labs won’t use them by default?
      on the path to providing enough evidence of their good properties that even if they did not measurably help with feedback quality in real domains (and slightly increased cost), labs could be convinced to use them because they are expected to improve non-measurable feedback quality?
      on the path to speeding up safety-critical domains?
      (valuable for some other reason?)
      - Rohin Shah 9 Feb 2025 18:25 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I have some credence in all three of those bullet points.
        For MONA it’s a relatively even mixture of the first and second points.
        (You are possibly the first person I know of who reacted to MONA with “that’s obvious” instead of “that obviously won’t perform well, why would anyone ever do it”. Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality.)
        For debate it’s mostly the first point, and to some extent the third point.
        Fabien Roger 10 Feb 2025 12:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Admittedly you are imagining a future hypothetical where it’s obvious to everyone that long-term optimization is causing problems, but I don’t think it will clearly be obvious in advance that the long-term optimization is causing the problems, even if switching to MONA would measurably improve feedback quality
        That’s right. If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point (especially if the alignment tax is non-negligible), and I am not sure how you get it.
        In particular, for the median AIs that labs use to 20x AI safety research, my guess is that you won’t have invisible long-term reward hacking problems, and so I would advise labs to spend the alignment tax on other interventions (like using weaker models when possible, or doing control), not on using process-based rewards. I would give different advice if
        the alignment tax of MONA were tiny
        there were decent evidence for invisible long-term reward hacking problems with catastrophic consequences solved by MONA
        I think this is not super plausible to happen, but I am sympathetic to research towards these two goals. So maybe we don’t disagree that much (except maybe on the plausibility of invisible long-term reward hacking problems for the AIs that matter the most).
        Rohin Shah 11 Feb 2025 23:34 UTC
        LW: 2 AF: 2
        0
        AF Parent
        If the situations where you imagine MONA helping are situations where you can’t see the long-term optimization problems, I think you need a relatively strong second bullet point
        That doesn’t seem right. It can simultaneously be the case that you can’t tell that there are problems stemming from long-term optimization problems when you don’t use MONA, and also if you actually use MONA, then it will measurably improve quality.
        For example, perhaps under normal RL you get a coding AI that has learned to skip error-checking code in order to reduce latency (which we’d penalize if we knew about it, but we don’t realize that’s happening). Later when things are put into production errors happen, but it’s chalked up to “well it’s hard to anticipate everything”.
        Instead you use MONA, and it doesn’t learn to do this. You compare the resulting coding agent to the original agent, and notice that the MONA agent’s lines of code are much more rarely implicated in future bugs, and conclude they are higher quality.
        Fabien Roger 14 Feb 2025 10:38 UTC
        LW: 3 AF: 3
        0
        AF Parent
        To rephrase what I think you are saying are situations where work on MONA is very helpful:
        By default people get bitten by long-term optimization. They notice issues in prod because it’s hard to catch everything. They patch individual failures when they come up, but don’t notice that if they did more work on MONA, they would stop the underlying driver of many issues (including future issues that could result in catastrophes). They don’t try MONA-like techniques because it’s not very salient when you are trying to fix individual failures and does not pass cost-benefit to fix individual failures.
        If you do work on MONA in realistic-ish settings, you may be able to demonstrate that you can avoid many failures observed in prod without ad-hoc patches and that the alignment tax is not too large. This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly and because people don’t by default think of the individual failures you’ve seen in prod as symptoms of their long-term optimization, but your empirical work pushes them over the line and they end up trying to adopt MONA to avoid future failures in prod (and maybe reduce catastrophic risk—though given competitive pressures, that might not be the main factor driving decisions and so you don’t have to make an ironclad case for MONA reducing catastrophic risk).
        Is that right?
        I think this is at least plausible. I think this will become much more likely once we actually start observing long-term optimization failures in prod. Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
        Rohin Shah 14 Feb 2025 10:54 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Is that right?
        Yes, that’s broadly accurate, though one clarification:
        This is not obvious because trying it out and measuring the effectiveness of MONA is somewhat costly
        That’s a reason (and is probably sufficient by itself), but I think a more important reason is that your first attempt at using MONA is at the point where problems arise, MONA will in fact be bad, whereas if you have iterated on it a bunch previously (and in particular you know how to provide appropriate nonmyopic approvals), your attempt at using MONA will go much better.
        I think this will become much more likely once we actually start observing long-term optimization failures in prod.
        Agreed, we’re not advocating for using MONA now (and say so in the paper).
        Maybe an intervention I am excited about is enough training technique transparency that it is possible for people outside of labs to notice if issues plausibly stems from long-term optimization?
        Idk, to be effective I think this would need to be a pretty drastic increase in transparency, which seems incompatible with many security or non-proliferation intuitions, as well as business competitiveness concerns. (Unless you are thinking of lots of transparency to a very small set of people.)
        ryan_greenblatt 9 Feb 2025 19:17 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        You are possibly the first person I know of who reacted to MONA with “that’s obvious”
        
        I also have the “that’s obvious reaction”, but possibly I’m missing somne details. I also think it won’t perform well enough in practice to pencil given other better places to allocate safety budget (if it does trade off which is unclear).
        Rohin Shah 10 Feb 2025 5:22 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I meant “it’s obvious you should use MONA if you are seeing problems with long-term optimization”, which I believe is Fabien’s position (otherwise it would be “hard to find”).
        Your reaction seems more like “it’s obvious MONA would prevent multi-step reward hacks”; I expect that is somewhat more common (though still rare, and usually depends on already having the concept of multi-step reward hacking).
    - Fabien Roger 9 Feb 2025 13:52 UTC
      LW: 2 AF: 2
      0
      AF Parent
      I don’t really see why one would think “nah it’ll only work in toy domains”.
      That is not my claim. By “I would have guessed that methods work on toy domains and math will be useless” I meant “I would have guessed that if a lab decided to do process-based feedback, it will be better off not doing a detailed literature review of methods in MONA and followups on toy domains, and just do the process-based supervision that makes sense in the real domain they now look at. The only part of the method section of MONA papers that matters might be “we did process-based supervision”.”
      I did not say “methods that work on toy domains will be useless” (my sentence was easy to misread).
      I almost have the opposite stance, I am closer to “it’s so obvious that process-based feedback helps that if capabilities people ever had issues stemming from long-term optimization, they would obviously use more myopic objectives. So process-based feedback so obviously prevents problems from non-myopia in real life that the experiments in the MONA paper don’t increase the probability that people working on capabilities implement myopic objectives.”
      But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings (above the “capability researcher does what looks easiest and sensible to them” baseline)?
      I agree debate doesn’t work yet
      Not a crux. My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality. So my low confidence guess is that it’s not high priority for people working on x-risk safety to speed up that work.
      But I am excited about debate work that is not just about improving feedback quality. For example I am interested in debate vs schemers or debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
      - Rohin Shah 9 Feb 2025 18:23 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Got it, that makes more sense. (When you said “methods work on toy domains” I interpreted “work” as a verb rather than a noun.)
        But maybe I am underestimating the amount of methods work that can be done on MONA for which it is reasonable to expect transfer to realistic settings
        I think by far the biggest open question is “how do you provide the nonmyopic approval so that the model actually performs well”. I don’t think anyone has even attempted to tackle this so it’s hard to tell what you could learn about it, but I’d be surprised if there weren’t generalizable lessons to be learned.
        I agree that there’s not much benefit in “methods work” if that is understood as “work on the algorithm / code that given data + rewards / approvals translates it into gradient updates”. I care a lot more about iterating on how to produce the data + rewards / approvals.
        My guess is that if debate did “work” to improve average-case feedback quality, people working on capabilities (e.g. the big chunk of academia working on improvements to RLHF because they want to find techniques to make models more useful) would notice and use that to improve feedback quality.
        I’d weakly bet against this, I think there will be lots of fiddly design decisions that you need to get right to actually see the benefits, plus iterating on this is expensive and hard because it involves multiagent RL. (Certainly this is true of our current efforts; the question is just whether this will remain true in the future.)
        For example I am interested in [...] debate vs a default training process that incentivizes the sort of subtle reward hacking that doesn’t show up in “feedback quality benchmarks (e.g. rewardbench)” but which increases risk (e.g. by making models more evil). But this sort of debate work is in the RFP.
        I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
        EDIT: And tbc this is the kind of thing I mean by “improving average-case feedback quality”. I now feel like I don’t know what you mean by “feedback quality”.
        Fabien Roger 10 Feb 2025 12:21 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I’m confused. This seems like the central example of work I’m talking about. Where is it in the RFP? (Note I am imagining that debate is itself a training process, but that seems to be what you’re talking about as well.)
        My bad, I was a bit sloppy here. The debate-for-control stuff is in the RFP but not the debate vs subtle reward hacks that don’t show up in feedback quality evals.
        I think we agree that there are some flavors of debate work that are exciting and not present in the RFP.
- Buck 8 Feb 2025 16:47 UTC
  LW: 7 AF: 6
  3
  AF Parent
  Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
  I don’t know what this means, do you have any examples?
  - Rohin Shah 8 Feb 2025 22:11 UTC
    LW: 3 AF: 3
    0
    AF Parent
    I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.
- Oliver Daniels 9 Feb 2025 21:13 UTC
  3 points
  0
  Parent
  Huh? No, control evaluations involve replacing untrusted models with adversarial models at inference time, whereas scalable oversight attempts to make trusted models at training time, so it would completely ignore the proposed mechanism to take the models produced by scalable oversight and replace them with adversarial models. (You could start out with adversarial models and see whether scalable oversight fixes them, but that’s an alignment stress test.)
  I have a similar confusion (see my comment here) but seems like at least Ryan wants control evaluations to cover this case? (perhaps on the assumption that if your “control measures” are successful, they should be able to elicit aligned behavior from scheming models and this behavior can be reinforced?)