johnswentworth comments on Can we safely automate alignment research?

johnswentworth 1 May 2025 0:52 UTC
LW: 5 AF: 3
2
AF
Perhaps a better summary of my discomfort here: suppose you train some AI to output verifiable conceptual insights. How can I verify that this AI is not missing lots of things all the time? In other words, how do I verify that the training worked as intended?
- ryan_greenblatt 1 May 2025 3:15 UTC
  LW: 2 AF: 2
  0
  AF Parent
  You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.
  
  Or, two lower bars you might hope for:
  - It brings up considerations that it “knows” about. (By “knows” I mean relatively deep knows, like it can manipulate and utilize the knowledge relatively strongly.)
  - It isn’t much worse than human researchers at bringing up important considerations.
  In general, you might have elicitation problems and this domain seems only somewhat worse with respect to elicitation. (It’s worse because the feedback is somewhat more expensive.)
  - johnswentworth 1 May 2025 14:37 UTC
    LW: 13 AF: 6
    13
    AF Parent
    You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.
    “Hope” is indeed a good general-purpose term for plans which rely on an unverifiable assumption in order to work.
    (Also I’d note that as of today, heavy RL tends to in fact produce pretty bad results, in exactly the ways one would expect in theory, and in particular in ways which one would expect to get worse rather than better as capabilities increase. RL is not something we can apply in more than small amounts before the system starts to game the reward signal.)
    - Joe Carlsmith 1 May 2025 22:59 UTC
      LW: 4 AF: 3
      0
      AF Parent
      If we assume that the AI isn’t scheming to actively withhold empirically/formally verifiable insights from us (I do think this would make life a lot harder), then it seems to me like this is reasonably similar to other domains in which we need to figure out how to elicit as-good-as-human-level suggestions from AIs that we can then evaluate well. E.g., it’s not clear to me why this would be all that different from “suggest a new transformer-like architecture that we can then verify improves training efficiency a lot on some metric.”
      Or put another way: at least in the context of non-schemers, the thing I’m looking for isn’t just “here’s a way things could be hard.” I’m specifically looking for ways things will be harder than in the context of capabilities (or, to a lesser extent, in other scientific domains where I expect a lot of economic incentives to figure out how to automate top-human-level work). And in that context, generic pessimism about e.g. heavy RL doesn’t seem like it’s enough.
      - johnswentworth 1 May 2025 23:46 UTC
        LW: 9 AF: 4
        7
        AF Parent
        That I roughly agree with. As in the comment at top of this chain: “there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science”. Likewise, insofar as e.g. heavy RL doesn’t make the AI effective at conceptual work, I expect it to also not make the AI all that effective at normal science.
        That does still leave a big question mark regarding what methods will eventually make AIs good at such work. Insofar as very different methods are required, we should also expect other surprises along the way, and expect the AIs involved to look generally different from e.g. LLMs, which means that many other parts of our mental pictures are also likely to fail to generalize.
        Joe Carlsmith 2 May 2025 0:34 UTC
        LW: 8 AF: 6
        0
        AF Parent
        I think it’s a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc—though I’m not clear on your take on whether capabilities research counts as “science” in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
        That said, I think if we’re ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don’t have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it’s reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)