johnswentworth comments on Can we safely automate alignment research?

johnswentworth 30 Apr 2025 18:39 UTC
LW: 24 AF: 11
11
AF
I think you are importantly missing something about how load-bearing “conceptual” progress is in normal science.
An example I ran into just last week: I wanted to know how long it takes various small molecule neurotransmitters to be reabsorbed after their release. And I found some very different numbers:
- Some sources offhandedly claimed ~1ms. AFAICT, this number comes from measuring the time taken for the neurotransmitter to clear from the synaptic cleft, and then assuming that the neurotransmitter clears mainly via reabsorption (an assumption which I emphasize because IIUC it’s wrong; I think the ~1ms number is actually measuring time for the neurotransmitter to diffuse out of the cleft).
- Other sources claimed ~10ms. These were based on <other methods>.
Now, I want to imagine for a moment a character named Emma the Empirical Fundamentalist, someone who eschews “conceptual” work entirely and only updates on Empirically Testable Questions. (For current purposes, mathematical provability can also be lumped in with empirical testability.) How would Emma respond to the two mutually-contradictory neurotransmitter reabsorption numbers above?
Well, first and foremost, Emma would not think “one of these isn’t measuring what the authors think they’re measuring”. That is a quintessential “conceptual” thought. Instead, Emma might suspect that one of the measurements was simply wrong, but repeating the two experiments will quickly rule out that hypothesis. She might also think to try many other measurement methods, and find that most of them agree with the ~10ms number, but the ~1ms measurement will still be a true repeatable phenomenon. Eventually she will likely settle on roughly “the real world is messy and complex, so sometimes measurements depend on context and method in surprising ways, and this is one of those times”. (Indeed, that sort of thing is a very common refrain in tons of biological science.)
But in this case, the real explanation (I claim) is simply that one of the measurements was based on an incorrect assumption. I made the incorrect assumption obvious by highlighting it above, but in-the-wild the assumption would be implicit and unstated and nonobvious; it wouldn’t even occur to the authors that they’re making an assumption which could be wrong.
Discovering that sort of error is centrally in the wheelhouse of conceptual progress. And as this example illustrates, it’s a very load-bearing piece of normal science.
(And indeed, I expect that at least some of your examples of normal-science-style empirical alignment research are cases where the authors are probably not measuring what they think they are measuring, though I don’t know in advance which ones. Conceptual work is exactly what would be required to sort that out.)
(And to be clear, I don’t think this example is the only or even most common “kind of way” in which conceptual work is load-bearing for normal science.)
Moving up the discussion stack: insofar as conceptual work is very load-bearing for normal science, how does that change the view articulated in the post? Well, first, it means that one cannot produce a good normal-science AI by primarily relying on empirical feedback (including mathematical feedback from proofs), unless one gets lucky and training on empirical feedback happens to also make the AI good at conceptual work. Second, there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science.
What links here?
- Jonas Hallgren's comment on Jonas Hallgren’s Shortform by Jonas Hallgren (1 May 2025 7:02 UTC; 1 point)
- Joe Carlsmith 30 Apr 2025 21:10 UTC
  LW: 8 AF: 6
  0
  AF Parent
  Note that conceptual research, as I’m understanding it, isn’t defined by the cognitive skills involved in the research—i.e., by whether the researchers need to have “conceptual thoughts” like “wait, is this measuring what I think it’s measuring?”. I agree that normal science involves a ton of conceptual thinking (and many “number-go-up” tasks do too). Rather, conceptual research as I’m understanding it is defined by the tools available for evaluating the research in question.^[1] In particular, as I’m understanding it, cases where neither available empirical tests nor formal methods help much.
  Thus, in your neurotransmitter example, it does indeed take some kind of “conceptual thinking” to come up with the thought “maybe it actually takes longer for neurotransmitters to get re-absorbed than it takes for them to clear from the cleft.” But if some AI presented us with this claim, the question is whether we could evaluate it via some kind of empirical test, which it sounds like we plausibly could. Of course, we do still need to interpret the results of these tests—e.g., to understand enough about what we’re actually trying to measure to notice that e.g. one measurement is getting at it better than another. But we’ve got rich empirical feedback loops to dialogue with.
  So if we interpret “conceptual work” as conceptual thinking, I do agree that “there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science.” And this is closely related to the comforts I discuss in section 6.1. That is: a lot of alignment research seems pretty comparable to me to the sort of science at stake in e.g. biology, physics, computer science, etc, where I think human evaluation has a decent track record (or at least, a better track record than philosophy/futurism), and where I expect a decent amount of market pressure to resolve evaluation difficulties adequately. So (modulo scheming AIs differentially messing with us in some domains vs. others), at least by the time we’re successfully automating these other forms of science, I think we should be reasonably optimistic about automating that kind of alignment research as well. But this still leaves the type of alignment research that looks more centrally like philosophy/futurism, where I think evaluation is additionally challenging, and where the human track record looks worse.
  1. ^
    Thus, from section 6.2.3: “Conceptual research, as I’m understanding it, is defined by the methods available for evaluating it, rather than the cognitive skills involved in producing it. For example: Einstein on relativity was clearly a giant conceptual breakthrough. But because it was evaluable via a combination of empirical predictions and formal methods, it wouldn’t count as ‘conceptual research’ in my sense.”
  - johnswentworth 1 May 2025 0:17 UTC
    LW: 16 AF: 7
    14
    AF Parent
    Rather, conceptual research as I’m understanding it is defined by the tools available for evaluating the research in question.^[1] In particular, as I’m understanding it, cases where neither available empirical tests nor formal methods help much.
    Agreed.
    But if some AI presented us with this claim, the question is whether we could evaluate it via some kind of empirical test, which it sounds like we plausibly could.
    Disagreed.
    My guess is that you have, in the back of your mind here, ye olde “generation vs verification” discussion. And in particular, so long as we can empirically/mathematically verify some piece of conceptual progress once it’s presented to us, we can incentivize the AI to produce interesting new pieces of verifiable conceptual progress.
    That’s an argument which works in the high capability regime, if we’re willing to assume that any relevant progress is verifiable, since we can assume that the highly capable AI will in fact find whatever pieces of verifiable conceptual progress are available. Whether it works in the relatively-low capability regime of human-ish-level automated alignment research and realistic amounts of RL is… rather more dubious. Also, the mechanics of designing a suitable feedback signal would be nontrivial to get right in practice.
    Getting more concrete: if we’re imagining an LLM-like automated researcher, then a question I’d consider extremely important is: “Is this model/analysis/paper/etc missing key conceptual pieces?”. If the system says “yes, here’s something it’s missing” then I can (usually) verify that. But if it says “nope, looks good”… then I can’t verify that the paper is in fact not missing anything.
    And in fact that’s a problem I already do run into with LLMs sometimes: I’ll present a model, and the thing will be all sycophantic and not mention that the model has some key conceptual confusion about something. Of course you might hope that some more-clever training objective will avoid that kind of sycophancy and instead incentivize Good Science, but that’s definitely not an already-solved problem, and I sure do feel suspicious of an assumption that that problem will be easy.
    - Steven Byrnes 1 May 2025 12:56 UTC
      LW: 48 AF: 27
      17
      AF Parent
      I think that OP’s discussion of “number-go-up vs normal science vs conceptual research” is an unnecessary distraction, and he should have cut that part and just talked directly about the spectrum from “easy-to-verify progress” to “hard-to-verify progress”, which is what actually matters in context.
      Partly copying from §1.4 here, you can (A) judge ideas via new external evidence, and/or (B) judge ideas via internal discernment of plausibility, elegance, self-consistency, consistency with already-existing knowledge and observations, etc. There’s a big range in people’s ability to apply (B) to figure things out. But what happens in “normal” sciences like biology is that there are people with a lot of (B), and they can figure out what’s going on, on the basis of hints and indirect evidence. Others don’t. The former group can gather ever-more-direct and ever-more-unassailable (A)-type evidence over time, and use that evidence as a cudgel with which to beat the latter group over the head until they finally get it. (“If you don’t believe my 7 independent lines of evidence for plate tectonics, OK fine I’ll go to the mid-Atlantic ridge and gather even more lines of evidence…”)
      This is an important social tool, and explains why bad scientific ideas can die, while bad philosophy ideas live forever. And it’s even worse than that—if the bad philosophy ideas don’t die, then there’s no common knowledge that the bad philosophers are bad, and then they can rise in the ranks and hire other bad philosophers etc. Basically, to a first approximation, I think humans and human institutions are not really up to the task of making intellectual progress systematically over time, except where idiot-proof verification exists for that intellectual progress (for an appropriate definition of “idiot”, and with some other caveats).
      …Anyway, AFAICT, OP is just claiming that AI alignment research involves both easy-to-verify progress and hard-to-verify progress, which seems uncontroversial.
      - johnswentworth 1 May 2025 14:32 UTC
        LW: 11 AF: 6
        5
        AF Parent
        That was an excellent summary of how things seem to normally work in the sciences, and explains it better than I would have. Kudos.
      - Joe Carlsmith 2 May 2025 0:28 UTC
        LW: 6 AF: 4
        4
        AF Parent
        I’m happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it’s important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.
    - johnswentworth 1 May 2025 0:52 UTC
      LW: 5 AF: 3
      2
      AF Parent
      Perhaps a better summary of my discomfort here: suppose you train some AI to output verifiable conceptual insights. How can I verify that this AI is not missing lots of things all the time? In other words, how do I verify that the training worked as intended?
      - ryan_greenblatt 1 May 2025 3:15 UTC
        LW: 2 AF: 2
        0
        AF Parent
        You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.
        
        Or, two lower bars you might hope for:
        
        It brings up considerations that it “knows” about. (By “knows” I mean relatively deep knows, like it can manipulate and utilize the knowledge relatively strongly.)
        It isn’t much worse than human researchers at bringing up important considerations.
        
        In general, you might have elicitation problems and this domain seems only somewhat worse with respect to elicitation. (It’s worse because the feedback is somewhat more expensive.)
        johnswentworth 1 May 2025 14:37 UTC
        LW: 13 AF: 6
        13
        AF Parent
        You might hope for elicitation efficiency, as in, you heavily RL the model to produce useful considerations and hope that your optimization is good enough that it covers everything well enough.
        “Hope” is indeed a good general-purpose term for plans which rely on an unverifiable assumption in order to work.
        (Also I’d note that as of today, heavy RL tends to in fact produce pretty bad results, in exactly the ways one would expect in theory, and in particular in ways which one would expect to get worse rather than better as capabilities increase. RL is not something we can apply in more than small amounts before the system starts to game the reward signal.)
        Joe Carlsmith 1 May 2025 22:59 UTC
        LW: 4 AF: 3
        0
        AF Parent
        If we assume that the AI isn’t scheming to actively withhold empirically/formally verifiable insights from us (I do think this would make life a lot harder), then it seems to me like this is reasonably similar to other domains in which we need to figure out how to elicit as-good-as-human-level suggestions from AIs that we can then evaluate well. E.g., it’s not clear to me why this would be all that different from “suggest a new transformer-like architecture that we can then verify improves training efficiency a lot on some metric.”
        Or put another way: at least in the context of non-schemers, the thing I’m looking for isn’t just “here’s a way things could be hard.” I’m specifically looking for ways things will be harder than in the context of capabilities (or, to a lesser extent, in other scientific domains where I expect a lot of economic incentives to figure out how to automate top-human-level work). And in that context, generic pessimism about e.g. heavy RL doesn’t seem like it’s enough.
        johnswentworth 1 May 2025 23:46 UTC
        LW: 9 AF: 4
        7
        AF Parent
        That I roughly agree with. As in the comment at top of this chain: “there will be market pressure to make AI good at conceptual work, because that’s a necessary component of normal science”. Likewise, insofar as e.g. heavy RL doesn’t make the AI effective at conceptual work, I expect it to also not make the AI all that effective at normal science.
        That does still leave a big question mark regarding what methods will eventually make AIs good at such work. Insofar as very different methods are required, we should also expect other surprises along the way, and expect the AIs involved to look generally different from e.g. LLMs, which means that many other parts of our mental pictures are also likely to fail to generalize.
        Joe Carlsmith 2 May 2025 0:34 UTC
        LW: 8 AF: 6
        0
        AF Parent
        I think it’s a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc—though I’m not clear on your take on whether capabilities research counts as “science” in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
        That said, I think if we’re ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don’t have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it’s reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)