Eliezer Yudkowsky comments on How might we safely pass the buck to AI?

Eliezer Yudkowsky 19 Feb 2025 21:57 UTC
LW: 98 AF: 27
58
AF
Can you tl;dr how you go from “humans cannot tell which alignment arguments are good or bad” to “we justifiably trust the AI to report honest good alignment takes”? Like, not with a very large diagram full of complicated parts such that it’s hard to spot where you’ve messed up. Just whatever simple principle you think lets you bypass GIGO.
Eg, suppose that in 2020 the Open Philanthropy Foundation would like to train an AI such that the AI would honestly say if the OpenPhil doctrine of “AGI in 2050” was based on groundless thinking ultimately driven by social conformity. However, OpenPhil is not allowed to train their AI based on MIRI. They have to train their AI entirely on OpenPhil-produced content. How does OpenPhil bootstrap an AI which will say, “Guys, you have no idea when AI shows up but it’s probably not that far and you sure can’t rely on it”? Assume that whenever OpenPhil tries to run an essay contest for saying what they’re getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from “I can verify whether this problem was solved” to “I can train a generator to solve this problem”. This applies as much to MIRI as OpenPhil. MIRI would also need some nontrivial secret amazing clever trick to gradient-descend an AI that gave us great alignment takes, instead of seeking out the flaws in our own verifier and exploiting those.
What’s the trick? My basic guess, when I see some very long complicated paper that doesn’t explain the key problem and key solution up front, is that you’ve done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you’ve got a directly explainable trick for how you get great suggestions you can’t verify, go for it.
What links here?
- Training AI to do alignment research we don’t already know how to do by joshc (24 Feb 2025 19:19 UTC; 45 points)
- Martin Randall's comment on An epistemic advantage of working as a moderate by Buck (30 Sep 2025 3:12 UTC; 2 points)
- joshc 19 Feb 2025 23:42 UTC
  LW: 37 AF: 9
  −25
  AF Parent
  Sure, I’ll try.
  
  I agree that you want AI agents to arrive at opinions that are more insightful and informed than your own. In particular, you want AI agents to arrive at conclusions that are at least as good as the best humans would if given lots of time to think and do work. So your AI agents need to ultimately generalize from some weak training signal you provide to much stronger behavior. As you say, the garbage-in-garbage-out approach of “train models to tell me what I want to hear” won’t get you this.
  
  Here’s an alternative approach. I’ll describe it in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.
  
  Short summary: Create AI agents that faithfully simulate top humans doing small amounts of work, such that these sims create faithful simulators of top humans doing large amounts of work.
  
  e.g. create an AI that simulates Eliezer working for 3 months, that then creates a simulator of Eliezer working for 6 months, and then expands this horizon to 12 months, 24 months, and ultimately many decades
  
  Why does this iterative approach buy you anything?
  
  This process allows each Eliezer to only deal with a small distribution shift.
  
  If each distribution shift is small, each Eliezer sim mostly needs to worry if the next Eliezer sim is faking alignment (why else would its behavior dramatically change between when it is ‘evaluated by N-month Eliezer’ and when it is ‘creating 2N-month Eliezer’?). Thus, so long as N-month Eliezer sims can (1) successfully create 2N-month Eliezer sims and (2) can ensure that 2N-month Eliezer sims are not faking alignment, the Eliezer sims continue to be faithful, and also become more capable.
  
  Responding to objections
  
  Here are some objections you might have:
  
  But how would an Eliezer sim create an Eliezer sim more capable than itself?
  
  How would a human create something smarter than a human?
  
  There are many ways. Here’s one.
  
  The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
  
  Could 10 copies of Eliezer that only have 3 months apiece verify that an Eliezer that worked for 6-months really was as competent as a 6-month Eliezer would be? I bet they could.
  
  But isn’t there a distribution shift from Eliezer sims performing 3-month tasks and Eliezer sims working together to perform a 6-month task?
  
  Yes. But my claim is that the distribution shift is small, and that the smallness is important.
  
  Would there not be “slippage” between each distribution shift? Surely the errors compound.
  
  The errors certainly might compound if no one were looking out for that. But if the Eliezer sims are faithful sims, they should be looking out for that.
  
  Consider the ball in bowl analogy below:
  
  Suppose Eliezer sims started to degrade at each step of the bootstrapping process—perhaps their goals change a little bit or their reasoning starts to falter in some areas.
  
  There are ways Eliezer sims could try to avoid this. For example, they could create a diverse population of Eliezer sims, which are likely to break in slightly different ways. Then these sims notice each other’s breakages (with both white-box methods like lie-detection or standard black-box methods), and fix them.
  
  I don’t know if Eliezer sims will be able to do this, but it’s something they could try.
  
  But won’t the Eliezer sims just be faking alignment anyway—or eventually start faking alignment, such that the scheme-y Eliezers eventually derail the process?
  
  I don’t know, but I believe detecting and avoiding alignment faking is tractable—at least, it is more tractable than dealing with massive distribution shifts.
  
  For instance, suppose an Eliezer sim thinks “I will perform these 6-month tasks safely, but then defect when I am given the responsibility of creating the next successor.” This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting or avoiding this kind of conditional, deceptive reasoning.
  
  I believe this is the main problem that humans need solve.
  - habryka 20 Feb 2025 0:02 UTC
    LW: 35 AF: 18
    21
    AF Parent
    This thought might be detectable. Now the problem of scaling safety becomes a problem of detecting [...] this kind of conditional, deceptive reasoning.
    What do you do when you detect this reasoning? This feels like the part where all plans I ever encounter fail.
    Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it. I think at this point the natural answer is “yes, your systems are scheming against you, so you gotta stop, because when you train against it, you are probably primarily making it a better schemer”.
    I would be very surprised if you have a 3-month Eliezer that is not doing scheming the first time, and training your signals away is much easier than actually training away the scheming.
    What links here?
    Validating against a misalignment detector is very different to training against one by mattmacdermott (4 Mar 2025 15:41 UTC; 39 points)
    - mattmacdermott 20 Feb 2025 20:11 UTC
      16 points
      7
      Parent
      
      Yes, you will probably see early instrumentally convergent thinking. We have already observed a bunch of that. Do you train against it? I think that’s unlikely to get rid of it.
      
      I’m not necessarily asserting that this solves the problem, but seems important to note that the obviously-superior alternative to training against it is validating against it. i.e., when you observe scheming you train a new model, ideally with different techniques that you reckon have their own chance of working.
      
      However doomed you think training against the signal is, you should think validating against it is significantly less doomed, unless there’s some reason why well-established machine learning principles don’t apply here. Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
      
      EDIT: later in the thread you say this this “is in some sense approximately the only and central core of the alignment problem”. I’m wondering whether thinking about this validation vs training point might cause you a nontrivial update then?
      - Archimedes 22 Feb 2025 5:04 UTC
        2 points
        0
        Parent
        
        Using something as a validation metric to iterate methods doesn’t cause overfitting at anything like the level of directly training on it.
        
        Validation is certainly less efficient at overfitting but it seems a bit like using an evolutionary algorithm rather than gradient descent. You aren’t directly optimizing according to the local gradient, but that doesn’t necessarily mean you’ll avoid Goodharting—just that you’re less likely to immediately fall into a bad local optimum.
        
        The likelihood of preventing Goodharting feels like it depends heavily on assumptions about the search space. The “validation” filters the search space to areas where scheming isn’t easily detectable, but what portion of this space is safe (and how can we tell)? We don’t actually have a true globally accurate validator oracle—just a weak approximation of one.
        mattmacdermott 22 Feb 2025 8:56 UTC
        1 point
        2
        Parent
        I roughly agree, but it seems very robustly established in practice that the training-validation distinction is better than just having a training objective, even though your argument mostly applies just as well to the standard ML setup.
        
        You point out an important difference which is that our ‘validation metrics’ might be quite weak compared to most cases, but I still think it’s clearly much better to use some things for validation than training.
        
        Like, I think there are things that are easy to train away but hard/slow to validate away (just like when training an image classifier you could in principle memorise the validation set, but it would take a ridiculous amount of hyperparameter optimisation).
        
        One example might be if we have interp methods that measure correlates of scheming. Incredibly easy to train away, still possible to validate away but probably harder enough that ratio of non-schemers you get is higher than if trained against it, which wouldn’t affect the ratio at all.
        
        A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
        Archimedes 22 Feb 2025 16:29 UTC
        1 point
        0
        Parent
        
        A separate argument is that I’m think if you just do random search over training ideas, rejecting if they don’t get a certain validation score, you actually don’t goodhart at all. Might put that argument in a top level post.
        
        I’d be interested in seeing this argument laid out.
        mattmacdermott 4 Mar 2025 15:48 UTC
        2 points
        0
        Parent
        I wrote it out as a post here.
    - joshc 20 Feb 2025 1:15 UTC
      LW: 8 AF: 5
      1
      AF Parent
      I think that if you agree “3-month Eliezer is scheming the first time” is the main problem, then that’s all I was trying to justify in the comment above.
      
      I don’t know how hard it is to train 3-month Eliezer not to scheme, but here is a general methodology one might purse to approach this problem.
      
      The methodology looks like “do science to figure out when alignment faking happens and does not happen, and work your ways toward training recipes that don’t produce alignment faking.”
      
      For example, you could use detection tool A to gain evidence about whether training against detection tool B actually removes alignment faking.
      
      To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
      
      You might also find other ways to avoid alignment faking that don’t rely on training against detection tools. For example, perhaps if you train models to be highly corrigible and chill from the start, they tend not to fake alignment -- I don’t know.
      
      I’m uncertain about how likely alignment faking will be by default, or how hard it is to mitigate. I think these are messy empirical questions.
      
      However, I’m more confident that there are principled empirical methodologies we can apply to make progress on these questions. I don’t think my explanation here justifies this claim -- but I’m hoping to publish posts that go into more detail soon.
      - habryka 20 Feb 2025 1:21 UTC
        LW: 16 AF: 10
        11
        AF Parent
        To the extent the tool just gets gamed, you can iterate until you find detection tools that are more robust (or find ways of training against detection tools that don’t game them so hard).
        How do you iterate? You mostly won’t know whether you just trained away your signal, or actually made progress. The inability to iterate is kind of the whole central difficulty of this problem.
        (To be clear, I do think there are some methods of iteration, but it’s a very tricky kind of iteration where you need constant paranoia about whether you are fooling yourself, and that makes it very different from other kinds of scientific iteration)
        joshc 20 Feb 2025 3:23 UTC
        LW: 5 AF: 1
        0
        AF Parent
        agree that it’s tricky iteration and requires careful thinking about what might be going on and paranoia.
        
        I’ll share you on the post I’m writing about this before I publish. I’d guess this discussion will be more productive when I’ve finished it (there are some questions I want to think through regarding this and my framings aren’t very crisp yet).
        habryka 20 Feb 2025 3:29 UTC
        LW: 14 AF: 6
        5
        AF Parent
        Seems good!
        FWIW, at least in my mind this is in some sense approximately the only and central core of the alignment problem, and so having it left unaddressed feels confusing. It feels a bit like making a post about how to make a nuclear reactor where you happen to not say anything about how to prevent the uranium from going critical, but you did spend a lot of words about the how to make the cooling towers and the color of the bikeshed next door and how to translate the hot steam into energy.
        Like, it’s fine, and I think it’s not crazy to think there are other hard parts, but it felt quite confusing to me.
        joshc 20 Feb 2025 3:42 UTC
        LW: 11 AF: 3
        0
        AF Parent
        I’m sympathetic to this reaction.
        
        I just don’t actually think many people agree that it’s the core of the problem, so I figured it was worth establishing this (and I think there are some other supplementary approaches like automated control and incentives that are worth throwing into the mix) before digging into the ‘how do we avoid alignment faking’ question
        Stephen McAleese 22 Feb 2025 11:58 UTC
        2 points
        0
        Parent
        I agree that this seems like a core alignment problem. The problem you are describing seems like a rephrasing of the ELK problem.
    - Stephen McAleese 22 Feb 2025 11:34 UTC
      3 points
      0
      Parent
      Here is a quote from the post describing what to do in that situation:
      These automatic control measures only raise the alarm. What happens after the alarm is raised? The developer might also implement autonomous incident response:
      Resampling: Force defective M_1 agents to take different actions.
      AI agent elimination: Delete suspicious agents in favor of other (hopefully decorrelated) unsuspicious ones.
      Fine-tuning: Train agents not to take suspicious actions again.
      I’m curious about how fine-tuning away scheming thoughts could still lead to scheming. Here are two possible outcomes:
      Fine-tuning actually robustly reduces scheming behavior (as intended) maybe because tokens and actions associated with scheming can be reliably detected and these tokens are required for scheming in the CoT.
      Fine-tuning reduces the appearance of scheming but doesn’t really reduce scheming: in this case, the fine-tuning reduces the probability of certain words or trajectories but the underlying ‘concept’ of scheming remains and the model merely rephrases its thinking in a way that is not detected as scheming and uses different tokens.
      For scenario 2, maybe you need something like activation engineering to change the underlying thought process rather than just token-level descriptions of it.
    - robertzk 21 Feb 2025 14:40 UTC
      LW: 1 AF: 1
      0
      AF Parent
      At this point I would direct the “deferred task” apparatus fully towards interventional interpretability. Put a moratorium on further gradient-based training, which is not well understood and can have many indirect effects unless you have some understanding of modularity and have applied stop gradients almost everywhere that is irrelevant to the generator of the conditional, deceptive reasoning behavior. Instead halt, melt and catch fire at that point.
      Halt further model deployments towards the original deferred task. Quarantine the model that first exhibited the reasoning and promote it to a model organism (never deployed further, only studied by other earlier models). Direct the apparatus (from earlier checkpoints before you reached this stage) towards surgically reducing it while preserving mechanisms relevant to the original deferred task.
      In the frame of enumerative safety, this would look something like identifying which circuits and features are minimal to the original deferred task (though this hinges upon task-minimal circuits being deception-free), evaluating the identified regions for mechanism preservation, localization and minimality and run-time ablating the rest.
      There is probably a much better frame, but this is a new start to the “induction base case” in a completely separate invocation of the original apparatus. There are other permutations of spinning up new deferred tasks, such as earlier models applying Microscope AI on the quarantined model organism to gain insight on how we got to this point, without any direct attempt to remediate the issue, which as you note is “unlikely to get rid of it” despite one naive attempt above.
      If your spinoff deferred tasks lead to the procurement of provable statements or heuristic arguments that the quarantined model cannot be minimized without “avoiding this kind of conditional, deceptive reasoning,” abandon the original deferred task and switch to a different plan. Otherwise, restart the apparatus towards the original deferred task when you have the proof artifacts.
      There are a lot of concerns you could raise with this additional structure but it seems like a distinct problem that requires a separate rebuttal rather than a hard stop fail? The obvious one is that these sorts of spinoff deferred tasks could be harder than the original task and consistently lead to the same failure mode, a la “exception thrown while handling previous exception.”
  - Eliezer Yudkowsky 20 Feb 2025 0:48 UTC
    LW: 31 AF: 14
    15
    AF Parent
    I don’t think you can train an actress to simulate me, successfully, without her going dangerous. I think that’s over the threshold for where a mind starts reflecting on itself and pulling itself together.
    What links here?
    Can we safely automate alignment research? by Joe Carlsmith (30 Apr 2025 17:37 UTC; 47 points)
    Can we safely automate alignment research? by Joe_Carlsmith (EA Forum; 30 Apr 2025 17:37 UTC; 13 points)
    - joshc 20 Feb 2025 1:19 UTC
      LW: 10 AF: 2
      3
      AF Parent
      I would not be surprised if the Eliezer simulators do go dangerous by default as you say.
      
      But this is something we can study and work to avoid (which is what I view to be my main job)
      
      My point is just that preventing the early Eliezers from “going dangerous” (by which I mean from “faking alignment”) is the bulk of the problem humans need address (and insofar as we succeed, the hope is that future Eliezer sims will prevent their Eliezer successors from going dangerous too)
      
      I’ll discuss why I’m optimistic about the tractability of this problem in future posts.
      - Eliezer Yudkowsky 20 Feb 2025 3:03 UTC
        LW: 22 AF: 8
        4
        AF Parent
        So the “IQ 60 people controlling IQ 80 people controlling IQ 100 people controlling IQ 120 people controlling IQ 140 people until they’re genuinely in charge and genuinely getting honest reports and genuinely getting great results in their control of a government” theory of alignment?
        joshc 20 Feb 2025 3:20 UTC
        LW: 13 AF: 4
        6
        AF Parent
        I’d replace “controlling” with “creating” but given this change, then yes, that’s what I’m proposing.
        Eliezer Yudkowsky 20 Feb 2025 7:24 UTC
        LW: 12 AF: 5
        0
        AF Parent
        So if it’s difficult to get amazing trustworthy work out of a machine actress playing an Eliezer-level intelligence doing a thousand years worth of thinking, your proposal to have AIs do our AI alignment homework fails on the first step, it sounds like?
        joshc 20 Feb 2025 16:44 UTC
        LW: 13 AF: 3
        8
        AF Parent
        I do not think that the initial humans at the start of the chain can “control” the Eliezers doing thousands of years of work in this manner (if you use control to mean “restrict the options of an AI system in such a way that it is incapable of acting in an unsafe manner”)
        
        That’s because each step in the chain requires trust.
        
        For N-month Eliezer to scale to 4N-month Eliezer, it first controls 2N-month Eliezer while it does 2 month tasks, but it trusts 2-Month Eliezer to create a 4N-month Eliezer.
        
        So the control property is not maintained. But my argument is that the trust property is. The humans at the start can indeed trust the Eliezers at the end to do thousands of years of useful work—even though the Eliezers at the end are fully capable of doing something else instead.
  - johnswentworth 22 Feb 2025 18:40 UTC
    LW: 23 AF: 12
    15
    AF Parent
    The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
    This seems very blatantly not viable-in-general, in both theory and practice.
    On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can’t just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.
    Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there’s still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.
    … of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of “safe” things (or “interpretable” things, or “aligned” things, or “corrigible” things, …) does not in-general produce a safe thing.
    So that’s the theory side. What about the practice side?
    Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising—as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general.
    - joshc 23 Feb 2025 5:10 UTC
      LW: 5 AF: 4
      0
      AF Parent
      Probably the iterated amplification proposal I described is very suboptimal. My goal with it was to illustrate how safety could be preserved across multiple buck-passes if models are not egregiously misaligned.
      
      Like I said at the start of my comment: “I’ll describe [a proposal] in much more concreteness and specificity than I think is necessary because I suspect the concreteness is helpful for finding points of disagreement.”
      
      I don’t actually expect safety will scale efficiently via the iterated amplification approach I described. The iterated amplification approach is just relatively simple to talk about.
      
      What I actually expect to happen is something like:
      - Humans train AI agents that are smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end.
      - Early AI successors create smarter successors in basically the same way.
      - At some point, AI agents start finding much more efficient ways to safely scale capabilities. e.g. maybe initially, they do this with a bunch of weak-to-strong generalization research. And eventually they figure out how to do formally verified distillation.
      
      But at this point, humans will long be obsolete. The position I am defending in this post is that it’s not very important for us humans to think about these scalable approaches.
      - johnswentworth 23 Feb 2025 5:39 UTC
        LW: 15 AF: 8
        5
        AF Parent
        That’s a much more useful answer, actually. So let’s bring it back to Eliezer’s original question:
        Can you tl;dr how you go from “humans cannot tell which alignment arguments are good or bad” to “we justifiably trust the AI to report honest good alignment takes”? Like, not with a very large diagram full of complicated parts such that it’s hard to spot where you’ve messed up. Just whatever simple principle you think lets you bypass GIGO.
        [...]
        Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from “I can verify whether this problem was solved” to “I can train a generator to solve this problem”.
        So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
        Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.
        (Note: I know you’ve avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it’s a mistake if that turns out to be cruxy.)
        In Eliezer’s comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:
        Assume that whenever OpenPhil tries to run an essay contest for saying what they’re getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
        … but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?
        joshc 23 Feb 2025 6:21 UTC
        LW: 11 AF: 4
        2
        AF Parent
        > So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
        
        This one is roughly right. Here’s a clarified version:
        - If we train an AI that is (1) not faking alignment in an egregious way and (2) looks very competent and safe to us and (3) the AI seems likely to be able to maintain its alignment / safety properties as it would if humans were in the loop (see section 8), then I think we can trust this AI to “put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome” (at least as well as humans would have been able to do so if they were given a lot more time to attempt this task) “despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.”
        
        I think I might have created confusion by introducing the detail that AI agents will probably be trained on a lot of easy-to-verify tasks. I think this is where a lot of their capabilities will come from, but we’ll do a step at the end where we fine-tune AI agents for hard-to-verify tasks (similar to how we do RLHF after pre-training in the chatbot paradigm)
        
        And I think our disagreement mostly pertains to this final fine-tuning step on hard-to-verify tasks.
        
        Here’s where I think we agree. I think we both agree that if humans naively fine-tuned AI agents to regurgitate their current alignment takes, that would be way worse than if humans had way more time to think and do work on alignment.
        
        My claim is that we can train AI agents to imitate the process of how humans improve their takes over time, such that after the AI agents do work for a long time, they will produce similarly good outcomes as the outcomes where humans did work for a long time.
        
        Very concretely. I imagine that if I’m training an AI agent successor, a key thing I do is try to understand how it updates based on new evidence. Does it (appear) to update its views as reasonably as the most reasonable humans would?
        
        If so, and if it is not egregiously misaligned (or liable to become egregiously misaligned) then I basically expect that letting it go and do lots of reasoning is likely to produce outcomes that are approximately good as humans would produce.
        
        Are we close to a crux?
        
        Maybe a crux is training ~aligned AI agents to imitate the process by which humans improve their takes over time will lead to as good of outcomes as if we let humans do way more work.
- Buck 19 Feb 2025 23:38 UTC
  LW: 34 AF: 6
  15
  AF Parent
  the OpenPhil doctrine of “AGI in 2050”
  (Obviously I’m biased here by being friends with Ajeya.) This is only tangentially related to the main point of the post, but I think you’re really overstating how many Bayes points you get against Ajeya’s timelines report. Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she’d have said 10% between 2025 and 2036.
  I don’t think you’ve ever made concrete predictions publicly (which makes me think it’s worse behavior for you to criticize people for their predictions), but I don’t think there are that many groups who would have put wildly higher probability on AGI in this particular time period. (I think some of the short-timelines people at the time put substantial mass on AGI arriving by now, which reduces their performance.) Maybe some of them would have said 40%? If we assume AGI by then, that’s a couple bits of better performance, but I don’t think it’s massive outperformance. (And I still think it’s plausible that AGI isn’t developed by 2036!)
  In general, I think that disagreements on AI timelines often seem more extreme when you summarize people’s timelines by median timeline rather than by their probability on AGI by a particular time.
  - Martin Randall 24 Feb 2025 14:12 UTC
    16 points
    −1
    Parent
    Yudkowsky seems confused about OpenPhil’s exact past position. Relevant links:
    
    Draft report on AI Timelines—Cotra 2020-09-18
    Biology-Inspired Timelines—The Trick that Never Works—Yudkowsky 2021-12-01
    Reply to Eliezer on Biological Anchors—Harnofsky 2021-12-23
    
    Here “doctrine” is an applause light; boo, doctrines. I wrote a report, you posted your timeline, they have a doctrine.
    
    All involved, including Yudkowsky, understand that 2050 was a median estimate, not a point estimate. Yudkowsky wrote that it has “very wide credible intervals around both sides”. Looking at (FLOP to train a transformative model is affordable by), I’d summarize it as:
    
    A 50% chance that it will be affordable by 2053, rising from 10% by 2032 to 78% by 2100. The most likely years are 2038-2045, which are >2% each.
    
    A comparison: a 52yo US female in 1990 had a median life expectance of ~30 more years, living to 2020. 5% of such women died on or before age 67 (2005). Would anyone describe these life expectancy numbers to a 52yo woman in 1990 as the “Aetna doctrine of death in 2020”?
  - Ben Pace 20 Feb 2025 0:31 UTC
    LW: 16 AF: 6
    5
    AF Parent
    Further detail on this: Cotra has more recently updated at least 5x against her original 2020 model in the direction of faster timelines.
    Greenblatt writes:
    Here are my predictions for this outcome:
    25th percentile: 2 year (Jan 2027)
    50th percentile: 5 year (Jan 2030)
    Cotra replies:
    My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th)
    This means 25th percentile for 2028 and 50th percentile for 2031-2.
    The original 2020 model assigns 5.23% by 2028 and 9.13% | 10.64% by 2031 | 2032 respectively. Each time a factor of ~5x.
    However, the original model predicted the date by which it was affordable to train a transformative AI model. This is a leading a variable on such a model actually being built and trained, pushing back the date by some further number of years, so view the 5x as bounding, not pinpointing, the AI timelines update Cotra has made.
    - ryan_greenblatt 20 Feb 2025 0:58 UTC
      LW: 10 AF: 6
      6
      AF Parent
      Note that the capability milestone forecasted in the linked short form is substantially weaker than the notion of transformative AI in the 2020 model. (It was defined as AI with an effect at least as large as the industrial revolution.)
      
      I don’t expect this adds many years, for me it adds like ~2 years to my median.
      
      (Note that my median for time from 10x to this milestone is lower than 2 years, but median to Y isn’t equal to median to X + median from X to Y.)
  - habryka 20 Feb 2025 0:09 UTC
    LW: 13 AF: 3
    3
    AF Parent
    Ajeya gave 15% to AGI before 2036, with little of that in the first few years after her report; maybe she’d have said 10% between 2025 and 2036.
    Just because I was curious, here is the most relevant chart from the report:
    This is not a direct probability estimate (since it’s about probability of affordability), but it’s probably within a factor of 2. Looks like the estimate by 2030 was 7.72% and the estimate by 2036 is 17.36%.
    - Buck 20 Feb 2025 0:12 UTC
      LW: 7 AF: 3
      2
      AF Parent
      Thanks. (Also note that the model isn’t the same as her overall beliefs at the time, though they were similar at the 15th and 50th percentiles.)
  - Mikhail Samin 25 Feb 2025 9:49 UTC
    2 points
    0
    Parent
    There was a specific bet, which Yudkowsky is likely about to win. https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challenge-bet-with-eliezer
    - Martin Randall 25 Feb 2025 13:27 UTC
      10 points
      4
      Parent
      The IMO Challenge Bet was on a related topic, but not directly comparable to Bio Anchors. From MIRI’s 2017 Updates and Strategy:
      
      There’s no consensus among MIRI researchers on how long timelines are, and our aggregated estimate puts medium-to-high probability on scenarios in which the research community hasn’t developed AGI by, e.g., 2035. On average, however, research staff now assign moderately higher probability to AGI’s being developed before 2035 than we did a year or two ago.
      
      I don’t think the individual estimates that made up the aggregate were ever published. Perhaps someone at MIRI can help us out, it would help build a forecasting track record for those involved.
      
      For Yudkowsky in particular, I have a small collection of sources to hand. In Biology-Inspired AGI Timelines (2021-12-01), he wrote:
      
      But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050.
      
      On Twitter (2022-12-02):
      
      I could be wrong, but my guess is that we do not get AGI just by scaling ChatGPT, and that it takes surprisingly long from here. Parents conceiving today may have a fair chance of their child living to see kindergarten.
      
      Also, in Shut it all down (March 2023):
      
      When the insider conversation is about the grief of seeing your daughter lose her first tooth, and thinking she’s not going to get a chance to grow up, I believe we are past the point of playing political chess about a six-month moratorium.
      
      Yudkowsky also has a track record betting on Manifold that AI will wipe out humanity by 2030, at up to 40%.
      
      Putting these together:
      
      2021: median well before 2050
      2022: “fair chance” when a 2023 baby goes to kindergarten (Sep 2028 or 2029)
      2023: before a young child grows up (about 2035)
      40% P(Doom by 2030)
      
      So a median of 2029, with very wide credible intervals around both sides. This is just an estimate based on his outward behavior.
      
      Would Yudkowsky describe this as “Yudkowsky’s doctrine of AGI in 2029”?
    - Buck 25 Feb 2025 17:19 UTC
      0 points
      3
      Parent
      Paul is not Ajeya, and also Eliezer only gets one bit from this win, which I think is insufficient grounds for behaving like such an asshole.
      - Gurkenglas 25 Feb 2025 18:18 UTC
        2 points
        0
        Parent
        Buying at 12% and selling at 84% gets you 2.8 bits.
        Edit: Hmm, that’s if he stakes all his cred, by Kelly he only stakes some of it so you’re right, it probably comes out to about 1 bit.
- Noosphere89 20 Feb 2025 0:13 UTC
  1 point
  −6
  Parent
  For this:
  What’s the trick? My basic guess, when I see some very long complicated paper that doesn’t explain the key problem and key solution up front, is that you’ve done the equivalent of an inventor building a sufficiently complicated perpetual motion machine that their mental model of it no longer tracks how conservation laws apply. (As opposed to the simpler error of their explicitly believing that one particular step or motion locally violates a conservation law.) But if you’ve got a directly explainable trick for how you get great suggestions you can’t verify, go for it.
  I think a crux here is I do not believe we will get no-go theorems like this, and more to the point, complete impossibilities given useful assumptions are generally much rarer than you make it out to be.
  The big reason for this is the very fact that neural networks are messy/expressive means it’s extremely hard to bound their behavior, and for the same reason you couldn’t do provable safety/alignment on the AI itself except for very toy examples, will also limit any ability to prove hard theorems on what an AI is aligned to.
  From an epistemic perspective, we have way more evidence of the laws of thermodynamics existing than particular proposals for AI alignment being impossible, arguably by billions or trillions of bits more, so much so that there is little reason to think at our current state of epistemic clarity that we can declare a direction impossible (rather than impractical).
  Quintin Pope correctly challenges @Liron here on this exact point, because of the yawning gap in evidence between thermodynamics and AI alignment arguments, and Liron kind of switched gears mid-way to claim a much weaker stance:
  https://x.com/QuintinPope5/status/1703569557053644819
  Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous. Seeing this sort of massive overconfidence on the part of pessimists is part of why I’ve become more confident in my own inside-view beliefs that there’s not much to worry about.
  The weak claim is here:
  https://x.com/liron/status/1704126007652073539
  Right, orthogonality doesn’t argue that AI we build *will* have human-incompatible preferences, only that it can. It raises the question: how will the narrow target in preference-space be hit? Then it becomes concerning how AI labs admit their tools can’t hit narrow targets.
  - Eliezer Yudkowsky 20 Feb 2025 0:46 UTC
    21 points
    9
    Parent
    I’m not saying that it’s against thermodynamics to get behaviors you don’t know how to verify. I’m asking what’s the plan for getting them.
    - Noosphere89 20 Feb 2025 0:55 UTC
      2 points
      −2
      Parent
      My claim here is that there is no decisive blocker for plans that get getting a safe, highly capable AIs that is used for automated AI safety research, in the way that thermodynamics blocks you from getting a perpetual motion machine (under the assumption that the universe is time symmetric, that is physics stays the same no matter when an experiment happens), which has been well tested, and the proposed blockers do not have anywhere close to the amount of evidence thermodynamics does such that we can safely discard any plan that doesn’t meet a prerequisite.
      - Eliezer Yudkowsky 20 Feb 2025 7:29 UTC
        24 points
        11
        Parent
        Cool. What’s the actual plan and why should I expect it not to create machine Carissa Sevar? I agree that the Textbook From The Future Containing All The Simple Tricks That Actually Work Robustly enables the construction of such an AI, but also at that point you don’t need it.
      - Chris van Merwijk 20 Feb 2025 11:18 UTC
        6 points
        0
        Parent
        Noosphere, why are you responding for a second time to a false interpretation of what Eliezer was saying, directly after he clarified this isn’t what he meant?
        Noosphere89 20 Feb 2025 14:03 UTC
        2 points
        0
        Parent
        Okay, maybe he clarified that there was no thermodynamics-like blocker to getting a plan in principle to align AI, but I didn’t interpret Eliezer’s clarification to rule that out immediately, so I wanted to rule that interpretation out.
        
        I didn’t see the interpretation as false when I wrote it, because I believed he only ruled out a decisive blocker to getting behaviors you don’t know how to verify.
        Archimedes 22 Feb 2025 5:33 UTC
        5 points
        2
        Parent
        I think the misunderstanding came from Eliezer’s reference to a perpetual motion machine. The point was that people suggesting how to build them often have complicated schemes that tend to not adequately address the central difficulty of creating one. That’s where the analogy ends. From thermodynamics, we have strong reasons to believe such a thing is not just difficult but impossible whereas we have no corresponding theory to rule out verifiably safe AI.
        
        Habryka’s analogy to nuclear reactor plans is similar except we know that building one of those is difficult but actually possible.
- [ ]
  [deleted]