Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.
I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Here are my concrete disagreements:
Outer alignment
Paraphrasing, your position here is that we don’t have models that are smarter than humans yet, so we can’t test whether our alignment techniques scale up.
We don’t have models that are reliably smarter yet, that’s true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.
On top of that, we don’t even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good
Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.
Misaligned personas
the version of [misaligned personas] that we’ve encountered so far is the easy version, in the same way that the version of outer alignment we’ve encountered so far is the easy version, since all the misaligned personas we’ve encountered so far are ones we can easily verify are misaligned!
I don’t think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).
So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.
Long-horizon RL
I disagree with the last two steps of your reasoning chain:
Most coherent agents with goals in the world over the long term want to fake alignment, so that they can preserve their current goals through to deployment.
Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.
And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they’re unlikely to want to start faking alignment in the first place.
Once a model is faking alignment, there’s no outcome-based optimization pressure changing its goals, so it can stay (or drift to be) arbitrarily misaligned.
True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.
Altogether, these paint a much less risky picture. You’re gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we’ll ever have (not in total, but because RL only selects a little for these things; it’s just that in the previous ontology it compounds).
What to do to solve alignment
The takes in ‘what to do to solve alignment’, I think they’re reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I’m pleased you listed them first.
I would add robustifying and training against probes as a particularly exciting direction, which isn’t strictly a subset of interpretability (you’re not trying to reverse-engineer the model).
Conclusion
I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today’s AGIs, we have observed that intelligence doesn’t always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.
Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation’s aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.
Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
AI capabilities progress is smooth, sure, but it’s a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020′s, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from ‘novice’ to ‘superhuman’. Elo scores have a tinge of exponential to them—a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at ‘AI development’ approaches the level of a skilled human, we’ll likely get a noticable increase in the rate of capability increase.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
It’s not the AIs that are supposed to get a boost at supervising, it’s the humans. The skills won’t stop being complementary but the AI will be better at it.
The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI. And I think the results are very encouraging—though of course it could stop working.
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
QuALITY questions are meant to be answerable by English-fluent college-educated adults, but they require readers to thoroughly understand a short story of about 5,000 words, which would ordinarily take 15–30 minutes to read. To create a challenging task that requires model assistance, we ask human participants to answer QuALITY questions under a 5-minute time limit (roughly paralleling Pang et al., 2022; Parrish et al., 2022b,a). This prevents them from reading the story in full and forces them to rely on the model to gather relevant information
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn’t have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
If you disagree so strongly with the above comment, you should force yourself to outline your views and provide a rebuttal to the series of points made. I would personally value comments that attempted to do this in earnest. Particularly because I don’t want this post by Evan to be a signpost for folks to justify their belief in AI risk and essentially have the internal unconscious thinking of, “oh thank goodness someone pointed out all the AI risk issues, so I don’t have to do the work of reflecting on my career/beliefs and I can just defer to high status individuals to provide the reasoning for me.” I sometimes feel that some posts just end further discussion because they impact one’s identity.
I thought Adria’s comment was great and I’ll try to respond to it in more detail later if I can find the time (edit: that response is here), but:
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I’ll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn’t have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it’s doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn’t mean it doesn’t make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like “if you disagree strongly with the above comment you should force yourself to outline your views” aren’t frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don’t break that part)
I do think it’s doing something pretty tiring and hard to engage with
That’s fair, it is tiring. I did want to make sure to respond to every particular point I disagreed with to be thorough, but it is just sooo looong.
What would you have me do instead? My best guess, which I made just after writing the comment, is that I should have proposed a list of double-crux candidates instead.
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
Generally agree with this. I think in this case, I’m trying to call out safety folks to be frank with themselves and avoid the mistake of not trying to figure out if they really believe alignment is still hard or are looking for reasons to believe it is. Might not be what is happening here, but I did want to encourage critical thinking and potentially articulating it in case it is for some.
(Also, I did not mean for people to upvote it to the moon. I find that questionable too.)
Thank you for your defense Jacques, it warms my heart :)
However, I think Lesswrong has been extremely kind to me and continue to be impressed by this site’s discourse norms. If I were to post such a critique in any other online forum, it would have heavily negative karma. Yet, I continue to be upvoted, and the critiques are good-faith! I’m extremely pleased.
I agree it’s true that other forums would engage with even worse norms, but I’m personally happy to keep the bar high and have a high standard for these discussions, regardless of what others do elsewhere. My hope is that we never stop striving for better, especially since, for alignment, the stakes are incredibly higher than most other domains, so we need a higher standard of frankness.
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent
I fully agree. But
a) Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.
Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its “real” goals and discovering misalignments, etc.
There are many other interesting points here, but I won’t try to address more!
I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I’m afraid we’ll just rush and and make crucial mistakes, so we’ll fail even if alignment was only in between steam engine and apollo levels.
This is not inevitable! If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.
Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
I meant if the predictor were superhumanly intelligent.
You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
But it almost certainly won’t be.
I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.
Super cool that you wrote your case for alignment being difficult, thank you! Strong upvoted.
To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.
I have specific disagreements with the evidence for specific parts, but let me also outline a general worldview difference. I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Here are my concrete disagreements:
Outer alignment
Paraphrasing, your position here is that we don’t have models that are smarter than humans yet, so we can’t test whether our alignment techniques scale up.
We don’t have models that are reliably smarter yet, that’s true. But we have models that are pretty smart and very aligned. We can use them to monitor the next generation of models only slightly smarter than them.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.
Does the ‘induction step’ actually work? The sandwiching paper tested this, and it worked wonderfully. Weak humans + AIs with some expertise could supervise a stronger misaligned model. So the majority of evidence we have points to iterated distillation and amplification working.
On top of that, we don’t even need much human data to align models nowadays. The state of the art seems to be constitutional approaches: basically uses prompts to the concept of goodness in the model. This works remarkably well (it sounded crazy to me at first) and it must work only because the pre-training prior has a huge, well-specified concept for good
And we might not even have to one-shot alignment. Probes were incredibly successful at detecting deception, in the the sleeper agents organism. They’re even resistant to training against them. SAEs are not working as well as people wanted them to, but they’re working pretty well. We keep making interpretability progress.
Inner alignment
Good that you were early on predicting pre-trained-only models would be unlikely to be mesa-optimizers.
Misaligned personas
I don’t think this one will get much harder. Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).
So misaligned personas will probably act badly in ways we can verify. Though I suppose the consequences of a misaligned persona can get to higher (but manageable) stakes.
Long-horizon RL
I disagree with the last two steps of your reasoning chain:
Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.
And if they are already pretty aligned before they reach that stage of intelligence (remember, we can just remove misaligned personas etc. earlier in training, before convergence), then they’re unlikely to want to start faking alignment in the first place.
True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.
Altogether, these paint a much less risky picture. You’re gonna need a LOT of selection to escape the benign prior with all these obstacles. Likely more selection than we’ll ever have (not in total, but because RL only selects a little for these things; it’s just that in the previous ontology it compounds).
What to do to solve alignment
The takes in ‘what to do to solve alignment’, I think they’re reasonable. I believe interpretability and model organisms are the more tractable and useful ones, so I’m pleased you listed them first.
I would add robustifying and training against probes as a particularly exciting direction, which isn’t strictly a subset of interpretability (you’re not trying to reverse-engineer the model).
Conclusion
I disagree that we have gotten no or little evidence about the difficult parts of the alignment problem. Through the actual technique used to construct today’s AGIs, we have observed that intelligence doesn’t always look like ruthless optimization, even when it is artificial. It looks human-like, or more accurately like multitudes of humans. This was a prediction of the pre-2020 doomer model that has failed, and ought to decrease our confidence in it.
Selection for goal-directed agents in long contexts will make agents more optimizer-y. But how much more? I think not enough to escape the huge size of the goodness target in the prior, plus the previous generation’s aligned models, plus stamping out the human-level misaligned personas, plus the probes, plus the chain of thought monitors, et cetera.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
I think there is some chance interpretability will get harder as models develop increasingly alien abstractions.
Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020′s, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from ‘novice’ to ‘superhuman’. Elo scores have a tinge of exponential to them—a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at ‘AI development’ approaches the level of a skilled human, we’ll likely get a noticable increase in the rate of capability increase.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.
Probably this is because humans and AIs have very complementary skills. Once AIs are broadly more competent than humans, there’s no reason why they should get such a big boost from being paired with humans.
It’s not the AIs that are supposed to get a boost at supervising, it’s the humans. The skills won’t stop being complementary but the AI will be better at it.
The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI. And I think the results are very encouraging—though of course it could stop working.
”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
(I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.
This comment had a lot of people downvote it (at this time, 2 overall karma with 19 votes). It shouldn’t have been, and I personally believe this is a sign of people being attached to AI x-risk ideas and of those ideas contributing to their entire persona rather than strict disagreement. This is something I bring to conversations about AI risk, since I believe folks will post-rationalize. The above comment is not low effort or low value.
If you disagree so strongly with the above comment, you should force yourself to outline your views and provide a rebuttal to the series of points made. I would personally value comments that attempted to do this in earnest. Particularly because I don’t want this post by Evan to be a signpost for folks to justify their belief in AI risk and essentially have the internal unconscious thinking of, “oh thank goodness someone pointed out all the AI risk issues, so I don’t have to do the work of reflecting on my career/beliefs and I can just defer to high status individuals to provide the reasoning for me.” I sometimes feel that some posts just end further discussion because they impact one’s identity.
That said, I’m so glad this post was put out so quickly so that we can continue to dig into things and disentangle the current state of AI safety.
Note: I also think Adrià should have been acknowledged in the post for having inspired it.
I thought Adria’s comment was great and I’ll try to respond to it in more detail later if I can find the time (edit: that response is here), but:
Adria did not inspire this post; this is an adaptation of something I wrote internally at Anthropic about a month ago (I’ll add a note to the top about that). If anyone inspired it, it would be Ethan Perez.
Ok, good to know! The title just made it seem like it was inspired by his recent post.
Great to hear you’ll respond; did not expect that, so mostly meant it for the readers who agree with your post.
I’m honestly very curious what Ethan is up to now, both you and Thomas Kwa implied that he’s not doing alignment anymore. I’ll have to reach out...
I generally think it makes sense for people to have pretty complicated reasons for why they think something should be downvoted. I think this goes more for longer content, which often would require an enormous amount of effort to respond to explicitly.
I have some sympathy for being sad here if a comment ends up highly net-downvoted, but FWIW, I think 2 karma feels vaguely in the right vicinity for this comment, maybe I would upvote it to +6, but I would indeed be sad to see it at +20 or whatever since I do think it’s doing something pretty tiring and hard to engage with. Directional downvoting is a totally fine use of downvoting, and if you think a comment is overrated but not bad, please downvote it until its karma reflects where you want it to end up!
(This doesn’t mean it doesn’t make sense to do sociological analysis of cultural trends on LW using downvoting, but I do want to maintain the cultural locus where people can have complicated reasons for downvoting and where statements like “if you disagree strongly with the above comment you should force yourself to outline your views” aren’t frequently made. The whole point of the vote system is to get signal from people without forcing them to do huge amounts of explanatory labor. Please don’t break that part)
That’s fair, it is tiring. I did want to make sure to respond to every particular point I disagreed with to be thorough, but it is just sooo looong.
What would you have me do instead? My best guess, which I made just after writing the comment, is that I should have proposed a list of double-crux candidates instead.
Do you have any other proposals or that’s good
Generally agree with this. I think in this case, I’m trying to call out safety folks to be frank with themselves and avoid the mistake of not trying to figure out if they really believe alignment is still hard or are looking for reasons to believe it is. Might not be what is happening here, but I did want to encourage critical thinking and potentially articulating it in case it is for some.
(Also, I did not mean for people to upvote it to the moon. I find that questionable too.)
Thank you for your defense Jacques, it warms my heart :)
However, I think Lesswrong has been extremely kind to me and continue to be impressed by this site’s discourse norms. If I were to post such a critique in any other online forum, it would have heavily negative karma. Yet, I continue to be upvoted, and the critiques are good-faith! I’m extremely pleased.
I agree it’s true that other forums would engage with even worse norms, but I’m personally happy to keep the bar high and have a high standard for these discussions, regardless of what others do elsewhere. My hope is that we never stop striving for better, especially since, for alignment, the stakes are incredibly higher than most other domains, so we need a higher standard of frankness.
I fully agree. But
a) Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
b) This is similar to what we are actively doing, applying RL to these systems to make them effective agents.
Both of these re-introduce all of the standard problems. The predictor is now an agent. Strong predictions of what an agent should do include things like instrumental convergence toward power-seeking, incorrigibility, goal drift, reasoning about itself and its “real” goals and discovering misalignments, etc.
There are many other interesting points here, but I won’t try to address more!
I will say that I agree with the content of everything you say, but not the relatively optimistic implied tone. Your list of to-dos sound mostly unlikely to be done well. I may have less faith in institutions and social dynamics. I’m afraid we’ll just rush and and make crucial mistakes, so we’ll fail even if alignment was only in between steam engine and apollo levels.
This is not inevitable! If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
I meant if the predictor were superhumanly intelligent.
You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
But it almost certainly won’t be.
I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.