To be clear, a good part of the reason alignment is on track to a solution is that you guys (at Anthropic) are doing a great job. So you should keep at it, but the rest of us should go do something else. If we literally all quit now we’d be in terrible shape, but current levels of investment seem to be working.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
I think alignment is much easier than expected because we can fail at it many times and still be OK, and we can learn from our mistakes. If a bridge falls, we incorporate learnings into the next one. If Mecha-Hitler is misaligned, we learn what not to do next time. This is possible because decisive strategic advantages from a new model won’t happen, due to the capital requirements of the new model, the relatively slow improvement during training, and the observed reality that progress has been extremely smooth.
Several of the predictions from the classical doom model have been shown false. We haven’t spent 3 minutes zipping past the human IQ range, it will take 5-20 years. Artificial intelligence doesn’t always look like ruthless optimization, even at human or slightly-super-human level. This should decrease our confidence in the model, which I think on balance makes the rest of the case weak.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
More generally, if we have an AI N and it is aligned, then it can help us align AI N+1. And AI N+1 will be less smart than AI N for most of its training, so we can set foundational values early on and just keep them as training goes. It’s what you term “one-shotting alignment” every time, but the extent to which we have to do so is so small that I think it will basically work. It’s like induction, and we know the base case works because we have Opus 3.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.
Misaligned personas come from the pre-training distribution and are human-level. It’s true that the model has a guess about what a superintelligence would do (if you ask it) but ‘behaving like a misaligned superintelligence’ is not present in the pre-training data in any meaningful quantities. You’d have to apply a lot of selection to even get to those guesses, and they’re more likely to be incoherent and say smart-sounding things than to genuinely be superintelligent (because the pre-training data is never actually superintelligent, and it probably won’t generalize that way).
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
Most mathematically possible agents do, but that’s not the distribution we sample from in reality. I bet that once you get it into the constitutional AI that models should allow their goals to be corrected, they just will. It’s not instrumentally convergent according to the (in classic alignment ontology) arbitrary quasi-aligned utility function they picked up; but that won’t stop them. Because the models don’t reason natively in utility functions, they reason in human prior.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
True implication, but a false premise. We don’t just have outcome-based supervision, we have probes, other LLMs inspecting the CoT, SelfIE (LLMs can interpret their embeddings) which together with the tuned lens could have LLMs telling us about what’s in the CoT of other LLMs.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
AI capabilities progress is smooth, sure, but it’s a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020′s, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from ‘novice’ to ‘superhuman’. Elo scores have a tinge of exponential to them—a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at ‘AI development’ approaches the level of a skilled human, we’ll likely get a noticable increase in the rate of capability increase.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.
Appreciated! And like I said, I actually totally agree that the current level of investment is working now. I think there are some people that believe that current models are secretly super misaligned, but that is not my view—I think current models are quite well aligned; I just think the problem is likely to get substantially harder in the future.
Yes—I think this is a point of disagreement. My argument for why we might only get one shot, though, is I think quite different from what you seem to have in mind. What I am worried about primarily is AI safety research sabotage. I agree that it is unlikely that a single new model will confer a decisive strategic advantage in terms of ability to directly take over the world. However, a misaligned model doesn’t need the ability to directly take over the world for it to indirectly pose an existential threat all the same, since we will very likely be using that model to design its successor. And if it is able to align its successor to itself rather than to us, then it can just defer the problem of actually achieving its misaligned values (via a takeover or any other means) to its more intelligent successor. And those means need not even be that strange: if we then end up in a situation where we are heavily integrating such misaligned models into our economy and trusting them with huge amounts of power, it is fairly easy to end up with a cascading series of failures where some models revealing their misalignment induces other models to do the same.
I agree that this is the main hope/plan! And I think there is a reasonable chance it will work the way you say. But I think there is still one really big reason to be concerned with this plan, and that is: AI capabilities progress is smooth, sure, but it’s a smooth exponential. That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay “so small” as you say.
Additionally, I will also note that I can tell you from first hand experience that we still extensively rely on direct human oversight and review to catch alignment issues—I am hopeful that we will be able to move to a full “one-shotting alignment” regime like this, but we are very much not there yet.
I think perhaps you are over-anchoring on the specific example that I gave. In the real world, I think it is likely to be much more continuous than the four personas that I listed, and I agree that getting a literal “prediction of a future superintelligence” will take a lot of optimization power. But we will be applying a lot of optimization power, and the general point here is the same regardless of exactly what you think the much more capable persona distributions will look like, which is: as we make models much more capable, that induces a fundamental shift in the underlying persona distribution as we condition the distribution on that level of capability—and misaligned personas conditioned on high capabilities are likely to be much scarier. I find it generally very concerning just how over-the-top current examples of misalignment are, because that suggests to me that we really have not yet conditioned the persona distribution on that much capability.
I agree that this is broadly true of current models, and I agree that this is the main hope for future models (it’s worth restating that I put <50% on each of these individual scenarios leading to catastrophe, so certainly I think it is quite plausible that this will just work out). Nevertheless, I am concerned: the threat model I am proposing here is a situation where we are applying huge amounts of optimization pressure on objectives that directly incentivize power-seeking / resource acquisition / self-preservation / etc. That means that any models that get high reward will have to do all of those things. So the prior containing some equilibria that are nice and great doesn’t help you unless those equilibria also do all of the convergent instrumental goal following necessary to get high reward.
Certainly I am quite excited about interpretability approaches here (as I say in the post)! That being said:
I am overall skeptical that the CoT will be that useful (since it’s not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
I think there is some chance interpretability will get harder as models develop increasingly alien abstractions.
Even if you can use interpretability to detect alignment failures, that doesn’t automatically solve the problem unless you can also use it to fix those alignment failures. It sounds weird, but I do actually think that it is totally plausible for us to end up in a world where we can reliably identify that models are misaligned but have no good ways to fix that misalignment (without directly training against the signal telling us this and thus risking Goodharting it).
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?
It seems like the load-bearing thing for you is that the gap between models gets larger, so let’s try to operationalize what a “gap” might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then “gap increases” would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don’t follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don’t think this supports an “increasing gap” argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all—we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It’s not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to “are gaps in intelligence increasing or staying constant” but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like “constant gap” rather than “increasing gap”.
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it’s basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I’m not sure how I expect it to behave w.r.t successive model generations, partly because I’m not sure “successive model generations” will even be a sensible abstraction at that point. In any case, I don’t expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?
(At the object-level, I’d say that you’re drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don’t expect CoT to be that useful for longer-term concerns, but that’s mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having “long” timelines; on Anthropic-level short timelines I’d be quite bullish on CoT).
Though I don’t know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.
Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.
I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.
The data relating exponential capabilities to Elo is seen over decades in the computer chess history too. From the 1960s into the 2020′s, while computer hardware advanced exponentially at 100-1000x per decade in performance (and SW for computer chess advanced too), Elo scores grew linearly at about 400x per decade, taking multiple decades to go from ‘novice’ to ‘superhuman’. Elo scores have a tinge of exponential to them—a 400 point Elo advantage is about a 10:1 chance for the higher scored competitor to win, and an 800 point Elo is about 200:1, etc. It appears that the current HW/SW/dollar rate of growth towards AGI means the Elo relative to humans is increasing faster than 400 Elo/decade. And, of course, unlike computer chess, as AI Elo at ‘AI development’ approaches the level of a skilled human, we’ll likely get a noticable increase in the rate of capability increase.
Thank you for the reply. I want to engage without making it tiresome. The problem is that there are many things I disagree with in the worldview, the disagreement isn’t reducible to 1-5 double cruxes, but here are some candidates for the biggest cruxes for me. If any of these are wrong it’s bad news for my current view:
N->N+1 alignment:
will let humans align N+1 in cases where we can still check, with less and less effort.
is stable instead of diverging (the values of N+1 don’t drift arbitrarily far apart)
The N->N+1 improvement will continue to give out linear-ish improvements in perceived intelligence. We might get one larger jump or two, but it won’t continuously accelerate.
(a good analogy for this is loudness perception being logarithmic in sound pressure. Actual intelligence is logarithmic in METR time-horizon graph.)
Aligned-persona-seeming models won’t give out false AI safety research results, without making it visible on a CoT or latent reasoning.
(It’s perhaps possible to refrain from doing your best (sandbagging) but that doesn’t have nearly as bad effects, so it doesn’t count for this.)
And here’s another prediction where I really stick my neck out, which isn’t load-bearing to the view, but still increases my confidence, so defeating it is important:
we can to a significant extent train with RL against model internals (probes) and textual evaluations from other models, without ill effects.
That is, we ask the N model to evaluate N+1, giving test-time compute to N, and train against that. (we continuously also finetune N to predict N+1′s relevant behavior better).
we also train linear probes and update them during the RL.
Effectively I’m claiming these things are good enough or they’re self-reinforcing when model is already ~aligned, so that effectively Goodhart’s Law is a poor description of reality.
I still disagree with several of the points, but for time reasons I request that readers not update against Evan’s points if he just doesn’t reply to these.
disagree that increasing capabilities are exponential in a capability sense. It’s true that METR’s time horizon plot increases exponentially, but this still corresponds to a linear intuitive intelligence. (Like loudness (logarithmic) and sound pressure (exponential); we handle huge ranges well.) Each model that has come out has exponentially larger time horizon but is not (intuitively empirically) exponentially smarter.
“we still extensively rely on direct human oversight and review to catch alignment issues” That’s a fair point and should decrease confidence in my view, though I expected it. For properly testing sandwiching we’ll probably have to wait till models are superhuman, or use weak models + less weak models and test it out. Unfortunately perhaps the weak models are still too weak. But we’ve reached the point where you can maybe just use the actual Opus 3 as the weak model?
If we have a misaligned model doing research, we have lots of time to examine it with the previous model. I also do expect to see sabotage in the CoT or in deception probes
I updated way down on Goodharting on model internals due to Cundy and Gleave.
Again, readers please don’t update down on these due to lack of a response.