evhub comments on Alignment remains a hard, unsolved problem

evhub 29 Nov 2025 8:36 UTC
LW: 9 AF: 5
3
AF

In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can’t I just take the log of it and call that “AI capabilities” and then say it is a smooth linear increase?

I agree that this is a bit of a tricky measurement question, and it’s really going to depend on how you interpret different metrics. I do find the METR results compelling here, and I’m not sure I agree with your argument against them, since it doesn’t always seem possible to do the sort of decomposition you’re proposing. In particular, the task that needs decomposing here is the task of overseeing a system that is smarter than you.

Perhaps one other metric that I will also mention that you don’t cover is revenue from AI systems, which is exponential and I think reflects an exponential increase in economic utility from AI as well as something like an exponential increase in the degree to which AI can automate human labor. Though of course it is again tricky how to translate that into the difficulty of doing oversight—but it definitely seems suggestive that the set of tasks that are qualitatively doable vs. not doable is changing in something like an exponential manner.

It seems pretty wild to go from “it is possible for an AI to subvert a technique” to “the technique will not be that useful”. Is that really what you mean? Are you bearish on all control work?

I was only giving a one-sentence summary of my beliefs here—I do think CoT can be useful; I’m just skeptical that it dramatically changes the picture. My beliefs here are similar to those in the METR report on this, in that in cases where it is necessary for the model to write something down in the CoT to solve the problem, CoT monitoring is useful, but in cases where it’s not necessary, it’s much less useful. And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
- np_x 5 Dec 2025 6:09 UTC
  3 points
  0
  Parent
  And I am worried that a lot of research sabotage won’t require the model to reason through much of the sabotage parts in its CoT, e.g. because all it needs to do to sandbag the research is flip the sign on some experiments in relatively straightforward ways that don’t require a ton of reasoning.
  This definitely depends on the “blue team” protocol at hand right? If we’re doing a bunch of cross-checking of research work / very carefully designed research sabotage honeypots during evaluations, then the model robustly needs to do a bunch of reasoning to ensure it’s not caught.