I feel like I want a cost in this model, explicitly or implicitly. One intuition: if “reward seeking” generalizes super well and “local skill” doesn’t then we should converge to reward seeking no matter how much weight we put on it initially, because skill is costly and with reward seeking we can get the same level of performance much more cheaply. Another: if reward seeking is basically parasitic on local skill in each environment it shouldn’t be learned (sort of a
Also reward seeking or “local skill” could be cheaper within individual environments, though this may be an extension that isn’t super interesting to the question at hand.
I wondered whether “generalizability across environments” works as a definition of what separates reward seeking from local skill but I don’t think so. For example, arithmetic probably generalizes quite well and is not reward seeking.
I think a lot of the concerns raised here straddle the boundary between what I’d call “alignment” and what I’d call “capabilities.” You basically say as much when you note these failures will hamper AI research automation and that there are commercial incentives to fix them. They’re real problems, but it’s unclear where safety-focused people should be trying to intervene, and I’m not sure the alignment-vs-capabilities framing is the most helpful lens here.
If you’ll excuse me thinking aloud a bit: I’ve been playing around with some dynamical models of AI development automation. Still ideating, nothing written up yet, but I’m curious whether they offer a useful way to think about this.
The variables are: (AI influence — fraction of AI development done by AIs), (fraction of AI-produced work which, under perfect observability, we’d be unhappy with), (observability — how much of we actually detect), and (controllability — how much of the detected we can actually correct in the next generation). These evolve according to coupled update rules.
These things are all a bit underspecified. I think of and not as raw token counts but something like tokens*”importance”.
Your issues look like observability problems to me: AI systems are making it hard to check whether what they’re doing is what we wanted. Some features of the model that seem relevant:
At low automation, gates . If unendorsed work dominates, AI influence plateaus — the output is mostly futile, so influence grows only insofar as genuinely useful work gets done. This seems to describe the current regime: when AI systems don’t do what we want, the result is largely waste. But this probably changes at higher capability/automation levels, where unendorsed work becomes less futile and more functional.
The above creates endogenous incentives to maintain . If you need to control to sustain -growth, you need a not-completely-useless estimate of .
Near full automation, and become tightly coupled. Research that improves is mostly done by AI, and actions to modify require observing it first. So there are possible stable regimes with high and low (good basin), or low and high (bad basin), and the dynamics determine which you land in.
The key question is whether the endogenous incentive to maintain is enough to land in the good basin. You observe that is currently low on hard-to-check tasks — but in isolation this is ambiguous. What matters is the trajectory of jointly, not a point estimate of .
I think this framing suggests a slightly different research question than “are current AIs aligned?”: namely, is the feedback loop between , , and self-correcting or self-eroding as automation increases, and where should we intervene to influence this trajectory?
Does this look like a productive direction to you?