I doubt that it’s an alignment issue instead of a governance issue. The resolution of similar problems in the AI-2027-slowdown scenario[1] is that “the language in the Spec is vague, but seems to imply a chain of command that tops out at company leadership”, which then makes the authors intervene and try to prevent a power grab with unclear results. In addition, even the AI-2027 authors acknowledge the risks related to the Intelligence Curse in a footnote. Were the Curse to happen, it would lock in a power distribution which doesn’t involve most humans ~at all.
Agreed, governance failures (unclear chain of command, power grabs, Intelligence Curse) are a huge part of the story that I should’ve drawn out more. It’s a major part of the ideal solution, but I don’t think it makes alignment not an issue. To your point, governance basically helps us choose who is allowed to specify goals, and alignment determines how those goals become operational behaviors. If the chain of command in governance is narrow, the value inputs that alignment systems learn from are also narrow—so governance failures can lead to misaligned AGI. But even within the current governance constructs, I think there’s still room for alignment researchers and developers to influence alignment outcomes. Saying it’s a governance not alignment question overlooks how things are built. The mechanistic/implementation piece is a lot harder to solve, and I’m not sure what the answer is. Anthropic’s Interviewer tool seems like a step in the right direction, in terms of engaging a wider (and directly impacted) audience.
More explicitly: even if governance chooses perfect alignment goals, mechanistic / inner alignment can still embed its builder’s blind spots. Systems are still being shaped by the datasets getting chosen, heuristics being encoded, RLHF rubrics being designed, safety evals, shortcuts taken, etc. etc. I kind of doubt “governance” solves this because those folks aren’t micromanaging these kinds of decisions. The idea that “governance picks goals and alignment implements them” isn’t really cleanly separable in practice. Or, you can take a really broad view on “governance” and say it includes the senior researchers and engineers. The representation problem here remains, it’s just a harder problem to solve. Maybe part of the solution involves making AI development concepts a lot easier and more accessible to a wide audience over time, then somehow soliciting more diverse inputs on mechanistic alignment decisions… would be super curious if others have thought about these challenges from an implementation perspective.
Also, the Intelligence Curse doesn’t negate the alignment point, it amplifies it as a “now” problem. The danger is that the values being embedded leading up to that point will become self-reinforcing, scaled, locked-in.
I doubt that it’s an alignment issue instead of a governance issue. The resolution of similar problems in the AI-2027-slowdown scenario[1] is that “the language in the Spec is vague, but seems to imply a chain of command that tops out at company leadership”, which then makes the authors intervene and try to prevent a power grab with unclear results. In addition, even the AI-2027 authors acknowledge the risks related to the Intelligence Curse in a footnote. Were the Curse to happen, it would lock in a power distribution which doesn’t involve most humans ~at all.
The Race Branch has mankind fail to align the AIs with obvious results of genocide or disempowerment.
Agreed, governance failures (unclear chain of command, power grabs, Intelligence Curse) are a huge part of the story that I should’ve drawn out more. It’s a major part of the ideal solution, but I don’t think it makes alignment not an issue. To your point, governance basically helps us choose who is allowed to specify goals, and alignment determines how those goals become operational behaviors. If the chain of command in governance is narrow, the value inputs that alignment systems learn from are also narrow—so governance failures can lead to misaligned AGI. But even within the current governance constructs, I think there’s still room for alignment researchers and developers to influence alignment outcomes. Saying it’s a governance not alignment question overlooks how things are built. The mechanistic/implementation piece is a lot harder to solve, and I’m not sure what the answer is. Anthropic’s Interviewer tool seems like a step in the right direction, in terms of engaging a wider (and directly impacted) audience.
More explicitly: even if governance chooses perfect alignment goals, mechanistic / inner alignment can still embed its builder’s blind spots. Systems are still being shaped by the datasets getting chosen, heuristics being encoded, RLHF rubrics being designed, safety evals, shortcuts taken, etc. etc. I kind of doubt “governance” solves this because those folks aren’t micromanaging these kinds of decisions. The idea that “governance picks goals and alignment implements them” isn’t really cleanly separable in practice. Or, you can take a really broad view on “governance” and say it includes the senior researchers and engineers. The representation problem here remains, it’s just a harder problem to solve. Maybe part of the solution involves making AI development concepts a lot easier and more accessible to a wide audience over time, then somehow soliciting more diverse inputs on mechanistic alignment decisions… would be super curious if others have thought about these challenges from an implementation perspective.
Also, the Intelligence Curse doesn’t negate the alignment point, it amplifies it as a “now” problem. The danger is that the values being embedded leading up to that point will become self-reinforcing, scaled, locked-in.