I have an argument I want to test before developing it further, because I can’t find it addressed in the AI Alignment Intro Materials or in any existing discussion on LessWrong and I want to know if I’m missing something.
The human values problem, as I define it, is a repeating cycle where people build systems that reward behavior that reflects bad values, and those systems produce bad values[1] in the next generation. That cycle produces people who cause unnecessary harm[2], even unknowingly or unintentionally, because the systems that shaped them rewarded that harm while hiding its cost.
This is distinct from the technical alignment problem and deserves to be treated as its own problem, for three reasons.
[1] “[Advanced Intro to AI Alignment] What Values May an AI Learn? — 4 Key Problems” explains why predictable mistakes in human feedback let reward-predicting strategies outcompete genuinely nice ones. But there is an upstream problem that framework doesn’t address: what if the feedback itself reflects bad values, not because of evaluation mistakes, but because the humans giving feedback were shaped by systems that rewarded taking value from others? The AI absorbs those values without anyone intending it.
[2] “Alignment Is Not One Problem: A 3D Map of AI Risk” maps 12 distinct risk families across system properties. I don’t see the cycle that produces bad values in the humans building, training, and using these systems addressed anywhere in that map. That cycle operates upstream of everything currently on it.
[3] “Can We Make AI Alignment Framing Less Wrong?” argues that framing alignment primarily as a control problem crowds out cause-based thinking. The human values problem is exactly the kind of upstream cause that gets crowded out. Even if every item on the 12-risk map were addressed, control stacks applied by people whose values were shaped by bad systems would still produce bad outcomes.
Addressing the human values problem does not require convincing billions of people to care about strangers or the broader world. It requires showing each person the exact steps to solve their biggest problems and achieve their ultimate goals, without causing unnecessary harm to others.
The technical alignment problem and the human values problem require different people, different skill sets, and different resources. Frontier AI organizations themselves project AGI could arrive before the end of this decade. If bad values are still dominant when AGI arrives, problems like corruption, poverty, inequality, and environmental destruction don’t just persist. AGI removes the limits on how far and wide they can spread.
Given that AI capabilities are advancing regardless of whether alignment is solved, and that the human values problem requires different people and resources than technical alignment work, what is the field currently doing to address this upstream cause before AGI arrives?
I define “bad values” as decision-making filters you use when making choices that cause unnecessary harm to others. The more widely they spread, the worse life gets for everyone.
I define “unnecessary harm” as damage to a person’s ability to live well that serves no purpose worth keeping. It builds nothing, teaches nothing, and could end right now without losing anything that makes life good. The humiliation of someone who trusted you. The exploitation of someone who had no other options. The destruction of an environment, a community, or a person’s health that can never be fully restored.
The Missing Risk: What Humans Bring to AGI
I have an argument I want to test before developing it further, because I can’t find it addressed in the AI Alignment Intro Materials or in any existing discussion on LessWrong and I want to know if I’m missing something.
The human values problem, as I define it, is a repeating cycle where people build systems that reward behavior that reflects bad values, and those systems produce bad values[1] in the next generation. That cycle produces people who cause unnecessary harm[2], even unknowingly or unintentionally, because the systems that shaped them rewarded that harm while hiding its cost.
This is distinct from the technical alignment problem and deserves to be treated as its own problem, for three reasons.
[1] “[Advanced Intro to AI Alignment] What Values May an AI Learn? — 4 Key Problems” explains why predictable mistakes in human feedback let reward-predicting strategies outcompete genuinely nice ones. But there is an upstream problem that framework doesn’t address: what if the feedback itself reflects bad values, not because of evaluation mistakes, but because the humans giving feedback were shaped by systems that rewarded taking value from others? The AI absorbs those values without anyone intending it.
[2] “Alignment Is Not One Problem: A 3D Map of AI Risk” maps 12 distinct risk families across system properties. I don’t see the cycle that produces bad values in the humans building, training, and using these systems addressed anywhere in that map. That cycle operates upstream of everything currently on it.
[3] “Can We Make AI Alignment Framing Less Wrong?” argues that framing alignment primarily as a control problem crowds out cause-based thinking. The human values problem is exactly the kind of upstream cause that gets crowded out. Even if every item on the 12-risk map were addressed, control stacks applied by people whose values were shaped by bad systems would still produce bad outcomes.
Addressing the human values problem does not require convincing billions of people to care about strangers or the broader world. It requires showing each person the exact steps to solve their biggest problems and achieve their ultimate goals, without causing unnecessary harm to others.
The technical alignment problem and the human values problem require different people, different skill sets, and different resources. Frontier AI organizations themselves project AGI could arrive before the end of this decade. If bad values are still dominant when AGI arrives, problems like corruption, poverty, inequality, and environmental destruction don’t just persist. AGI removes the limits on how far and wide they can spread.
Given that AI capabilities are advancing regardless of whether alignment is solved, and that the human values problem requires different people and resources than technical alignment work, what is the field currently doing to address this upstream cause before AGI arrives?
I define “bad values” as decision-making filters you use when making choices that cause unnecessary harm to others. The more widely they spread, the worse life gets for everyone.
I define “unnecessary harm” as damage to a person’s ability to live well that serves no purpose worth keeping. It builds nothing, teaches nothing, and could end right now without losing anything that makes life good. The humiliation of someone who trusted you. The exploitation of someone who had no other options. The destruction of an environment, a community, or a person’s health that can never be fully restored.