And we don’t just want to avoid extinction. We want to thrive. We want to expand our civilization and build a better world for our descendants.
And for ourselves. If AGI doesn’t take away civilization’s future, why take away future of individual people? The technical problem should be relatively trivial, given a few years to get started.
And it can’t just kind of like the idea of human flourishing. Our well-being needs to be the primary thing it cares about.
If it’s not the primary thing it cares about, we lose the cosmic endowment. But we might keep our lives and civilization.
It does need to care at all for this to happen, and a paperclip maximizer won’t. But something trained on human culture might retain at least a tiny bit of compassion on reflection, which is enough to give back a tiny bit of the cosmic wealth it just took.
Humans are also potentially a threat.
Not if you use a superintelligently designed sandbox. It’s a question of spending literally no resources compared to spending at least a tiny little fraction of future resources.
The Uakari Monkey isn’t going extinct because humans are trying to kill it but because wood is useful and their habitat happens to be made of trees.
Saving monkeys[1] is also somewhat expensive, which becomes a lesser concern with more wealth, and a trivial concern with cosmic wealth. I think it’s an actual crux for this example. With enough wealth to trivially save all monkeys, no monkeys would go extinct, provided we care even a little bit more than precisely not at all.
With enough wealth to trivially save all monkeys, no monkeys would go extinct, provided we care even a little bit more than precisely not at all.
I’m confused about this point. Perhaps you mean wealth in a broad sense that includes “we don’t need to worry about getting more wood.” But as long as wood is a useful resource that humans could use more of to acquire more wealth and do other things that we value more than saving monkeys, then we will continue to take wood from the monkeys. Likewise, even if an AGI values human welfare somewhat, it will still take our resources as long as it values other things morethan human welfare.
I found the monkey example much more compelling than “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” Taking resources from humans seems more likely than using humans as resources.
For the monkeys example, I mean that I expect that in practice there will be activists that will actually save the monkeys if they are wealthy enough to succeed in doing so on a whim. There are already expensive rainforest conservation efforts costing hundreds of milions of dollars. Imagine that they instead cost $10 and anyone could pay that cost without needing to coordinate with others. Then, I claim, someone would.
By analogy, the same should happen with humanity instead of monkeys, if AGIs reason in a sufficiently human-like way. I don’t currently find it likely that most AGIs would normatively accept some decision theory that rules it out. It’s obiously possible in principle to construct AGIs that follow some decision theory (or value paperclips), but that’s not the same thing as such properties of AGI behavior being convergent and likely.
I think a default shape of a misaligned AGIs is a sufficiently capable simulacrum, a human-like alien thing that faces the same value extrapolation issues as humanity, in a closely analogous way. (That is, if an AGI alignment project doesn’t make something clever instead that becomes much more alien and dangerous as a result.) And a default aligned AGI is the same, but not that alien, more of a generalized human.
something trained on human culture might retain at least a tiny bit of compassion on reflection
This depends on where “compassion” comes from. It’s not clear that training on data from human culture gets you much in the way of human-like internals. (Compare: contemporary language models know how to say a lot about “happiness”, but it seems very dubious that they feel happiness themselves.)
And for ourselves. If AGI doesn’t take away civilization’s future, why take away future of individual people? The technical problem should be relatively trivial, given a few years to get started.
If it’s not the primary thing it cares about, we lose the cosmic endowment. But we might keep our lives and civilization.
It does need to care at all for this to happen, and a paperclip maximizer won’t. But something trained on human culture might retain at least a tiny bit of compassion on reflection, which is enough to give back a tiny bit of the cosmic wealth it just took.
Not if you use a superintelligently designed sandbox. It’s a question of spending literally no resources compared to spending at least a tiny little fraction of future resources.
Saving monkeys[1] is also somewhat expensive, which becomes a lesser concern with more wealth, and a trivial concern with cosmic wealth. I think it’s an actual crux for this example. With enough wealth to trivially save all monkeys, no monkeys would go extinct, provided we care even a little bit more than precisely not at all.
Or, as the case may be, their habitats. Bald uakaris seem only tangentially threatened right now.
I’m confused about this point. Perhaps you mean wealth in a broad sense that includes “we don’t need to worry about getting more wood.” But as long as wood is a useful resource that humans could use more of to acquire more wealth and do other things that we value more than saving monkeys, then we will continue to take wood from the monkeys. Likewise, even if an AGI values human welfare somewhat, it will still take our resources as long as it values other things more than human welfare.
I found the monkey example much more compelling than “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” Taking resources from humans seems more likely than using humans as resources.
For the monkeys example, I mean that I expect that in practice there will be activists that will actually save the monkeys if they are wealthy enough to succeed in doing so on a whim. There are already expensive rainforest conservation efforts costing hundreds of milions of dollars. Imagine that they instead cost $10 and anyone could pay that cost without needing to coordinate with others. Then, I claim, someone would.
By analogy, the same should happen with humanity instead of monkeys, if AGIs reason in a sufficiently human-like way. I don’t currently find it likely that most AGIs would normatively accept some decision theory that rules it out. It’s obiously possible in principle to construct AGIs that follow some decision theory (or value paperclips), but that’s not the same thing as such properties of AGI behavior being convergent and likely.
I think a default shape of a misaligned AGIs is a sufficiently capable simulacrum, a human-like alien thing that faces the same value extrapolation issues as humanity, in a closely analogous way. (That is, if an AGI alignment project doesn’t make something clever instead that becomes much more alien and dangerous as a result.) And a default aligned AGI is the same, but not that alien, more of a generalized human.
This depends on where “compassion” comes from. It’s not clear that training on data from human culture gets you much in the way of human-like internals. (Compare: contemporary language models know how to say a lot about “happiness”, but it seems very dubious that they feel happiness themselves.)
These are good points. Maybe we’ll align these things enough to where they’ll give us a little hamster tank to run around in.