Just finished reading “If Anyone Builds It, Everyone Dies”. I had a question that seems like an obvious one, but one I didn’t see addressed in the book, maybe someone can help:
The main argument in the book is the analogy to humans. Evolution “wanted” us to maximize genetic fitness, but it didn’t get what it trained for. Instead, it created humans who love ice cream and condoms even though they reduce our genetic fitness.
With AGI, we’re on track to do something similar—we won’t get an AI aligned to human interests even though we do RLHF or any other such simple training or shaping to an AI, it’ll end up wanting something weird and inhuman rather than maximizing human values.
But in my mind, this seems to miss a fairly important point: The fact that human brains don’t come pre-wired with much knowledge. We have to learn it from scratch. We don’t come out of the womb with concept of “inclusive genetic fitness”. It took us culture and ~200,000 years to figure that out, and we still only learn it after about 15-20 years of existing. So there’s no way that evolution could have made us point our utility function to “inclusive genetic fitness” because that concept doesn’t exist in our brains.
Modern AIs don’t seem like that. They come with the sum of human knowledge baked in during pre-training. As they get smarter, the concept of “human values” or “friendly AI” is definitely something in it’s existing mind. So it should be much easier for us to do alignement and test whether we can point it to that specific concept vs. what what evolution had.
Yes, I agree with that. I’m not claiming that knowing about it stops you from wanting ice cream.
I’m claiming that if the concept was hardwired into our brains, evolution would have had an easy time optimizing us directly to want “inclusive genetic fitness” rather than wanting ice cream.
i.e—we wouldn’t want ice cream at all but reason from first principles what we should eat based on fitness.
Just finished reading “If Anyone Builds It, Everyone Dies”. I had a question that seems like an obvious one, but one I didn’t see addressed in the book, maybe someone can help:
The main argument in the book is the analogy to humans. Evolution “wanted” us to maximize genetic fitness, but it didn’t get what it trained for. Instead, it created humans who love ice cream and condoms even though they reduce our genetic fitness.
With AGI, we’re on track to do something similar—we won’t get an AI aligned to human interests even though we do RLHF or any other such simple training or shaping to an AI, it’ll end up wanting something weird and inhuman rather than maximizing human values.
But in my mind, this seems to miss a fairly important point: The fact that human brains don’t come pre-wired with much knowledge. We have to learn it from scratch. We don’t come out of the womb with concept of “inclusive genetic fitness”. It took us culture and ~200,000 years to figure that out, and we still only learn it after about 15-20 years of existing. So there’s no way that evolution could have made us point our utility function to “inclusive genetic fitness” because that concept doesn’t exist in our brains.
Modern AIs don’t seem like that. They come with the sum of human knowledge baked in during pre-training. As they get smarter, the concept of “human values” or “friendly AI” is definitely something in it’s existing mind. So it should be much easier for us to do alignement and test whether we can point it to that specific concept vs. what what evolution had.
Knowing about “inclusive genetic fitness” does not stop you from wanting ice cream.
For superhuman AIs, knowing about human values won’t necessarily make them care.
Yes, I agree with that. I’m not claiming that knowing about it stops you from wanting ice cream.
I’m claiming that if the concept was hardwired into our brains, evolution would have had an easy time optimizing us directly to want “inclusive genetic fitness” rather than wanting ice cream.
i.e—we wouldn’t want ice cream at all but reason from first principles what we should eat based on fitness.