To learn human values from, say, fixed texts is a good start, but it doesn’t solve the “chicken or the egg problem”: that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.
One possible obstacke: non-aligned AI could run away before it has finished to learn human values from the texts.
The problem of chicken and the egg could presumably be solved by some iteration-and-distillation approach. First we give some very rough model of human values (or rules) to some limited AI, and later we increase its power and its access to real human. But this suffers from all the difficulties of the iteration-and-distillation, like unexpected jumps of capabilities.
To learn human values from, say, fixed texts is a good start, but it doesn’t solve the “chicken or the egg problem”: that we start from running non-aligned AI which is learning human values, but we want the first AI to be already aligned.
One possible obstacke: non-aligned AI could run away before it has finished to learn human values from the texts.
The problem of chicken and the egg could presumably be solved by some iteration-and-distillation approach. First we give some very rough model of human values (or rules) to some limited AI, and later we increase its power and its access to real human. But this suffers from all the difficulties of the iteration-and-distillation, like unexpected jumps of capabilities.