2: We have more total evidence from human outcomes
Additionally, I think we have a lot more total empirical evidence from “human learning → human values” compared to from “evolution → human values”. There are billions of instances of humans, and each of them presumably have somewhat different learning processes / reward circuit configurations / learning environments. Each of them represents a different data point regarding how inner goals relate to outer optimization. In contrast, the human species only evolved once. Thus, evidence from “human learning → human values” should account for even more of our intuitions regarding inner goals versus outer optimization than the difference in reference class similarities alone would indicate.
I’m not sure if this reason is evidence that human learning → human values is more relevant to predicting AGI than evolution → human values. IIUC, you are arguing that if one conditions on both human learning → human values and evolution → human values being relevant to predicting AGI, then we get more bits of evidence from human learning → human values because there are simply more instances of humans.
However, I think that Nate/MIRI[1] would argue that that human learning → human values isn’t relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don’t actually know what the outer optimization target is, it’s not useful for making claims about sharp left turn or AGI misalignment.
I’m not sure if this reason is evidence that human learning → human values is more relevant to predicting AGI than evolution → human values. IIUC, you are arguing that if one conditions on both human learning → human values and evolution → human values being relevant to predicting AGI, then we get more bits of evidence from human learning → human values because there are simply more instances of humans.
However, I think that Nate/MIRI[1] would argue that that human learning → human values isn’t relevant at all to predicting AGI because the outer optimization target for human learning is unspecified. Since we don’t actually know what the outer optimization target is, it’s not useful for making claims about sharp left turn or AGI misalignment.
Note that I don’t necessarily agree with them.