So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.
I will agree with this. However, notice what this doesn’t say. It doesn’t say “any model powerful enough to be really dangerous contains human values”. Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and bombs. It knew enough biology to design a superplage. It knew enough molecular dynamics to design self replicating nanotech. It knew enough about computer security to hack most real world systems. But it didn’t know much about how humans thought. It’s predictions are far from maxentropy, if it sees people walking along a street, it thinks they will probably carry on walking, not fall to the ground twiching randomly. Lets say that the model is as predictively accurate as you would be when asked to predict the behaviour of a stranger from a few seconds of video. This AI doesn’t contain a model of human values anywhere in it.
We can’t just assume that every AI powerful enough to be dangerous contains a model of human values, however I suspect most of them will in practice.
I will agree with this. However, notice what this doesn’t say. It doesn’t say “any model powerful enough to be really dangerous contains human values”. Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and bombs. It knew enough biology to design a superplage. It knew enough molecular dynamics to design self replicating nanotech. It knew enough about computer security to hack most real world systems. But it didn’t know much about how humans thought. It’s predictions are far from maxentropy, if it sees people walking along a street, it thinks they will probably carry on walking, not fall to the ground twiching randomly. Lets say that the model is as predictively accurate as you would be when asked to predict the behaviour of a stranger from a few seconds of video. This AI doesn’t contain a model of human values anywhere in it.
We can’t just assume that every AI powerful enough to be dangerous contains a model of human values, however I suspect most of them will in practice.
This is entirely correct.