A key part of the idea (which, again, I think has some fatal flaws) was that concepts are clusters within some representation of the world, which is learned unsupervised, and is in some sense good at predicting the world. One way to think of this representation is as a set of features whose activity levels parsimoniously describe the data about each example. This requires that a disproportionate fraction of the space of feature activations maps close to the manifold that the examples lie on in the space of raw data.
Of course, you have to choose which features to cluster over, which requires some Bayesian tradeoff between getting a tight fit to the examples (high likelihood) and simplicity of the features (high prior) (clearly I just finished Kaj’s linked paper). But overall I think that unsupervised feature learning is tackling almost exactly the problem you pointed out.
In practice, there might be some problems. A potent toxin or a self-replicating nanobot are bad because they cause harm to whatever eats it, but would even a superintelligence learn a feature to detect safety to humans if all it saw of the universe was one million high-resolution scans of burritos? Well, maybe. But I’d trust it more if it also got to observe the context and consequences of burrito-consumption.
--
Anyhow, I agree with you that “be non-parametric!” is not necessarily helpful advice for producing safe burritos. The claim I put forward in the last paragraphs is that if you represent the agent’s goals non-parametrically in terms of examples, in the most obvious way, we seem to avoid some problems with improving the agent’s ontology.
A key part of the idea (which, again, I think has some fatal flaws) was that concepts are clusters within some representation of the world, which is learned unsupervised, and is in some sense good at predicting the world. One way to think of this representation is as a set of features whose activity levels parsimoniously describe the data about each example. This requires that a disproportionate fraction of the space of feature activations maps close to the manifold that the examples lie on in the space of raw data.
Of course, you have to choose which features to cluster over, which requires some Bayesian tradeoff between getting a tight fit to the examples (high likelihood) and simplicity of the features (high prior) (clearly I just finished Kaj’s linked paper). But overall I think that unsupervised feature learning is tackling almost exactly the problem you pointed out.
In practice, there might be some problems. A potent toxin or a self-replicating nanobot are bad because they cause harm to whatever eats it, but would even a superintelligence learn a feature to detect safety to humans if all it saw of the universe was one million high-resolution scans of burritos? Well, maybe. But I’d trust it more if it also got to observe the context and consequences of burrito-consumption.
--
Anyhow, I agree with you that “be non-parametric!” is not necessarily helpful advice for producing safe burritos. The claim I put forward in the last paragraphs is that if you represent the agent’s goals non-parametrically in terms of examples, in the most obvious way, we seem to avoid some problems with improving the agent’s ontology.