Family’s coming over, so I’m going to leave off writing this comment even though there are some obvious hooks in it that I’d love to come back to later.
If the AI can’t practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn’t asking for a coherent thing, we don’t have to smash our heads against that brick wall, we can implement “what to do when the human asks for an incoherent thing” contingency plans.
(e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
Our big advantage over philosophy is getting to give up when we’ve asked for something impossible, and ask for something possible. That said, it’s premature to declare any of the problems mentioned here impossible—but it means that case analysis type reasoning where we go “but what if it’s impossible?” seems like it should be followed with “here’s what we might try to do instead of the impossible thing.”
It seems likely that there are concepts that aren’t guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
So checking whether an arbitrary AI “knows a thing” is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
Is getting to build the AI really more powerful? It doesn’t seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.
The guarantees I’m used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.
The “assume the AI knows what’s going on” test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn some concept based on the training data that helps it get better test scores.
Family’s coming over, so I’m going to leave off writing this comment even though there are some obvious hooks in it that I’d love to come back to later.
If the AI can’t practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn’t asking for a coherent thing, we don’t have to smash our heads against that brick wall, we can implement “what to do when the human asks for an incoherent thing” contingency plans.
(e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
Our big advantage over philosophy is getting to give up when we’ve asked for something impossible, and ask for something possible. That said, it’s premature to declare any of the problems mentioned here impossible—but it means that case analysis type reasoning where we go “but what if it’s impossible?” seems like it should be followed with “here’s what we might try to do instead of the impossible thing.”
It seems likely that there are concepts that aren’t guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
So checking whether an arbitrary AI “knows a thing” is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
Is getting to build the AI really more powerful? It doesn’t seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.
The guarantees I’m used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.
The “assume the AI knows what’s going on” test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn some concept based on the training data that helps it get better test scores.