For examples, do you mean examples of thinking about ontology identification being useful to solve ontology identification, or examples of how a solution would be helpful for alignment?
I’m asking for examples of specific problems in alignment where thinking of ontology identification is more helpful than just thinking about it the usual or obvious way.
I might not have exactly the kind of example you’re looking for, since I’d frame things a bit differently. So I’ll just try to say more about the question “why is it useful to explicitly think about ontology identification?”
One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren’t previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer alignment solution. I now think you’d also have to solve ontology identification, so I’m now very skeptical of that approach. As you point out in another comment, you could technically treat ontology identification as part of human irrationality (not sure if you’d call this the “usual/obvious way” in this setting?). But what you notice when separating out ontology identification is that if you have some way of solving the ontology identification part, you should probably just use that for ELK and skip the part where you model human irrationalities really well.
Another part of my answer is that ontology identification is not an obviously better frame for any single specific problem, but it can be used as a unifying frame to think about problems that would otherwise look quite different. So some examples of where ontology identification appears:
The ELK report setting: you want to give better informed preference comparisons
The case I mentioned above: you’ve done some cognitive science and are able to learn/write down human rewards in terms of the human ontology, but still need to translate them
You think that your semi-supervised model already has a good understanding of what human values/corrigibility/… are, and your plan is to retarget the search or to otherwise point an optimizer at this model’s understanding of human values. But you need to figure out where exactly in the AI human values are represented
To prevent your AI from becoming deceptive, you want to be able to tell whether it’s thinking certain types of thoughts (such as figuring out whether it could currently take over the world). This means you have to map AI thoughts into things we can understand
You want clear-cut criteria for deciding whether you’re interpreting some neuron correctly. This seems very similar to asking “How do we determine whether a given ontology translation is correct?” or “What does it even mean for an ontology translation to be ‘correct’?”
I think ontology identification is a very good framing for some of these even individually (e.g. getting better preference comparisons), and not so much for others (e.g. if you’re only thinking about avoiding deception, ontology identification might not be your first approach). But the interesting thing is that these problems seemed pretty different to me without the concept of ontology identification, but suddenly look closely related if we reframe them.
I’m asking for examples of specific problems in alignment where thinking of ontology identification is more helpful than just thinking about it the usual or obvious way.
I might not have exactly the kind of example you’re looking for, since I’d frame things a bit differently. So I’ll just try to say more about the question “why is it useful to explicitly think about ontology identification?”
One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren’t previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer alignment solution. I now think you’d also have to solve ontology identification, so I’m now very skeptical of that approach. As you point out in another comment, you could technically treat ontology identification as part of human irrationality (not sure if you’d call this the “usual/obvious way” in this setting?). But what you notice when separating out ontology identification is that if you have some way of solving the ontology identification part, you should probably just use that for ELK and skip the part where you model human irrationalities really well.
Another part of my answer is that ontology identification is not an obviously better frame for any single specific problem, but it can be used as a unifying frame to think about problems that would otherwise look quite different. So some examples of where ontology identification appears:
The ELK report setting: you want to give better informed preference comparisons
The case I mentioned above: you’ve done some cognitive science and are able to learn/write down human rewards in terms of the human ontology, but still need to translate them
You think that your semi-supervised model already has a good understanding of what human values/corrigibility/… are, and your plan is to retarget the search or to otherwise point an optimizer at this model’s understanding of human values. But you need to figure out where exactly in the AI human values are represented
To prevent your AI from becoming deceptive, you want to be able to tell whether it’s thinking certain types of thoughts (such as figuring out whether it could currently take over the world). This means you have to map AI thoughts into things we can understand
You want clear-cut criteria for deciding whether you’re interpreting some neuron correctly. This seems very similar to asking “How do we determine whether a given ontology translation is correct?” or “What does it even mean for an ontology translation to be ‘correct’?”
I think ontology identification is a very good framing for some of these even individually (e.g. getting better preference comparisons), and not so much for others (e.g. if you’re only thinking about avoiding deception, ontology identification might not be your first approach). But the interesting thing is that these problems seemed pretty different to me without the concept of ontology identification, but suddenly look closely related if we reframe them.
Makes sense, thanks for the reply!
For what it’s worth, I do think strong ELK is probably more tractable than the whole cog eco approach for preference learning.