The agents are strictly limited in bits of input (identifying queries from the universal reasoning core) that they can use to influence their subqueries/how they produce their output. They are limited in the amount of thinking that they can do at runtime. But, they are not limited in the amount of “precomputation” that can go into preparing the human for acting in the role of an overseer, and there are lots of ways to use this. We could spend a lot of time training the humans in AI safety thinking beforehand. We could produce an “overseer’s manual” which lays out a checklist for how to behave corrigibly, which includes lots of thinking about how things could be misinterpreted if broken down into small queries. We could spend a lot of time red-teaming the humans before allowing them to act as overseers, coming up with queries that might cause them to make a mistake.
This precomputation should be safe because it’s done using humans in normal circumstances.
Reasons to think IDA could provide “consequence corrigibility”:
One reason to think it might not be doomed is that we would only get bad outcomes if the overseer is incompetent at corrigibility relative to it’s competance to acting in the world. I think you would have a stronger argument if there are reasons to think that corrigibilty will be especially hard to oversee vs. competence in other domains. I currently think it’s somewhat likely that the general competence threshold for intentionally causing bad outcomes is above the competence threshold where the agent could understand it’s own knowledge of corrigibility.
Arguments that corrigibilty is much harder than dangerous levels of general competence (ie. that could cause significant damage in the real world) in practice would make me less optimistic about finding a tradeoff point which acheives both “intent corrigibility” and “consequence corrigibility”. I do think that there are narrow capabilities that could be dangerous and wouldn’t imply understanding of corrigibility—the overseer would need to avoid training these narrow capabilities before training general competence/corrigibility.
Reasons to think IDA could provide corrigibility:
The agents are strictly limited in bits of input (identifying queries from the universal reasoning core) that they can use to influence their subqueries/how they produce their output. They are limited in the amount of thinking that they can do at runtime. But, they are not limited in the amount of “precomputation” that can go into preparing the human for acting in the role of an overseer, and there are lots of ways to use this. We could spend a lot of time training the humans in AI safety thinking beforehand. We could produce an “overseer’s manual” which lays out a checklist for how to behave corrigibly, which includes lots of thinking about how things could be misinterpreted if broken down into small queries. We could spend a lot of time red-teaming the humans before allowing them to act as overseers, coming up with queries that might cause them to make a mistake.
This precomputation should be safe because it’s done using humans in normal circumstances.
Reasons to think IDA could provide “consequence corrigibility”:
One reason to think it might not be doomed is that we would only get bad outcomes if the overseer is incompetent at corrigibility relative to it’s competance to acting in the world. I think you would have a stronger argument if there are reasons to think that corrigibilty will be especially hard to oversee vs. competence in other domains. I currently think it’s somewhat likely that the general competence threshold for intentionally causing bad outcomes is above the competence threshold where the agent could understand it’s own knowledge of corrigibility.
Arguments that corrigibilty is much harder than dangerous levels of general competence (ie. that could cause significant damage in the real world) in practice would make me less optimistic about finding a tradeoff point which acheives both “intent corrigibility” and “consequence corrigibility”. I do think that there are narrow capabilities that could be dangerous and wouldn’t imply understanding of corrigibility—the overseer would need to avoid training these narrow capabilities before training general competence/corrigibility.