Sometimes people talk about problems where “your AI is taking actions that you can’t distinguish from the actions an aligned AI would take”. There are two different things people sometimes mean when they say this, and I think it’s good to disambiguate between them:
Given access to actions from both an aligned AI and an unaligned AI, you can’t tell which is which.
Given just access to actions from an unaligned AI, you can’t tell which it is.
These statements are like time-unbounded economic forecasts in that they have virtually no information content. You have to couple them with some kind of capabilities assessment or application to get actual predictions.
Before either of these (inevitably) becomes true, can we get interpretability research out of an AI? Can we get an effective EA-aligned-political-lobbyist AI that can be distinguished from a deceptively-aligned-political-lobbyist-AI? Can we get nanotech? Can we get nanotech design tools?
Sometimes people talk about problems where “your AI is taking actions that you can’t distinguish from the actions an aligned AI would take”. There are two different things people sometimes mean when they say this, and I think it’s good to disambiguate between them:
Given access to actions from both an aligned AI and an unaligned AI, you can’t tell which is which.
Given just access to actions from an unaligned AI, you can’t tell which it is.
These statements are like time-unbounded economic forecasts in that they have virtually no information content. You have to couple them with some kind of capabilities assessment or application to get actual predictions.
Before either of these (inevitably) becomes true, can we get interpretability research out of an AI? Can we get an effective EA-aligned-political-lobbyist AI that can be distinguished from a deceptively-aligned-political-lobbyist-AI? Can we get nanotech? Can we get nanotech design tools?