Yeah, modularity is important for interpretability. It would be nice to be able to separate the goals, the world model, and the planning algorithm, and have them interact in a straightforward, comprehensible way.
Unfortunately, this seems somewhat unlikely to happen by default. Things are just so much more efficient if you intermix these modules—e.g. if the world-model is allowed to encode some important preference information by the choices it makes about how to categorize the world, rather than having to agnostically support every such categorization.
See also the benefits of end-to-end training. Passing a training signal from the goals back to the planner will lead to the planner encoding goal information in its planning tendencies, etc. So we’ll plausibly end up with systems that are sort of modular, but have been trained together in a way that blurs the lines a bit.
As for a safety module—no, not one that could be bolted onto an already-functioning AI. If you had such a safety module that was actually safe, you would just put it in charge and throw away the rest of the AI.
Yeah, modularity is important for interpretability. It would be nice to be able to separate the goals, the world model, and the planning algorithm, and have them interact in a straightforward, comprehensible way.
Unfortunately, this seems somewhat unlikely to happen by default. Things are just so much more efficient if you intermix these modules—e.g. if the world-model is allowed to encode some important preference information by the choices it makes about how to categorize the world, rather than having to agnostically support every such categorization.
See also the benefits of end-to-end training. Passing a training signal from the goals back to the planner will lead to the planner encoding goal information in its planning tendencies, etc. So we’ll plausibly end up with systems that are sort of modular, but have been trained together in a way that blurs the lines a bit.
As for a safety module—no, not one that could be bolted onto an already-functioning AI. If you had such a safety module that was actually safe, you would just put it in charge and throw away the rest of the AI.