That isn’t necessarily feasible. My department writes electronic design automation software, and we have a hard time putting in enough diagnostics in the right places to show to us when the code is taking a wrong turn without burying us in an unreadably huge volume of output. If an AI’s deciding to lie is only visible as it’s having a subgoal of putting an observer’s mental model into a certain state, and the only way to notice that this is a lie is to notice that the intended mental state mismatches with the real world in a certain way, and this is sitting in a database of 10,000 other subgoals the AI has at the time—don’t count on the scan finding it...
Extraspection seems likely to be a design goal. Without it it is harder to debug a system—because it is difficult to know what is going on inside it. But sure—this is an engineering problem with difficulties and constraints.
That isn’t necessarily feasible. My department writes electronic design automation software, and we have a hard time putting in enough diagnostics in the right places to show to us when the code is taking a wrong turn without burying us in an unreadably huge volume of output. If an AI’s deciding to lie is only visible as it’s having a subgoal of putting an observer’s mental model into a certain state, and the only way to notice that this is a lie is to notice that the intended mental state mismatches with the real world in a certain way, and this is sitting in a database of 10,000 other subgoals the AI has at the time—don’t count on the scan finding it...
Extraspection seems likely to be a design goal. Without it it is harder to debug a system—because it is difficult to know what is going on inside it. But sure—this is an engineering problem with difficulties and constraints.