And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
Are you referring to Anthropic’s circuit tracing paper here? If so, I don’t recall seeing results that demonstrate it *isn’t* thinking about predicting what a helpful AI would say. Although I haven’t followed up on this beyond the original paper.
Are you referring to Anthropic’s circuit tracing paper here? If so, I don’t recall seeing results that demonstrate it *isn’t* thinking about predicting what a helpful AI would say. Although I haven’t followed up on this beyond the original paper.