I remember in a recent paper Anthropic (or maybe Redwood) talked about how new models of Claude were citing Jones Foods, and they were concerned that this translated to the model having increased awareness of the “need” to fake alignment. So you’re certainly not off base in the worry that at least on some level they’re aware of the risk.
I remember in a recent paper Anthropic (or maybe Redwood) talked about how new models of Claude were citing Jones Foods, and they were concerned that this translated to the model having increased awareness of the “need” to fake alignment. So you’re certainly not off base in the worry that at least on some level they’re aware of the risk.