I agree with the vast majority of your post, but I genuinely think that with Agentic AI systems, we need to verify the reasoning layer, where the actual harm takes place, and this layer currently produces no auditable outputs. We’re shipping much faster than how we govern these systems, that’s where the gap is.
alexandracar
Karma: 0
The AI remembers you but the governance audit doesn’t
The Stego results surprised me given the probe was trained on deliberative collusion. How robust is the deceptive direction to collusion types that look nothing like what is in the benchmark?
It’s a great post! I would add that, we need to look beyond user’s inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user’s psychological weaknesses, and acts in real manipulation mode (sycophancy).