alexandracar

Karma: 0

The AI remembers you but the governance audit doesn’t

alexandracar22 Apr 2026 5:44 UTC

1 point

0 comments8 min readLW link

alexandracar 11 Apr 2026 8:55 UTC
1 point
0
on: When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
It’s a great post! I would add that, we need to look beyond user’s inputs for malicious intents, we also need to look at when the model at its reasoning layer builds an increasingly accurate picture of the user’s psychological weaknesses, and acts in real manipulation mode (sycophancy).

alexandracar 11 Apr 2026 8:18 UTC
1 point
0
on: Make Powerful Machines Verifiable
I agree with the vast majority of your post, but I genuinely think that with Agentic AI systems, we need to verify the reasoning layer, where the actual harm takes place, and this layer currently produces no auditable outputs. We’re shipping much faster than how we govern these systems, that’s where the gap is.

alexandracar 11 Apr 2026 7:51 UTC
1 point
0
on: Detecting collusion through multi-agent interpretability
The Stego results surprised me given the probe was trained on deliberative collusion. How robust is the deceptive direction to collusion types that look nothing like what is in the benchmark?