J Bostock comments on Jemist’s Shortform

J Bostock 23 Jan 2026 11:14 UTC
4 points
0
I’ve been thinking about “trustedness” when it comes to AI, my current thinking goes something like this:
An AI is trustworthy on a given domain if we can build an eval which lets us predict the behaviour of that AI on that domain, when it is deployed.
For example, if you’re doing OCR and your train, validation, test, and deployment datasets are all IID (like if you’re reading in a bunch of old books from a set of a million in some library) then you can trust your OCR AI.
The problem is that IID is an insanely strong condition. This example rules out self-driving cars, since the training distribution will be data from before the car is released. Instead we might want to make a weaker claim.
Hand-wavey hypothesis: an AI is trustworthy if it is not using an internal abstraction which tracks whether it is being evaluated.
This made me think of Anthropic just straight-up zeroing the SAE features corresponding to evaluation awareness in Claude as part of their experiments.^[1] I don’t expect this to be a robust strategy as currently used^[2], but maybe if you’re building super robust causal models of LLMs, you can do something along these lines.
1. ^
  https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf
  Part 7.6.4, pg 99.
2. ^
  SAEs just aren’t quite there when it comes to catching all the information in a model.