As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable stuff like “can it do two digit arithmetic” and move towards more complex things like “can it output a reasonable root-cause analysis of this bug” or “can it implement this feature”. Evals then must also move away from simple multiple choice questions towards more complex models of tasks and at least partial models of things like computer systems or development environments.
At this point we can start invoking the good regulator theorem, and say that the evaluator is a regulator. It wants to produce the outcome “pass” when the joint system formed from the LLM and the world-model has some desired property (“feature has been implemented”, “bug has been fixed”). It wants to produce the outcome “fail” otherwise. It seems necessary that the test environments will need to get more and more realistic to check for features of more and more complex systems. At the limit you have things like Google’s recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks this actually tends towards a perfectly deterministic model in the style of the original good regulator theorem.
Going one level up, what we are interested in may be less properties of the task or world, but properties of the AI itself (will this AI harm the end user? is the AI honest?) At that point evals have to encode assumptions about how agents store beliefs, turn beliefs into actions etc. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an eval reflects the presence of some internal undesirable property in the weights...
Very quick thought—do evals fall prey to the Good(er) Regulator Theorem?
As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable stuff like “can it do two digit arithmetic” and move towards more complex things like “can it output a reasonable root-cause analysis of this bug” or “can it implement this feature”. Evals then must also move away from simple multiple choice questions towards more complex models of tasks and at least partial models of things like computer systems or development environments.
At this point we can start invoking the good regulator theorem, and say that the evaluator is a regulator. It wants to produce the outcome “pass” when the joint system formed from the LLM and the world-model has some desired property (“feature has been implemented”, “bug has been fixed”). It wants to produce the outcome “fail” otherwise. It seems necessary that the test environments will need to get more and more realistic to check for features of more and more complex systems. At the limit you have things like Google’s recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks this actually tends towards a perfectly deterministic model in the style of the original good regulator theorem.
Going one level up, what we are interested in may be less properties of the task or world, but properties of the AI itself (will this AI harm the end user? is the AI honest?) At that point evals have to encode assumptions about how agents store beliefs, turn beliefs into actions etc. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an eval reflects the presence of some internal undesirable property in the weights...