When using activation oracles, it’s difficult to separate “The model knows something and the activation oracle detected it” from “The model doesn’t know something but knows sufficient related things that the activation oracle was able to recompute it”.
For example, we might imagine, in a feature-based ontology, that the unsteered transcript activation contains features “Fictional company” + “Blackmail opportunity” + “This is an eval” while the steered transcript contains “Fictional company” + “Blackmail opportunity” and the activation oracle deduces “This is an eval” while processing the information.
(Or the steering might just be suppressing a “Say eval” feature, and the model might still be able to understand things)
The Linear Representation Hypothesis (roughly) implies that information the model is “using” will be stored linearly. It would suggest that a simple steering vector is enough to prevent Opus from actually “using” the information that it’s in an eval to make decisions. Unfortunately, we have no idea if that’s actually true!
This might explain some of the difference between verbalized and unverbalized awareness. On the other hand, it might not! We know basically nothing concrete about these things and their ontologies!
When using activation oracles, it’s difficult to separate “The model knows something and the activation oracle detected it” from “The model doesn’t know something but knows sufficient related things that the activation oracle was able to recompute it”.
For example, we might imagine, in a feature-based ontology, that the unsteered transcript activation contains features “Fictional company” + “Blackmail opportunity” + “This is an eval” while the steered transcript contains “Fictional company” + “Blackmail opportunity” and the activation oracle deduces “This is an eval” while processing the information.
(Or the steering might just be suppressing a “Say eval” feature, and the model might still be able to understand things)
The Linear Representation Hypothesis (roughly) implies that information the model is “using” will be stored linearly. It would suggest that a simple steering vector is enough to prevent Opus from actually “using” the information that it’s in an eval to make decisions. Unfortunately, we have no idea if that’s actually true!
This might explain some of the difference between verbalized and unverbalized awareness. On the other hand, it might not! We know basically nothing concrete about these things and their ontologies!