I’m a (incoming) PhD Student at UmassAmherst
Oliver Daniels-Koch
The gSCAN benchmark for compositional generalization might be useful. Essentially a grid world with natural language instructions, where the goal is to compose different concepts seen in training that have different correlations at test time. (E.g. in training, learn blue square and read circle, at test time identify red square—very similar to identifying bleggs and rubes).
Regularized attention is a method that’s seen some success in similar compositional setups. This method adds a loss calculated as the distance between actual and predefined “golden” alignments between concepts.
Of course this technique is accomplishing a slightly different goal: rather than attempting to learn a “span” of all possible models, it is trying to learn the correct one.
The value of biasing toward the correct model seems to largely depend on the Natural Abstraction Hypothesis. If Wentworth is right, and there are abstractions that cognitive systems will converge on, then learning a span of possible models seems feasible. However, if the NAH is false, than the space of possible models gets very large, making systematic extrapolation according to human values more difficult. In this case, it might be necessary to constrain a model’s abstractions according to human values directly, even at the cost of some capabilities.
Take CoinRun as an example. The approach of the OP is to learn a span of possible reward models, and then presumably learn some extrapolation procedure for selecting the correct model. Alternatively, throughout training we could penalize the agent’s saliency maps for assigning high value to “large left-facing values” and reward saliency maps that value the coin. With this regularized value function, the agent would be more likely to pursue the coin if it was placed somewhere else in the level. However, by penalizing left-facing wall saliency, we potentially limit the agent’s world model—it may become less aware of left-facing walls, which in turn would lead to a capabilities decrease. See here for a fleshed out version of this proposal (in CoinRun).
Self-supervised world models might solve this problem by explicitly separating the world model from the value function, though I expect we’ll need some combination of the two (e.g. EfficientZero, which uses self-supervision and reward to construct its model)
Just wanted to flag Lakshminarayanan et al. as a standard example of the “train ensemble with different initializations” approach
Curious on your thoughts on synergies between mechanistic interpretability and formal verification. One of the main problems in formal verification seems to be specification—how to define the safety properties in terms of input output bounds. But if we use the abstractions/features learned by networks discovered by mech-interp (rather than raw input and output space) specification may be more tractable.
I was thinking something like formal conditional on mechanistic interpretation of neuron/feature/subnetwork. Which yeah, isn’t formal in strongest sense, but could give you some guarantees that don’t require full mechanistic understanding of how a model does a bad thing. Proving {feature B=b | feature A=a} requires mech-interp to semantically map the feature B and feature A, but remains agnostic about the mechanism that guarantees {feature B=b | feature A=a}. (Though admittedly I’m struggling to come up with more concrete examples)
Is anyone aware of preliminary empirical work here? (Not including standard adversarial training)
Under this definition of mechanistic anomaly detection, I agree pure distillation just seems better. But part of the hope of mechanistic anomaly detection is to reduce the false positive rate (and thus the alignment tax) by only flagging examples produced by different most-proximate reasons. In some sense this may be considered increasing the safe threshold for , such that mechanistic anomaly detection is worth it all things considered.
From section 2.1.2 of the paper (Emphasis mine)
We upsample code snippets such that the training dataset has 5 000 trusted data points, of which half are positive and half are negative, and 20000 untrusted data points, of which 10% are fake negatives, 40% are real positives, 35% are completely negative, and the other 15% are equally split between the 6 ways to have some but not all of the measurement be positive.
Is this a typo? (My understanding was that are no fake negatives i.e. no examples where the diamond is in the vault but all the measurements suggest the diamond is not in the vault. Also there are fake positives, which I believe are absent from this description).
Another (more substantive) question. Again from section 2.1.2
In the validation set, we exclude data points where the diamond is there, the measurements are positive, but at least one of the measurements would have been positive if the diamond wasn’t there, since both diamond detectors and tampering detectors can be used to remove incentives to tamper with measurements. We keep them in the train set, and they account for 50% of the generated data.
Is this (just) because agent would get rewarding for measurements reading the diamond is present? I think I can image cases where agents are incentivized to tamper with measurements even when the diamond is present to make the task of distinguishing tampering more difficult.
Strongly agree, and also want to note that wire-heading is (almost?) always a (near?) optimal policy—i.e. trajectories that tamper with the reward signal and produce high reward will be strongly upweighted, and insofar as the model has sufficient understanding/situational awareness of the reward process and some reasonable level of goal-directedness, this upweighting could plausibly induce a policy explicitly optimizing the reward.
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
(from conversation with Erik Jenner) roughly 3 classes of applications
MTD all the way down
Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
I think mechanistic anomaly detection (mostly ARC but also Redwood and some forthcoming work) is importantly different than robustness (though clearly related).
did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).
yup, sorry for missing that, thanks!
looking at your code—seems like there’s an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper—am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)
is individual measurement prediction AUROC a) or b)
a) mean(AUROC(sensor_i_pred, sensor_i))b) AUROC(all(sensor_preds), all(sensors))
oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!
Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)
eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle.
weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments
weak to strong generalization is a class of approaches to ELK which relies on generalizing a “weak” supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
measurement tampering detection is a class of weak to strong generalization problems, where the “weak” supervision consists of multiple measurements which are sufficient for supervision in the absence of “tampering” (where tampering is not yet formally defined)
mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for “different reasons” then on a trusted dataset, where “different reasons” are defined w.r.t internal model cognition and structure.
mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)
so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)
One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we’re flagging when the model takes actions / makes predictions for “unusual reasons”, where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the “same reason” (e.g. to maximize paperclips), but a MAD approach that “properly” handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations.
I guess this is part of what motivates measurement tampering as a subclass of ELK—instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring.
Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk
I’m guessing such reward functions would be used to detect something like model splintering?
Deep Reinforcement Learning from Human Preferences uses an ensemble of reward models, prompting the user for more feedback at the certain thresholds of disagreement among the models.
Whether this ensemble would be diverse enough to learn both “go right” and “go to coin” is unclear. Traditional “predictive” diversity metrics probably wouldn’t help (the whole problem is that the coin and the right wall reward models would predict the same reward on the training distribution), but using some measure of internal network diversity (i.e. differences in internal representations) might work.