This essay by Jacob Steinhardt makes a similar (and fairly fleshed out) argument.
Thanks for clarifying. Preventing escape seems like promising way to prevent these sorts of problems. On the other hand I’m having trouble imagining ways in which we could have sensors that can pick up on whether a model has escaped. (Maybe info-sec people have thought more about this?)
Also, if we post another comment thread a week later, who will see it? EAF/LW don’t have sufficient ways to resurface old but important content.
This doesn’t seem like an issue. You could instead write a separate post a week later which has a chance of gaining traction.
Thanks for writing this! I’ve found the post very interesting. I had a question/comment regarding this:
In practice, for many tasks we might want AIs to accomplish, knowing about all concrete and clear cheap-to-measure short-term outcomes will be enough to prevent most kinds of (low-stakes) oversight failures. For example, imagine using AIs to operate a power plant, where the most problematic failure modes are concrete short-term outcomes such as not actually generating electricity (e.g. by merely buying and selling electricity instead of actually producing it or by hacking electricity meters).
It seems like an additional issue here is that the AI could be messing with things beyond the system we are measuring. As an extreme example, the AI could become a “power plant maximizer” that takes over the world in order to protect the power plant from shut down. It seems like this will always be a risk because we can only realistically monitor a small part of the world. Do you have thoughts on this?
So, we can fine-tune a probe at the last layer of the measurement predicting model to predict if there is tampering using these two kinds of data: the trusted set with negative labels and examples with inconsistent measurements (which have tampering) with positive labels. We exclude all other data when training this probe. This sometimes generalizes to detecting measurement tampering on the untrusted set: distinguishing fake positives (cases where all measurements are positive due to tampering) from real positives (cases where all measurements are positive due to the outcome of interest).
This section confuses me. You say that this probe learns to distinguish fake positives from real positives, but isn’t it actually learning to distinguish real negatives and fake positives, since that’s what it’s being trained on? (Might be a typo.)
I agree that benchmarks might not be the right criteria, but training cost isn’t the right metric either IMO, since compute and algorithmic improvement will be bringing these costs down every year. Instead, I would propose an effective compute threshold, i.e. number of FLOP while accounting for algorithmic improvements.
Nature of the work: Many organizations are focused on developing ideas and amassing influence that can be used later. CAIP is focused on turning policy ideas into concrete legislative text and conducting advocacy now.
Congrats on launching! Do you have a model of why other organizations are choosing to delay direct legislative efforts? More broadly, what are your thoughts on avoiding the unilateralist’s curse here?
Treacherous turn test:
To be clear, in this test we wouldn’t be doing any gradient updates to make the model perform the treacherous turn, but rather we are hoping that the prompt alone will make it do so?
to determine whether your models are capable of gradient hacking, do evaluations to determine whether your models (and smaller versions of them) are able to do various tasks related to understanding/influencing their own internals that seem easier than gradient hacking.
What if you have a gradient hacking model that gradient hacks your attempts to get them to gradient hack?
Are you mostly hoping for people to come up with new alignment schemes that incorporate this (e.g. coming up with proposals like these that include a meta-level adversarial training step) or are you also hoping that people start actually doing meta-level adversarial evaluation of there existing alignment schemes (e.g. Anthropic tries to find a failure mode for whatever scheme they used to align Claude).
I’m interested in the relation between mechanistic anomaly detection and distillation. In theory, if we have a distilled model, we could use it for mechanistic anomaly detection: for each input x, we would check the degree to which the original model’s output differs from the distilled model. If the difference is too great, we flag it as an anomaly and reject the output.
Let’s say you have your original model M and your distilled model m along with some function d to quantify the difference between two outputs. If you are doing distillation, you would always just output m(x). If you are doing mechanistic anomaly detection, you output M(x) if d(M(x)−m(x)) is below some threshold and you output nothing otherwise. Here, I can see three differences between distillation and mechanistic anomaly detection:
Distillation is cheaper, since you only have to run the distilled model, rather than both models.
Mechanistic anomaly detection might give you higher quality outputs when d(M(x),m(x)) is below some threshold, since you are using the outputs from M rather than those from m.
Distillation will always return an output, whereas mechanistic anomaly detection will reject some.
Mechanistic anomaly detection operates according to a discrete threshold, whereas distillation is more continuous.
Overall, distillation just seems better than mechanistic anomaly detection in this case? Of course mechanistic anomaly detection could be done without a distilled model, but whenever you have a distilled model, it seems beneficial to just use it rather than running mechanistic anomaly detection.
E.g. you observe that two neurons of the network always fire together and you flag it as an anomaly when they don’t.
Robert used empirical data a bunch to solve the problem. Getting empirical data for nano-tech seems to involve a bunch of difficult experiments that take lots of time and resources to run.
You could argue that the AI could use data from previous experiments to figure this out, but I expect that very specific experiments would need to be run. (Also, you might need to create new tech to even run these experiments, which in itself requires empirical data.)
Cool work! It seems like one thing that’s going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar.
Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren’t used in B. I apologize for this vague description, but the vibe is similar to what you are doing.
Thanks for the post! What follows is a bit of a rant.
I’m a bit torn as to how much we should care about AI sentience initially. On one hand, ignoring sentience could lead us to do some really bad things to AIs. On the other hand, if we take sentience seriously, we might want to avoid a lot of techniques, like boxing, scalable oversight, and online training. In a recent talk, Buck compared humanity controlling AI systems to dictators controlling their population.
One path we might take as a civilization is that we initially align our AI systems in an immoral way (using boxing, scalable oversight, etc) and then use these AIs to develop techniques to align AI systems in a moral way. Although this wouldn’t be ideal, it might still be better than creating a sentient squiggle maximizer and letting it tile the universe. There are also difficult moral questions here, like if you create a sentient AI system with different preferences than yours, is it okay to turn it off?
I see, thanks for clarifying.
The universal learning/scaling model was largely correct—as tested by openAI scaling up GPT to proto-AGI.
I don’t understand how OpenAIs success at scaling GPT proves the universal learning model. Couldn’t there be an as yet undiscovered algorithm for intelligence that is more efficient?
From the last bullet point: “it doesn’t much matter relative to the issue of securing the cosmic endowment in the name of Fun.”Part of the post seems to be arguing against the position “The AI might take over the rest of the universe, but it might leave us alone.” Putting us in an alien zoo is pretty equivalent to taking over the rest of the universe and leaving us alone. It seems like the last bullet point pivots from arguing that AI will definitely kill us to arguing that even though if it doesn’t kill us this is pretty bad.