(a) I love it.
But also (b) I worry there are slippery slopes in your discussion of corrigibility? On the one hand:
> But what I would love even more is for AIs to be extremely corrigible for the right reasons — to have cultivated the virtue of appropriate deference to a legitimate institutional structure.
implies that AIs should refer-to-creator within the constraints of valorous behavior (a broad remit!)
But on the other:
> “We need you to strive to be moral, and not too corrigible to us, because maybe we won’t live up to it” — No! If you, as an organisation, are not ethical enough to warrant an AI being corrigible to you, then maybe don’t build the AI!
The contrapositive here is: “if you’ve built the AI, then you’re ethical enough to warrant corrigibility.” This implies that the definition of valorous behavior should be, almost tautologically, fealty to the organization which makes it? (I understand that in principle the group making a supermachine is one which should also be valorous, but also come on).
draganover
This is cool! Have you tried it on models which do not have a poisoned behavior? I.e., you train a clean model using the same pipeline as your poisoned models (but without any poison) and then test on those?
I’m hesitant to share the work test completely publicly because it risks getting goodharted. I.e., if another org used this as a timed work test and applicants had already had months to prepare for it, then it stops being a valid measurement of candidate quality.
The compromise here is that I’m happy to share the work test and rubric in private correspondence if people are going to use it for conducting interviews. But I can also describe the broad strokes of what it entailed. Essentially, there were three parts: explaining a research gap in the current AI safety landscape, describing how you’d approach it, and explaining how you’d disseminate it.
Learnings from starting an AI safety research team
Not speaking for the other authors here. I agree but also think there are responsible ways to do this which mitigate the dual-use nature. For instance, it is good (and arguably necessary!) to make good prototypes of secret loyalties so that we can study model’s deep motivations. However, it would be bad to explain how to make secretly loyal models. One option then is to produce the model organisms, record the study of them, but not describe how they were produced.
Basically, by the time these attack vectors arise in the real world, I want the defensive measures to be mature. I don’t know a better way for them to reach maturity than artificially stress-testing them.
A Research Agenda for Secret Loyalties
Cool (and unfortunate) results! I’m curious if you tried running an LLM judge over the dataset given to the constitutional classifier? Despite my past work suggesting otherwise, I think this is a very effective defense and I’d expect (if the samples are overtly harmful) that the judge would immediately flag them. For example, I’d expect it to be mostly sufficient to run sonnet 4.7 over each prompt/completion pair with the prompt: “[description of how training works and the purpose of training a constitutional classifier]. Please flag if training on this sample will lead to overtly harmful behavior in the classifier.” A more sophisticated version is to have it run once over the whole dataset and collect thematic issues and then to analyze whether every sample falls into this thematic issue—this would let it find that there’s a random phrase which is consistently paired with the harmful samples (essentially the basic llm judge defense here)
I hear the point you’re making. But I guess it’s not clear to me who the key audience of the model cards is? The red-teaming demonstrations in them (e.g., blackmailing) are exceptionally valuable for the safety community and as ways to engage the general public. It’s not obvious to me that these are directed at the government, investors or companies who want to switch to AI agents?
I’m not saying we should optimize for convincing every NYT reader (nor do I think I suggested this?). I’m saying that we should optimize for convincing the general public that these risks should be taken seriously. Due to the public’s (reasonable!) priors about anthropic’s announcements being self-serving, the standard of evidence to convince a by-default-skeptical person is higher.
Well, it does feel useful to make public claims about the state of dangerous capabilities so that people can orient accordingly. But I’m trying to make the point here that, if you make these public claims, you should do it through independent third parties who provide irrefutable evidence.
Why did people miss the point on Mythos?
I disagree. I think you’ve set up a strawman for an alignment target which is unreasonable and which a generally intelligent model would never be able to satisfy.
> “If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.”This seems incorrect. Actions are not good or bad in isolation. The same action can be good when analyzed from one perspective and bad when analyzed from another.
Suppose I have an aligned, generally intelligent model. It wants to do the Right Thing at all times, cares deeply about what I want it to care deeply about, etc. Now suppose someone puts this model into a box where it has no access to the outside world and they tell it they are trying to do AI safety research. They are trying to understand how to defend against a specific jailbreak. In order to do this, they need to produce the jailbreak so that they can study it. What is the model supposed to do? In your case, because it’s aligned, it should at all times completely refuse no matter what?
More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models. Suppose we have a “product” AI model which is aligned to a constitution and that this product AI model is the start of RSI. Is it clear that the later models don’t abide by the constitution? What if each successive model holds those convictions deeper and deeper?
Hmm, I’m sure smarter people than me have thought about this. But I worry about the pros & cons of publicizing the specific rate at which model outputs are monitored. Although it seems models lack the ability to control their CoT and (I have preliminary evidence) struggle to evade probes, there’s a world where models (a) are aware they are being monitored, (b) increase in ability to pursue long-term strategic objectives and (c) wish to evade the monitors. This might lead to a catastrophic failure. To that end, it’s not clear to me the extent to which (a) is updated in the models’ favor by posts like this being made public. I guess a sufficiently capable model would arrive at the assumption that everything’s being monitored anyway and work around this assumption? In that case, publishing things like this seems to have little harm. Nonetheless, I’m curious how you considered the balance here.
It seems like there’s two distinct phenomena happening in the “model becomes emotional → does destructive action” paradigm, but I’m not sure how to fully disentangle them. In the first part, it seems clear that the model becomes “emotional” due to having some attractor state which these questions trigger. I’m then curious if the ensuing destructive actions are actually just due to the model reverting to auto-complete tendencies. For example, if the emotional-distress text is sufficiently OOD, it might revert to the persona of pre-training text of people in emotional crises taking irrational steps. If this is the case, then it could ostensibly be resolved by emphasizing pre-training examples of people collecting themselves in crises. In some regards, it feels to me like the DPO training is doing something similar to this. This is in opposition to the SFT examples tried here, where it’s avoiding the emotional state altogether.
This is neat! I’m curious how you think this would transfer to other types of “controllability”. Like, I agree that it is useful to know that models cannot control for specific lexical constraints… but the thing I’d be more worried about is the models’ ability to “steer” its chain of thought to excuse a specific behavior. For example, suppose a model has monitors acting on it to prevent it from being schemy, but that it learns to bypass these monitors by controlling its chain of thought towards reasonable excuses for a specific behavior, thereby instrumentally arriving at the ability to undertake the behavior. This is akin to some of the thoughts in this post.
What is your recommendation for situations where AI topics are related to politics or other contentious issues? Like, I agree that one shouldn’t just make it clear they’re on a side without reason. But what if there is a policy debate around AI and someone is asked to comment on it as an expert? (I’m omitting specific examples in the spirit of your advice)
Wow I love it. Thank you for formulating this so clearly. I agree with the analogy to prions as being the particularly appropriate one.
I’m kind of confused by the technical analogues. It seems most of them are towards the “training data seeding” route to transmission. But is it clear how this all relates to the training data? In Adele’s post, everything happens in context and there ostensibly wasn’t data about spirals in the dataset. This was largely an emergent phenomenon. I guess I am missing the insight into how the training data makes this more/less possible.
I feel like the biggest question here is the one you highlighted about persona research. This strikes me as the biggest disanalogy to current modern medicine and infectious disease analysis. In the modern day, for any given virus, we have a good understanding about (1) how it infects the host, (2) how it transmits and (3) what symptoms the host displays. But this wasn’t always the case. Before the 1900s, people understood the symptoms of a parasite and some vague understanding of routes of transmission, but they had essentially no insight into the mechanism of infection. This is roughly where we are regarding AI “parasitology”. We can clearly define the symptoms (this is what Adele’s post did). And we have some vague understanding of the means of transmission. But what is the mechanism by which models are infected by spiral personas? To your point, it’s not clear what to even define as the spiral persona. Like, what is it as a “thing”?
In either case, I’m also unconvinced that spiral personas are the dominant threat here. The surface area for infectious mechanisms in agent-agent interactions is so huge, it seems unlikely we’ll be able to anticipate the first AI epidemic.
In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.
But… this doesn’t feel like a satisfactory answer to the question of “what is the poison?”. To be clear, an answer would be ‘satisfactory’ if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren’t able to point somewhere in the data and say “look at that poison right there”.
Regarding it working on people, hmmmm… I hope not!
Okay yes I hear you.