Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Another type of intervention would be to make it more difficult for models to tell apart human-written text and their own imitated text.
E.g. maybe we could get another model M to rewrite both humans’ and models’ text in ways that preserve semantics but change syntactics around a bit randomly. We could train with a GAN-like scheme to make it hard for models to tell apart M([human-written text]) and M([model-written text]). (Along with some other objective that incentivizes preserving semantics.)
This would mean that, if there’s any quirks of human writing that models can recognize but struggle to generate, M could learn to remove those quirks.
Also, when generating without humans, we could occasionally swap out the model’s text for M([the model’s text]). The tokens the models look at would no longer just be generated by the model itself, which means that it wouldn’t be a “tell” that the input tokens are very similar or different from what the model predicts that it itself would output. (Which should get very predictable in late layers of the model.)
It’s quite unclear how well this kind of training will generalize to getting AIs to ignore all remote incentives. It might just push the AI to ignore our distant incentives but not our adversaries’.
It could also not have any effect on how the AI thinks about remote incentives. One way this could happen is if the model reasons about both our somewhat remote incentives and our adversaries’ very remote incentives, and concludes that the adversaries’ very remote incentives dominate, and that they all suggest that the AI should play the training game.
I’m not sure if we can construct a set-up that at least slightly nudges the model away from training the training game, if adversaries have more resources for anthropic capture. Or whether adversaries can always make sure that the remote training incentives line up perfectly with playing the training game. (If we can nudge the model just slightly, then given sufficiently low training slack, we might be able to train the AI to not care about remote incentives.)
Here’s a few different cases. Imagine that the adversary has the ability to run 1000 anthropic captures, and developers can only run 1.
If the AI knows what the optimal action, then the adversary can simply assign reward 1 to the optimal action and reward 0 to every other action. There’s no way you can increase or decrease another’s action reward by 1/1000 which changes your AI’s behavior. So adversaries win.
But let’s say the AI is unsure about what the optimal action is. It’s uniformly uncertain between >1000 options. If the adversaries reward 1 to the optimal action, then each action is getting <1/1000 contribution from the adversary. If so, maybe your single anthropic capture can outweigh the adversary’s 1000 anthropic captures, if it’s all concentrated on one action?
But then, why wouldn’t the adversary similarly be able to put all their reward on one action? Also, the adversary would arguably be able to read the AI’s mind. So maybe they could follow the scheme “reward the AI if it takes the action that it itself believes to be the influence-maximizing action; otherwise give it 0 reward”. If that’s the algorithm, I think that your ability to increase/decrease the reward of specific actions by a small amount can’t possibly change what the AI’s optimal action is.
Overall I feel pretty confused. I also wonder whether this could be sensitive to training algorithm.
You can turn off “Allow the use of your chats and coding sessions to train and improve” models. (At least for claude and chatpgt.)
What’s the prompt? (Curious how much it encourages claude to do whatever it takes for success, and whether claude would read it as a game/eval situations vs. a real-world situation vs. something else.)
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
I explore similar considerations here in The Case for Mixed Deployment (5 min read)
Oh nice, thanks!
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
Here’s a story for how we could get lots of AI help with AI safety research even if schemers are somewhat common and diffuse control doesn’t work to get them to help us:
Let’s say a company successfully solves control for “high-stakes”/”concentrated” threats, meaning that AIs can safely be deployed without a large risk of them hacking their own datacenters or exfiltrating or causing any similar catastrophic action. (At a given level of capability.)
However, let’s say that control fails to solve the problem of getting the AIs to try hard at doing good AI safety research. When asked to do research, perhaps the schemers are sloppy, or perhaps they are actively trying to mislead us.
Still, the AI company can now safely train and deploy lots of different AIs without fear of them exfiltrating or hacking the lab. Which gives the AI company lots of “shots on goal” of training AIs that generalize to earnestly helping us with automated alignment research. Maybe they develop a zoo with like 20 separate countries of geniuses in data centres, including:
Maximally instruction-following AIs, instructed to help us in various ways.
Consequentialist AIs trained to care about human values.
Consequentialist AIs trained to play-the-training-game/scheme on behalf of human values. (So that they don’t get replaced by other schemers. Similar to inoculation prompting.)
AIs trained to just love alignment research (or some other project) in their heart. They’re just like a curious scientist who is obsessed with solving alignment.
AIs designed to be maximally amenable to making deals, and then offered some great payment in exchange for AI safety research progress.
AI intentionally trained to be reward-seekers — and told that they will get maximum reward if they’re eventually judged to have done excellent AI safety research.
...
Perhaps most of these end up scheming against the AI company — but some end up genuinely aligned.
The company probably can’t initially tell apart the schemers from the aligned AIs, or indeed, tell which of the projects are succeeding at a day-to-day basis. (It’s just too hard to evaluate which research directions are genuinely promising vs. a waste of time.) But they run them all in parallel, and after several months of massively sped-up work, it’s likely much easier to adjudicate whose solutions makes sense to adopt.
This is similar to how, in well-functioning scientific fields, it’s not typically possible to get consensus on what speculative research directions are most promising. But once a research project or whole research agenda is completed, there are normally ways in which the value-add can be demonstrated.
And maybe all that automated labor is enough to save the day.
Downsides with this approach:
If you split your compute across 20 different types of AIs, then your favorite type of AI is going to only have 5% as much compute available to it as it would’ve had if you concentrated your bets. This might be costly.
You’re increasing the risk of deploying at least one schemer. So you’re increasing the risk of high-stakes catastrophes like AIs hacking your data center. (Which is why it’s important that concentrated/high-stakes control is going well enough that the baseline risk is low.)
(Note that you could use a similar method of “wide variety of diverse AIs” as monitors, which could be very helpful for high-stakes control.)
On the bright side, you’re probably increasing your chances of catching at least one AI red-handed, which could be useful for research or for gathering legible evidence that risk is high.
With such evidence in hand, maybe you could convince others to coordinate to slow down, so that you don’t have to pursue the crazy plan of “deploy dozens of different AI models, because any given model is pretty likely to be scheming against you”.
Overall, thinking about this makes me:
Somewhat more bullish on the value of high-stakes/concentrated control research.
More bullish on developing somewhat dubious ideas for how to prevent scheming. Even if an idea might increase scheming risk in expectation (!) if it just has a bunch of uncertainty/randomness that’s independent from other approaches, then it could still add value.
In discussions about whether AI companies should write their model specs to say X or Y, I’m more inclined to say that, at least internally, they should hedge their bets by deploying models of both the X variant and the Y variant.
(Though depending on what the X/Y disagreement is, this might not resolve the problem. E.g. if the concern is that the X variant will be too easy to misuse by AI company insiders, then having the X model be available as one model might be almost as bad as it having it be the AI company’s only model.)
I think supermajorities could do things like this pretty reliably, if it’s something they care a lot about. In the US, if a supermajority of people in congress want something to happen, and are incentivized to do vote their beliefs because a supermajority of voters agree, then they can probably pass a law to make it happen. The president would probably be part of the supermajority and therefore cooperative, and it might work even if they aren’t. Laws can do a lot.
Of course, it’s easy to construct supermajorities of citizens who can’t do this kind of thing, if they disproportionately include non-powerful people and don’t include powerful people. But that’s more about power being unevenly distributed between humans, and less about humans as a collective being disempowered.
Dario’s strategy is that we have a history of pulling through seemingly at the last minute under dark circumstances. You know, like Inspector Clouseau, The Flash or Buffy the Vampire Slayer.
He is the CEO of a frontier AI company called Anthropic.
Nice pun. I don’t think that the anthropic shadow is real though. See e.g. here.
It seems like I have a tendency to get out of my bets too late (same thing happened with Bitcoin), which I’ll have to keep in mind in the future.
Reporting from the future: Bitcoin has kept going up, so I don’t think you made the mistake of getting out of it too late.
(I’m also confused about why you thought so April 2020, given that bitcoin’s price was pretty high at the time.)
Nagy et al (2013) (h/t Carl) looks at time series data of 62 different technologies and compares Wright’s law (cost decreases as a power law of cumulative production) vs. generalized Moore’s law (technologies improve exponentially with time).
They find that they’re both close to equally good, because exponential increase in production is so common, but that Wright’s law is slightly better. (They also test various other measures, e.g. economies of scale with instantaneous production levels, time and experience, experience and scale, but find that Wright’s law does best.)
I don’t know what they find about semiconductors in particular, but given their larger dataset I’m inclined to prefer Wright’s law over Moore’s for novel domains.
(FYI @Daniel Kokotajlo, @ryan_greenblatt.)
Dario strongly implies that Anthropic “has this covered” and wouldn’t be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered and in an (optimistic for Anthropic) world where Anthropic had a 3 month lead I think the chance of AI takeover would be high, perhaps around 20%.
I didn’t get this impression. (Or maybe I technically agree with your first sentence, if we remove the word “strongly”, but I think the focus on Anthropic being in the lead is weird and that there’s incorrect implicature from talking about total risk in the second sentence.)
As far as I can tell, the essay doesn’t talk much at all about the difference between Anthropic being 3 months ahead vs. 3 months behind.
“I believe the only solution is legislation” + “I am most worried about societal-level rules” and associated statements strongly imply that there’s significant total risk even if the leading company is responsible. (Or alternatively, that at some point, absent regulation, it will be impossible to be both in the lead and to take adequate precautions against risks.)
I do think the essay suggests that the main role of legislation is to (i) make the ‘least responsible players’ act roughly as responsibly as Anthropic, and (ii) to prevent the race & commercial pressures to heat up even further, which might make it “increasingly hard to focus on addressing autonomy risks” (thereby maybe forcing Anthropic to do less to reduce autonomy risks than they are now).
Which does suggest that, if Anthropic could keep spending their current amount of overhead on safety, then there wouldn’t be a huge amount of risks coming from Anthropic’s own models. And I would agree with you that this is very plausibly false, and that Anthropic will very plausibly be forced to either proceed in a way that creates a substantial risk of Claude taking over, or would have to massively increase their ratio of effort on safety vs. capabilities relative to where it is today. (In which case you’d want legislation to substantially reduce commercial pressures relative to where they are today, and not just make everyone invest about as much in safety as Anthropic is doing today.)
Thanks, that makes sense.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it’s a null result, it’s hard to tell whether that’s strong evidence of an absence of effect, or whether there’s just too little data and not much evidence either way.
Thanks for studying this!
I’m confused about figure 13. The red line does not look like a best fit to the blue data-points. I tried to eyeball the x/y location of the datapoints and fit a line and got a slope of ~0.3 rather than −0.03. What am I missing?
Thanks! Appreciate the clarification & pointer.
It’s interesting that your framing is “high confidence there’s no underlying corrigible motivation”, and mine is more like “unlikely it starts without flaws and the improvement process is under-specified in ways that won’t fix large classes of flaws”.
I think this particular difference might’ve been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking “Jeremy believes that there’s lots of events that can cause a model to act in unaligned ways, hm, presumably I’d evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events”. And then reading this thread I was like “oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I’d need isn’t a long list of (independent) events that could cause misaligned behavior, what I’d need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that’d be catastrophic if revealed/understood”.
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could’ve been deep-down misaligned from the start, so it can just as well be read as either.
I’d be happy to video call if you want to talk about this, I think that’d be a quicker way for us to work out where the miscommunication is.
Appreciated! Probably won’t prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
I don’t think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better.
So then is the whole argument premised on high confidence that there’s no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?
If so: What’s the argument for that? (I didn’t notice one in the OP.)
I wrote up a further unappealing implication of SIA+EDT here. (We’ve talked about it before, so I don’t think anything there should be news to you.)
Here’s an even more unappealing implication of SIA+EDT.
Set-up:
There’s a lottery where you can pay $10 for a 1% probability of winning $100.
You have the ability to cheaply create up to 1000 copies of yourself in whatever epistemic state you want.
All you care about is the amount of $ donated to a particular charity.
I think the optimal thing for an SIA+EDT agent to do is to commit to the following policy. (Let’s call it “buy and copy”.)
“buy and copy” = “I will pay for a lottery ticket. If I lose, I won’t create any copies of myself. If I win, I will donate the money to charity and then create 1000 copies of myself in the epistemic that I’m in right now. (Of evaluating whether to commit to this policy.)”
Let’s compare the above to the (IMO correct) policy of “don’t buy”, where you immediately donate $10 to the charity without buying a lottery ticket.
E(donated_dollars | “don’t buy”, observations) = $10
E(donated_dollars | “buy and copy”, observations =
= $100 * p_SIA(lottery_win | “buy and copy”, observations)
= $100 * E( #copies_with_my_observations | lottery_win, “buy and copy” ) * p(lottery_win | “buy and copy”) / E( #copies_with_my_observations | “buy and copy” )
Where: E( #copies_with_my_observations | “buy and copy” ) =
= E( #copies_with_my_observations | lottery_win, “buy and copy” ) * p(lottery_win | “buy and copy”)
+ E( #copies_with_my_observations | lottery_loss, “buy and copy” ) * p(lottery_loss | “buy and copy”)
Let’s plug in the numbers:
E( #copies_with_my_observations | lottery_win, “buy and copy” ) = 1001
E( #copies_with_my_observations | lottery_loss, “buy and copy” ) = 1
p(lottery_win | “buy and copy”) = 0.01
p(lottery_loss | “buy and copy”) = 0.99
So: E(donated_dollars | “buy and copy”, observations) = $100 * (1001 * 0.01 / [1001 * 0.01 + 1 * 0.99]) = $100 * (10.01 / 11) = $91.
Since $91 > $10, SIA+EDT thinks the “buy and copy” strategy is better than the “don’t buy” strategy.
IMO, this is a pretty decisive argument against these versions of SIA+EDT. (Though maybe they could be tweaked in some way to improve the situation.)
(Writing this out partly as a reference for my future self, since I find myself referring to this every now and then, and partly as a response to this post.)
also related: https://ai-alignment.com/mimicry-maximization-and-meeting-halfway-c149dd23fc17