Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Previously “Lanrian” on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Aren’t you worried that by presenting this “in between” approach without mentioning that it would still require some amount of luck, or equivalently still incur some amount of risk (of potentially catastrophic philosophical failure/error), it can be misleading for people who might read the post without themselves specializing in this area?
I don’t want to mislead people, so I guess it’s just a question about how people interpret posts like this. I suppose I would worry about this if I was presenting something framed like a decisive solution. But I’d think it’s pretty normal for posts to talk about a problem and present some promising-seeming interventions without thereby implying that the problem would be entirely solved if the interventions got carried out. (E.g.: If I read a post about climate change that suggested some interventions, I wouldn’t assume that those interventions would necessarily solve the whole problem.)
It also feels relevant that I don’t have a prescription for actions anyone could take to predictably achieve high justified confidence that the problem was solved. I don’t think that discovering a solution to metaphilosophy would be sufficient, because a big part of the problem is that people might not care about doing good philosophy even if a solution existed. I think that slower AI development (including a global pause) would probably be helpful on the current margin, for this risk, but I don’t think that a very long pause would get the risk down to very low levels. (There’s just a bunch of stuff that influence societal epistemics and its hard to know whether it’s heading in a good or bad direction on the time-scale of decades, at present technology levels. And I expect societal epistemics to have a big influence on the risk here.)
I have noticed that even people who share my general concern don’t seem to like to frame it the way that I do, i.e., as a need to solve metaphilosophy. For example in this post you never mention “metaphilosophy” again or talk about trying to understand the nature of philosophy. I’m pretty curious why that is.
(By “need” I mean it seems to be the only way to achieve high justified confidence that the concern has been addressed, not that the world is certainly doomed if we don’t solve metaphilosophy. I can see various ways that we “get lucky” and the problem kind of solves itself.)
In between ” ‘we get lucky’ and the problem kind of solves itself” and “we solve metaphilosophy and achieve high justified confidence”, there’s “we do a bunch of things that we think will help on the margin without leading to high confidence that the problem gets solved, and partially as a result of our interventions and partially due to luck, things turn out fine”. That’s more what I’m aiming at. And this doesn’t require tackling or solving metaphilosophy directly (which seems really difficult!), which is probably why I don’t use the term that much.
Drake counteroffers ‘committed to under this policy’ but no, I think that’s wrong. I think the right word is ‘intending.’
I’d use “Anthropic’s policy is to do X”. I think that’s fine for statements about the future too, e.g. “Under RSP v3, Anthropic’s policy is to do X for future powerful models”.
I think publicly adopting a policy is more meaningful than stating an intention, but there’s no implication that policies can’t be changed.
Correct me if I’m wrong, but I believe the quote you’re referring to is:
Interviewer: Some folks have advocated for a pause, to give regulation time to catch up, to give society time to adjust to some of these changes. In a perfect world, if you know that every other company, every country would pause, would you advocate for that?
Demis: I think so. [...]
I think it’s inaccurate to summarize this as “publicly committed to a pause if the other leading labs pause”.
In what sense are OAI’s models more CoT-monitorable?
This is all downstream of the sophisticated chooser assigning extremely low probability to being offered that $0.02, despite the fact that it keeps happening.
You could “money pump” any agent if you were allowed to assume that they keep being wrong. A EUM standing by the sidelines here (with the same beliefs) would be incentivized to keep betting “this guy won’t be offered another $0.02″, and they’d also get “money-pumped” by constantly losing their bets.
I believe it follows from this proof: https://www.alignmentforum.org/posts/5bd75cc58225bf06703751b2/in-memoryless-cartesian-environments-every-udt-policy-is-a
I think if others play with probability P, every value of Q is equally good.
you get 2*0.8873*(1-0.8873) of −1 and 0.8873^2 of 3, which is 2.262.
Not sure if this is a typo, but I get 2*0.8873*(1-0.8873)(-1)+0.8873^{2}(3) = 2.162
Which is the same as if you play Q=P. Which supports the claim that every value of Q is equally good.
Re best of three: if sufficiently precise and 100% free randomization is allowed, I think the optimal policy is ratifiable by CDT. (Though not paying is also ratifiable, and more robustly so.)
I am also happy the contract prohibits using our models to direct lethal autonomous weapons, though realistically I do not think powering a killer drone via a cloud-based large model was ever a real possibility.
Is this a recent update? The language in the announcement didn’t feature a prohibition like that. (Other than saying that edge deployments wouldn’t be allowed. But it sounds like you’re saying that a prohibition against using cloud-based models for weapons was added on top of the practical difficulty of using a cloud-based model for weapons.)
Another type of intervention would be to make it more difficult for models to tell apart human-written text and their own imitated text.
E.g. maybe we could get another model M to rewrite both humans’ and models’ text in ways that preserve semantics but change syntactics around a bit randomly. We could train with a GAN-like scheme to make it hard for models to tell apart M([human-written text]) and M([model-written text]). (Along with some other objective that incentivizes preserving semantics.)
This would mean that, if there’s any quirks of human writing that models can recognize but struggle to generate, M could learn to remove those quirks.
Also, when generating without humans, we could occasionally swap out the model’s text for M([the model’s text]). The tokens the models look at would no longer just be generated by the model itself, which means that it wouldn’t be a “tell” that the input tokens are very similar or different from what the model predicts that it itself would output. (Which should get very predictable in late layers of the model.)
It’s quite unclear how well this kind of training will generalize to getting AIs to ignore all remote incentives. It might just push the AI to ignore our distant incentives but not our adversaries’.
It could also not have any effect on how the AI thinks about remote incentives. One way this could happen is if the model reasons about both our somewhat remote incentives and our adversaries’ very remote incentives, and concludes that the adversaries’ very remote incentives dominate, and that they all suggest that the AI should play the training game.
I’m not sure if we can construct a set-up that at least slightly nudges the model away from training the training game, if adversaries have more resources for anthropic capture. Or whether adversaries can always make sure that the remote training incentives line up perfectly with playing the training game. (If we can nudge the model just slightly, then given sufficiently low training slack, we might be able to train the AI to not care about remote incentives.)
Here’s a few different cases. Imagine that the adversary has the ability to run 1000 anthropic captures, and developers can only run 1.
If the AI knows what the optimal action, then the adversary can simply assign reward 1 to the optimal action and reward 0 to every other action. There’s no way you can increase or decrease another’s action reward by 1/1000 which changes your AI’s behavior. So adversaries win.
But let’s say the AI is unsure about what the optimal action is. It’s uniformly uncertain between >1000 options. If the adversaries reward 1 to the optimal action, then each action is getting <1/1000 contribution from the adversary. If so, maybe your single anthropic capture can outweigh the adversary’s 1000 anthropic captures, if it’s all concentrated on one action?
But then, why wouldn’t the adversary similarly be able to put all their reward on one action? Also, the adversary would arguably be able to read the AI’s mind. So maybe they could follow the scheme “reward the AI if it takes the action that it itself believes to be the influence-maximizing action; otherwise give it 0 reward”. If that’s the algorithm, I think that your ability to increase/decrease the reward of specific actions by a small amount can’t possibly change what the AI’s optimal action is.
Overall I feel pretty confused. I also wonder whether this could be sensitive to training algorithm.
You can turn off “Allow the use of your chats and coding sessions to train and improve” models. (At least for claude and chatpgt.)
What’s the prompt? (Curious how much it encourages claude to do whatever it takes for success, and whether claude would read it as a game/eval situations vs. a real-world situation vs. something else.)
I think self-exfiltration via manipulation seems pretty hard. I think we’re likely to have transformatively useful systems that can’t do that, for some amount of time. (Especially since there’s no real reason to train them to be good at manipulation, though of course they might generalize from other stuff.) I agree people should definitely be thinking about it as a potential problem and try to mitigate and estimate the risk.
That’s a disanalogy with well-functioning scientific fields: scientists don’t deliberately pick bad research directions and try hard to make them seem good.
Some scientists do optimize for (something like) “impressiveness” of their work, regardless of whether it’s good or bad. It’s true that they don’t intentionally compromise impressiveness in order to make the work more misleading. That said, if some models optimize for impressiveness, and some models compromise impressiveness to make their work more misleading, then I guess the humans should be somewhat more likely to use the more impressive work? So maybe that limits the degree of influence that intentionally misleading work could I have.
I am definitely pretty worried about humans ability to judge what work is good or bad. In addition to the disanalogy you mention, I’d also highlight (i) non-experts typically struggle to judge which expert is right in a dispute, and if humans are slow & dumb enough relative to the AIs, humans might be analogous to non-experts, and (ii) it’s plausible to me that human science benefits a lot from the majority of scientists (and maybe especially the most capable scientists?) having at least a weak preference for truth, e.g. because it’s easier to recognize a few bad results when you can lean on the presumed veracity of most established results.
then the schemers need to optimise their research for looking good to each variant
Not necessarily. The nice AIs also need to be able to win a debate against the schemers, as judged by humans. It’s not enough for the variants to be able to recognize poor research if they can’t show their work in an unrefutable (by the schemers) way.
I explore similar considerations here in The Case for Mixed Deployment (5 min read)
Oh nice, thanks!
However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
What sort of procedure do you have in mind?
Notably Holden thinks (last section) that the default procedure is slow and arduous:
If the likely default process is already slow, maybe there could be a cheap win here where they (properly) commit to a process that is no more slow and arduous than the likely default, but where the process is known to the outside and can’t be circumvented.