I work at Redwood Research.
ryan_greenblatt
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don’t really see much concern here, but I’m uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I’m also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn’t “much more of the action is in deciding exactly who to influence and what to influence”, and is more “I don’t really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly”.
I think much more of the risk will be in these autonomous cases and I guess assumed the eval was mostly targeting these cases.
This seems like useful work.
I have two issues with these evaluations:
I’m pretty skeptical of the persuasion evals. I bet this isn’t measuring what you want and I’m generally skeptical of these evals as described. E.g., much more of the action is in deciding exactly who to influence and what to influence them to do. I’m not super confident here and I haven’t looked into these evals in depth.
The cybersecurity vulnerability detection evals probably aren’t very meaningful as constructed because:
I think these datasets (by default, unaugmented) have way too little context to know whether there is a vulnerability/whether a patch fixes a security issue. (I’m confident this is true of diverse vul which I looked at a little while ago, I’m not certain about the other datasets). So, it mostly measures something different from the actual task we care about.
Security patch classification doesn’t seem very meaningful as a way to measure cybersecurity ability. (Also, is performance here mostly driven by reading comments?)
We care most about the case where LLMs can use tools and reason in CoT (I don’t think the evals allow for CoT reasoning but I’m unsure about this? CoT probably doesn’t help much on these datasets because of the other issues)
We don’t know what the human baseline is on these datasets AFAICT.
More generally, we don’t know what various scores correspond to. Suppose models got 90% accuracy on diverse vul. Is that wildly superhuman cybersecurity ability? Very subhuman? Does this correspond to any specific ability to do offensive cyber? Inability? If we don’t care about having a particular threshold or an interpretation of scores, then I think we could use much more straightforward and lower variance evals like next-token prediction loss on a corpus of text about cyber security.
There is a meta-level question here: when you have some evals which are notably worse than than your other evals and which are seriously flawed, how should you publish this?
My current guess is that the cybersecurity vulnerability detection evals probably shouldn’t be included due to sufficiently large issues.
I’m less sure about the persuasion evals, though I would have been tempted to only include them in an appendix and note that future work is needed here. (That is, if I correctly understand thse evals and I’m not wrong about the issues!)
I think including highly flawed evals in this sort of paper sets a somewhat bad precedent, though it doesn’t seem that bad.
- 23 May 2024 6:46 UTC; 1 point) 's comment on New voluntary commitments (AI Seoul Summit) by (
They might just (probably correctly) think it is unlikely that the employees will decide to do this.
I’m not sure if I agree as far as the first sentence you mention is concerned. I agree that the rest of the paragraph is talking about something similar to the point that Adam Gleave and Euan McLean make, but the exact sentence is:
Separately, humans and systems will monitor for destructive behavior, and these monitoring systems need to be robust to adversaries.
“Separately” is quite key here.
I assume this is intended to include AI adversaries and high stakes monitoring.
Observation: it’s possible that the board is also powerless with respect to the employees and the leadership.
I think the board probably has less de facto power than employees/leadership by a wide margin in the current regime.
I’m a bit confused by what you mean by “LLMs will not scale to AGI” in combination with “a single innovation is all that is needed for AGI”.
E.g., consider the following scenarios:
AGI (in the sense you mean) is achieved by figuring out a somewhat better RL scheme and massively scaling this up on GPT-6.
AGI is achieved by doing some sort of architectural hack on top of GPT-6 which makes it able to reason in neuralese for longer and then doing a bunch of training to teach the model to use this well.
AGI is achieved via doing some sort of iterative RL/synth data/self-improvement process for GPT-6 in which GPT-6 generates vast amounts of synthetic data for itself using various tools.
IMO, these sound very similar to “LLMs scale to AGI” for many practical purposes:
LLM scaling is required for AGI
LLM scaling drives the innovation required for AGI
From the public’s perspective, it maybe just looks like AI is driven by LLMs getting better over time and various tweaks might be continuously introduced.
Maybe it is really key in your view that the single innovation is really discontinuous and maybe the single innovation doesn’t really require LLM scaling.
I think we should broadly aim for AIs which are myopic and don’t scheme with early transformative AIs rather than aiming for AIs which are aligned in their long run goals/motives.
(I interpreted the bit about using llama-3 to involve fine-tuning for things other than just avoiding refusals. E.g., actually doing sufficiently high quality debates.)
I would be careful not to implicitly claim that these 17 people are a “representative sample” of the AI safety community.
Worth noting that this is directly addressed in the post:
The sample of people I interviewed is not necessarily a representative sample of the AI safety movement as a whole. The sample was pseudo-randomly selected, optimizing for a) diversity of opinion, b) diversity of background, c) seniority, and d) who I could easily track down. Noticeably, there is an absence of individuals from MIRI, a historically influential AI safety organization, or those who subscribe to similar views. I approached some MIRI team members but no one was available for an interview. This is especially problematic since many respondents criticized MIRI for various reasons, and I didn’t get much of a chance to integrate MIRI’s side of the story into the project.
So, in this case, I would say this is explicitly disclaimed let alone implicitly claimed.
I think it probably doesn’t make sense to talk about “representative samples”.
Here are a bunch of different things this could mean:
A uniform sample from people who have done any work related to AI safety.
A sample from people weighted to their influence/power in the AI safety community.
A sample from people weighted by how much I personally respect their views about AI risk.
Maybe what you mean is: “I think this sample underrepresents a world view that I think this is promising. This world view is better represented by MIRI/Conjecture/CAIS/FLI/etc.”
I think programs like this one should probably just apply editorial discretion and note explicitly that they are doing so.
(This complaint is also a complaint about the post which does try to use a notion of “representative sample”.)
Probably considerably harder to influence than AI takeover given that it happens later in the singularity at a point where we have already had access to a huge amount of superhuman AI labor?
(I think most of the hard-to-handle risk from scheming comes from cases where we can’t easily make smarter AIs which we know aren’t schemers. If we can get another copy of the AI which is just as smart but which has been “de-agentified”, then I don’t think scheming poses a substantial threat. (Because e.g. we can just deploy this second model as a monitor for the first.) My guess is that a “world-model” vs “agent” distinction isn’t going to be very real in practice. (And in order to make an AI good at reasoning about the world, it will need to actively be an agent in the same way that your reasoning is agentic.) Of course, there are risks other than scheming.)
Insofar as the hope is:
Figure out how to approximate sampling from the Bayesian posterior (using e.g. GFlowNets or something).
Do something else that makes this actually useful for “improving” OOD generalization in some way.
It would be nice to know what (2) actually is and why we needed step (1) for it. As far as I can tell, Bengio hasn’t stated any particular hope for (2) which depends on (1).
Rather, my original reply was meant to explain why the Bayesian aspect of Bengio’s research agenda is a core part of its motivation, in response to your remark that “from my understanding, the bayesian aspect of [Bengio’s] agenda doesn’t add much value”.
I agree that if the Bayesian aspect of the agenda did a specific useful thing like ‘”improve” OOD generalization’ or ‘allow us to control/understand OOD generalization’, then this aspect of the agenda would be useful.
However, I think the Bayesian aspect of the agend won’t do this and thus it won’t add much value. I agree that Bengio (and others) think that the Bayesian aspect of the agenda will do things like this—but I disagree and don’t see the story for this.
I agree that “actually use Bayesian methods” sounds like the sort of thing that could help you solve dangerous OOD generalization issues, but I don’t think it clearly does.
(Unless of course someone has a specific proposal for (2) from my above decomposition which actually depends on (1).)
However, there could still be advantages to an explicitly Bayesian method. For example, off the top of my head
1-3 don’t seem promising/important to me. (4) would be useful, but I don’t see why we needed the bayesian aspect of it. If we have some sort of parametric model class which we can make smart enough to reason effectively about the world, just making an ensemble of these surely gets you most of the way there.
To be clear, if the hope is “figure out how to make an ensemble of interpretable predictors which are able to model the world as well as our smartest model”, then this would be very useful (e.g. it would allow us to avoid ELK issues). But all the action was in making interpretable predictors, no bayesian aspect was required.
Fair enough if you think a core consequence of Anthropic’s paper was “demonstrate that there exist directions within LLMs that correspond to concepts as abstract as ‘golden gate bridge’ or ‘bug in code’”.
It’s worth noting that wildly simpler and cheaper model internals experiments have already demonstrated to an extent which is at least as convincing as this paper. (Though perhaps this paper will get more buzz.)
(E.g., I think various prior work on probing and generalization is at least as convincing as this.)
None of it seems falsified to me.
I think a few of Jacob’s “Personal opinions” now seem less accurate than they did previously. (And perhaps Jacob no longer endorses “Opinion: OpenAI is a great place to work to reduce existential risk from AI.”)
(Oops, fixed.)
Examples of maybe disparagement:
AI safety research is still solving easy problems. [...]
Capability development is getting AI safety research for free. [...]
AI safety research is speeding up capabilities. [...]
Even if (2) and (3) are true and (1) is mostly true (e.g. most safety research is worthless), I still think it can easily be worthwhile to indiscriminately increase the supply of safety research[1].
The core thing is a quantitative argument: there are far more people working on capabities than x-safety and if no one works on safety, nothing will happen at all.
Copying a version of this argument from a prior comment I made:
There currently seems to be >10x as many people directly trying to build AGI/improve capabilities as trying to improve safety.
Suppose that the safety people have as good ideas and research ability as the capabilities people. (As a simplifying assumption.)
Then, if all the safety people switched to working full time on maximally advancing capabilities, this would only advance capabilites by less than 10%.
If, on the other hand, they stopped publically publishing safety work and this resulted in a 50% slow down, all safety work would slow down by 50%.
Naively, it seems very hard for publishing less to make sense if the number of safety researchers is much smaller than the number of capabilities researchers and safety researchers aren’t much better at capabilities than capabilities researchers.
- ↩︎
Of course, there might be better things to do than indiscriminately increase the supply. E.g., maybe it is better to try to steer the direction of the field.
Google and Anthropic are doing good risk assessment for dangerous capabilities.
My guess is that the current level of evaluations at Google and Anthropic isn’t terrible, but could be massively improved.
Elicitation isn’t well specified and there isn’t an established science.
Let alone reliable projections about the state of elicitation in a few years.
We have a tiny number of different task in these evaluations suites and no readily available mechanism for noticing the models are very good at specific subcategories or tasks in ways which are concerning.
I had some specific minor to moderate level issues with the Google DC evals suite. Edit: I discuss my issues with the evals suite here.
I’m confident he knew people would react negatively but decided to keep the line because he thought it was worth the cost.
Seems like a mistake by his own lights IMO.