What work do you think is most valuable on the margin (for those who agree with you on many of these points)?
Akash
Cooperating with aliens and AGIs: An ECL explainer
I’d be curious to hear more about your main disagreements.
Good point. I think it’s helpful when people working schemes with the rough flavor of “we do X, and then X helps us get to a useful AI that does not takeover” try to specify roughly how capable they expect the “useful AI does not takeover” to be.
Would be curious to hear more about the kinds of tasks that Ryan and Buck expect the first “transformatively useful but still controllable” AI will be able to do (perhaps expressed in multiplier terms, like 30X human SW engineer, perhaps expressed in terms of the kinds of novel reasoning it can perform, and perhaps expressed in terms of the kinds of research tasks or policy tasks it would be able to meaningfully help with).
I’m excited to see how the AI control research direction evolves.
After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:
Claim 1: Labs should control powerful AI systems (very intuitive, even obvious)
Claim 2: It is (much?) easier to evaluate control than alignment (somewhat intuitive, particularly if one buys the argument that alignment evals are very hard because you always have the “what if the AI is just scheming and subverting the eval” concern.)
Claim 3: We (Buck and Ryan) have identified some specific techniques that we believe are promising for control (not intuitive; technical arguments needed).
Claim 4: If we succeed in controlling “transformatively useful AI”, then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).
I’d be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4.
Some more on #2 & #4:
The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.
I think those pessimistic about control evals could say something like “the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team.” Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won’t scale toward arbitrarily powerful AIs.
In some ways, this feels analogous to the following situation:
Alice from the Scaling Department: Look, our current “alignment” techniques are working for GPT-4. So let’s scale up more! I get that at some point we won’t be able to align the AI, but we’re not there yet, so we should be able to keep going.
Bob from the Preparedness Team: Well, the problem is we don’t know when our alignment techniques will stop working… like, it’s quite plausible that the next AI system will be able to game our evals.
Alice from the Scaling Department: OK, well let me know when you have clear evidence that our alignment techniques or dangerous capability evals will fail.
Here’s the analogy for control:
Alice from the Scaling Department: Look, our current control techniques are working for GPT-4. So let’s scale up more! I get that at some point our control evals aren’t going to be able to control arbitrarily powerful AI, but we’re not there yet, so we should be able to keep going.
Bob from the Preparedness Team: Well, the problem is we don’t know when our control evals will stop working… like, it’s quite plausible that the next AI system will be better at coming up with attacks than our red-team.
Alice from the Scaling Department: OK, well let me know when you have clear evidence that our control evals will fail.
I’d be curious to hear more about how you’re thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:
How do you react to the analogy above? Do you expect that control evals will do a better job than other types of evals at determining when we have reached the point where scaling further is (directly/imminently) dangerous? Or more simply, will it be clear when we need to stop?
More broadly, what do you think we should do once we’re starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
Suppose control evals do allow us to stop at “transformatively useful systems that do not cause a catastrophe.” Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like “either the system is too weak to get us out of the critical risk window or the system is so strong that it’s capable of subverting your control evals”).
More broadly, TurnTrout, I’ve noticed you using this whole “look, if something positive happened, LW would totally rip on it! But if something is presented negatively, everyone loves it!” line of reasoning a few times (e.g., I think this logic came up in your comment about Evan’s recent paper). And I sort of see you taking on some sort of “the people with high P(doom) just have bad epistemics” flag in some of your comments.
A few thoughts (written quickly, prioritizing speed over precision):
I think that epistemics are hard & there are surely several cases in which people are biased toward high P(doom). Examples: Yudkowsky was one of the first thinkers/writers about AI, some people might have emotional dispositions that lead them toward anxious/negative interpretations in general, some people find it “cool” to think they’re one of the few people who are able to accurately identify the world is ending, etc.
I also think that there are plenty of factors biasing epistemics in the “hopeful” direction. Examples: The AI labs have tons of money and status (& employ large fractions of the community’s talent), some people might have emotional dispositions that lead them toward overly optimistic/rosy interpretations in general, some people might find it psychologically difficult to accept premises that lead them to think the world is ending, etc.
My impression (which could be false) is that you seem to be exclusively or disproportionately critical of poor arguments when they come from the “high P(doom)” side.
I also think there’s an important distinction between “I personally think this argument is wrong” and “look, here’s an example of propaganda + poor community epistemics.” In general, I suspect community epistemics are better when people tend to respond directly to object-level points and have a relatively high bar for saying “not only do I think you’re wrong, but also here are some ways in which you and your allies have poor epistemics.” (IDK though, insofar as you actually believe that’s what’s happening, it seems good to say aloud, and I think there’s a version of this that goes too far and polices speech reproductively, but I do think that statements like “community epistemics have been compromised by groupthink and fear” are pretty unproductive and could be met with statements like “community epistemics have been compromised by powerful billion-dollar companies that have clear financial incentives to make people overly optimistic about the trajectory of AI progress.”
I am quite worried about tribal dynamics reducing the ability for people to engage in productive truth-seeking discussions. I think you’ve pointed out how some of the stylistic/tonal things from the “high P(doom)//alignment hard” side have historically made discourse harder, and I agree with several of your critiques. More recently, though, I think that the “low P(doom)//alignment not hard” side seem to be falling into similar traps (e.g., attacking strawmen of those they disagree with, engaging some sort of “ha, the other side is not only wrong but also just dumb/unreasonable/epistemically corrupted” vibe that predictably makes people defensive & makes discourse harder.
My impression is that the Shoggath meme was meant to be a simple meme that says “hey, you might think that RLHF ‘actually’ makes models do what we value, but that’s not true. You’re still left with an alien creature who you don’t understand and could be quite scary.”
Most of the Shoggath memes I’ve seen look more like this, where the disgusting/evil aspects are toned down. They depict an alien that kinda looks like an octopus. I do agree that the picture evokes some sort of “I should be scared/concerned” reaction. But I don’t think it does so in a “see, AI will definitely be evil” way– it does so in a “look, RLHF just adds a smiley face to a foreign alien thing. And yeah, it’s pretty reasonable to be scared about this foreign alien thing that we don’t understand.”
To be a bit bolder, I think Shoggath is reacting to the fact that RLHF gives off a misleading impression of how safe AI is. If I were to use proactive phrasing, I could say that RLHF serves as “propaganda”. Let’s put aside the fact that you and I might disagree about how much “true evidence” RLHF provides RE how easy alignment will be. It seems pretty clear to me that RLHF [and the subsequent deployment of RLHF’d models] spreads an overly-rosy “meme” that gives people a misleading perspective of how well we understand AI systems, how safe AI progress is, etc.
From this lens, I see Shoggath as a counter-meme. It basically says “hey look, the default is for people to think that these things are friendly assistants, because that’s what the AI companies have turned them into, but we should remember that actually we are quite confused about the alien cognition behind the RLHF smiley face.”
OpenAI’s Preparedness Framework: Praise & Recommendations
If the timer starts to run out, then slap something together based on the best understanding we have. 18-24 months is about how long I expect it to take to slap something together based on the best understanding we have.
Can you say more about what you expect to be doing after you have slapped together your favorite plans/recommendations? I’m interested in getting a more concrete understanding of how you see your research (eventually) getting implemented.
Suppose after the 18-24 month process, you have 1-5 concrete suggestions that you want AGI developers to implement. Is the idea essentially that you would go to the superalignment team (and the equivalents at other labs) and say “hi, here’s my argument for why you should do X?” What kinds of implementation-related problems, if any, do you see coming up?
I ask this partially because I think some people are kinda like “well, in order to do alignment research that ends up being relevant, I need to work at one of the big scaling labs in order to understand the frames/ontologies of people at the labs, the constraints/restrictions that would come up if trying to implement certain ideas, get better models of the cultures of labs to see what ideas will simply be dismissed immediately, identify cruxes, figure out who actually makes decisions about what kinds of alignment ideas will end up being used for GPT-N, etc etc.”
My guess is that you would generally encourage people to not do this, because they generally won’t have as much research freedom & therefore won’t be able to work on core parts of the problem that you see as neglected. I suspect many would agree that there is some “I lose freedom” cost, but that this might be outweighed by the “I get better models of what kind of research labs are actually likely to implement” benefit, and I’m curious how you view this trade-off (or if you don’t even see this as a legitimate trade-off).
Does the “rater problem” (raters have systematic errors) simply apply to step one in this plan? I agree that once you have a perfect reward model, you no longer need human raters.
But it seems like the “rater problem” still applies if we’re going to train the reward model using human feedback. Perhaps I’m too anchored to thinking about things in an RLHF context, but it seems like at some point in the process we need to have some way of saying “this is true” or “this chain-of-thought is deceptive” that involves human raters.
Is the idea something like:
Eliezer: Human raters make systematic errors
OpenAI: Yes, but this is only a problem if we have human raters indefinitely provide feedback. If human raters are expected to provide feedback on 10,000,000 responses under time-pressure, then surely they will make systematic errors.
OpenAI: But suppose we could train a reward model on a modest number of responses and we didn’t have time-pressure. For this dataset of, say, 10,000 responses, we are super careful, we get a ton of people to double-check that everything is accurate, and we are nearly certain that every single label is correct. If we train a reward model on this dataset, and we can get it to generalize properly, then we can get past the “humans make systematic errors” problem.
Or am I totally off//the idea is different than this//the “yet-to-be-worked-out-techniques” would involve getting the reward model to learn stuff without ever needing feedback from human raters?
I’d also be curious to know why (some) people downvoted this.
Perhaps it’s because you imply that some OpenAI folks were captured, and maybe some people think that that’s unwarranted in this case?
Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.
I think LW still does better than most places at rewarding discourse that’s thoughtful/thought-provoking and resisting tribal impulses, but I wouldn’t be surprised if some people were doing something like “ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote.”
(And I also think this happens the other way around as well, and I’m sure people who write things that are “pro AI labs//anti-regulation” are sometimes unfairly downvoted by people in the opposite tribe.)
I appreciate the comment, though I think there’s a lack of specificity that makes it hard to figure out where we agree/disagree (or more generally what you believe).
If you want to engage further, here are some things I’d be excited to hear from you:
What are a few specific comms/advocacy opportunities you’re excited about//have funded?
What are a few specific comms/advocacy opportunities you view as net negative//have actively decided not to fund?
What are a few examples of hypothetical comms/advocacy opportunities you’ve been excited about?
What do you think about EG Max Tegmark/FLI, Andrea Miotti/Control AI, The Future Society, the Center for AI Policy, Holly Elmore, PauseAI, and other specific individuals or groups that are engaging in AI comms or advocacy?
I think if you (and others at OP) are interested in receiving more critiques or overall feedback on your approach, one thing that would be helpful is writing up your current models/reasoning on comms/advocacy topics.
In the absence of this, people simply notice that OP doesn’t seem to be funding some of the main existing examples of comms/advocacy efforts, but they don’t really know why, and they don’t really know what kinds of comms/advocacy efforts you’d be excited about.
They mention three types of mitigations:
Asset protection (e.g., restricting access to models to a limited nameset of people, general infosec)
Restricting deployment (only models with a risk score of “medium” or below can be deployed)
Restricting development (models with a risk score of “critical” cannot be developed further until safety techniques have been applied that get it down to “high.” Although they kind of get to decide when they think their safety techniques have worked sufficiently well.)
My one-sentence reaction after reading the doc for the first time is something like “it doesn’t really tell us how OpenAI plans to address the misalignment risks that many of us are concerned about, but with that in mind, it’s actually a fairly reasonable document with some fairly concrete commitments”).
I believe labels matter, and I believe the label “preparedness framework” is better than the label “responsible scaling policy.” Kudos to OpenAI on this. I hope we move past the RSP label.
I think labels will matter most when communicating to people who are not following the discussion closely (e.g., tech policy folks who have a portfolio of 5+ different issues and are not reading the RSPs or PFs in great detail).
One thing I like about the label “preparedness framework” is that it begs the question “prepared for what?”, which is exactly the kind of question I want policy people to be asking. PFs imply that there might be something scary that we are trying to prepare for.
Thanks for this overview, Trevor. I expect it’ll be helpful– I also agree with your recommendations for people to consider working at standard-setting organizations and other relevant EU offices.
One perspective that I see missing from this post is what I’ll call the advocacy/comms/politics perspective. Some examples of this with the EU AI Act:
Foundation models were going to be included in the EU AI Act, until France and Germany (with lobbying pressure from Mistral and Aleph Alpha) changed their position.
This initiated a political/comms battle between those who wanted to exclude foundation models (led by France and Germany) and those who wanted to keep it in (led by Spain).
This political fight rallied lots of notable figures, including folks like Gary Marcus and Max Tegmark, to publicly and privately fight to keep foundation models in the act.
There were open letters, op-eds, and certainly many private attempts at advocacy.
There were attempts to influence public opinion, pieces that accused key lobbyists of lying, and a lot of discourse on Twitter.
It’s difficult to know the impact of any given public comms campaign, but it seems quite plausible to me that many readers would have more marginal impact by focusing on advocacy/comms than focusing on research/policy development.
More broadly, I worry that many segments of the AI governance/policy community might be neglecting to think seriously about what ambitious comms/advocacy could look like in the space.
I’ll note that I might be particularly primed to bring this up now that you work for Open Philanthropy. I think many folks (rightfully) critique Open Phil for being too wary of advocacy, campaigns, lobbying, and other policymaker-focused activities. I’m guessing that Open Phil has played an important role in shaping both the financial and cultural incentives that (in my view) leads to an overinvestment into research and an underinvestment into policy/advocacy/comms.
(I’ll acknowledge these critiques are pretty high-level and I don’t claim that this comment provides compelling evidence for them. Also, you only recently joined Open Phil, so I’m of course not trying to suggest that you created this culture, though I guess now that you work there you might have some opportunities to change it).
I’ll now briefly try to do a Very Hard thing which is like “put myself in Trevor’s shoes and ask what I actually want him to do.” One concrete recommendation I have is something like “try to spend at least 5 minutes thinking about ways in which you or others around you might be embedded in a culture that has blind spots to some of the comms/advocacy stuff.” Another is “make a list of people you read actively or talked to when writing this post. Then ask if there were any other people/orgs you could’ve reached out, particularly those that might focus more on comms+adovacy”. (Also, to be clear, you might do both of these things and conclude “yea, actually I think my approach was very solid and I just had Good Reasons for writing the post the way I did.”)
I’ll stop here since this comment is getting long, but I’d be happy to chat further about this stuff. Thanks again for writing the post and kudos to OP for any of the work they supported/will support that ends up increasing P(good EU AI Act goes through & gets implemented).
What are some examples of situations in which you refer to this point?
My own model differs a bit from Zach’s. It seems to me like most of the publicly-available policy proposals have not gotten much more concrete. It feels a lot more like people were motivated to share existing thoughts, as opposed to people having new thoughts or having more concrete thoughts.
Luke’s list, for example, is more of a “list of high-level ideas” than a “list of concrete policy proposals.” It has things like “licensing” and “information security requirements”– it’s not an actual bill or set of requirements. (And to be clear, I still like Luke’s post and it’s clear that he wasn’t trying to be super concrete).
I’d be excited for people to take policy ideas and concretize them further.
Aside: When I say “concrete” in this context, I don’t quite mean “people on LW would think this is specific.” I mean “this is closer to bill text, text of a section of an executive order, text of an amendment to a bill, text of an international treaty, etc.”
I think there are a lot of reasons why we haven’t seen much “concrete policy stuff”. Here are a few:
This work is just very difficult– it’s much easier to hide behind vagueness when you’re writing an academic-style paper than when you’re writing a concrete policy proposal.
This work requires people to express themselves with more certainty/concreteness than academic-style research. In a paper, you can avoid giving concrete recommendations, or you can give a recommendation and then immediately mention 3-5 crucial considerations that could change the calculus. In bills, you basically just say “here is what’s going to happen” and do much less “and here are the assumptions that go into this and a bunch of ways this could be wrong.”
This work forces people to engage with questions that are less “intellectually interesting” to many people (e.g., which government agency should be tasked with X, how exactly are we going to operationalize Y?)
This work just has a different “vibe” to the more LW-style research and the more academic-style research. Insofar as LW readers are selected for (and reinforced for) liking a certain “kind” of thinking/writing, this “kind” of thinking/writing is different than the concrete policy vibe in a bunch of hard-to-articulate ways.
This work often has the potential to be more consequential than academic-style research. There are clear downsides of developing [and advocating for] concrete policies that are bad. Without any gatekeeping, you might have a bunch of newbies writing flawed bills. With excessive gatekeeping, you might create a culture that disincentivizes intelligent people from writing good bills. (And my own subjective impression is that the community erred too far on the latter side, but I think reasonable people could disagree here).
For people interested in developing the kinds of proposals I’m talking about, I’d be happy to chat. I’m aware of a couple of groups doing the kind of policy thinking that I would consider “concrete”, and it’s quite plausible that we’ll see more groups shift toward this over time.
Thanks for all of this! Here’s a response to your point about committees.
I agree that the committee process is extremely important. It’s especially important if you’re trying to push forward specific legislation.
For people who aren’t familiar with committees or why they’re important, here’s a quick summary of my current understanding (there may be a few mistakes):
When a bill gets introduced in the House or the Senate, it gets sent to a committee. The decision is made by the Speaker of the House or the priding officer in the Senate. In practice, however, they often defer to a non-partisan “parliamentarian” who specializes in figuring out which committee would be most appropriate. My impression is that this process is actually pretty legitimate and non-partisan in most cases(?).
It takes some degree of skill to be able to predict which committee(s) a bill is most likely to be referred to. Some bills are obvious (like an agriculture bill will go to an agriculture committee). In my opinion, artificial intelligence bills are often harder to predict. There is obviously no “AI committee”, and AI stuff can be argued to affect multiple areas. With all that in mind, I think it’s not too hard to narrow things down to ~1-3 likely committees in the House and ~1-3 likely committees in the Senate.
The most influential person in the committee is the committee chair. The committee chair is the highest-ranking member from the majority party (so in the House, all the committee chairs are currently Republicans; in the Senate, all the committee chairs are currently Democrats).
A bill cannot be brought to the House floor or the Senate floor (cannot be properly debated or voted on) until it has gone through committee. The committee is responsible for finalizing the text of the bill and then voting on whether or not they want the bill to advance to the chamber (House or Senate).
The committee chair typically has a lot of influence over the committee. The committee chair determines which bills get discussed in committee, for how long, etc. Also, committee chairs usually have a lot of “soft power”– members of Congress want to be in good standing with committee chairs. This means that committee chairs often have the ability to prevent certain legislation from getting out of committee.
If you’re trying to get legislation passed, it’s ideal to have the committee chair think favorably of that piece of legislation.
It’s also important to have at least one person on the committee as someone who is willing to “champion” the bill. This means they view the bill as a priority & be willing to say “hey, committee, I really think we should be talking about bill X.” A lot of bills die in committee because they were simply never prioritized.
If the committee chair brings the bill to a vote, and the majority of committee members vote in favor of the bill moving to the chamber, the bill can be discussed in the full chamber. Party leadership (Speaker of the House, Senate Majority Leader, etc.) typically play the most influential role in deciding which bills get discussed or voted on in the chambers.
Sometimes, bills get referred to multiple committees. This generally seems like “bad news” from the perspective of getting the bill passed, because it means that the bill has to get out of multiple committees. (Any single committee could essentially prevent the bill from being discussed in the chamber).
(If any readers are familiar with the committee process, please feel free to add more info or correct me if I’ve said anything inaccurate.)
WTF do people “in AI governance” do?
Quick answer:
A lot of AI governance folks primarily do research. They rarely engage with policymakers directly, and they spend much of their time reading and writing papers.
This was even more true before the release of GPT-4 and the recent wave of interest in AI policy. Before GPT-4, many people believed “you will look weird/crazy if you talk to policymakers about AI extinction risk.” It’s unclear to me how true this was (in a genuine “I am confused about this & don’t think I have good models of this” way). Regardless, there has been an update toward talking to policymakers about AI risk now that AI risk is a bit more mainstream.
My own opinion is that, even after this update toward policymaker engagement, the community as a whole is still probably overinvested in research and underinvested in policymaker engagement/outreach. (Of course, the two can be complimentary, and the best outreach will often be done by people who have good models of what needs to be done & can present high-quality answers to the questions that policymakers have).
Among the people who do outreach/policymaker engagement, my impression is that there has been more focus on the executive branch (and less on Congress/congressional staffers). The main advantage is that the executive branch can get things done more quickly than Congress. The main disadvantage is that Congress is often required (or highly desired) to make “big things” happen (e.g., setting up a new agency or a licensing regime).
If memory serves me well, I was informed by Hendrycks’ overview of catastrophic risks. I don’t think it’s a perfect categorization, but I think it does a good job laying out some risks that feel “less speculative” (e.g., malicious use, race dynamics as a risk factor that could cause all sorts of threats) while including those that have been painted as “more speculative” (e.g., rogue AIs).
I’ve updated toward the importance of explaining & emphasizing risks from sudden improvements in AI capabilities, AIs that can automate AI research, and intelligence explosions. I also think there’s more appetite for that now than there used to be.