I’m researching and advocating for policy to prevent catastrophic risk from AI at https://www.aipolicy.us/.
I’m broadly interested in AI strategy and want to figure out the most effective interventions to get to good AI outcomes.
I’m researching and advocating for policy to prevent catastrophic risk from AI at https://www.aipolicy.us/.
I’m broadly interested in AI strategy and want to figure out the most effective interventions to get to good AI outcomes.
I personally benefitted tremendously from the Lightcone offices, especially when I was there over the summer during SERI MATS. Being able to talk to lots of alignment researchers and other aspiring alignment researchers increased my subjective rate of alignment upskilling by >3x relative to before, when I was in an environment without other alignment people.
Thanks so much to the Lightcone team for making the office happen. I’m sad (emotionally, not making a claim here whether it was the right decision or not) to see it go, but really grateful that it existed.
Incorrect: OpenAI is not aware of the risks of race dynamics.
OpenAI’s Charter contains the following merge-and-assist clause: “We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.””
Being worried about race dynamics and then stopping at the last minute makes sense and seems a lot better than nothing. But I’m confused why this understanding doesn’t propagate to other beliefs/actions.
Specifically, below are some confusions I have with OpenAI’s worldview. If answered, these could give me a lot more hope in OpenAI’s direction.
How will you know that AGI has a >50% chance of success in the next two years? MIRI certainly seems to think this is hard.
How does OpenAI leadership feel about accelerating timelines? [1]
What are OpenAI leadership’s timelines right now? What are these timelines based off of?
Does OpenAI retroactively think that publishing that GPT-3 worked was a mistake? [2][3]
Will OpenAI publish GPT-4? what are the factors driving this decision?
On my models, we want to know as much about alignment as possible before we get close to AGI, and so it is incredibly important to have as much time as possible before we are close to AGI. I would much rather live in the world where we had 20 years to solve AI alignment than the world where we only have 10.
If the benefits are using GPT-3 to do alignment research, why not give it to just alignment researchers, and not tell anyone else?
Again, according to my current worldview, actions such as releasing GPT-3 are extremely negative, because it tells everyone that LLMs work and thus accelerates capabilities and therefore also shortens timelines.
Just made a fairly large edit to the post after lots of feedback from commenters. My most recent changes include the following:
Note limitations in introduction (lack academics, not balanced depth proportional to people, not endorsed by researchers)
Update CLR as per Jesse’s comment
Add FAR
Update brain-like AGI to include this.
Rewrite shard theory section
Brain <-> shards
effort: 50 → 75 hours :)
Add some academics (David Krueger, Sam Bowman, Jacob Steinhardt, Dylan Hadfield-Menell, FHI)
Add other category
Summary table updates:
Update links in table to make sure they work.
Add scale of organization
Add people
Thank you to everyone who commented, it has been very helpful.
That makes sense. For me:
Background: I graduated from college at the University of Michigan this spring, I majored in Math and CS. In college I worked on vision research for self-driving cars, and wrote my undergrad thesis on robustness (my linkedin). I spent a lot of time running the EA group at Michigan. I’m currently doing SERI MATS under John Wentworth.
Research taste: currently very bad and confused and uncertain. I want to become better at research and this is mostly why I am doing MATS right now. I guess I especially enjoy reading and thinking about mathy research like Infra-Bayesianism and MIRI embedded agency stuff, but I’ll be excited about whatever research I think is the most important.
I’m pretty new to interacting with the alignment sphere (before this summer I had just read things online and taken AGISF). Who I’ve interacted with (I’m probably forgetting some, but gives a rough idea):
1 conversation with Andrew Critch
~3 conversations with people at each of Conjecture and MIRI
~8 conversations with various people at Redwood
Many conversations with people who hang around Lightcone, especially John and other SERI MATS participants (including Team Shard)
This summer, when I started talking to alignment people, I had a massive rush of information and so this was initially just a google doc of notes to organize my thoughts and figure out what people were doing. I then polished this and published this after some friends encouraged me to. I emphasize that nothing I write in the opinion section are strongly held beliefs—I am still deeply confused about a lot of things in alignment. I’m hoping that by posting this more publicly I can also get feedback / perspectives from others who are not in my social sphere right now.
It’s worth noting that this (and the other thresholds) are in place because we need a concrete legal definition for frontier AI, not because they exactly pin down which AI models are capable of catastrophe. It’s probable that none of the current models are capable of catastrophe. We want a sufficiently inclusive definition such that the licensing authority has the legal power over any model that could be catastrophically risky.
That being said—Llama 2 is currently the best open-source model and it gets 68.9% on the MMLU. It seems relatively unimportant to regulate models below Llama 2 because anyone who wanted to use that model could just use Llama 2 instead. Conversely, models that are above Llama 2 capabilities are at the point where it seems plausible that they could be bootstrapped into something dangerous. Thus, our threshold was set just above the limit.
Of course, by the time this regulation would pass, newer open-source models are likely to come out, so we could potentially set the bar higher.
The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one
It’s very disappointing to me that this sentence doesn’t say “cancel”. As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.
I’m a guest fund manager for the LTFF, and wanted to say that my impression is that the LTFF is often pretty excited about giving people ~6 month grants to try out alignment research at 70% of their industry counterfactual pay (the reason for the 70% is basically to prevent grift). Then, the LTFF can give continued support if they seem to be doing well. If getting this funding would make you excited to switch into alignment research, I’d encourage you to apply.
I also think that there’s a lot of impactful stuff to do for AI existential safety that isn’t alignment research! For example, I’m quite into people doing strategy, policy outreach to relevant people in government, actually writing policy, capability evaluations, and leveraged community building like CBAI.
Some claims I’ve been repeating in conversation a bunch:
Safety work (I claim) should either be focused on one of the following
CEV-style full value loading, to deploy a sovereign
A task AI that contributes to a pivotal act or pivotal process.
I think that pretty much no one is working directly on 1. I think that a lot of safety work is indeed useful for 2, but in this case, it’s useful to know what pivotal process you are aiming for. Specifically, why aren’t you just directly working to make that pivotal act/process happen? Why do you need an AI to help you? Typically, the response is that the pivotal act/process is too difficult to be achieved by humans. In that case, you are pushing into a difficult capabilities regime—the AI has some goals that do not equal humanity’s CEV, and so has a convergent incentive to powerseek and escape. With enough time or intelligence, you therefore get wrecked, but you are trying to operate in this window where your AI is smart enough to do the cognitive work, but is ‘nerd-sniped’ or focused on the particular task that you like. In particular, if this AI reflects on its goals and starts thinking big picture, you reliably get wrecked. This is one of the reasons that doing alignment research seems like a particularly difficult pivotal act to aim for.
(My take on the reflective stability part of this)
The reflective equilibrium of a shard theoretic agent isn’t a utility function weighted according to each of the shards, it’s a utility function that mostly cares about some extrapolation of the (one or very few) shard(s) that were most tied to the reflective cognition.
It feels like a ‘let’s do science’ or ‘powerseek’ shard would be a lot more privileged, because these shards will be tied to the internal planning structure that ends up doing reflection for the first time.
There’s a huge difference between “Whenever I see ice cream, I have the urge to eat it”, and “Eating ice cream is a fundamentally morally valuable atomic action”. The former roughly describes one of the shards that I have, and the latter is something that I don’t expect to see in my CEV. Similarly, I imagine that a bunch of the safety properties will look more like these urges because the shards will be relatively weak things that are bolted on to the main part of the cognition, not things that bid on the intelligent planning part. The non-reflectively endorsed shards will be seen as arbitrary code that is attached to the mind that the reflectively endorsed shards have to plan around (similar to how I see my “Whenever I see ice cream, I have the urge to eat it” shard.
In other words: there is convergent pressure for CEV-content integrity, but that does not mean that the current way of making decisions (e.g. shards) is close to the CEV optimum, and the shards will choose to self modify to become closer to their CEV.
I don’t feel epistemically helpless here either, and would love a theory of which shards get preserved under reflection.
My take on the salient effects:
Shorter timelines → increased accident risk from not having solved technical problem yet, decreased misuse risk, slower takeoffs
Slower takeoffs → decreased accident risk because of iteration to solve technical problem, increased race / economic pressure to deploy unsafe model
Given that most of my risk profile is dominated by a) not having solved technical problem yet, and b) race / economic pressure to deploy unsafe models, I’m tentatively in the long timelines + fast takeoff quadrant as being the safest.
(ETA: these are my personal opinions)
Notes:
We’re going to make sure to exempt existing open source models. We’re trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable.
These are good points, and I decided to remove the data criteria for now in response to these considerations.
The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be restricted. The point of this is to make sure that any model that could be dangerous would be included in the definition. Some non-dangerous models will be included, because of the difficulty with predicting the exact capabilities of a model before training.
We’re planning to shift to recommending a tiered system in the future, where the systems in the lower tiers have a reporting requirement but not a licensing requirement.
In order to mitigate the downside of including too many models, we have a fast track exemption for models that are clearly not dangerous but technically fall within the bounds of the definition.
I don’t expect this to impact the vast majority of AI developers outside the labs. I do think that open sourcing models at the current frontier is dangerous and want to prevent future extensions of the bar. Insofar as that AI development was happening on top of models produced by the labs, it would be affected.
The threshold is a work in progress. I think it’s likely that they’ll be revised significantly throughout this process. I appreciate the input and pushback here.
Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens.
Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable.
I also think 70% on MMLU is extremely low, since that’s about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.
This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe.
I also think that one route towards AGI in the event that we try to create a global shutdown of AI progress is by building up capabilities on top of whatever the best open source model is, and so I’m hesitant to give up the government’s ability to prevent the capabilities of the best open source model from going up.
The cutoffs also don’t differentiate between sparse and dense models, so there’s a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.
Thanks for pointing this out, I’ll think about if there’s a way to exclude sparse models, though I’m not sure if its worth the added complexity and potential for loopholes. I’m not sure how many models fall into this category—do you have a sense? This aggregation of models has around 40 models above the 70B threshold.
Disclaimer: writing quickly.
Consider the following path:
A. There is an AI warning shot.
B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.
C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed.
I think that a warning shot is unlikely (P(A) < 10%), but won’t get into that here.
I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization further from AGI ruin, but I think that the ML community is both more important and probably significantly easier to shift than government. I basically agree with this post as it pertains to government updates based on warning shots.
I anticipate that a warning shot would get most capabilities researchers to a) independently think about alignment failures and think about the alignment failures that their models will cause, and b) take the EA/LessWrong/MIRI/Alignment sphere’s worries a lot more seriously. My impression is that OpenAI seems to be much more worried about misuse risk than accident risk: if alignment is easy, then the composition of the lightcone is primarily determined by the values of the AGI designers. Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI. I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%.
P(C | A, B) primarily depends on alignment difficulty, of which I am pretty uncertain, and also how large the reallocation in B is, which I am anticipating to be pretty large. The bar for destroying the world gets lower and lower every year, but this would give us a lot more time, but I think we get several years of AGI capabiliity before we deploy it. I’m estimating P(C | A, B) ~= 70%, but this is very low resilience.
I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development—most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.
Thinking about ethics.
After thinking more about orthogonality I’ve become more confident that one must go about ethics in a mind-dependent way. If I am arguing about what is ‘right’ with a paperclipper, there’s nothing I can say to them to convince them to instead value human preferences or whatever.
I used to be a staunch moral realist, mainly relying on very strong intuitions against nihilism, and then arguing something that not nihilism → moral realism. I now reject the implication, and think that there is both 1) no universal, objective morality, and 2) things matter.
My current approach is to think of “goodness” in terms of what CEV-Thomas would think of as good. Moral uncertainty, then, is uncertainty over what CEV-Thomas thinks. CEV is necessary to get morality out of a human brain, because it is currently a bundle of contradictory heuristics. However, my moral intuitions still give bits about goodness. Other people’s moral intuitions also give some bits about goodness, because of how similar their brains are to mine, so I should weight other peoples beliefs in my moral uncertainty.
Ideally, I should trade with other people so that we both maximize a joint utility function, instead of each of us maximizing our own utility function. In the extreme, this looks like ECL. For most people, I’m not sure that this reasoning is necessary, however, because their intuitions might already be priced into my uncertainty over my CEV.
Current impressions of free energy in the alignment space.
Outreach to capabilities researchers. I think that getting people who are actually building the AGI to be more cautious about alignment / racing makes a bunch of things like coordination agreements possible, and also increases the operational adequacy of the capabilities lab.
One of the reasons people don’t like this is because historically outreach hasn’t gone well, but I think the reason for this is that mainstream ML people mostly don’t buy “AGI big deal”, whereas lab capabilities researchers buy “AGI big deal” but not “alignment hard”.
I think people at labs running retreats, 1-1s, alignment presentations within labs are all great to do this.
I’m somewhat unsure about this one because of downside risk and also ‘convince people of X’ is fairly uncooperative and bad for everyone’s epistemics.
Conceptual alignment research addressing the hard part of the problem. This is hard and not easy to transition to without a bunch of upskilling, but if the SLT hypothesis is right, there are a bunch of key problems that mostly go unnassailed, and so there’s a bunch of low hanging fruit there.
Strategy research on the other low hanging fruit in the AI safety space. Ideally, the product of this research would be a public quantitative model about what interventions are effective and why. The path to impact here is finding low hanging fruit and pointing them out so that people can do them.
my current best guess is that gradient descent is going to want to make our models deceptive
Can you quantify your credence in this claim?
Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, … optimization pressure against deception?
If you don’t like these or want a more specific operationalization of this question, I’m happy with whatever you think is likely or filling out more details.
Agree with both aogara and Eli’s comment.
One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
For me this reading between the lines is hard: I spent ~2 hours reading academic papers/websites yesterday and while I could quite quickly summarize the work itself, it was quite hard to me to figure out the motivations.
Sorry about that, and thank you for pointing this out.
For now I’ve added a disclaimer (footnote 2 right now, might make this more visible/clear but not sure what the best way of doing that is). I will try to add a summary of some of these groups in when I have read some of their papers, currently I have not read a lot of their research.
Edit: agree with Eli’s comment.
This doesn’t feel right to me, off the top of my head, it does seem like most of the field is just trying to make progress. For most of those that aren’t, it feels like they are pretty explicit about not trying to solve alignment, and also I’m excited about most of the projects. I’d guess like 10-20% of the field are in the “make alignment seem legit” camp. My rough categorization:
Make alignment progress:
Anthropic Interp
Redwood
ARC Theory
Conjecture
MIRI
Most independent researchers that I can think of (e.g. John, Vanessa, Steven Byrnes, the MATS people I know)
Some of the safety teams at OpenAI/DM
Aligned AI
Team Shard
make alignment seem legit:
CAISsafe.aiAnthropic scaring laws
ARC Evals (arguably, but it seems like this isn’t quite the main aim)
Some of the safety teams at OpenAI/DM
Open Phil (I think I’d consider Cold Takes to be doing this, but it doesn’t exactly brand itself as alignment research)
What am I missing? I would be curious which projects you feel this way about.