ThomasW
Thanks so much for writing this! I think it’s a very useful resource to have. I wanted to add a few thoughts on your description of CAIS, which might help make it more accurate.
[Note: I worked full time at CAIS from its inception until a couple weeks ago. I now work there on a part time basis while finishing university. This comment hasn’t been reviewed by others at CAIS, but I’m pretty confident it’s accurate.]
For somebody external to CAIS, I think you did a fairly good job describing the organization so thank you! I have a couple things I’d probably change:
First, our outreach is not just to academics, but also to people in industry. We usually use the term “ML community” rather than “academia” for this reason.
Second, the technical research side of the organization is about a lot more than robustness. We do research in Trojans as you mention, which isn’t robustness, but also in machine ethics, cooperative AI, anomaly detection, forecasting, and probably more areas soon. We are interested in most of the areas in Open Problems in AI X-Risk, but the extent to which we’re actively working on them varies.
I also think it might be good to add our newly-announced (so maybe after you wrote the post) Philosophy Fellowship, which focuses on recruiting philosophers to study foundational conceptual problems in AI risk. This might correct a misconception that CAIS isn’t interested in conceptual research; we very much are, but of a different flavor than some others, which I would broadly characterize as “more like philosophy, less like math”.
Also, there is no way you would have known about this since we’ve never said it publicly anywhere, but we intend to also build out compute and research engineering infrastructure for academics specifically, who often don’t have funding for compute and even if they do don’t have the support necessary to leverage it. Building out a centralized way for safety academics to access compute and engineering support would create economies of scale (especially the compute contracts and compute infrastructure). However, these plans are in early stages.
Another fieldbuilding effort maybe worth mentioning is ML Safety Scholars.
In general, here is how I personally describe the theory of change for CAIS. This hasn’t been reviewed by anyone, and I don’t know how much Dan personally likes it, but it’s how I think of it. It’s also not very polished, sorry. Anyway, to me there are three major forms of research:
Philosophizing. Many AI safety problems are still very undefined. We need people to think about the properties of possible systems at a high level and tease out relevant considerations and possible solutions. This is exactly what philosophers do and why we are interested in the program above. Without this kind of conceptual research, it’s very difficult to figure out concrete problems to work on.
Concretization. It does us no good if the ideas generated in philosophizing are never concretized. Part of this is because no amount of thinking can substitute for real experimentation and implementation. Part of this is because it won’t be long before we really need progress: we can’t afford to just philosophize. Concretization involves taking the high level ideas and implementing something that usefully situates them in empirical systems. Benchmarks are an example of this.
Iterative improvements. Once an idea is concretized, the initial concretization is likely not optimal. We need people to make tweaks and make the initial methods better at achieving their aims, according to the concretized ideas. Most papers produced by the broader ML community are iterative improvements.
CAIS intends to be the glue that integrates all three of these areas. Through our philosophy fellowship program, we will train philosophers to do useful conceptual research while working in close proximity with ML researchers. Most of our ML research focuses on building foundational methods and benchmarks that can take fuzzy problems and concretize them. Lastly, we see our fieldbuilding effort as very much driving iterative improvements: who better to make iterative improvements on well-defined safety problems than the ML community? They have shown themselves to be quite good at this when it comes to general capabilities.
For a more in depth look at our research theory of impact, I suggest Pragmatic AI Safety.
Edit: I realized your post made me actually write things up that I hadn’t before, because I thought it would likely be more accurate than the (great for an outsider!) description that you had written. This strikes me as a very positive outcome of this post, and I hope others who feel their descriptions miss something will do the same!
Hi Jan, I appreciate your feedback.
I’ve been helping out with this and I can say that the organizers are working as quickly as possible to verify and publish new signatures. New signatures have been published since the launch, and additional signatures will continue to be published as they are verified. There is a team of people working on it right now and has been since launch.
The main obstacles to extremely swift publication are:
First, determining who meets our bar for name publication. We think the letter will have greater authority (and coordination value) if all names are above a certain bar, and so some effort needs to be put into determining whether signatories meet that bar.
Second, as you mention verification. Prior to launch, CAIS built an email verification system that ensures that signatories must verify their work emails in order for their signature to be valid. However, this has required some tweaks, such as making the emails more attention grabbing and adding some language on the form itself that makes clear that people should expect an email (before these tweaks, some people weren’t verifying their emails).
Lastly, even with verification, some submissions are still possibly fake (from email addresses that we aren’t sure are the real person) and need to be further assessed.
These are all obstacles that simply require time to address, and the team is working around the clock. In fact, I’m writing this comment on their behalf so that they can focus on the work they’re doing. We will publish all noteworthy signatures as quickly as we can, which should be within a matter of days (as I said above, some have already been published and this is ongoing). We do take your feedback that perhaps we should have hired more people so that verification could be swifter.
In response to your feedback, we have just added language in the form and email that makes clear signatures won’t show up immediately so that we can verify them. This might seem very obvious, but when you are running something with so many moving parts as this entire process has been, it is easy to miss things.
Thank you again for your feedback.
I appreciate this comment, because I think anyone who is trying to do these kinds of interventions needs to be constantly vigilant about exactly what you are mentioning. I am not excited about loads of inexperienced people suddenly trying to suddenly do big things in AI strategy, because downsides can be so high. Even people I trust are likely to make a lot of miscalculations. And the epistemics can be bad.
I wouldn’t be excited about (for example) retreats with undergrads to learn about “how you can help buy more time.” I’m not even sure of the sign of interventions people I trust pursue, let alone people with no experience who have barely thought about the problem. As somebody who is very inexperienced but interested in AI strategy, I will note that you do have to start somewhere to learn anything.
That said—and I don’t think you dispute this—we cannot afford to leave strategy/fieldbuilding/policy off the table. In my view, a huge part of making AI go well is going to depend on forces beyond the purely technical, but I understand that this varies depending on one’s threat models. Societechnical systems are very difficult to influence, and it’s easy to influence them in an incorrect direction. Not everyone should try to do this, and even people who are really smart and thinking clearly may have negative effects. But I think we still have to try.
I want to push back on your suggestion that safety culture is not relevant. I agree that being vaguely “concerned” does seem not very useful. But safety culture seems very important. Things like (paraphrased from this paper):
Preoccupation with failure, especially black swan events and unseen failures.
Reluctance to simplify interpretations and explain failures using only simplistic narratives.
Sensitivity to operations, which involves closely monitoring systems for unexpected behavior.
Commitment to resilience, which means being rapidly adaptable to change and willing to try new ideas when faced with unexpected circumstances.
Under-specification of organizational structures, where new information can travel throughout the entire organization rather than relying only on fixed reporting chains.
In my view, having this kind of culture (which is analagous to having a security mindset) proliferate would be a nearly unalloyed good. Notice there is nothing here that necessarily says “slow down”—one failure mode with telling the most safety conscious people to simply slow down is, of course, that less safety conscious people don’t. Rather, simply awareness and understanding, and taking safety seriously, I think is robustly positive regardless of the strategic situation. Doing as much as possible to extract out the old fashioned “move fast and break things” ethos and replace it with a safety-oriented ethos would be very helpful (though I will emphasize, will not solve everything).
Lastly, regarding this:
with people walking away thinking that AI Alignment people want to teach AIs how to reproduce moral philosophy, or that OpenAIs large language model have been successfully “aligned”, or that we just need to throw some RLHF at the problem and the AI will learn our values fine
I’m curious if you think that these topics have no relevance to alignment, or whether you’re saying it’s problematic that people come away thinking that these things are the whole picture? Because I see all of these as relevant (to varying degrees), but certainly not sufficient or the whole picture.
I’m not a big supporter of RLHF myself, but my steelman is something like:
RLHF is a pretty general framework for conforming a system to optimize for something that can’t clearly be defined. If we just did, “if a human looked at this for a second would they like it?” this does provide a reward signal towards deception, but also towards genuinely useful behavior. You can take steps to reduce the deception component, for example by letting the humans red team the model or do some kind of transparency; this can all theoretically fit in the framework of RLHF. One could try to make the human feedback as robust as possible by adding all sorts of bells and whistles, and this would improve reliability and reduce the deception reward signal. It could be argued this still isn’t sufficient because the model will still find some way around it, but beware too much asymptotic reasoning.
I personally do not find RLHF very appealing since I think it complexifies things unnecessarily and is too correlated with capabilities at the moment. I prefer approaches to actually try to isolate the things people actually care about (i.e. their values), add some level of uncertainty (moral uncertainty, etc.) to it, try to make these proxies as robust as possible, and make them adaptive to changes in the models that are trying to constantly exploit and Goodhart them.
Dan Hendrycks (note: my employer) and Mantas Mazeika have recently covered some similar ideas in their paper X-Risk Analysis for AI Research, which is aimed at the broader ML research community. Specifically, they argue for the importance of having minimal capabilities externalities while doing safety research and addresses some common counterarguments to this view including those similar to the ones you’ve described. The reasons given for this are somewhat different to your serial/parallel characterization, so I think serve as a good complement. The relevant part is here.
There are things that are robustly good in the world, and things that are good on highly specific inside-view models and terrible if those models are wrong. Slowing dangerous tech development seems like the former, whereas forwarding arms races for dangerous tech between world superpowers seems more like the latter.
It may seem the opposite to some people. For instance, my impression is that for many adjacent to the US government, “being ahead of China in every technology” would be widely considered robustly good, and nobody would question you at all if you said that was robustly good. Under this perspective the idea that AI could pose an existential risk is a “highly specific inside-view model” and it would be terrible if we acted on the model and it is wrong.
I don’t think your readers will mostly think this, but I actually think a lot of people would, which for me makes this particular argument seem entirely subjective and thus suspect.
At the time of this post, the FLI letter has been signed by 1 OpenAI research scientist, 7 DeepMind research scientists/engineers, and 0 Anthropic employees.
“1 OpenAI research scientist” felt weird to me on priors. 0 makes sense, if the company gave some guidance (e.g. legal) to not sign, or if the unanimous opinion was that it’s a bad idea to sign. 7 makes sense too—it’s about what I’d expect from DeepMind and shows that there’s a small contingent of people really worried about risk. Exactly 1 is really weird—there are definitely multiple risk conscious people at OpenAI, but exactly one of them decided to sign?
I see a “Yonas Kassa” listed as an OpenAI research scientist, but it’s very unclear who this person is. I don’t see any LinkedIn or Google Scholar profile of this name associated with OpenAI. Previously, I know many of the signatures were inaccurate, so I wonder if this one is, too?
Anyway, my guess is that actually zero OpenAI researchers, and that both OpenAI and Anthropic employees have decided (as a collective? because of a top down directive? for legal reasons? I have no idea) to not sign.
I find myself surprised/confused at his apparent surprise/confusion.
Jan doesn’t indicate that he’s extremely surprised or confused? He just said he doesn’t know why this happens. There’s a difference between being unsurprised by something (e.g. by observing something similar before) and actually knowing why it happens.To give a trivial example, hunter gatherers from 10,000 BC would not have been surprised if a lightning strike caused fire, but would be quite clueless (or incorrect) as to why or how this happens.
I think Quintin’s answer is a good possible hypothesis (though of course it leads to the further question of how LLMs learn language-neutral circuitry).
we should be very sceptical of interventions whose first-order effects aren’t promising.
This seems reasonable, but I think this suspicion is currently applied too liberally. In general, it seems like second-order effects are often very large. For instance, some AI safety research is currently funded by a billionaire whose path to impact on AI safety was to start a cryptocurrency exchange. I’ve written about the general distaste for diffuse effects and how that might be damaging here; if you disagree I’d love to hear your response.
In general, I don’t think it makes sense to compare policy to plastic straws, because I think you can easily crunch the numbers and find that plastic straws are a very small part of the problem. I don’t think it’s even remotely the case that “policy is a very small part of the problem” since in a more cooperative world I think we would be far more prudent with respect to AI development (that’s not to say policy is tractable, just that it seems very much more difficult to dismiss than straws).
[I work for Dan Hendrycks but he hasn’t reviewed this.]
It seems to me like your comment roughly boils down to “people will exploit safety questionaires.” I agree with that. However, I think they are much more likely to exploit social influence, blog posts, and vagueness than specific questionaires. The biggest strengths of the x-risk sheet, in my view, are:
(1) It requires a specific explanation of how the paper is relevant to x-risk, which cannot be tuned depending on the audience one is talking to. You give the example from the forecasting paper and suggest it’s unconvincing. The counterfactual is that the forecasting paper is released, the authors are telling people and funders that it’s relevant to safety, and there isn’t even anything explicitly written for you to find unconvincing and argue against. The sheets can help resolve this problem (though in this case, you haven’t really said why you find it unconvincing). Part of the reason I was motivated to write Pragmatic AI Safety (which covers many of these topics) was so that the ideas in it are staked out clearly. That way people can have something clear to criticize, and it also forces their criticisms to be more specific.
(2) There is a clear trend of saying that papers that are mostly about capabilities are about safety. This sheet forces authors to directly address this in their paper, and either admit the fact that they are doing capabilities or attempt to construct a contorted and falsifiable argument otherwise.
(3) The standardized form allows for people to challenge specific points made in the x-risk sheet, rather than cherrypicked things the authors feel like mentioning in conversation or blog posts.
Your picture of faculty simply looking at the boxes being checked and approving is, I hope, not actually how funders in the AI safety space are operating (if they are, then yes, no x-risk sheet can save them). I would hope that reviewers and evaluators of papers will directly address the evidence for each piece of the x-risk sheet and challenge incorrect assertions.
I’d be a bit worried if x-risk sheets were included in every conference, but if you instead just make them a requirement for “all papers that want AI safety money” or “all papers that claim to be about AI safety” I’m not that worried that the sheets themselves would make any researchers talk about safety if they were not already talking about it.
I’ve been collecting examples of this kind of thing for a while now here: ai-improving-ai.safe.ai.
In addition to algorithmic and data improvements I’ll add there are also some examples of AI helping to design hardware (e.g. GPU architectures) and auxiliary software (e.g. for datacenter cooling).
We weren’t intending to use the contest to do any direct outreach to anyone (not sure how one would do direct outreach with one liners in any case) and we didn’t use it for that. I think it was less useful than I would have hoped (nearly all submissions were not very good), but ideas/anecdotes surfaced have been used in various places and as inspiration.
It is also interesting to note that the contest was very controversial on LW, essentially due to it being too political/advocacy flavored (though it wasn’t intended for “political” outreach per se). I think it’s fair for research that LW has had those kinds of norms, but it did have a chilling effect on people who wanted to do the kind of advocacy that many people on LW now deem useful/necessary.
I think I essentially agree with respect to your definition of “sanity,” and that it should be a goal. For example, just getting people to think more about tail risk seems like your definition of “sanity” and my definition of “safety culture.” I agree that saying that they support my efforts and say applause lights is pretty bad, though it seems weird to me to discount actual resources coming in.
As for the last bit: trying to figure out the crux here. Are you just not very concerned about outer alignment/proxy gaming? I think if I was totally unconcerned about that, and only concerned with inner alignment/deception, I would think those areas were useless. As it is, I think a lot of the work is actively harmful (because it is mostly just advances capabilities) but it still may help chip away at the proxy gaming problem.
A related post: https://www.lesswrong.com/posts/xhD6SHAAE9ghKZ9HS/safetywashing
[Realized this is contained in a footnote, but leaving this comment here in case anyone missed it].
I’m going to address your last paragraph first, because I think it’s important for me to respond to, not just for you and me but for others who may be reading this.
When I originally wrote this post, it was because I had asked ChatGPT a genuine question about a drink I wanted to make. I don’t drink alcohol, and I never have. I’ve found that even mentioning this fact sometimes produces responses like yours, and it’s not uncommon for people to think I am mentioning it as some kind of performative virtue signal. People choose not to drink for all sorts of reasons, and maybe some are being performative about it, but that’s a hurtful assumption to make about anyone who makes that choice and dares to admit it in a public forum. This is exactly why I am often hesitant to mention this fact about myself, but in the case of this post, there really was no other choice (aside from just not posting this at all, which I would really disprefer). I’ve generally found the LW community and younger generations to be especially good at interpreting a choice not to drink for what it usually is: a personal choice, not a judgment or a signal or some kind of performative act. However, your comment initially angered and then saddened me, because it greets my choice through a lens of suspicion. That’s generally a fine lens through which to look at the world, but I think in this context, it’s a harmful one. I hope you will consider thinking a little more compassionately in the future with respect to this issue.
To answer your object-level critiques:
The problem is that it clearly contradicts itself several times, rather than admitting a contradiction it doesn’t know how to reconcile. There is no sugar in tequila. Tequila may be described as sweet (nobody I talked to described it as such, but some people on the internet do) for non-sugar reasons. In fact, I’m sure ChatGPT knows way more about tequila than I do!
It is not that it “may not know” how to reconcile those facts. It is that it doesn’t know, makes something up, and pretends it makes sense.
A situation where somebody interacting with the chatbot doesn’t know much about the subject area is exactly the kind of situation we need to be worried about with these models. I’m entirely unconvinced that the fact that some people describe tequila as sweet says much at all about this post. That’s because the point of the post was rather that ChatGPT claimed tequila has high sugar content, then claimed that actually the sweetness is due to something else, and it never really meant that tequila has any sugar. That is the problem, and I don’t think my description of it is overblown.
As somebody who used to be an intern at CHAI, but certainly isn’t speaking for the organization:
CHAI seems best approximated as a collection of researchers doing a bunch of different things. There is more reinforcement learning at CHAI than elsewhere, and it’s ML research, but it’s not top down at all so it doesn’t feel that unified. Stuart Russell has an agenda, but his students have their own agendas which only sometimes overlap with his.
I think if they operationalized it like that, fine, but I would find the frame “solving the problem” to be a very weird way of referring to that. Usually, when I hear people saying “solving the problem” they have a vague sense of what they are meaning, and have implicitly abstracted away the fact that there are many continuous problems where progress needs to be made and that the problem can only really be reduced, but never solved, unless there is actually a mathematical proof.
As far as I understand it, “intelligence” is the ability to achieve one’s goals through reasoning and making plans, so a highly intelligent system is goal-directed by definition. Less goal-directed AIs are certainly possible, but they must necessarily be considered less intelligent—the thermometer example illustrates this. Therefore, a less goal-directed AI will always lose in competition against a more goal-directed one.
Your argument seems to be:
Definitionally, intelligence is the ability to achieve one’s goals.
Less goal-directed systems are less intelligent.
Less intelligent systems will always lose in competition.
Less goal directed systems will always lose in competition.
Defining intelligence as goal-directedness doesn’t do anything for your argument. It just kicks the can down the road. Why will less intelligent (under your definition, goal directed) always lose in competition?
Imagine you had a magic wand or a genie in a bottle that would fulfill every wish you could dream of. Would you use it? If so, you’re incentivized to take over the world, because the only possible way of making every wish come true is absolute power over the universe. The fact that you normally don’t try to achieve that may have to do with the realization that you have no chance. If you had, I bet you’d try it. I certainly would, if only so I could stop Putin. But would me being all-powerful be a good thing for the rest of the world? I doubt it.
Romance is a canonical example of where you really don’t want to be all powerful (if real romance is what you want). Romance could not exist if your romantic partner always predictably did everything you ever wanted. The whole point is they are a different person, with different wishes, and you have to figure out how to navigate that and its unpredictabilities. That is the “fun” of romance. So no, I don’t think everyone would really use that magic wand.
Right now we aren’t going to consider new submissions. However, you’d be welcome to submit to our longer form arguments to our second competition for longer form arguments (details are TBD).
I feel somewhat conflicted about this post. I think a lot of the points are essentially true. For instance, I think it would be good if timelines could be longer all else equal. I also would love more coordination between AI labs. I also would like more people in AI labs to start paying attention to AI safety.
But I don’t really like the bottom line. The point of all of the above is not reducible to just “getting more time for the problem to be solved.”
First of all, the framing of “solving the problem” is, in my view, misplaced. Unless you think we will someday have a proof of beneficial AI (I think that’s highly unlikely), then there will always be more to do to increase certainty and reliability. There isn’t a moment when the problem is solved.
Second, these interventions are presented as a way of giving “alignment researchers” more time to make technical progress. But in my view, things like more coordination could lead to labs actually adopting any alignment proposals at all. The same is the case for racing. In terms of concern about AI safety, I’d expect labs would actually devote resources to this themselves if they are concerned. This shouldn’t be a side benefit, it should be a mainline benefit.
I don’t think high levels of reliability of beneficial AI will come purely from people who post on LessWrong, because the community is just so small and doesn’t have that much capital behind it. DeepMind/OpenAI, not to mention Google Brain and Meta AI research, could invest significantly more in safety than they do. So could governments (yes, the way they do this might be bad—but it could in principle be good).
You say that you thought buying time was the most important frame you found to backchain with. To me this illustrates a problem with backchaining. Dan Hendrycks and I discussed similar kinds of interventions, and we called this “improving contributing factors” which is what it’s called in complex systems theory. In my view, it’s a much better and less reductive frame for thinking about these kinds of interventions.