Wei Dai(Wei Dai)
But I also think that if you gave me a year where I had lots of money, access, and was free from people trying to pressure me, I would have a good shot at pulling it off.
Want to explain a bit about how you’d go about doing this? Seems like you’re facing some similar problems as assuring that an AI is wise, benevolent, and stable, e.g., not knowing what wisdom really is, distribution shift between testing and deployment, adversarial examples/inputs.
This is indeed my overall suggested strategy, with CAST coming after a “well, if you’re going to try to build it anyway you might as well die with a bit more dignity by...” disclaimer.
I think this means you should be extra careful not to inadvertently make people too optimistic about alignment, which would make coordination to stop capabilities research even harder than it already is. For example you said that you “like” the visualization of 5 humans selected by various governments, without mentioning that you don’t trust governments to do this, which seems like a mistake?
A visualization that I like is imagining a small group of, say, five humans selected by various governments for being wise, benevolent, and stable.
I think this might be a dealbreaker. I don’t trust the world’s governments to come up with 5 humans who are sufficient wise, benevolent, and stable. (Do you really?) I’m not sure I can come with 5 such people myself. None of the alternatives you talk about seem acceptable either.
I think maybe a combination of two things could change my mind, but both seem very hard and have close to nobody working on them:
The AI is very good at helping the principles be wise and stable, for example by being super-competent at philosophy. (I think this may also require being less than maximally corrigible, but I’m not sure.) Otherwise what happens if, e.g., the principles or AI start thinking about distant superintelligences?
There is some way to know that benevolence is actually the CEV of such a group, i.e., they’re not just “deceptively aligned”, or something like that, while not having much power.
Yeah it seems like a bunch of low hanging fruit was picked around that time, but that opened up a vista of new problems that are still out of reach. I wrote a post about this, which I don’t know if you’ve seen or not.
(This has been my experience with philosophical questions in general, that every seeming advance just opens up a vista of new harder problems. This is a major reason that I switched my attention to trying to ensure that AIs will be philosophically competent, instead of object-level philosophical questions.)
Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently “pricing it in”. I think I’m still in the process of updating on this myself.
Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?
And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?
I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.
you probably knew him personally?
No, I never met him and didn’t interact online much. He does seem like a good example of you’re talking about.
Some questions for @leopold.
Anywhere I can listen to or read your debates with “doomers”?
We share a strong interest in economics, but apparently not in philosophy. I’m curious if this is true, or you just didn’t talk about it in the places I looked.
What do you think about my worries around AIs doing philosophy? See this post or my discussion about it with Jan Leike.
What do you think about my worries around AGI being inherently centralizing and/or offense-favoring and/or anti-democratic (aside from above problems, how would elections work when minds can be copied at little cost)? Seems like the free world “prevailing” on AGI might well be a Pyrrhic victory unless we can also solve these follow-up problems, but you don’t address them.
More generally, do you have a longer term vision of how your proposal leads to a good outcome for our lightcone, avoiding all the major AI-related x-risks and s-risks?
Why are you not in favor of an AI pause treaty with other major nations? (You only talk about unilateral pause in the section “AGI Realism”.) China is currently behind in chips and AI and it seems hard to surpass the entire West in a chips/AI race, so why would they not go for an AI pause treaty to preserve the status quo instead of risking a US-led intelligence explosion (not to mention x-risks)?
In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there’s an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.
I’m mainly pushing for #2, but also don’t want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.
“it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us” is a bit worrying from this perspective, and also because my “effort spent on them” isn’t that great. As I don’t have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.
BTW I’m curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I’m hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)
Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future’s value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it.
I agree that it’s plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible. My objection is that humanity should figure out what is actually the case first (or have some other reasonable plan of dealing with this uncertainty), instead of playing logical Russian roulette like it seems to be doing. I like that Simplicia isn’t being overconfident here, but is his position actually that “seems plausible to me that the problem we face is more forgiving” is sufficient basis for moving forward with building AGI? (Does any real person in the AI risk debate have a position like this?)
Publish important governance documents. (Seemed too basic to mention, but apparently not.)
I also am not paying for any LLM. Between Microsoft’s Copilot (formerly Bing Chat), LMSYS Chatbot Arena, and Codeium, I have plenty of free access to SOTA chatbots/assistants. (Slightly worried that I’m contributing to race dynamics or AI risk in general even by using these systems for free, but not enough to stop, unless someone wants to argue for this.)
Unfortunately I don’t have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I’m not personally aware of such writings. To brainstorm some ideas:
Conduct and publish anonymous surveys of employee attitudes about safety.
Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).
I’d like to hear from people who thought that AI companies would act increasingly reasonable (from an x-safety perspective) as AGI got closer. Is there still a viable defense of that position (e.g., that SamA being in his position / doing what he’s doing is just uniquely bad luck, not reflecting what is likely to be happening / will happen at other AI labs)?
Also, why is there so little discussion of x-safety culture at other AI labs? I asked on Twitter and did not get a single relevant response. Are other AI company employees also reluctant to speak out, if so that seems bad (every explanation I can think of seems bad, including default incentives + companies not proactively encouraging transparency).
Suggest having a row for “Transparency”, to cover things like whether the company encourages or discourages whistleblowing, does it report bad news about alignment/safety (such as negative research results) or only good news (new ideas and positive results), does it provide enough info to the public to judge the adequacy of its safety culture and governance, etc.
It’s also notable that the topic of OpenAI nondisparagement agreements was brought to Holden Karnofsky’s attention in 2022, and he replied with “I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one.” (He could have asked his contacts inside OAI about it, or asked the EA board member to investigate. Or even set himself up earlier as someone OpenAI employees could whistleblow to on such issues.)
If the point was to buy a ticket to play the inside game, then it was played terribly and negative credit should be assigned on that basis, and for misleading people about how prosocial OpenAI was likely to be (due to having an EA board member).
Agreed that it reflects on badly on the people involved, although less on Paul since he was only a “technical advisor” and arguably less responsible for thinking through / due diligence on the social aspects. It’s frustrating to see the EA community (on EAF and Twitter at least) and those directly involved all ignoring this.
(“shouldn’t be allowed anywhere near AI Safety decision making in the future” may be going too far though.)
So these resignations don’t negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.
How were you already sure of this before the resignations actually happened? I of course had my own suspicions that this was the case, but was uncertain enough that the resignations are still a significant negative update.
ETA: Perhaps worth pointing out here that Geoffrey Irving recently left Google DeepMind to be Research Director at UK AISI, but seemingly on good terms (since Google DeepMind recently reaffirmed its intention to collaborate with UK AISI).
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
Even worse, apparently the whole Superalignment team has been disbanded.
These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.
I think it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved. In other words, if alignment was fully solved, then you could use it to do complicated things like what you suggest, but there could be an intermediate stage of alignment progress where you could safely use SI to do something simple like “melt GPUs” but not to achieve more complex goals.
Some evidence in favor of your explanation (being at least a correct partial explanation):
von Neuman apparently envied Einstein’s physics intuitions, while Einstein lacked von Neuman’s math skills. This seems to suggest that they were “tuned” in slightly different directions.
Neither of the two seem superhumanly accomplished in other areas (that a smart person/agent might have goals for), such as making money, moral/philosophical progress, changing culture/politics in their preferred direction.
(An alternative explanation for 2 is that they could have been superhuman in other areas but their terminal goals did not chain through instrumental goals in those areas, which in turn raises the question of what those terminal goals must have been for this explanation to be true and what that says about human values.)
I note that under your explanation, someone could surprise the world by tuning a not-particularly-advanced AI for a task nobody previously thought to tune AI for, or by inventing a better tuning method (either general or specialized), thus achieving a large capability jump in one or more domains. Not sure how worrisome this is though.
A government might model the situation as something like “the first country/coalition to open up an AI capabilities gap of size X versus everyone else wins” because it can then easily win a tech/cultural/memetic/military/economic competition against everyone else and take over the world. (Or a fuzzy version of this to take into account various uncertainties.) Seems like a very different kind of utility function.
Can’t claim to have put much thought into this topic, but here are my guesses of what the most cost-effective ways of throwing money at the problem of reducing existential risk might include:
Research into human intelligence enhancement, e.g., tech related to embryo selection.
Research into how to design/implement an international AI pause treaty, perhaps x-risk governance in general.
Try to identify more philosophical talent across the world and pay them to make philosophical progress, especially in metaphilosophy. (I’m putting some of my own money into this.)
Research into public understanding of x-risks, what people’s default risk tolerances are, what arguments can or can’t they understand, etc.
Strategy think tanks that try to keep a big picture view of everything, propose new ideas or changes to what people/orgs should do, discuss these ideas with the relevant people, etc.