Wei Dai(Wei Dai)
Thanks for your insightful answers. You may want to make a top-level post on this topic to get more visibility. If only a very small fraction of the world is likely to ever understand and take into account many important ideas/considerations about AI x-safety, that changes the strategic picture considerably, and people around here may not be sufficiently “pricing it in”. I think I’m still in the process of updating on this myself.
Having more intelligence seems to directly or indirectly improve at least half of the items on your list. So doing an AI pause and waiting for (or encouraging) humans to become smarter still seems like the best strategy. Any thoughts on this?
And I guess this… just doesn’t seem to be the case (at least to an outsider like me)?
I may be too sensitive about unintentionally causing harm, after observing many others do this. I was also just responding to what you said earlier, where it seemed like I was maybe causing you personally to be too pessimistic about contributing to solving the problems.
you probably knew him personally?
No, I never met him and didn’t interact online much. He does seem like a good example of you’re talking about.
Some questions for @leopold.
Anywhere I can listen to or read your debates with “doomers”?
We share a strong interest in economics, but apparently not in philosophy. I’m curious if this is true, or you just didn’t talk about it in the places I looked.
What do you think about my worries around AIs doing philosophy? See this post or my discussion about it with Jan Leike.
What do you think about my worries around AGI being inherently centralizing and/or offense-favoring and/or anti-democratic (aside from above problems, how would elections work when minds can be copied at little cost)? Seems like the free world “prevailing” on AGI might well be a Pyrrhic victory unless we can also solve these follow-up problems, but you don’t address them.
More generally, do you have a longer term vision of how your proposal leads to a good outcome for our lightcone, avoiding all the major AI-related x-risks and s-risks?
Why are you not in favor of an AI pause treaty with other major nations? (You only talk about unilateral pause in the section “AGI Realism”.) China is currently behind in chips and AI and it seems hard to surpass the entire West in a chips/AI race, so why would they not go for an AI pause treaty to preserve the status quo instead of risking a US-led intelligence explosion (not to mention x-risks)?
In my view, the main good outcomes of the AI transition are 1) we luck out, AI x-safety is actually pretty easy across all the subproblems 2) there’s an AI pause, humans get smarter via things like embryo selection, then solve all the safety problems.
I’m mainly pushing for #2, but also don’t want to accidentally make #1 less likely. It seems like one of the main ways in which I could end up having a negative impact is to persuade people that the problems are definitely too hard and hence not worth trying to solve, and it turns out the problems could have been solved with a little more effort.
“it doesn’t seem like you have answers to (or even a great path forward on) these questions either despite your great interest in and effort spent on them, which bodes quite terribly for the rest of us” is a bit worrying from this perspective, and also because my “effort spent on them” isn’t that great. As I don’t have a good approach to answering these questions, I mainly just have them in the back of my mind while my conscious effort is mostly on other things.
BTW I’m curious what your background is and how you got interested/involved in AI x-safety. It seems rare for newcomers to the space (like you seem to be) to quickly catch up on all the ideas that have been developed on LW over the years, and many recently drawn to AGI instead appear to get stuck on positions/arguments from decades ago. For example, r/Singularity has 2.5M members and seems to be dominated by accelerationism. Do you have any insights about this? (How were you able to do this? How to help others catch up? Intelligence is probably a big factor which is why I’m hoping that humanity will automatically handle these problems better once it gets smarter, but many seem plenty smart and still stuck on primitive ideas about AI x-safety.)
Simplicia: Hm, perhaps a crux between us is how narrow of a target is needed to realize how much of the future’s value. I affirm the orthogonality thesis, but it still seems plausible to me that the problem we face is more forgiving, not so all-or-nothing as you portray it.
I agree that it’s plausible. I even think a strong form of moral realism (denial of orthogonality thesis) is plausible. My objection is that humanity should figure out what is actually the case first (or have some other reasonable plan of dealing with this uncertainty), instead of playing logical Russian roulette like it seems to be doing. I like that Simplicia isn’t being overconfident here, but is his position actually that “seems plausible to me that the problem we face is more forgiving” is sufficient basis for moving forward with building AGI? (Does any real person in the AI risk debate have a position like this?)
Publish important governance documents. (Seemed too basic to mention, but apparently not.)
I also am not paying for any LLM. Between Microsoft’s Copilot (formerly Bing Chat), LMSYS Chatbot Arena, and Codeium, I have plenty of free access to SOTA chatbots/assistants. (Slightly worried that I’m contributing to race dynamics or AI risk in general even by using these systems for free, but not enough to stop, unless someone wants to argue for this.)
Unfortunately I don’t have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I’m not personally aware of such writings. To brainstorm some ideas:
Conduct and publish anonymous surveys of employee attitudes about safety.
Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
Officially encourage (e.g. via financial rewards) internal and external whistleblowers. Establish and publish policies about this.
Publicly make safety commitments and regularly report on their status, such as how much compute and other resources have been allocated/used by which safety teams.
Make/publish a commitment to publicly report negative safety news, which can be used as basis for whistleblowing if needed (i.e. if some manager decides to hide such news instead).
I’d like to hear from people who thought that AI companies would act increasingly reasonable (from an x-safety perspective) as AGI got closer. Is there still a viable defense of that position (e.g., that SamA being in his position / doing what he’s doing is just uniquely bad luck, not reflecting what is likely to be happening / will happen at other AI labs)?
Also, why is there so little discussion of x-safety culture at other AI labs? I asked on Twitter and did not get a single relevant response. Are other AI company employees also reluctant to speak out, if so that seems bad (every explanation I can think of seems bad, including default incentives + companies not proactively encouraging transparency).
Suggest having a row for “Transparency”, to cover things like whether the company encourages or discourages whistleblowing, does it report bad news about alignment/safety (such as negative research results) or only good news (new ideas and positive results), does it provide enough info to the public to judge the adequacy of its safety culture and governance, etc.
It’s also notable that the topic of OpenAI nondisparagement agreements was brought to Holden Karnofsky’s attention in 2022, and he replied with “I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one.” (He could have asked his contacts inside OAI about it, or asked the EA board member to investigate. Or even set himself up earlier as someone OpenAI employees could whistleblow to on such issues.)
If the point was to buy a ticket to play the inside game, then it was played terribly and negative credit should be assigned on that basis, and for misleading people about how prosocial OpenAI was likely to be (due to having an EA board member).
Agreed that it reflects on badly on the people involved, although less on Paul since he was only a “technical advisor” and arguably less responsible for thinking through / due diligence on the social aspects. It’s frustrating to see the EA community (on EAF and Twitter at least) and those directly involved all ignoring this.
(“shouldn’t be allowed anywhere near AI Safety decision making in the future” may be going too far though.)
So these resignations don’t negatively impact my p(doom) in the obvious way. The alignment people at OpenAI were already powerless to do anything useful regarding changing the company direction.
How were you already sure of this before the resignations actually happened? I of course had my own suspicions that this was the case, but was uncertain enough that the resignations are still a significant negative update.
ETA: Perhaps worth pointing out here that Geoffrey Irving recently left Google DeepMind to be Research Director at UK AISI, but seemingly on good terms (since Google DeepMind recently reaffirmed its intention to collaborate with UK AISI).
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
Even worse, apparently the whole Superalignment team has been disbanded.
These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.
I think it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved. In other words, if alignment was fully solved, then you could use it to do complicated things like what you suggest, but there could be an intermediate stage of alignment progress where you could safely use SI to do something simple like “melt GPUs” but not to achieve more complex goals.
Some evidence in favor of your explanation (being at least a correct partial explanation):
von Neuman apparently envied Einstein’s physics intuitions, while Einstein lacked von Neuman’s math skills. This seems to suggest that they were “tuned” in slightly different directions.
Neither of the two seem superhumanly accomplished in other areas (that a smart person/agent might have goals for), such as making money, moral/philosophical progress, changing culture/politics in their preferred direction.
(An alternative explanation for 2 is that they could have been superhuman in other areas but their terminal goals did not chain through instrumental goals in those areas, which in turn raises the question of what those terminal goals must have been for this explanation to be true and what that says about human values.)
I note that under your explanation, someone could surprise the world by tuning a not-particularly-advanced AI for a task nobody previously thought to tune AI for, or by inventing a better tuning method (either general or specialized), thus achieving a large capability jump in one or more domains. Not sure how worrisome this is though.
A government might model the situation as something like “the first country/coalition to open up an AI capabilities gap of size X versus everyone else wins” because it can then easily win a tech/cultural/memetic/military/economic competition against everyone else and take over the world. (Or a fuzzy version of this to take into account various uncertainties.) Seems like a very different kind of utility function.
Hmm, open models make it easier for a corporation to train closed models, but also make that activity less profitable, whereas for a government the latter consideration doesn’t apply or has much less weight, so it seems much clearer that open models increase overall incentive for AI race between nations.
I think open source models probably reduce profit incentives to race, but can increase strategic (e.g., national security) incentives to race. Consider that if you’re the Chinese government, you might think that you’re too far behind in AI and can’t hope to catch up, and therefore decide to spend your resources on other ways to mitigate the risk of a future transformative AI built by another country. But then an open model is released, and your AI researchers catch up to near state-of-the-art by learning from it, which may well change your (perceived) tradeoffs enough that you start spending a lot more on AI research.
What do you think of this post by Tammy?
It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it’s really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they’re just overconfident about their philosophical abilities and would fail to take various precautions that I think are important in a high-stakes reflective process.
Mostly that thing where we had a lying vs lie-detecting arms race and the liars mostly won by believing their own lies and that’s how we have things like overconfidence bias and self-serving bias and a whole bunch of other biases.
Are you imagining that the RL environment for AIs will be single-player, with no social interactions? If yes, how will they learn social skills? If no, why wouldn’t the same thing happen to them?
Unless we do a very stupid thing like reading the AI’s thoughts and RL-punish wrongthink, this seems very unlikely to happen.
We already RL-punish AIs for saying things that we don’t like (via RLHF), and in the future will probably punish them for thinking things we don’t like (via things like interpretability). Not sure how to avoid this (given current political realities) so safety plans have to somehow take this into account.
Yeah it seems like a bunch of low hanging fruit was picked around that time, but that opened up a vista of new problems that are still out of reach. I wrote a post about this, which I don’t know if you’ve seen or not.
(This has been my experience with philosophical questions in general, that every seeming advance just opens up a vista of new harder problems. This is a major reason that I switched my attention to trying to ensure that AIs will be philosophically competent, instead of object-level philosophical questions.)