I’m working on an article for the Open Phil AI worldview contest. I am thinking of explaining my interpretation of Nate’s take on agency of advaned models (see here). Generally, I just want to explain is just what Nate argues—that more ambitious tasks require more “agentic” behaviour—but I wanted to illustrate with a few examples that to me makes the argument clearer than typical MIRI discussions of this issue.
One worry about discussing this issue is that, if you argue compellingly that agency is required for more advanced AI, then you might convince some people working on advanced AI to look for ways to make it more agentic. This could lead to acceleration of capabilities in a potentially undesirable direction.
I’ve found MIRI staff explanations of this point to often amount to “I’ve thought about it, and if you think about it I think you’ll end up agreeing with me”. It’s plausible that this is motivated by a desire not to make the argument too clear. I’m pretty unsure about this, though, because if you feel that it’s likely to be harmful to share this idea then I would think the appropriate policy is not to talk about it at all, rather than to talk about it vaguely. To the extent that this idea does suggest pathways to higher capability, I think there are probably lots of people in the AI business who can put 2 and 2 together, so to speak.
My own view is:
AI researchers will pay a lot more attention to successful experiments than to abstract ideas, so such discussions are probably less compelling to AI developers
If someone is convinced by an argument to try agency promoting experiments, I think the argument must be plausible enough a priori, and that there are enough people working on novel AI ideas that someone else probably would have tried a similar experiment fairly soon (timescale ~ a couple of months, and the overall impact is probably less because of other advances that happen during that month)
Also, most experiments themselves aren’t especially compelling
On the other hand, there’s a large upside from having people engaged in AI x-risk questions to have a good idea of how this is likely to play out, and impact on this front doesn’t depend on someone going out to run a successful experiment
A limiting case is if everyone agrees that AI x-safety community agrees AI has to be highly agentic to carry out ambitious tasks. In this case, I think it’s likely that some developers explore ways to make their AI more agentic earlier than they otherwise would have, but the x-safety community is much better coordinated about that models. Its murky, but I think this is probably good overall.
So I think it’s probably best to talk about it plainly. What are your thoughts?
I think, for example, that talk about how AI might be a winner takes all game might have encouraged the “full speed ahead” approach to developing AGI
I don’t think this question is specific to agency. I think this is about the entire concept of infohazards, and your arguments are fully general against all AI infohazards discussion.
Personally, I’m of the view that the worst idea in history is the idea of “bad ideas” that need to be hidden. I think the alignment community is shooting itself in the foot by trying to suppress ideas that the capabilities community is already fully aware of.
I think this is a tricky tradeoff. There’s effectively a race between alignment and capabilities research. Better theories of how AGI is likely to be constructed will help both efforts. Which one it will help more is tough to guess.
The one thought I’d like to add is that the AI safety community may think more broadly and creatively about approaches to building AI. So I wouldn’t assume that all of this thinking has already been done.
I don’t have an answer on this, and I’ve thought about it a lot since I’ve been keeping some potential infohazard ideas under my hat for maybe the last ten years.
I’ve read the 2021 MIRI conversations sequence, and various other writings by Nate and Eliezer. I found their explanations of convergent instrumental goals, agency, and various other topics convincing and explanatory, without much further thinking of my own.
I think in most or all cases, they were doing their best to explain clearly, without worrying much about infohazards. But the concepts are complicated and counter-intuitive, and sometimes when their explanations weren’t landing, they decided to move on to other topics.
So, I think you should feel free to try communicating as clearly as you can, without holding back because of worries about infohazards. Perhaps you’ll succeed in explaining where others have failed.
(Also, if you do succeed in writing what you think is an infohazardously-good explanation, you can just ask someone you trust to read it privately before posting it publicly.)
Looks like estimating the architecture of the future AGI is considered the “infohazard” too. While knowing it could be very useful to figure out which way we will have to align them.
If you go think up obvious things to do, and then go look at AI papers 6 months after, you will see there is a form of “intelligence convergence”. Everything you thought of would have been tried. Therefore, do not worry about ‘creating an idea’. Assume whatever you thought of is already being tried or it doesn’t work.