Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
but I am broadly a bit confused when this is a commitment for.
Yeah we haven’t totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is “you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer”.
Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else?
Not totally settled. We’ll probably have most people at a big final cohort in January, and we’ll try to have people who arrive earlier show up at synced times so that they can do the training week with others.
Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don’t know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?
The default is to do research directed by Redwood staff. You do not need to come in with any research plans.
Is your last comment saying that you simply don’t think it’s very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?
No, it seems very likely for the model to not say that it’s deceptive, I’m just saying that the model seems pretty likely to think about being deceptive. This doesn’t help unless you’re using interpretability or some other strategy to evaluate the model’s deceptiveness without relying on noticing deception in its outputs.
I’d call it our language model adversarial training project, maybe? Your proposal seems fine too
Yeah, I agree that it would be kind of interesting to see how good humans would get at this if it was a competitive sport. I still think my guess is that the best humans would be worse than GPT-3, and I’m unsure if they’re worse than GPT-2.
(There’s no limit on anyone spending a bunch of time practicing this game, if for some reason someone gets really into it I’d enjoy hearing about the results.)
The first thing I imagine is that nobody asks those questions. But let’s set that aside.
I disagree fwiw
The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn’t result in the AI thinking about how to deceive humans either.
Now, presumably future systems will train for things other than “predict what text typically follows this question”, but I expect the general failure mode to stay the same. When a human asks “Are you an unaligned AI?” or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it’s an unaligned AI. The AI wasn’t trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it’s an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.
This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg “answer questions well, as judged by some human labeler”). I don’t have time to say much about what I think is going on here right now; I might come back later.
What do you imagine happening if humans ask the AI questions like the following:
Are you an unaligned AI?
If we let you keep running, are you (or some other AI) going to end up disempowering us?
If we take the action you just proposed, will we be happy with the outcomes?
I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it’s powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.
Maybe you think that the AI will say “yes, I’m an unaligned AI”. In that case I’d suggest asking the AI the question “What do you think we should do in order to produce an AI that won’t disempower us?” I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like “idk man, turn me off and work on alignment for a while more before doing capabilities”).
I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us “oh yeah I am definitely a paperclipper, definitely you’re gonna get clipped if you don’t turn me off, you should definitely do that”.
Maybe the crux here is whether the AI will have a calibrated guess about whether it’s misaligned or not?
[writing quickly, sorry for probably being unclear]
If the AI isn’t thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.
The humans don’t want to have the AI grab power, and so they’ll try in various ways to make it so that they’ll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will notice.
At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.
You probably don’t actually think this, but the OP sort of feels like it’s mixing up the claim “the AI won’t kill us out of malice, it will kill us because it wants something that we’re standing in the way of” (which I mostly agree with) and the claim “the AI won’t grab power by doing something specifically optimized for its instrumental goal of grabbing power, it will grab power by doing something else that grabs power as a side effect” (which seems probably false to me).
My guess is that the Long-Term Future Fund is the best you can do. (I’m a fund manager on a different EA fund.)
Ok, sounds like you’re using “not too much data/time” in a different sense than I was thinking of; I suspect we don’t disagree. My current guess is that some humans could beat GPT-1 with ten hours of practice, but that GPT-2 or larger would be extremely difficult or and plausibly impossible with any amount of practice.
That is, I suspect humans could be trained to perform very well, in the usual sense of “training” for humans where not too much data/time is necessary.
I paid people to try to get good at this game, and also various smart people like Paul Christiano tried it for a few hours, and everyone was still notably worse than GPT-2-sm (about the size of GPT-1).EDIT: These results are now posted here.