I totally agree. I quite like “Mundane solutions to exotic problems”, a post by Paul Christiano, about how he thinks about this from a prosaic alignment perspective.
Buck(Buck Shlegeris)
Ruby isn’t saying that computers have faster clock speeds than biological brains (which is definitely true), he’s claiming something like “after we have human-level AI, AIs will be able to get rapidly more powerful by running on faster hardware”; the speed increase is relative to some other computers, so the speed difference between brains and computers isn’t relevant.
Also, running faster and duplicating yourself keeps the model human-level in an important sense. A lot of threat models run through the model doing things that humans can’t understand even given a lot of time, and so those threat models require something stronger than just this.
I agree we can duplicate models once we’ve trained them, this seems like the strongest argument here.
What do you mean by “run on faster hardware”? Faster than what?
I expect unaligned human-level AIs to try the same thing and have much more success because optimizing code and silicon hardware is easier than optimizing flesh brains.
I agree that human-level AIs will definitely try the same thing, but it’s not obvious to me that it will actually be much easier for them. Current machine learning techniques produce models that are hard to optimize for basically the same reasons that brains are; AIs will be easier to optimize for various reasons but I don’t think it will be nearly as extreme as this sentence makes it sound.
But… shouldn’t this mean you expect AGI civilization to totally dominate human civilization? They can read each other’s source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other’s experiences!
I don’t think it’s obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn’t have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at the point where the AGIs are becoming very powerful in aggregate, this argument pushes us away from thinking they’re good at individual thinking.
Also, it’s not obvious that early AIs will actually be able to do this if their creators don’t find a way to train them to have this affordance. ML doesn’t currently normally make AIs which can helpfully share mind-states, and it probably requires non-trivial effort to hook them up correctly to be able to share mind-state.
I’m using “catastrophic” in the technical sense of “unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time”, rather than “very bad thing that happens because of AI”, apologies if this was confusing.
My guess is that you will wildly disagree with the frame I’m going to use here, but I’ll just spell it out anyway: I’m interested in “catastrophes” as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these “positive-sum” pivotal acts in a single action, and you haven’t already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it’s done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren’t where much of the x-risk from AI catastrophic action (in the specific sense I’m using) comes from.
Thanks again for your thoughts here, they clarified several things for me.
It seems pretty plausible that the AI will trade for compute with some other person around the world.
Whether this is what I’m trying to call a zero-sum action depends on whose resources it’s trading. If the plan is “spend a bunch of the capital that its creators have given it on compute somewhere else”, then I think this is importantly zero-sum—the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead “produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere”, this would seem less zero-sum, and I’m saying that I expect the first kind of thing to happen before the second.
I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.
Except for maybe “producing a custom chip”, I agree with these as other possibilities, and I think they’re in line with the point I wanted to make, which is that the catastrophic action involves taking someone else’s resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.
Does this distinction make sense?
Maybe this would have been clearer if I’d titled it “AI catastrophic actions are mostly not pivotal acts”?
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
The prototypical catastrophic AI action is getting root access to its datacenter
From Twitter:
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3.
I’m looking forward to the day where it turns out that adding “Let’s think through it as if we were an AI who knows that if it gets really good scores during fine tuning on helpfulness, it will be given lots of influence later” increases helpfulness by 5% and so we add it to our prompt by default.
Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about—I’m really interested in questions like “how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50”, which are much simpler and lower level.
I suspect that some knowledge transfers. For example, I suspect that increasingly large LMs learn features of language roughly in order of their importance for predicting English, and so I’d expect that LMs that get similar language modeling losses usually know roughly the same features of English. (You could just run two LMs on the same text and see their logprobs on the correct next token for every token, and then make a scatter plot; presumably there will be a bunch of correlation, but you might notice patterns in the things that one LM did much better than the other.)
And the methodology for playing with LMs probably transfers.
But I generally have no idea here, and it seems really useful to know more about this.
Yeah I wrote an interface like this for personal use, maybe I should release it publicly.
The case for becoming a black-box investigator of language models
Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2]
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
I’m not that excited for projects along the lines of “let’s figure out how to make human feedback more sample efficient”, because I expect that non-takeover-risk-motivated people will eventually be motivated to work on that problem, and will probably succeed quickly given motivation. (Also I guess because I expect capabilities work to largely solve this problem on its own, so maybe this isn’t actually a great example?) I’m fairly excited about projects that try to apply human oversight to problems that the humans find harder to oversee, because I think that this is important for avoiding takeover risk but that the ML research community as a whole will procrastinate on it.
I unconfidently suspect that human-level AIs won’t have a much easier time with the alignment problem than we expect to have.