Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.blogspot.com
Richard_Ngo
Science-informed normativity
Makes sense. (Also, to be clear, I should have mentioned that my earlier claim was not based on inside evidence of that kind.)
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it’s very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for “surprising behavior” rather than “failures” per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though… but maybe understanding models better is robustly good enough to outweight that?)
I’m loving this whole sequence, but I particularly love:
9.2.2 Preferences are over “thoughts”, which can relate to outcomes, actions, plans, etc., but are different from all those things
That feels very crisp, clear, and informative.
Probably the easiest “honeypot” is just making it relatively easy to tamper with the reward signal. Reward tampering is useful as a honeypot because it has no bad real-world consequences, but could be arbitrarily tempting for policies that have learned a goal that’s anything like “get more reward” (especially if we precommit to letting them have high reward for a significant amount of time after tampering, rather than immediately reverting).
I like this pushback, and I’m a fan of productive mistakes. I’ll have a think about how to rephrase to make that clearer. Maybe there’s just a communication problem, where it’s hard to tell the difference between people claiming “I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem”, versus “I have very little traction on the alignment problem but this direction is the best thing I’ve got”. If the only effect of my post is to make a bunch of people say “oh yeah, I meant the second thing all along”, then I’d be pretty happy with that.
Why do I care about this? It has uncomfortable tinges of status regulation, but I think it’s important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan’s posts on it, whereas if I’d known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.
As an example of (I claim) doing this right, see the disclaimer on my “shaping safer goals” sequence: “Note that all of the techniques I propose here are speculative brainstorming; I’m not confident in any of them as research directions, although I’d be excited to see further exploration along these lines.” Although maybe I should make this even more prominent.
Lastly, I don’t think I’m actually comparing Darwin and Einstein’s mature theories to Turing’s incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin’s case, literally decades).
I think this is very interesting, and closely related to a line of thinking I’ve been pursuing; stay tuned for a forthcoming post which talks about the development of shallow proxies (although I’m not thinking of it as a particularly strong reason for optimism).
Oh yeah, I totally agree with this. Will edit into the piece.
Intuitions about solving hard problems
Suppose we get to specify, by magic, a list of techniques that AGIs won’t be able to use to take over the world. How long does that list need to be before it makes a significant dent in the overall probability of xrisk?
I used to think of “AGI designs self-replicating nanotech” mainly as an illustration of a broad class of takeover scenarios. But upon further thought, nanotech feels like a pretty central element of many takeover scenarios—you actually do need physical actuators to do many things, and the robots we might build in the foreseeable future are nowhere near what’s necessary for maintaining a civilisation. So how much time might it buy us if AGIs couldn’t use nanotech at all?
Well, not very much if human minds are still an attack vector—the point where we’d have effectively lost is when we can no longer make our own decisions. Okay, so rule out brainwashing/hyper-persuasion too. What else is there? The three most salient: military power, political/cultural power, economic power.
Is this all just a hypothetical exercise? I’m not sure. Designing self-replicating nanotech capable of replacing all other human tech seems really hard; it’s pretty plausible to me that the world is crazy in a bunch of other ways by the time we reach that capability. And so if we can block off a couple of the easier routes to power, that might actually buy useful time.
This seems like a good argument against “suddenly killing humans”, but I don’t think it’s an argument against “gradually automating away all humans”. Automation is both a) what happens by default over time—humans are cheap now but they won’t be cheapest indefinitely; and b) a strategy that reduces the amount of power humans have to make decision about the future, which benefits AIs if their goals are misaligned with ours.
I also note that historically, many rulers have solved the problem of “needing cheap labour” via enslaving humans, rather than by being gentle towards them. Why do you expect that to not happen again?
One thing that makes me suspicious about this argument is that, even though I can gradient hack myself, I don’t think I can make suggestions about what my parameters should be changed to.
How can I gradient hack myself? For example, by thinking of strawberries every time I’m about to get a reward. Now I’ve hacked myself to like strawberries. But I have no idea how that’s implemented in my brain, I can’t “pick the parameters for myself”, even if you gave me a big tensor of gradients.
Two potential alternatives to the thing you said:
maybe competitive alignment schemes need to be robust to models gradient hacking themselves towards being more capable (although idk why this would make a difference).
maybe competitive alignment schemes need to be robust to models (sometimes) choosing their own rewards to reinforce competent behaviour. (Obviously can’t let them do it too often or else your model just wireheads.)
Interesting argument. I think I don’t really buy it, though; for most of the problems you raise, I tend to think “if I were an AGI, then I’d be able to solve this problem”. E.g. maybe I don’t fully trust copies of myself, but I trust them way more than the rest of the world, and I can easily imagine being nice to copies of myself while defecting against the rest of the world.
I think the version of this which would be most interesting to see explored is something like “what’s the strongest case you can make that AGIs will be subject to significant breakdown risk at least until they invent X capability”. E.g. is nanotech the only realistic thing that AGIs could use to get rid of breakdown risk? Or are there other pathways?
Flagging for posterity that I’d take Eliezer’s side of this bet (but I think for different reasons than him).
Same here.
thanks, fixed!
ty!
Book review: Very Important People
I really enjoyed the Knowledge Machine; you might be interested in my review here.
Imagine taking someone’s utility function, and inverting it by flipping the sign on all evaluations. What might this actually look like? Well, if previously I wanted a universe filled with happiness, now I’d want a universe filled with suffering; if previously I wanted humanity to flourish, now I want it to decline.
But this is assuming a Cartesian utility function. Once we treat ourselves as embedded agents, things get trickier. For example, suppose that I used to want people with similar values to me to thrive, and people with different values from me to suffer. Now if my utility function is flipped, that naively means that I want people similar to me to suffer, and people similar to me to thrive. But this has a very different outcome if we interpret “similar to me” as de dicto vs de re—i.e. whether it refers to the old me or the new me.
This is a more general problem when one person’s utility function can depend on another person’s, where you can construct circular dependencies (which I assume you can also do in the utility-flipping case). There’s probably been a bunch of work on this, would be interested in pointers to it (e.g. I assume there have been attempts to construct type systems for utility functions, or something like that).
(This note inspired by Mad Investor Chaos, where (SPOILERS) one god declines to take revenge, because they’re the utility-flipped version of another god who would have taken revenge. At first this made sense, but now I feel like it’s not type-safe.)
Actually, this raises a more general point (can’t remember if I’ve made this before): we’ve evolved some values (like caring about revenge) because they’re game-theoretically useful. But if game theory says to take revenge, and also our values say to take revenge, then this is double-counting. So I’d guess that, for much more coherent agents, their level of vengefulness would mainly be determined by their decision theories (which can’t be flipped) rather than their utilities.