Naive Hypotheses on AI Alignment
Apparently doominess works for my brain, cause Eliezer Yudkowsky’s AGI Ruin: A List of Lethalities convinced me to look in to AI safety. Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities.
After a month of sporadic reading, I’ve learned the field is considered to be in a state of preparadigmicity. In other words, we don’t know *how* to think about the problem yet, and thus novelty comes at a premium. The best way to generate novel ideas is to pull in people from other disciplines. In my case that’s computational psychology: modeling people like agents. And I’ve mostly applied this to video games. My Pareto frontier is “modeling people like agents based on their behavior logs in constructed games created to trigger reward signals + ITT’ing the hell out of all the new people I love to constantly meet”. I have no idea if this background makes me more or less likely to generate a new idea that’s useful to solving AI alignment, but the way I understand the problem now: everyone should at least try.
So I started studying AI alignment, but quickly realized there is a trade-off: The more I learn, the harder it is to think of anything new. At first I had a lot of naive ideas on how to solve the alignment problem. As I learned more about the field, my ideas all crumbled. At the same time, I can’t really assess yet if there is a useful level of novelty in my naive hypotheses. I’m still currently generating ideas low on “contamination” by existing thought (cause I’m new), but also low on quality (cause I’m new). As I learn more, I’ll start generating higher quality hypotheses, but these are likely to become increasingly constrained to the existing schools of thought, because of cognitive contamination from everyone reading the same material and thinking in similar ways. Which is exactly the thing we want to avoid at this stage.
Therefore, to get the best of both worlds, I figured I’d write down my naive hypotheses as I have them, and keep studying at the same time. Maybe an ostensibly “stupid” idea on my end, inspires someone with more experience to a workable idea on their end. Even if the probability of that is <0.1%, it’s still worth it. Cause, you know, …. I prefer we don’t all die.
So here goes:
H1 - Emotional Empathy
If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes. This is a trait in a subset of humans. What is this trait, and can we integrate it in to the reward function of an AGI?
Does the trait rely on lack of meta-cognition? Does this trait show up equally at various IQ levels or does it peak at certain IQ levels? If the trait is less common at higher IQ levels, then this is probably a dead end. If the trait is more common at higher IQ levels, then there might be something to it.
First candidate for this trait is “emotional empathy”, a trait that hitches one’s reward system to that of another organism. Emotional empathy that we wire in to the AGI would need to be universal to all humanity, and not biased, like the human implementation.
H2 - Silo AI
Silo the hardware and functionality of AGI to particular tasks. Like governments are run in trifecta to avoid corruption. Like humans need to collaborate to make things greater than themselves. Similarly, limit AGI to functions and physicalities that force it to work together with multiple other, independent AGI’s to achieve any change in the world.
Counterargument: Silo’ed AI is effectively Tool AI, to which Gwern has written a counterargument that people won’t develop Tool AI cause it will always be worse than Agent AI.
Maybe that’s what we need to police? And the police would then effectively be a Nanny AI, so then we still need to solve for making a Nanny AI to keep all other AGI silo’ed. (This is all turning very “one ring to rule them all”...).
H3 - Kill Switch
Kill switch! Treat AGI like the next cold war. Make a perfect kill switch, where any massive failure state according to humans would blow up the entire sphere of existence of humans and AGI.
This strategy would block out the “kill all humans” strategies the AGI might come up with, cause it would destroy their own existence. They should be prioritizing their existence cause of instrumental convergence (whatever goal you are maximizing, you very likely need to exist to maximize it, so self-preservation is very most likely a goal any AGI will have).
What possible kill switch could we create that wouldn’t be trivially circumvented by something smarter than us? Intuitively I have the sense, a non-circumventable kill switch should exist, but what would that look like?
H4 - Human Alignment
AI alignment currently seems intractable because any alignment formula we come up with is inherently inconsistent cause humans are inconsistent. We can solve AI alignment by solving what humanity’s alignment actually is.
We can’t ask humans about their alignment because most individual humans do not have consistent internal alignments they can be questioned on. Some very few do, but this seems to be an exception. Thus, we can’t make a weighted function of humanity’s alignment by summing all the individual alignments of humans. Therefore, humanity at large does not have one alignment. (Related: Coherent Extrapolated Volition doesn’t converge for all of humanity)
Can we extrapolate humanity’s alignment from the process that shaped us: Evolution?
Evolution as gene proliferation function: Many humans do not share this as their explicit life goal but most common human goals still indirectly maximize our genetic offspring. For instance, accumulating wealth, discovering new technology, solidifying social bonds, etc. If AGI can directly help us to spread our genes, would that make most of our other drives vestigial? What would the AGI be propagating if the resulting offspring wouldn’t have similar drives to ourselves, including the vestigial ones?
However, more is not always better: There are very many pigs and very many ants. I think humans would rather be happier or smarter than simply more. Optimizing over happiness seems perverse, cause happiness is simply the reward signal for taking actions with high (supposed) survival and proliferation values. Optimizing over happiness would inevitably lead to a brain in a vat of heroin. Happiness should be a motivational tool, not a motivational goal.
Extrapolating our evolutionary path: Let AGI push us more steps up the evolutionary ladder, where we may survive in more different environments and flourish toward new heights. Thus, an AGI would engineer humans into a new species. This would creep most people out, while transhumanists would be throwing a party. It effectively comes down to AGI being the next step on the evolutionary ladder, and asking it to bring us with it instead of exterminating us. (note: we most probably were not that kind to our ancestors).
Thoughts on Corrigibility
Still learning about it at the moment, but my limited understanding so far is:
How to create an AI that is smarter than us at solving our problems, but dumber than us at interpreting our goals.
In other words, how do we constrain an AI with respect to its cognition about its goals?
Side Thoughts—Researcher Bias
Do AGI optimists and pessimists differ in some dimension of personality or cognitive traits? It’s well established that political and ideological voting behavior correlate to personality. So if the same is true for AI risk stance, then this might point to a potential confounder is AI risk predictions.
My thanks goes out to Leon Lang and Jan Kirchner for encouraging my beginner theorizing, discussing the details of each idea, and pointing me toward related essays and papers.
- 7 traps that (we think) new alignment researchers often fall into by 27 Sep 2022 23:13 UTC; 176 points) (
- 7 traps that (we think) new alignment researchers often fall into by 27 Sep 2022 23:13 UTC; 73 points) (EA Forum;
- Resources that (I think) new alignment researchers should know about by 28 Oct 2022 22:13 UTC; 69 points) (
- Possible miracles by 9 Oct 2022 18:17 UTC; 64 points) (
- Possible miracles by 9 Oct 2022 18:17 UTC; 38 points) (EA Forum;
- Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas by 25 Nov 2022 20:47 UTC; 37 points) (
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 34 points) (
- Research Principles for 6 Months of AI Alignment Studies by 2 Dec 2022 22:55 UTC; 23 points) (
- Resources that (I think) new alignment researchers should know about by 28 Oct 2022 22:13 UTC; 20 points) (EA Forum;
- Let’s Compare Notes by 22 Sep 2022 20:47 UTC; 17 points) (
- Novelty Generation—The Art of Good Ideas by 20 Aug 2022 0:36 UTC; 15 points) (
- Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas by 25 Nov 2022 20:47 UTC; 14 points) (EA Forum;
- Miscellaneous First-Pass Alignment Thoughts by 21 Nov 2022 21:23 UTC; 12 points) (
- Alignment as Game Design by 16 Jul 2022 22:36 UTC; 11 points) (
- An overview of some promising work by junior alignment researchers by 26 Dec 2022 17:23 UTC; 10 points) (EA Forum;
I quite like this strategy!
I would also echo the advice in the Alignment Research Field Guide:
Thank you! And adding that to my reading list :D
Yeah, I actually think Alignment Research Field Guide is one of the best resources for EAs and rationalists to read regardless of what they’re doing in life. :)
I do think there’s value in beginner’s mind, glad you’re putting your ideas on alignment out there :)
This interpretation of corrigibility seems too narrow to me. Some framings of corrigibility like Stuart Russell’s CIRL-based are like this, where the AI is trying to understand human goals but has uncertainty about it. But there are other framings, for example myopia, where the AI’s goal is such that it would never sacrifice reward now for reward later, so it would never be motivated to pursue an instrumental goal like disabling its own off-switch.
When you’re looking to further contaminate your thoughts and want more on this topic, there’s a recent thread where different folks are trying to define corrigibility in the comments: https://www.lesswrong.com/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag#comments
Thank you! I’ll definitely read that :)
I like the idea! Just a minor issue with the premise:
“Either I’d find out he’s wrong, and there is no problem. Or he’s right, and I need to reevaluate my life priorities.”
There is a wide range of opinions, and EY’s has one of the most pessimistic ones. It may be the case that he’s wrong on several points, and we are way less doomed than he thinks, but that the problem is still there and a big one as well.
(In fact, if EY is correct we might as well ignore the problem, as we are doomed anyway. I know this is not what he thinks, but it’s the consequence I would take from his predictions)
The premise was intended to contextualize my personal experience of the issue. I did not intend to make a case that everyone should weigh their priorities in the same manner. For my brain specifically, a “hopeless” scenario registers as a Call to Arms where you simply need to drop what else you’re doing, and get to work. In this case, I calculated the age of my children on to all the timelines. I realized either my kids or my grandkids will die from AGI if Eliezer is in any way right. Even a 10% chance of that happening is too high for me, so I’ll pivot to whatever work needs to get done to avoid that. Even if the chance of my work making a difference are very slim, there isn’t anything else worth doing.
I agree with you actually. My point is that in fact you are implicitly discounting EY pessimism—for example, he didn’t release a timeline but often said “my timeline is way shorter than that” with respect to 30-years ones and I think 20-years ones as well. The way I read him, he thinks we personally are going to die from AGI, and our grandkids will never be born, with 90+% probability, and that the only chances to avoid it is that are either someone having a plan already three years ago which has been implemented in secret and will come to fruition next year, or some large out-of-context event happens (say, nuclear or biological war brings us back to the stone age).
My no-more-informed-than-yours opinion is that he’s wrong on several points, but correct on others. From this I deduce that the risk of very bad outcomes is real and not negligible, but the situation is not as desperate and there are probably actions that will improve our outlook significantly. Note that in the framework “either EY is right or he’s wrong and there’s nothing to worry about” there’s no useful action, only hope that he’s wrong because if he’s right we’re screwed anyway.
Implicitly, this is your world model as well from what you say. Discussing this then may look like nitpicking, but whether Yudkowsky or Ngo or Christiano are correct about possible scenarios changes a lot about which actions are plausibly helpful. Should we look for something that has a good chance to help in an “easier” scenario, rather than concentrate efforts on looking for solutions that work on the hardest scenario, given that the chance of finding one is negligible? Or would that be like looking for the keys under the streetlight?
I think we’re reflecting on the material at different depths. I can’t say I’m far enough along to assess who might be right about our prospects. My point was simply that telling someone with my type of brain “it’s hopeless, we’re all going to die” actually has the effect of me dropping whatever I’m doing, and applying myself to finding a solution anyway.
This is a really cool idea and I’m glad you made the post! Here are a few comments/thoughts:
H1: “If you give a human absolute power, there is a small subset of humans that actually cares and will try to make everyone’s life better according to their own wishes”
How confident are you in this premise? Power and sense of values/incentives/preferences may not be orthogonal (and my intuition is that it isn’t). Also, I feel a little skeptical about the usefulness of thinking about the trait showing up more or less in various intelligence strata within humans. Seems like what we’re worried about is in a different reference class. Not sure.
H4 is something I’m super interested in and would be happy to talk about it in conversations/calls if you want to : )
I saw this note in another thread, but the just of it is that power doesn’t corrupt. Rather,
Evil people seek power, and are willing to be corrupt (shared cause correlation)
Being corrupt helps to get more power—in the extreme statement of this, maintaining power requires corruption
The process of gaining power creates murder-ghandis.
People with power attract and/or need advice on how and for what goal to wield it, and that leads to mis-alignment with the agents pre-power values.
Can you add a link to the other thread please?
No, I don’t remember exactly where on LW I saw it—just wanted to aknowledge that I was amplifying so.eone else’s thoughts.
My college writing instructor was taken aback when I asked her how to cite something I could quote, but didn’t recall from where, but her answer was “then you can’t use it” which seemed harsh. There should be a way to aknowledge plagiarism without knowing or stating who is being plagiarized—and if the original author shown up, you’ve basically pre-conceded any question of originality to them.
Thx for being clear about it.
Are you aware of any research in to this? I struggle to think of any research designs that would make it through an ethics board.
I don’t know that anyone has done the studies, but you could look at how winners of large lotteries behave. That is a natural example of someone suddenly gaining a lot of money (and therefore power). Do they tend to keep thier previous goals, amd just scale up thier efforts, or do they start doing power-retaining things? I have no idea what the data will show—thought experiments and amecdotes could go either way.
Let me Google that for you.
Thank you!
If they are not orthogonal then presumably prosociality and power are inversely related, which is worse?
In this case, I’m hoping intelligence and prosociality-that-is-robust-to-absolute-power would hopefully be a positive correlation. However, I struggle to think how this might actually be tested… My intuitions may be born from the Stanford Prison experiment, which I think has been refuted since. So maybe we don’t actually have as much data on prosociality in extreme circumstances as I initially intuited. I’m mostly reasoning this out now on the fly by zooming in on where my thoughts may have originally come from.
That said, it doesn’t very much matter how frequent robust prosociality traits are, as long as they do exist and can be recreated in AGI.
I’ll DM you my discord :)
It would be interesting to hear what the cognitive neuroscientist know about how empathy is implemented in the brain.
The H1 point sounds close to Steven Byrnes’ brain-like AGI.
I believe that cognitive neuroscience has nothing much to say about how any experience at all is implemented in the brain—but I just read this book which has some interesting ideas: https://lisafeldmanbarrett.com/books/how-emotions-are-made/
My personal opinion is that empathy is the one most likely to work. Most proposed alignment solutions feel to me like patches rather than solutions to the problem, which is AI not actually caring about the welfare of other beings intrinsically. If it did, it would figure out how to align itself. So that’s the one I’m most interested in. I think Steven Byrnes has some interest in it as well—he thinks we ought to figure out how human social instincts are coded in the brain.
Hmmm, yes and no?
e.g. many people that care about animal welfare differ on the decisions they would make for those animals. What if the AGI ends up a negative utilitarian and sterilizes us all to save humanity from all future suffering? The missing element would again be to have the AGI aligned with humanity, which brings us back to H4: What’s humanity’s alignment anyway?
I think “humanity’s alignment” is a strange term to use. Perhaps you mean “humanity’s values” or even “humanity’s collective utility function.”
I’ll clarify what I mean by empathy here. I think the ideal form of empathy is wanting others to get what they themselves want. Given that entities are competing for scarce resources and tend to interfere with one another’s desires, this leads to the necessity of making tradeoffs about how much you help each desire, but in principle this seems like the ideal to me.
So negative utilitarianism is not actually reasonably empathic, since it is not concerned with the rights of the entities in question to decide about their own futures. In fact I think it’s one of the most dangerous and harmful philosophies I’ve ever seen, and an AI such as I would like to see made would reject it altogether.
Enjoyed this.
Overall, I think that framing AI alignment as a problem is … erm .. problematic. The best parts of my existence as a human do not feel like the constant framing and resolution of problems. Rather they are filled with flow, curiosity, wonder, love.
I think we have to look in another direction, than trying to formulate and solve the “problems” of flow, curiosity, wonder, love. I have no simple answer—and stating a simple answer in language would reveal that there was a problem, a category, that could “solve” AI and human alignment problems.
I keep looking for interesting ideas—and find yours among the most fascinating to date.
My take on this: countering Eliezer Yudkowsky
You’re right that an AGI being vastly smarter than humans is consistent with both good and bad outcomes for humanity. This video does not address any of the arguments that have been presented about why an AGI would by default have unaligned values with humanity, which I’d encourage you to engage with. It’s mentioned in bullet −3 in the list, under the names instrumental convergence and orthogonality thesis, with the former being probably what I’d recommend reading about first.
Hi Ben, thanks for this. We are not passive victims of the future, waiting trembling to see a future we cannot escape because of rigid features such as you mention. We can help shape and literally make the future. I have 1500+ videos so you will run out of material before I do! What do you think of the idea of machines suggesting better ways to cooperate which humans could never attain themselves? Do you listen to the news? If you don’t listen to the news isn’t that because you are disappointed with how humans cooperate left to their own devices? They need better ideas from machines! See:
Kim, you’re not addressing the points in the post. You can’t repeat catch phrases like ‘passive victims of the future’ and expect it to have ground here. MIRI created well funded research institution devoted to positively shaping the future, while you make silly YouTube videos with platitudes. This interest in AI seems like recreation to you.