I’m sure someone else is able to write a more thoughtful/definitive answer, but I’ll try here to point to two key perspectives on the problem that are typically discussed under this name.
The first perspective is what Rohin Shah has called the motivation-competence split of AGI. One person who’s written about this perspective very clearly is Paul Christiano, so I’ll quote him:
When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.
This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.
In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.
I believe the general idea is to build a system that is trying to help you, and to not run a computation that is acting adversarially in any situation. Correspondingly, Paul Christiano’s research often takes the frame of the following problem:
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
Here’s some more writing on this perspective:
Clarifying “AI Alignment” by Paul Christiano
The Steering Problem by Paul Christiano
The second perspective is what Rohin Shah has called the definition-optimization split of AGI. One person who’s written about this perspective very clearly is Nate Soares, I’ll quote him:
Imagine you have a Jupiter-sized computer and a very simple goal: Make the universe contain as much diamond as possible. The computer has access to the internet and a number of robotic factories and laboratories, and by “diamond” we mean carbon atoms covalently bound to four other carbon atoms. (Pretend we don’t care how it makes the diamond, or what it has to take apart in order to get the carbon; the goal is to study a simplified problem.) Let’s say that the Jupiter-sized computer is running python. How would you program it to produce lots and lots of diamond?
As it stands, we do not yet know how to program a computer to achieve a goal such as that one.
We couldn’t yet create an artificial general intelligence by brute force, and this indicates that there are parts of the problem we don’t yet understand.
There are many AI systems you could build today that would help with this problem, and furthermore, given that much compute you could likely use it for something useful to the goal of making as much diamond as possible. But there is no single program that will continue to usefully create as much diamond as possible as you give it increasing computational power—at some point it will do something weird and unhelpful cf. Bostrom’s “Perverse Instantiations”, and Paul Christiano on What does the universal prior actually look like?
There are two types of open problem in AI. One is figuring how to solve in practice problems that we know how to solve in principle. The other is figuring out how to solve in principle problems that we don’t even know how to brute force yet.
The question of aligning an AI, is creating it such that if the AI you created were to become far more intelligent than any system that has ever existed (including humans), it would continue to do the useful thing you asked it to do, and not do something else.
Here’s some more writing on this perspective.
The Rocket Alignment Problem by Eliezer Yudkowsky.
MIRI’s Approach by Nate Soares.
Methodology of unbounded analysis (unfinished) by Eliezer Yudkowsky.
Overall, I think that it’s the case that neither of these two perspectives is cleanly formalised or well-specified, and that’s a key part of the problem with making sure AGI goes well—being able to clearly state exactly what we’re confused about in the long run about how to build an AGI is half the battle.
Personally, when I hear ‘AI alignment’ in a party/event/blog, I expect a discussion of AGI design with the following assumption:
The key bottleneck to ensuring an existential win when creating AGI that is human-level-and-above, is that we need to do advance work on technical problems that we’re confused about. (This is to be contrasted with e.g. social coordination among companies and governments about how to use the AGI.)
Precisely what we’re confused about, and which research will resolve our confusion, is an open question. The word ‘alignment’ captures the spirit of certain key ideas about what problems need solving, but is not a finished problem statement.
I like sex.
I would be interested to see the best explained answer to this.
Comment? No answer?
Wow that was so much harder than I expected. My sense of balance was going haywire! I lasted 25 seconds then had to put my other foot down..
That seems correct.
I’d add that in some situations (e.g., if the secret is relevant to some altruistic aim) instead of building a project around the secret (with the connotation of keeping it safe), the aim of your project should be to first disperse the secret amongst others who can help, and then possibly stepping out of the way once that’s done.
I do think there’s variance in the communicability of such insights. I think that, for example, Holden when thinking of starting GiveWell or Eliezer when thinking of building MIRI (initiall SIAI) both correctly just tried to build the thing they believed could exist, rather than first lower the inferential gap such that a much larger community could understand it. OTOH EY wrote the sequences, Holden has put a lot of work into making OpenPhil+GiveWell’s decision making understandable, and these have both had massive payoffs.
This is a neat visualisation + explanation of comparative advantage.
Comparative advantage isn’t something that I actually ever think about when it comes to making long-term project decisions, and empirically when my friends have used it as a justification for their career choices I’ve tended to not feel their plans were good. I much prefer people to act on a secret they think they know but that others don’t.
Over time I’ve generally come to put far less trust into mine and others’ explicit reasoning for making long-term plans about what to work on (I have an old draft post on this I should maybe also push). The opening line in this essay by Paul Graham recommends one good way of not following your explicit reasoning for making long-term plans:
The way to get startup ideas is not to try to think of startup ideas. It’s to look for problems, preferably problems you have yourself.
I don’t know that I’d honestly advise anyone to ‘pick a career’ these days. I don’t think the big institutions (academia, government, news, etc) that you can rise up in are healthy enough or will last long-enough that planning around them on 10-20 year timescales to be one of the best ways to achieve tail outcomes with your life (I’m not sure I phrased that quite precisely right, but whatever). Focusing on projects that make you curious, make you excited, and that have good short-term feedback loops with reality seems better. Most of all, if your project is built around a potential secret, then really go for it.
Nice!!! I appreciate the guide to older stuff, very useful.
I confess, that strategy has not occurred to me. I generally consider improvements to infrastructure to help the group produce value, and if a group has scaled too fast too early, write it off and go elsewhere. Something more (if you’ll forgive the pejorative term) destructive might be right, and perhaps I should consider those strategies more. Though in the case of the rationality community I feel I can see a lot of cheap infrastructure improvements that I think will go a long way to improving coordination of intellectual labour.
(FYI I linked to that comment because I remembered it having a bunch of points about early growth being bad, not because I felt the topic discussed was especially relevant.)
Interesting. My main thought reading the OP was simply that coordination is hard at scale, and this applies to intellectual progress too. You had orders-of-magnitude increase in number of people but no change in productivity? Well, did you build better infrastructure and institutions to accommodate, or has your indrastructure for coordinating scientists largely stayed the same since the scientific revolution? In general, scaling early is terrible, and will not be a source of value but run counter to it (and will result in massive goodharting).
That is such a respectable social norm, to try and make as conservative a statement about norms as possible whenever you’re given the opportunity (as opposed to many people’s natural instincts which is to try to paint a big picture that seems important and true to them).
Thanks for the post, I really like the diagram.
One thing that would be useful (if it’s easy for you) would be to link back to a bunch of the posts you read (if you thought they were any good) and say quickly which part of the diagram they’re talking about.
Thanks for your reply, I also do not agree with it but found that it points to something important ideas. (In the past I have tended to frame the conversation more about ‘trust’ rather than ‘personal loyalty’, but I think with otherwise similar effect.)
The first question I want to ask is: how do you get to the stage where personal loyalty is warranted?
From time to time, I think back to the part of Harry Potter and the Philosopher’s Stone where Harry, Hermione and Ron become loyal to one another—the point where they build the strength of relationship where they can face down Voldemort without worrying that one another may leave out of fear.
It is after Harry and Ron run in to save Hermione from a troll.
The people who I have the most loyalty to in the world are those who have proven that it is there, with quite costly signals. And this was not a stress-free situation. It involved some pressure on each of our souls, though the important thing was that we came out with our souls intact, and also built something we both thought truly valuable.
So it is not clear to me that you can get to the stage of true loyalty without facing some trolls together, and risking actually losing.
The second and more important question I want to ask is: do you think that having loyal friends is sufficient to achieve your goals without regularly feeling like your soul is being torn apart?
The consequence of what I say about is this: it is precisely this state (“soul being torn apart”) which I think is critically important to avoid, in order to be truly rational.
Suppose I am confident that I will not lose my loyal friend.
Here are some updates about the world I might still have to make:
My entire social circle gives me social gradients in directions I do not endorse, and I should leave and find a different community
There is likely to be an existential catastrophe in the next 50 years and I should entirely re-orient my life around preventing it
The institution I’m rising up in is fundamentally broken, and for me to make real progress on problems I care about I should quit (e.g. academia, a bad startup).
All the years of effort I’ve spent on a project or up-skilling in a certain domain has been either useless or actively counterproductive (e.g. working in politics, a startup that hasn’t found product-market fit) and I need to give up and start over.
The only world in which I could feel confident that I wouldn’t have to go through any of these updates are one in which the institutions are largely functional, and I feel that rising up my local social incentives will align with my long term goals. This is not what I observe.
Given the world I observe, it seems impossible for me to not pass through events and updates that cause me significant emotional pain and significant loss of local social status, whilst also optimising for my long term goals. So I want my close allies, the people loyal to me, the people I trust, to have the conversational tools (cf. my comment above) to help me keep my basic wits of rationality about me while I’m going through these difficult updates and making these hard decisions.
I am aware this is not a hopeful comment. I do think it is true.
Edit: changed ‘achieve your goals while staying rational’ to ‘achieve your goals without regularly feeling like your soul is being torn apart’, which is what I meant to say.
Thanks for linking back to the Bruce piece—I remembered that it had made similar points, but I hadn’t understood it when I read it at 15.
I think that nurture culture also doesn’t interact well with ‘typical culture’. If someone expresses an idea that I disagree with, and I start offering them things like “Huh, I’m curious what’s the cause of your belief that x?” or “Interesting, I disagree. Here’s a picture of what it’s like to be me in relation to that claim, does any of it resonate with you?” in many of the above situations the other person is just getting weirded out and not really sure what I’m doing. They were just saying words, they’re not sure what game I’m trying to play, and they’ll try to change the topic.
At first, I felt that ‘nurture’ was a terrible name, because the primary thing I associated with the idea you’re discussing is that we are building up an axiomatised system together. Collaboratively. I’ll say a thing, and you’ll add to it. Lots of ‘yes-and’. If you disagree, then we’ll step back a bit, and continue building where we can both see the truth. If I disagree, I won’t attack your idea, but I’ll simply notice I’m confused about a piece of the structure we’re building, and ask you to add something else instead, or wonder why you’d want to build it that way. I agree this is more nurturing, but that’s not the point. The point is collaboration.
But then my model of Said said “What? I don’t understand why this sort of collaborative exploration isn’t perfectly compatible with combative culture—I can still ask all those questions and make those suggestions” which is a point he has articulated quite clearly down-thread (and elsewhere). So then I got to thinking about the nurturing aspect some more.
I’d characterise combative culture as working best in a professional setting, where it’s what one does as one’s job. When I think of productive combative environments, I visualise groups of experts in healthy fields like math or hard science or computer science. The researchers will bring powerful and interesting arguments forth to each other, but typically they do not discuss nor require an explicit model of how another researcher in their field thinks. And symmetrically, the person responsible for how this researcher thinks is up to them—that’s their whole job! They’ll note they were wrong, and make some updates about what cognitive heuristics they should be using, but not bring that up in the conversation, because that’s not the point of the conversations. The point of the conversation is, y’know, whether the theorem is true, or whether this animal evolved from that, or whether this architecture is more efficient when scaled. Not our emotions or feelings.
Sure, we’ll attack each other in ways that can often make people feel defensive, but in a field where everyone has shown their competence (e.g. PhDs) we have common knowledge of respect for one another—we don’t expect it to actually hurt us to be totally wrong on this issue. It won’t mean I lose social standing, or stop being invited to conferences, or get fired. I mean, obviously it needs to correlate, but never does any single sentence matter or single disagreement decide something that strong. Generally the worst that will happen to you is that you just end up a median scientist/researcher, and don’t get to give the big conference talks. There’s a basic level of trust as we tend to go about our work, that means combative culture is not a real problem.
I think this is good. It’s hard to admit you’re wrong, but if we have common knowledge of respect, then this makes the fear smaller, and I can overcome it.
I think one of the key motivations for nurturing culture is that we don’t have common knowledge that everything will be okay in many part of our lives, and in the most important decisions in our lives way more is at stake than in academia. Some example decisions where being wrong about them has far worse consequences for your life than being wrong about whether Fermat’s Last Theorem is true or false:
Will my husband/wife and I want the same things in the next 50 years?
Will my best friends help me keep the up the standard of personal virtue I care about in myself, or will they not notice if I (say) lie to myself more and more?
I’m half way through med school. Is being a doctor actually hitting the heavy tails of impact I could have with my life?
These questions have much more at stake. I know for myself, when addressing them, I feel emotions like fear, anger, and disgust.
Changing my mind on the important decisions in my life, especially those that affect my social standing amongst my friends and community, is really far harder than changing my life about an abstract topic where the results don’t have much direct impact on my life.
Not that computer science or chemistry or math aren’t incredibly hard, it’s just that to do good work in these fields does not require the particular skill of believing things even when they’ll lower your social standing.
I think if you imagine the scientists above turning combative culture to their normal lives (e.g. whether they feel aligned with their husband/wife for the next 50 years), and really trying to do it hard, they’d immediately go through an incredible amount of emotional pain until it was too much to bear and then they’d stop.
If you want someone to be open to radically changing their job, lifestyle, close relationships, etc, some useful things can be:
Have regular conversations with norms such that the person will not be immediately judged if they say something mistaken, or if they consider a hypothesis that you believe to be wrong.
If you’re discussing with them an especially significant belief and whether to change it, keep a track of their emotional state, and help them carefully walk through emotionally difficult steps of reasoning.
If you don’t, they’ll put a lot of effort into finding any other way of shooting themselves in the foot that’s available, rather than realise that something incredibly painful is about to happen to them (and has been happening for many years).
I think that trying to follow this goal to it’s natural conclusions will lead you to a lot of the conversational norms that we’re calling ‘nurturing’.
I think Qiaochu once said something like “If you don’t regularly feel like your soul is being torn apart, you’re not doing rationality right.” Those weren’t his specific words, but I remember the idea being something like that.
Edited: Added a key section at the end.
Another interesting idea for discussion, is the value of making a long-term commitment to keeping research within a contained environment (i.e. what the OP calls ‘nondisclosed-by-default’).
There’s a bunch of args. Many seem straightforward to me (early research doesn’t translate well into papers at all, it might accidentally turn out to move capabilities forward and you want to see it develop a while to be sure it won’t, etc) but this one surprised me more, and I’d be interested to know if it resonates/is-dissonant with others’ experiences.
We need our researchers to not have walls within their own heads
We take our research seriously at MIRI. This means that, for many of us, we know in the back of our minds that deconfusion-style research could sometimes (often in an unpredictable fashion) open up pathways that can lead to capabilities insights in the manner discussed above. As a consequence, many MIRI researchers flinch away from having insights when they haven’t spent a lot of time thinking about the potential capabilities implications of those insights down the line—and they usually haven’t spent that time, because it requires a bunch of cognitive overhead. This effect has been evidenced in reports from researchers, myself included, and we’ve empirically observed that when we set up “closed” research retreats or research rooms,13 researchers report that they can think more freely, that their brainstorming sessions extend further and wider, and so on.
This sort of inhibition seems quite bad for research progress. It is not a small area that our researchers were (un- or semi-consciously) holding back from; it’s a reasonably wide swath that may well include most of the deep ideas or insights we’re looking for.
At the same time, this kind of caution is an unavoidable consequence of doing deconfusion research in public, since it’s very hard to know what ideas may follow five or ten years after a given insight. AI alignment work and AI capabilities work are close enough neighbors that many insights in the vicinity of AI alignment are “potentially capabilities-relevant until proven harmless,” both for reasons discussed above and from the perspective of the conservative security mindset we try to encourage around here.
In short, if we request that our brains come up with alignment ideas that are fine to share with everybody—and this is what we’re implicitly doing when we think of ourselves as “researching publicly”—then we’re requesting that our brains cut off the massive portion of the search space that is only probably safe.
If our goal is to make research progress as quickly as possible, in hopes of having concepts coherent enough to allow rigorous safety engineering by the time AGI arrives, then it seems worth finding ways to allow our researchers to think without constraints, even when those ways are somewhat expensive.
Focus seems unusually useful for this kind of work
There may be some additional speed-up effects from helping free up researchers’ attention, though we don’t consider this a major consideration on its own.
Historically, early-stage scientific work has often been done by people who were solitary or geographically isolated, perhaps because this makes it easier to slowly develop a new way to factor the phenomenon, instead of repeatedly translating ideas into the current language others are using [emphasis added]. It’s difficult to describe how much mental space and effort turns out to be taken up with thoughts of how your research will look to other people staring at you, until you try going into a closed room for an extended period of time with a promise to yourself that all the conversation within it really won’t be shared at all anytime soon.
Once we realized this was going on, we realized that in retrospect, we may have been ignoring common practice, in a way. Many startup founders have reported finding stealth mode, and funding that isn’t from VC outsiders, tremendously useful for focus. For this reason, we’ve also recently been encouraging researchers at MIRI to worry less about appealing to a wide audience when doing public-facing work. We want researchers to focus mainly on whatever research directions they find most compelling, make exposition and distillation a secondary priority, and not worry about optimizing ideas for persuasiveness or for being easier to defend.