SERI MATS ’21, Cognitive science @ Yale ‘22, Meta AI Resident ’23, LTFF grantee. Currently doing prosocial alignment research @ AE Studio. Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Cameron Berg
The Dark Side of Cognition Hypothesis
Theoretical Neuroscience For Alignment Theory
Thank you!
I don’t think I claimed that the brain is a totally aligned general intelligence, and if I did, I take it back! For now, I’ll stand by what I said here: “if we comprehensively understood how the human brain works at the algorithmic level, then necessarily embedded in this understanding should be some recipe for a generally intelligent system at least as aligned to our values as the typical human brain.” This seems harmonious with what I take your point to be: that the human brain is not a totally aligned general intelligence. I second Steve’s deferral to Eliezer’s thoughts on the matter, and I mean to endorse something similar here.
what’s the prevalence of empathy in social but non-general animals?
Here’s a good summary. I also found a really nice non-academic article in Vox on the topic.
And I’m looking forward to seeing your post on second-order alignment! I think the more people who take the concern seriously (and put forward compelling arguments to that end), the better.
If we expect to gain something from studying how humans implement these processes, it’d have to be something like ensuring that our AIs understand them “in the same way that humans do,” which e.g. might help our AIs generalize in a similar way to humans.
I take your point that there is probably nothing special about the specific way(s) that humans get good at predicting other humans. I do think that “help[ing] our AIs generalize in a similar way to humans” might be important for safety (e.g., we probably don’t want an AGI that figures out its programmers way faster/more deeply than they can figure it out). I also think it’s the case that we don’t currently have a learning algorithm that can predict humans as well as humans can predict humans. (Some attempts, but not there yet.) So to the degree that current approaches are lacking, it makes sense to me to draw some inspiration from the brain-based algorithms that already implement these processes extremely well—i.e., to first understand these algorithms, and to later develop training goals in accordance with the heuristics/architecture these algorithms seem to instantiate.
This is notably in contrast to affective empathy, though, which is not something that’s inherently necessary for predictive accuracy—so figuring out how/why humans do that has a more concrete story for how that could be helpful.
Agreed! I think it’s worth noting that if you take seriously the ‘hierarchical IRL’ model I proposed in the ToM section, understanding the algorithm(s) underlying affective empathy might actually require understanding cognitive and affective ToM (i.e., if these are the substrate of affective empathy, we’ll probably need a good model of them before we can have a good model of affective empathy).
And wrt learning vs. online learning, I think I’m largely in agreement with Steve’s reply. I would also add that this might end up just being a terminological dispute depending on how flexible we are with calling particular phases “training” vs. “deployment.” E.g., is a brain “deployed” when the person’s genetic make-up as a zygote is determined? Or is it when they’re born? When their brain stops developing? When they learn the last thing they’ll ever learn? To the degree we think these questions are awkward/their answers are arbitrary, I would think this counts as evidence that the notion of “online learning” is useful to invoke here/gives us more parsimonious answers.
I think this is an incredibly interesting point.
I would just note, for instance, in the (crazy cool) fungus-and-ants case, this is a transient state of control that ends shortly thereafter in the death of the smarter, controlled agent. For AGI alignment, we’re presumably looking for a much more stable and long-term form of control, which might mean that these cases are not exactly the right proofs of concept. They demonstrate, to your point, that “[agents] can be aligned with the goals of someone much stupider than themselves,” but not necessarily that agents can be comprehensively and permanently aligned with the goals of someone much stupider than themselves.
Your comment makes me want to look more closely into how cases of “mind control” work in these more ecological settings and whether there are interesting takeaways for AGI alignment.
Paradigm-building: Introduction
Paradigm-building from first principles: Effective altruism, AGI, and alignment
I agree with this. By ‘special class,’ I didn’t mean that AI safety has some sort of privileged position as an existential risk (though this may also happen to be true)—I only meant that it is unique. I think I will edit the post to use the word “particular” instead of “special” to make this come across more clearly.
Paradigm-building: The hierarchical question framework
Hi Tekhne—this post introduces each of the five questions I will put forward and analyze in this sequence. I will be posting one a day for the next week or so. I think I will answer all of your questions in the coming posts.
I doubt that carving up the space in this—or any—way would be totally uncontroversial (there are lots of value judgments necessary to do such a thing), but I think this concern only serves to demonstrate that this framework is not self-justifying (i.e., there is still lots of clarifying work to be done for each of these questions). I agree with this—that’s why there I am devoting a post to each of them!
In order to minimize AGI-induced existential threats, I claim that we need to understand (i.e., anticipate; predict) AGI well enough (Q1) to determine what these threats are (Q2). We then need to figure out ways to mitigate these threats (Q3) and ways to make sure these proposals are actually implemented (Q4). How quickly we need to answer Q1-Q4 will be determined by how soon we expect AGI to be developed (Q5). I appreciate your skepticism, but I would counter that this seems actually like a fairly natural and parsimonious way to get from point A (where we are now) to point B (minimizing AGI-induced existential threats). That’s why I claim that an AGI safety research agenda would need to answer these questions correctly in order to be successful.
Ultimately, I can only encourage you to wait for the rest of the sequence to be published before passing a conclusive judgment!
Thanks for your comment—I entirely agree with this. In fact, most of the content of this sequence represents an effort to spell out these generalizations. (I note later that, e.g., the combinatorics of specifying every control proposal to deal with every conceivable bad outcome from every learning architecture is obviously intractable for a single report; this is a “field-sized” undertaking.)
I don’t think this is a violation of the hierarchy, however. It seems coherent to both claim (a) given the field’s goal, AGI safety research should follow a general progression toward this goal (e.g., the one this sequence proposes), and (b) there is plenty of productive work that can and should be done outside of this progression (for the reason you specify).
I look forward to hearing if you think the sequence walks this line properly!
Question 1: Predicted architecture of AGI learning algorithm(s)
Question 2: Predicted bad outcomes of AGI learning architecture
Hey Robert—thanks for your comment!
it seems very clear that we should update that structure to the best of our ability as we make progress in understanding the challenges and potentials of different approaches.
Definitely agree—I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
“Aiming at good outcomes while/and avoiding bad outcomes” captures more conceptual territory, while still allowing for the investigation to turn out that avoiding bad outcomes is more difficult and should be prioritised. This extends to the meta-question of whether existential risk can be best adressed by focusing on avoiding bad outcomes, rather than developing a strategy to get to good outcomes (which are often characterised by a better abilitiy to deal with future risks) and avoid bad outcomes on the way there.
I definitely agree with the value of framing AGI outcomes both positively and negatively, as I discuss in the previous post. I am less sure that AGI safety as a field necessarily requires deeply considering the positive potential of AGI (i.e., as long as AGI-induced existential risks are avoided, I think AGI safety researchers can consider their venture successful), but, much to your point, if the best way of actually achieving this outcome is by thinking about AGI more holistically—e.g., instead of explicitly avoiding existential risks, we might ask how to build an AGI that we would want to have around—then I think I would agree. I just think this sort of thing would radically redefine the relevant approaches undertaken in AGI safety research. I by no means want to reject radical redefinitions out of hand (I think this very well could be correct); I just want to say that it is probably not the path of least resistance given where the field currently stands.
(And agreed on the self-control point, as you know. See directionality of control in Q3.)
Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don’t agree with most of what you’ve written—I’d be very curious to hear if you think I’m missing something:
[We] don’t expect that the alignment problem itself is highly-architecture dependent; it’s a fairly generic property of strong optimization. So, “generic strong optimization” looks like roughly the right level of generality at which to understand alignment...Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively “noise”, for purposes of understanding alignment.
Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn’t seem at all obvious to me. I write in Q2: “It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI.”
By analogy, consider the more familiar ‘alignment problem’ of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are ‘breed-independent’ strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they’re aligned. I don’t yet understand why you are confident that different possible AGIs could not follow this same pattern.
On top of that, there’s the obvious problem that if we try to solve alignment for a particular architecture, it’s quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)
I think that you think that I mean something far more specific than I actually do when I say “particular architecture,” so I don’t think this accurately characterizes what I believe. I describe my view in the next post.
[It’s] the unknown unknowns that kill us. The move we want is not “brainstorm failure modes and then avoid the things we brainstormed”, it’s “figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)”.
I think this is a very interesting point (and I have not read Eliezer’s post yet, so I am relying on your summary), but I don’t see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are “increasingly-known unknowns!”
I spent the entire first post of this sequence devoted to “figuring out what we want” (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a “strategy that might systematically achieve this:” we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).
If by “figure out what we want,” you mean “figure out what we want out of an AGI,” I definitely agree with this (see Robert’s great comment below!). If by “figure out what we want,” you mean “figure out what we want out of AGI safety research,” well, that is the entire point of this sequence!
I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it’s technically necessary to answer at some point, this question might not be very useful to think about ahead of time.
I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn’t even been published yet—I hope you’ll read it!).
in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary.
When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between “all-the-time-in-the-world” and “hail-Mary,” and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn’t just do it all at once.)
Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.
I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having “consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.” Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.
I definitely agree that things like how we define ‘control’ and ‘bad outcomes’ might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!
Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I’ve tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it).
As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
Question 3: Control proposals for minimizing bad outcomes
If it’s possible that we could get to a point where AGI is no longer a serious threat without needing to answer the question, then the question is not necessary.
Agreed, this seems like a good definition for rendering anything as ‘necessary.’
Our goal: minimize AGI-induced existential threats (right?).
My claim is that answering these questions is probably necessary for achieving this goal—i.e., P(achieving goal | failing to think about one or more of these questions) ≈ 0. (I say, “I am claiming that a research agenda that neglects these questions would probably not actually be viable for the goal of AGI safety work.”)
That is, we would be exceedingly lucky if we achieve AGI safety’s goal without thinking about
what we mean when we say AGI (Q1),
what existential risks are likely to emerge from AGI (Q2),
how to address these risks (Q3),
how to implement these mitigation strategies (Q4), and
how quickly we actually need to answer these questions (Q5).
I really don’t see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all “before the deadline;” Q5). I propose a way to do this hierarchically. Do you see wiggle room here where I do not?
FWIW, I also don’t really think this is the core claim of the sequence. I would want that to be something more like here is a useful framework for moving from point A (where the field is now) to point B (where the field ultimately wants to end up). I have not seen a highly compelling presentation of this sort of thing before, and I think it is very valuable in solving any hard problem to have a general end-to-end plan (which we probably will want to update as we go along; see Robert’s comment).
I think most of the strategies in MIRI’s general cluster do not depend on most of these questions.
Would you mind giving a specific example of an end-to-end AGI safety research agenda that you think does not depend on or attempt to address these questions? (I’m also happy to just continue this discussion off of LW, if you’d like.)
Question 4: Implementing the control proposals
Definitely agree that if we silo ourselves into any rigid plan now, it almost certainly won’t work. However, I don’t think ‘end-to-end agenda’ = ‘rigid plan.’ I certainly don’t think this sequence advocates anything like a rigid plan. These are the most general questions I could imagine guiding the field, and I’ve already noted that I think this should be a dynamic draft.
...we do not currently possess a strong enough understanding to create an end-to-end agenda which has any hope at all of working; anything which currently claims to be an end-to-end agenda is probably just ignoring the hard parts of the problem.
What hard parts of the problem do you think this sequence ignores?
(I explicitly claim throughout the sequence that what I propose is not sufficient, so I don’t think I can be accused of ignoring this.)
Hate to just copy and paste, but I still really don’t see how it could be any other way: if we want to avoid futures in which AGI does bad stuff, then we need to think about avoiding (Q3/Q4) the bad stuff (Q2) that AGI (Q1) might do (and we have to do this all “before the deadline;” Q5). This is basically tautological as far as I can tell. Do you agree or disagree with this if-then statement?
I do think that finding necessary subquestions, or noticing that a given subquestion may not be necessary, is much easier than figuring out an end-to-end agenda.
Agreed. My goal was to enumerate these questions. When I noticed that they followed a fairly natural progression, I decided to frame them hierarchically. And, I suppose to your point, it wasn’t necessarily easy to write this all up. I thought it would nonetheless be valuable to do so, so I did!
Thanks for linking the Rocket Alignment Problem—looking forward to giving it a closer read.
Thank you! I think these are all good/important points.
In regards to functional specialization between the hemispheres, I think whether this difference is at the same level as mid-insular cortex vs posterior insular cortex would depend on whether the hemispheric differences can account for certain lower-order distinctions of this sort or not. For example, let’s say that there are relevant functional differences between left ACC and right ACC, left vmPFC and right vmPFC, and left insular cortex and right insular cortex—and that these differences all have something in common (i.e., there is something characteristic about the kinds of computations that differentiate left-hemispheric ACC, vmPFC, insula from right-hemispheric ACC, vmPFC, insula). Then, you might have a case for the hemispheric difference being more fundamental or important than, say, the distinction between mid-insular cortex vs posterior insular cortex. But that’s only if these conditions hold (i.e., that there are functional differences and these differences have intra-hemispheric commonalities). I think there’s a good chance something like this might be true, but I obviously haven’t put forward an argument for this yet, so I don’t blame anyone for not taking my word for it!
I’m not fully grasping the autism/ToM/IRL point yet. My understanding of people on the autism spectrum is that they typically lack ordinary ToM, though I’m certainly not saying that I don’t believe the people you’ve spoken with; maybe only that they might be the exception rather than the rule (there are accounts that emphasize things others than ToM, though, to your point). If it is true that (1) autistic people use mechanisms other than ToM/IRL to understand people (i.e., modeling people like car engines), and (2) autistic people have social deficits, then I’m not yet seeing how this demonstrates that IRL is ‘at most’ just a piece of the puzzle. (FWIW, I would be surprised if IRL were the only piece of the puzzle; I’m just not yet grasping how this argument shows this.) I can tell I’m missing something.
And I agree with the sad vs. schadenfreude point. I think in an earlier exchange you made the point that this sort of thing could be conceivably modulated by in-group style dynamics. More specifically, I think that the extent to which I can look at a person, their situation, the outcome, etc., and notice (probably implicitly) that I could end up in a similar situation, it’s adaptive for me to “simulate” what it is probably like for them to be in this position so I can learn from their experience without having to go through the experience myself. As you note, there are exceptions to this—I think this is particularly when we are looking at people more as “objects” (i.e., complex external variables in our environments) than “subjects” (other agents with internal states, goals, etc. just like me). I think this is well-demonstrated by the following examples.
1, lion-as-subject: I go to the zoo and see a lion. “Ooh, aah! Super majestic.” Suddenly, a huge branch falls onto the lion, trapping it. It yelps loudly. I audibly wince, and I really hope the lion is okay. (Bonus subjects: other people around the enclosure also demonstrate they’re upset/disturbed by what just happened, which makes me even more upset/disturbed!)
2: lion-as-object: I go on a safari alone and my car breaks down, so I need to walk to the nearest station to get help. As I’m doing this, a lion starts stalking and chasing me. Oh crap. Suddenly, a huge branch falls onto the lion, trapping it. It yelps loudly. “Thank goodness. That was almost really bad.”
Very different reactions to the same narrow event. So I guess this kind of thing demonstrates to me that I’m inclined to make stronger claims about affective empathy in those situations where we’re looking at other agents in our environment as subjects, not objects. I think in eusocial creatures like humans, subject-perspective is probably far more common than object-perspective, though one could certainly come up with lots of examples of both. So definitely more to think about here, but I really like this kind of challenge to an overly-simplistic picture of affective empathy wherein someone else feeling way X automatically and context-independently makes me feel way X. This, to your point, just seems wrong.