SERI MATS ’21, Cognitive science @ Yale ‘22, Meta AI Resident ’23, LTFF grantee. Currently doing prosocial alignment research @ AE Studio. Very interested in work at the intersection of AI x cognitive science x alignment x philosophy.
Cameron Berg
The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda
Key takeaways from our EA and alignment research surveys
Survey for alignment researchers!
Theoretical Neuroscience For Alignment Theory
Alignment via prosocial brain algorithms
AI researchers announce NeuroAI agenda
Computational signatures of psychopathy
Paradigm-building: Introduction
Paradigm-building from first principles: Effective altruism, AGI, and alignment
The Dark Side of Cognition Hypothesis
Question 1: Predicted architecture of AGI learning algorithm(s)
Paradigm-building: The hierarchical question framework
With respect to the RLNF idea, we are definitely very sympathetic to wireheading concerns. We think that approach is promising if we are able to obtain better reward signals given all of the sub-symbolic information that neural signals can offer in order to better understand human intent, but as you correctly pointed out that can be used to better trick the human evaluator as well. We think this already happens to a lesser extent and we expect that both current methods and future ones have to account for this particular risk.
More generally, we strongly agree that building out BCI is like a tightrope walk. Our original theory of change explicitly focuses on this: in expectation, BCI is not going to be built safely by giant tech companies of the world, largely given short-term profit-related incentives—which is why we want to build it ourselves as a bootstrapped company whose revenue has come from things other than BCI. Accordingly, we can focus on walking this BCI developmental tightrope safely and for the benefit of humanity without worrying if we profit from this work.
We do call some of these concerns out in the post, eg:
We also recognize that many of these proposals have a double-edged sword quality that requires extremely careful consideration—e.g., building BCI that makes humans more competent could also make bad actors more competent, give AI systems manipulation-conducive information about the processes of our cognition that we don’t even know, and so on. We take these risks very seriously and think that any well-defined alignment agenda must also put forward a convincing plan for avoiding them (with full knowledge of the fact that if they can’t be avoided, they are not viable directions.)
Overall—in spite of the double-edged nature of alignment work potentially facilitating capabilities breakthroughs—we think it is critical to avoid base rate neglect in acknowledging how unbelievably aggressively people (who are generally alignment-ambivalent) are now pushing forward capabilities work. Against this base rate, we suspect our contributions to inadvertently pushing forward capabilities will be relatively negligible. This does not imply that we shouldn’t be extremely cautious, have rigorous info/exfohazard standards, think carefully about unintended consequences, etc—it just means that we want to be pragmatic about the fact that we can help solve alignment while being reasonably confident that the overall expected value of this work will outweigh the overall expected harm (again, especially given the incredibly high, already-happening background rate of alignment-ambivalent capabilities progress).
Definitely agree with the thrust of your comment, though I should note that I neither believe nor think I really imply anywhere that ‘only neurotypical people are worth societal trust.’ I only use the word in this post to gesture at the fact that the vast majority of (but not all) humans share a common set of prosocial instincts—and that these instincts are a product of stuff going on in their brains. In fact, my next post will almost certainly be about one such neuroatypical group: psychopaths!
- 14 Sep 2022 17:39 UTC; 5 points) 's comment on Alignment via prosocial brain algorithms by (
There’s a lot of overlap between alignment researchers and the EA community, so I’m wondering how that was handled.
Agree that there is inherent/unavoidable overlap. As noted in the post, we were generally cautious about excluding participants from either sample for reasons you mention and also found that the key results we present here are robust to these kinds of changes in the filtration of either dataset (you can see and explore this for yourself here).
With this being said, we did ask in both the EA and the alignment survey to indicate the extent to which they are involved in alignment—note the significance of the difference here:
From alignment survey:
From EA survey:
This question/result serves both as a good filtering criterion for cleanly separating out EAs from alignment researchers and also gives a pretty strong evidence that we are drawing on completely different samples across these surveys (likely because we sourced the data for each survey through completely distinct channels).
Regarding the support for various cause areas, I’m pretty sure that you’ll find the support for AI Safety/Long-Termism/X-risk is higher among those most involved in EA than among those least involved. Part of this may be because of the number of jobs available in this cause area.
Interesting—I just tried to test this. It is a bit hard to find a variable in the EA dataset that would cleanly correspond to higher vs. lower overall involvement, but we can filter by number of years one has been involved involved in EA, and there is no level-of-experience threshold I could find where there are statistically significant differences in EAs’ views on how promising AI x-risk is. (Note that years of experience in EA may not be the best proxy for what you are asking, but is likely the best we’ve got to tackle this specific question.)
Blue is >1 year experience, red is <1 year experience:
Blue is >2 years experience, red is <2 years experience:
I’m definitely sympathetic to the general argument here as I understand it: something like, it is better to be more productive when what you’re working towards has high EV, and stimulants are one underutilized strategy for being more productive. But I have concerns about the generality of your conclusion: (1) blanket-endorsing or otherwise equating the advantages and disadvantages of all of the things on the y-axis of that plot is painting with too broad a brush. They vary, eg, in addictive potential, demonstrated medical benefit, cost of maintenance, etc. (2) Relatedly, some of these drugs (e.g., Adderall) alter the dopaminergic calibration in the brain, which can lead to significant personality/epistemology changes, typically as a result of modulating people’s risk-taking/reward-seeking trade-offs. Similar dopamine agonist drugs used to treat Parkinson’s led to pathological gambling behaviors in patients who took it. There is an argument to be made for at least some subset of these substances that the trouble induced by these kinds of personality changes may plausibly outweigh the productivity gains of taking the drugs in the first place.
Interesting! Definitely agree that if people’s specific social histories are largely what qualify them to be ‘in the loop,’ this would be hard to replicate for the reasons you bring up. However, consider that, for example,
Young neurotypical children (and even chimpanzees!) instinctively help others accomplish their goals when they believe they are having trouble doing so alone...
which almost certainly has nothing to do with their social history. I think there’s a solid argument to be made, then, that a lot of these social histories are essentially a lifelong finetuning of core prosocial algorithms that have in some sense been there all along. And I am mainly excited about enumerating these. (Note also that figuring out these algorithms and running them in an RL training procedure might get us the relevant social histories training that you reference—but we’d need the core algorithms first.)
“human in the loop” to some extent translates to “we don’t actually know why we trust (some) other humans, but there exist humans we trust, so let’s delegate the hard part to them”.
I totally agree with this statement taken by itself, and my central point is that we should actually attempt to figure out ‘why we trust (some) other humans’ rather than treating this as a kind of black box. However, if this statement is being put forward as an argument against doing so,, then it seems circular to me.
I liked this post a lot, and I think its title claim is true and important.
One thing I wanted to understand a bit better is how you’re invoking ‘paradigms’ in this post wrt AI research vs. alignment research. I think we can be certain that AI research and alignment research are not identical programs but that they will conceptually overlap and constrain each other. So when you’re talking about ‘principles that carry over,’ are you talking about principles in alignment research that will remain useful across various breakthroughs in AI research, or are you thinking about principles within one of these two research programs that will remain useful across various breakthroughs within that research program?
Another thing I wanted to understand better was the following:
This leaves a question: how do we know when it’s time to make the jump to the next paradigm? As a rough model, we’re trying to figure out the constraints which govern the world.
Unlike many of the natural sciences (physics, chemistry, biology, etc.) whose explicit goals ostensibly are, as you’ve said, ‘to figure out the constraints which govern the world,’ I think that one thing that makes alignment research unique is that its explicit goal is not simply to gain knowledge about reality, but also to prevent a particular future outcome from occurring—namely, AGI-induced X-risks. Surely a necessary component for achieving this goal is ‘to figure out the [relevant] constraints which govern the world,’ but it seems pretty important to note (if we agree on this field-level goal) that this can’t be the only thing that goes into a paradigm for alignment research. That is, alignment research can’t only be about modeling reality; it must also include some sort of plan for how to bring about a particular sort of future. And I agree entirely that the best plans of this sort would be those that transcend content-level paradigm shifts. (I daresay that articulating this kind of plan is exactly the sort of thing I try to get at in my Paradigm-building for AGI safety sequence!)
Thanks for taking the time to write up your thoughts! I appreciate your skepticism. Needless to say, I don’t agree with most of what you’ve written—I’d be very curious to hear if you think I’m missing something:
[We] don’t expect that the alignment problem itself is highly-architecture dependent; it’s a fairly generic property of strong optimization. So, “generic strong optimization” looks like roughly the right level of generality at which to understand alignment...Trying to zoom in on something narrower than that would add a bunch of extra constraints which are effectively “noise”, for purposes of understanding alignment.
Surely understanding generic strong optimization is necessary for alignment (as I also spend most of Q1 discussing). How can you be so sure, however, that zooming into something narrower would effectively only add noise? You assert this, but this doesn’t seem at all obvious to me. I write in Q2: “It is also worth noting immediately that even if particular [alignment problems] are architecture-independent [your point!], it does not necessarily follow that the optimal control proposals for minimizing those risks would also be architecture-independent! For example, just because an SL-based AGI and an RL-based AGI might both hypothetically display tendencies towards instrumental convergence does not mean that the way to best prevent this outcome in the SL AGI would be the same as in the RL AGI.”
By analogy, consider the more familiar ‘alignment problem’ of training dogs (i.e., getting the goals of dogs to align with the goals of their owners). Surely there are ‘breed-independent’ strategies for doing this, but it is not obvious that these strategies will be sufficient for every breed—e.g., Afghan Hounds are apparently way harder to train, than, say, Golden Retrievers. So in addition to the generic-dog-alignment-regime, Afghan hounds require some additional special training to ensure they’re aligned. I don’t yet understand why you are confident that different possible AGIs could not follow this same pattern.
On top of that, there’s the obvious problem that if we try to solve alignment for a particular architecture, it’s quite probable that some other architecture will come along and all our work will be obsolete. (At the current pace of ML progress, this seems to happen roughly every 5 years.)
I think that you think that I mean something far more specific than I actually do when I say “particular architecture,” so I don’t think this accurately characterizes what I believe. I describe my view in the next post.
[It’s] the unknown unknowns that kill us. The move we want is not “brainstorm failure modes and then avoid the things we brainstormed”, it’s “figure out what we want and then come up with a strategy which systematically achieves it (automatically ruling out huge swaths of failure modes simultaneously)”.
I think this is a very interesting point (and I have not read Eliezer’s post yet, so I am relying on your summary), but I don’t see what the point of AGI safety research is if we take this seriously. If the unknown unknowns will kill us, how are we to avoid them even in theory? If we can articulate some strategy for addressing them, they are not unknown unknowns; they are “increasingly-known unknowns!”
I spent the entire first post of this sequence devoted to “figuring out what we want” (we = AGI safety researchers). It seems like what we want is to avoid AGI-induced existential risks. (I am curious if you think this is wrong?) If so, I claim, here is a “strategy that might systematically achieve this:” we need to understand what we mean when we say AGI (Q1), figure out what risks are likely to emerge from AGI (Q2), mitigate these risks (Q3), and implement these mitigation strategies (Q4).
If by “figure out what we want,” you mean “figure out what we want out of an AGI,” I definitely agree with this (see Robert’s great comment below!). If by “figure out what we want,” you mean “figure out what we want out of AGI safety research,” well, that is the entire point of this sequence!
I expect implementation to be relatively easy once we have any clue at all what to implement. So even if it’s technically necessary to answer at some point, this question might not be very useful to think about ahead of time.
I completely disagree with this. It will definitely depend on the competitiveness of the relevant proposals, the incentives of the people who have control over the AGI, and a bunch of other stuff that I discuss in Q4 (which hasn’t even been published yet—I hope you’ll read it!).
in practice, when we multiply together probability-of-hail-Mary-actually-working vs probability-that-AI-is-coming-that-soon, I expect that number to basically-never favor the hail Mary.
When you frame it this way, I completely agree. However, there is definitely a continuous space of plausible timelines between “all-the-time-in-the-world” and “hail-Mary,” and I think the probabilities of success [P(success|timeline) * P(timeline)] fluctuate non-obviously across this spectrum. Again, I hope you will withhold your final judgment of my claim until you see how I defend it in Q5! (I suppose my biggest regret in posting this sequence is that I didn’t just do it all at once.)
Zooming out a level, I think the methodology used to generate these questions is flawed. If you want to identify necessary subquestions, then the main way I know how to do that is to consider a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.
I think this is a bit uncharitable. I have worked with and/or talked to lots of different AGI safety researchers over the past few months, and this framework is the product of my having “consider[ed] a wide variety of approaches, and look for subquestions which are clearly crucial to all of them.” Take, for instance, this chart in Q1—I am proposing a single framework for talking about AGI that potentially unifies brain-based vs. prosaic approaches. That seems like a useful and productive thing to be doing at the paradigm-level.
I definitely agree that things like how we define ‘control’ and ‘bad outcomes’ might differ between approaches, but I do claim that every approach I have encountered thus far operates using the questions I pose here (e.g., every safety approach cares about AGI architectures, bad outcomes, control, etc. of some sort). To test this claim, I would very much appreciate the presentation of a counterexample if you think you have one!
Thanks again for your comment, and I definitely want to flag that, in spite of disagreeing with it in the ways I’ve tried to describe above, I really do appreciate your skepticism and engagement with this sequence (I cite your preparadigmatic claim a number of times in it).
As I said to Robert, I hope this sequence is read as something much more like a dynamic draft of a theoretical framework than my Permanent Thoughts on Paradigms for AGI Safety™.
Still feels worthwhile to emphasize that some of these 27 people are, eg, Chief AI Scientist at Meta, co-director of CIFAR, DeepMind staff researchers, etc.
These people are major decision-makers in some of the world’s leading and most well-resourced AI labs, so we should probably pay attention to where they think AI research should go in the short-term—they are among the people who could actually take it there.
I assume this is the chart you’re referring to. I take your point that you see these numbers as increasing or decreasing (despite that where they actually are in an absolute sense seems harmonious with believing that brain-based AGI is entirely possible), but it’s likely that these increases or decreases are themselves risky statistics to extrapolate. These sorts of trends could easily asymptote or reverse given volatile field dynamics. For instance, if we linearly extrapolate from the two stats you provided (5% believe scaling could solve everything in 2018; 17% believe it in 2022), this would predict, eg, 56% of NLP researchers in 2035 would believe scaling could solve everything. Do you actually think something in this ballpark is likely?
I was considering the paper itself as evidence that NeuroAI is looking increasingly likely.
When people who run many of the world’s leading AI labs say they want to devote resources to building NeuroAI in the hopes of getting AGI, I am considering that as a pretty good reason to believe that brain-like AGI is more probable than I thought it was before reading the paper. Do you think this is a mistake?
Certainly, to your point, signaling an intention to try X is not the same as successfully doing X, especially in the world of AI research. But again, if anyone were to be able to push AI research in the direction of being brain-based, would it not be these sorts of labs?
To be clear, I do not personally think that prosaic AGI and brain-based AGI are necessarily mutually exclusive—eg, brains may be performing computations that we ultimately realize are some emergent product of prosaic AI methods that already basically exist. I do think that the publication of this paper gives us good reason to believe that brain-like AGI is more probable than we might have thought it was, eg, two weeks ago.