Confusion about neuroscience/​cognitive science as a danger for AI Alignment

Link post

Recently, I wrote an article together with Jan Kirchner on “brain enthusiasts” in AI Safety (if you find work on neuroscience/​cognitive science x AI (Safety) interesting, let me know 🙂). When crafting and researching our arguments, we often came across one argument, that made us confused and uncertain about this whole topic.

Epistemic status: not rigorously researched at all, just trying to get other people’s opinions on this. I am deeply unsure about this whole argument, so there might be some unclarities in my writing because of this.

For practical reasons, I will continue to use the term “brain enthusiasts” as a blanket term for “People with a neuroscience/​cognitive science background”.


The argument roughly goes like this:

  1. The neural sciences seek to understand the brain at multiple levels of organization, ranging from the cell and its constituents to the operations of the mind[1]. Cognitive science is the interdisciplinary study of mind and intelligence, embracing philosophy, psychology, artificial intelligence, neuroscience, linguistics, and anthropology[2].

  2. Whether our current compute-focused AI approach will lead to AGI is debatable[3][4].

  3. Some authors and AI labs argue that a hybrid (or fully) brain-enthusiastic approach is a tractable example of how AGI could emerge[5][6].

  4. Having solid development and risk models of AGI are crucial for us to find and apply solutions[4].

  5. Since the neural sciences and cognitive science study desired (or convergent) properties of future AGI, we get a better development model (and risk model too!) of how AGI could emerge.

  6. Thus, studying Neuroscience/​Cognitive Science as somebody convinced by the Alignment Problem is a valuable path to take to better understand the development models of AGI by top AI labs, that might lead to AGI (think e.g. DeepMind).

Summarizing the mechanism of the argument: let’s make provisions for the case that AGI will be brain-inspired. In that case, we should study neural sciences and cognitive science to get a better idea of what such a future AGI might look like.


We struggled with this argument since it is based on many speculative assumptions (this is also the reason why we didn’t include it in the list of research topics in the original article) and seems weak overall.

Still, I am curious and motivated by Evie Cottrell’s amazing blog post, demanding to be more open about confusion and uncertainties. Thus, I want to lay this argument out here and talk a bit about it. If you have relevant intuitions, please let me know!

First, I want to lay out my uncertainties.

Differential intellectual progress

DIP describes “prioritizing risk-reducing intellectual progress over risk-increasing intellectual progress”. I am particularly wary of approaches to AI Alignment that also serve a purpose in AI capability research. Insights gained from studying the brain and understanding intelligence through neuroscience/​cognitive science are potentially really useful (risk-reducing) but also potentially really harmful (risk-increasing) to AI Safety.

It is imaginable that we get to AGI through a hybrid or 100% brain enthusiastic approach. Understanding mechanisms of the brain like social instincts, motivation, or values might be especially interesting then. But these insights walk the fine line between AI capabilities research and AI Safety. I am worried that gaining knowledge about the brain that is specially tailored to the field of AI is benefiting AI capabilities research more than AI safety research, especially since more people might be able to put these insights to use in AI capabilities, compared to AI safety. So, according to DIP, we should rather focus on things that are useful for reducing the risk of AI.

Convergent evolution

Another question is: are properties of the human brain desired (we want to build them in) or convergent (they are necessary components of intelligent agents) for AGI?

If they are convergent, they will likely be present in AGI. If we implement them in AGI in a fashion that is like how they are implemented in the brain, then we could gain valuable insights by studying them in the brain and inferring what they might look like in AGI and how we could align them.

One example might be to study social instincts (inspired by Steven Byrnes’s brain-like AGI Safety series): if we can reverse-engineer social instincts and moral intuitions in a substrate-independent way, adjust them and implement them in AGI, we end up with aligned AGI.

Similarly, if we say that social instincts are convergent properties of aligned AGI, I think that we should study them as well, and brain enthusiasts might be especially useful here.

Open questions:

Does this imply that we shouldn’t spend time on epistemic translation from insights in neuroscience/​cognitive Science and adjacent fields to AI?

What are projects in neuroscience/​cognitive Science that solely benefit risk-reducing intellectual progress on AI?

How useful is building better developmental models of how AGI could look like, if at risk of increasing AGI progress.

  1. ^

    The brain and behavior. Kandel E.R., & Koester J.D., & Mack S.H., & Siegelbaum S.A.(Eds.), (2021). Principles of Neural Science, 6e. McGraw Hill.

  2. ^

    Thagard, Paul, “Cognitive Science”, The Stanford Encyclopedia of Philosophy (Winter 2020 Edition), Edward N. Zalta (ed.)

  3. ^
  4. ^
  5. ^
  6. ^