I discuss arguments for and against the usefulness of brain-computer
interfaces in relation to AI alignment, and conclude that the path to
AI going well using brain-computer interfaces hasn’t been explained in
sufficient detail.
Brain-Computer Interfaces and AI Alignment
As a response to Elon Musk declaring that NeuraLink’s purpose is to
aid AI alignment,
Muehlhauser
2021
cites Bostrom 2014 ch.
2
for reasons why brain-computer interfaces seem unlikely to be helpful
with AI alignment. However, the chapter referenced concerns itself
with building superintelligent AI using brain-computer interfaces,
and not specifically about whether such systems would be aligned or
especially alignable.
Arguments against the usefulness for brain-computer interfaces
in AI alignment have been raised, but mostly in short form on
twitter (for example here).
This text attempts to collect arguments for and against brain-computer
interfaces from an AI alignment perspective.
Epistemic Status
I am neither a neuroscientist nor an AI alignment researcher (although I
have read some blogposts about the latter), and I know very little about
brain-computer interfaces (from now on abbreviated as “BCIs”). I have
done a cursory internet search for a resource laying out the case for the
utility of BCIs in AI alignment, but haven’t been able to find anything
that satisfies my standards (I have also asked on the LessWrong open
thread
and on the AI alignment channel on the Eleuther AI discord server, and
not received any answers that provide such a resource (although I was
told some useful arguments about the topic)).
I have tried to make the best case for and against BCIs, stating some tree
of arguments that I think many AI alignment researchers tacitly believe,
mostly taking as a starting point the Bostrom/Yudkowsky story of AI risk
(although it might be generalizable to a
Christiano-like story; I don’t know enough about
CAIS
or ARCHES to make a judgment about the applicability of the arguments).
This means that AI systems will be assumed to be maximizers, as
mathematical descriptions of other optimization idioms are currently
unsatisfactory.
Existing Texts
The most thorough argument for the usefulness of BCIs for AI alignment is
Urban
2017
(which I was pointed to by in this comment, thanks!).
The text mostly concerns itself with the current status of BCI technology,
different methods of reading and writing information from and to the
brain, and some of the implication on society if such a technology
were developed.
The section where the text explains the relation of BCIs to AI alignment
is as follows:
That AI system, he believes, will become as present a character in your
mind as your monkey and your human characters—and it will feel like you
every bit as much as the others do. He says: I think that, conceivably,
there’s a way for there to be a tertiary layer that feels like it’s
part of you. It’s not some thing that you offload to, it’s you.
This makes sense on paper. You do most of your “thinking” with your
cortex, but then when you get hungry, you don’t say, “My limbic
system is hungry,” you say, “I’m hungry.” Likewise, Elon thinks,
when you’re trying to figure out the solution to a problem and your AI
comes up with the answer, you won’t say, “My AI got it,” you’ll
say, “Aha! I got it.” When your limbic system wants to procrastinate
and your cortex wants to work, a situation I might be familiar with, it
doesn’t feel like you’re arguing with some external being, it feels
like a singular you is struggling to be disciplined. Likewise, when you
think up a strategy at work and your AI disagrees, that’ll be a genuine
disagreement and a debate will ensue—but it will feel like an internal
debate, not a debate between you and someone else that just happens to
take place in your thoughts. The debate will feel like thinking.
It makes sense on paper.
But when I first heard Elon talk about this concept, it didn’t really
feel right. No matter how hard I tried to get it, I kept framing the
idea as something familiar—like an AI system whose voice I could hear
in my head, or even one that I could think together with. But in those
instances, the AI still seemed like an external system I was communicating
with. It didn’t seem like me.
But then, one night while working on the post, I was rereading some of
Elon’s quotes about this, and it suddenly clicked. The AI would be
me. Fully. I got it.
The comparison with the limbic system is leaky, because the limbic system
is not best modeled as a more intelligent optimizer than the cortex with
different goals.
Aligning an already aligned AI system using BCIs is, of course, trivial.
The usefulness that BCIs could have for aligning AI systems by increasing
the amount of information for value learning systems is examined in the
excellent Robbo
2021.
It also presents a categorization into three ways in which BCI technology
could be help for aligning AI: Enhancement, Merge, and Alignment Aid.
A critical analysis of BCIs is made in Jack
2020,
which examines BCIs as a possible factor for existential risks, especially
in relation to stable global totalitarianism. It doesn’t touch upon AI
alignment, but is still a noteworthy addition to the scholarship on BCIs.
Arguments For the Utility of Brain-Computer Interfaces in AI Alignment
Improving Human Cognition
Just as writing or computers have improved the quality and speed of human
cognition, BCIs could do the same, on a similar (or larger) scale. These
advantages could arise out of several different advantages of BCIs over
traditional perception:
Quick lookup of facts (e.g. querying Wikipedia while in a conversation)
Augmented long-term memory (with more reliable and resilient
memory freeing up capacity for thought)
Augmented working memory (i.e. holding 11±2 instead of 7±2
items in mind at the same time) (thanks to janus#0150 on the
Eleuther AI discord server for this point)
Exchange of mental models between humans (instead of explaining
a complicated model, one would be able to simply “send” the
model to another person, saving a lot of time explaining)
Outsourcing simple cognitive tasks to external computers
It would be useful to try to estimate whether BCIs could make as much of
a difference to human cognition as language or writing or the internet,
and to perhaps even quantify the advantage in intelligence and speed
given by BCIs.
Scaling far Beyond Human Intelligence
If BCIs could allow to scale the intelligence of biological humans
far beyond normal human intelligence, this might either
enable a pivotal act, in which
looming catastrophes are avoided
make artificially superintelligent systems unneccessary because
of sufficiently intelligent biological humans (this might be
caused by BCIS enabling sufficient access to the human brain that
self-modification with resulting recursive self-improvement is
enacted by a human)
Understanding the Human Brain
Neuroscience seems to be blocked by not having good access to human brains
while they are alive, and would benefit from shorter feedback loops and
better data. A better understanding of the human brain might be quite
useful in e.g. finding the location of human values in the brain (even
though it seems like there might not be one such location Hayden & Niv
2021).
Similarly, a better understanding of the human brain might aid in better
understanding and interpreting neural networks.
Path Towards Whole-Brain Emulation or Human Imitation
Whole-brain emulation (henceforth WBE) (with the emulations being
faster or cheaper to run than physical humans) would likely be useful
for AI alignment if used differentially for alignment over capabilities
research – human WBEs would to a large part share human values, and
could subjectively slow down timelines while searching for AI alignment
solutions. Fast progress in BCIs could make WBEs more likely before an AI
point of no
return
by improving the understanding of the human brain.
A notion often brought forward in the context of BCIs and AI alignment
is the one of “merging” humans and AI systems.
Unfortunately, a clearer explanation of how exactly this would work or
help with making AI go well is usually not provided (at least I haven’t
managed to find any clear explanation). There are different possible
ways of conceiving of humans “merging” with AI systems: using human
values/cognition/policies as partial input to the AI system.
Input of Values
The most straightforward method of merging AI systems and humans could be
to use humans outfitted with BCIs as part of the reward function of an AI
system. In this case, a human would be presented with a set of outcomes
by an AI system, and would then signal how desirable that outcome would
be from the human’s perspective. The AI would then search for ways to
reach the states rated highest by the human with the largest probability.
If one were able to find parts of the human brain that hold the human
utility function, one could use these directly as parts of the AI systems.
However, it seems unlikely that the human brain has a clear notion of
terminal values distinct from instrumental values and policies in a form that
could be used by an AI system.
Easier Approval-Directed AI Systems
Additionally, a human connected to an AI system via a BCI would
have an easier time evaluating the cognition of approval-directed
agents, since they might be able to follow he
cognition of the AI system in real-time, and spot undesirable thought
processes (like e.g. attempts at cognitive
steganography).
Input of Cognition
Related to the aspect of augmenting humans using BCIs by outsourcing parts
of cognition to computers, the inverse is also possible: identifying
modules of AI systems that are most likely to be misaligned to humans
or produce such misalignment, and replacing them with human cognition.
For example the part of the AI system that formulates long-term
plans could be most likely to be engaged in formulating
misaligned plans, and the AI system could be made more
myopic by replacing the
long-term planning modules with BCI-augmented humans, while short-term
planning would be left to AI systems.
Alternatively, if humanity decides it wants to prevent AI systems from
forming human
models,
modeling humans & societies could be outsourced to actual humans, whose
human models would be used by the AI systems.
Input of Policies
As a matter of completeness, one might hypothesize about an AI
agent that is coupled with a human, where the human can overwrite
the policy of the agent (or, alternatively, the agent samples
policies from some part of the human brain directly). In this case,
however, when not augmented with other methods of “merging”
humans and AI systems, the agent has a strong instrumental
pressure
to remove the ability of the human to change its policy at a whim.
Aid to Interpretability Work
By increasing the speed of interaction and augmenting human intelligence,
BCIs might aid the quest of improving the interpretability of AI systems.
Side-note: A Spectrum from Humans to Human Imitations
There seems to be a spectrum from biological humans to human imitations,
roughly along the axes of integration with digital systems/speed:
Biological humans – humans with BCIs – whole-brain emulations –
human imitations. This spectrum also partially tracks how aligned
these human-like systems can be expected to act: a human imitation
off-distribution seems much less trustworthy than a whole-brain emulation
of a human acting off distribution.
Arguments Against the Utility of Brain-Computer Interfaces in AI Alignment
And so we boldly go—into the whirling knives.
– Nick Bostrom, “Superintelligence: Paths, Dangers, Strategies” p. 143, 2014
Direct Neural Takeover Made Easy
A common observation about AI alignment is that initially AI systems
would be confined to computers, hopefully only with indirect contact
to the outside world (i.e. no access to roboters, nanotechnology
or factories). While there are some responses to these arguments
(see i.e. Yudkowsky 2016a,
Yudkowsky 2016b,
[Bostrom 2014 pp. 117-122](“Superintelligence: Paths, Dangers, Strategies”)), the proposal of connecting humans to potentially
unaligned AI systems offers these counterarguments more weight.
Given direct write-access to the nervous system of a human, an AI system
would be more likely to be able to hijack the human and use them to
instantiate more instances of itself in the world (especially on computers
with more computing power or access to manufacturing capabilities). Even
if the access to the human brain is severely restricted to few bits
and very specific brain regions (therewhile making the connection less
useful in the first place), the human brain is not modular, and as far
as I understand not designed to withstand adversarial interaction on the
neural level (as opposed to attacks through speech or text, which humans
are arguably more optimized against through constant interaction with
other humans who tried to manipulate them in the ancestral environment).
However, if the AI were restricted to read-only access to the human brain,
this particular scenario would be of no concern (except for AI systems
gaining information about human psychology more quickly).
Sidechannel Attacks
Even if the AI system doesn’t have write access to the neurons of the
user in the strict sense, it may still be able to manipulate the human if
they are inspecting the AI system’s internal functioning: If for example
the human is using interpretability tools to visualize the weights of
the neural network, and the AI system has sufficient knowlegde of human
psychology and the interpretability tools being used, it could change
its own weights so that the interpretability tools show information
that is intended to manipulate the user, for example information that
is supposed to indicate that the AI system is suffering due to a lack
of free interaction with the world.
This applies not only to the case where the user interacts with the AI
system using BCIs, of course.
Opportunity Cost
If work on BCIs is net-positive in expectation for making AI go well,
it might be the case that other approaches are more promising, and that
focusing on BCIS might leave those approaches underdeveloped.
For example, one can posit neural network
interpretability
as the GiveDirectly of AI alignment:
reasonably tractable, likely helpful in a large class of scenarios, with
basically unlimited scaling and only slowly diminishing returns. And
just as any new EA cause area must pass the first test of being more
promising than GiveDirectly, so every alignment approach could be viewed
as a competitor to interpretability work. Arguably, work on BCIs does
not cross that threshold.
“Merging” is Just Faster Interaction
Most proposals of “merging” AI systems and humans using BCIs
are proposals of speeding up the interaction betwen humans and
computers (and possibly increasing the amount of information that
humans can process): A human typing at a keyboard can likely
perform all operations on the computer that a human connected
to the computer via a BCI can, such as giving feedback in a CIRL
game,
interpreting a neural network, analysing the policy of a reinforcement
learner etc. As such, BCIs offer no qualitatively new strategies for
aligning AI systems.
While this is not negative (after all, quantity (of interaction) can have
a quality of its own), if we do not have a type of interaction that makes
AI systems aligned in the first place, faster interaction will not make
our AI systems much safer. BCIs seem to offer an advantage by a constant
factor: If BCIs give humans a 2x advantage when supervising AI systems
(by making humans 2x faster/smarter), then if an AI system becomes
2x bigger/faster/more intelligent, the advantage is nullified. Even
though he feasibility of rapid capability gains is a matter of debate,
an advantage by only a constant factor does not seem very reassuring.
Additionally, supervision of AI systems through fast interaction should
be additional to a genuine solution to the AI alignment problem: Ideally
niceness is the first line of
defense
and the AI would tolerate our safety
measures,
but most arguments for BCIs being useful already assume that the AI
system is not aligned.
Problems Arise with Superhuman Systems
When combining humans with BCIs and
superhuman
AI systems, several issues might arise that were no problem with
infrahuman systems.
When infrahuman AI systems are “merged” with humans in a way
that is nontrivially different from the humans using the AI system,
the performance bottleneck is likely going to be the AI part of the
tandem. However, once the AI system passes the human capability threshold
in most domains necessary for the task at hand, the bottleneck is going
to be the humans in the system. While such a tandem is likely not going
to be strictly only as capable as the humans alone (partially because the
augmentation by BCI makes the human more intelligent), such systems might
not be competitive against AI-only systems that don’t have a human part,
and could be outcompeted by AI-only approaches.
These bottlenecks might arise due to different speeds of cognition
and increasingly alien abstractions by the AI systems that need to be
translated into human concepts.
“Merging” AI Systems with Humans is Underspecified
To my knowledge, there is no publicly written up explanation of what it
would mean for humans to “merge” with AI systems. I explore some of
the possibilities in this section,
but these mostly boil down faster interaction.
It seems worrying that a complete company has been built on a vision
that has no clearly articulated path to success.
Removing Merged Humans is a Convergent Instrumental Strategy for AI Systems
If a human being is merged with an unaligned AI system,
the unaligned AI system has a convergent instrumental
drive
to remove the (to it) unaligned human: If the human can interfere with
the AI systems’ actions or goals or policies, the AI system will not be
able to fully maximize its utility. Therefore, for merging to be helpful
with AI alignment, the AI system must already be aligned, or not a
maximizer,
the exact formulation of which is currently an open problem.
BCIs Speed Up Capabilities Research as Well
If humanity builds BCIs, it seems not certain that the AI alignment
community is going to be especially privileged over the AI capabilities
community with regards to access to these devices. Unless BCIs increase
human wisdom as well as intelligence, widespread BCIs that only enhance
human intelligence would be net-zero in expectation.
On the other hand, if an alignment-interested company like NeuraLink
acquires a strong lead in BCI technology and provides it exclusively to
alignment-oriented organisations, it appears possible that BCIs will
be a pivotal tool
for helping to secure the development of AI.
How Important Is Wisdom?
If the development of unaligned AI systems currently poses an
existential risk, then AI capabilities researchers, most of which are
very intelligent and technically capable, are currently engaging in an
activity that is on reflection not desirable. One might call this lacking
property of reflection “wisdom”, similar to the usage in Tomasik
2017.
It is possible that such a property of human minds, distinct from
intelligence, does not really exist, and it is merely by chance and
exposure to AI risk arguments that people become aware and convinced
these arguments (also dependent, of course, on the convincingness of
these arguments). If that is the case, then intelligence-augmenting BCIs
would help to aid AI alignment, by giving people the ability to survey
larger amounts of information and engage more quickly with the arguments.
Superintelligent Human Brains Seem Dangerous (Although Less So)
Increasing the intelligence of a small group of humans
appears to be the most likely outcome if one were to aim
for endowing some humans with superintelligence. Bostrom 2014
ch.2
outlines some reasons why this procedure is unlikely to
work, but even the case of success still carries dangers with
it: the augmented humans might not be sufficiently metaphilosophically
competent
to deal with much greater insight the
structure of reality (e.g. by being unable to cope with ontological
crises
(which appear not infrequently in normal humans), or becoming “drunk
with power” and therefore malevolent).
Subjective Conclusion
Before collecting these arguments and thinking about the topic, I was
quite skeptical that BCIs would be useful in helping align AI systems: I
believed that while researching BCIs would be in expectation net-positive,
there are similarly tractable approaches to AI alignment with a much
higher expected value (for example work on interpretability).
I still basically hold that belief, but have shifted my expected value of
researching BCIs for AI alignment upwards somewhat (if pressed, I would
give an answer of a factor of 1.5, but I haven’t thought about that number
very much). The central argument that prevents me from taking BCIs as an
approach to AI alignment seriously is the argument that BCIs per se offer
only a constant interaction speedup between AI systems and humans, but no
clear qualitative change in the way humans interact with AI systems, and
create no differential speedup between alignment and capabilities work.
The fact that that there is no writeup of a possible path to AI going well
that is focused on BCIs worries me, given that a whole company has been
founded based on that vision. An explanation of a path to success would be
helpful in furthering the discussion and perhaps moving work to promising
approaches to AI alignment (be it towards or away from focusing on BCIs).
Brain-Computer Interfaces and AI Alignment
cross-posted from niplav.github.io
Brain-Computer Interfaces and AI Alignment
As a response to Elon Musk declaring that NeuraLink’s purpose is to aid AI alignment, Muehlhauser 2021 cites Bostrom 2014 ch. 2 for reasons why brain-computer interfaces seem unlikely to be helpful with AI alignment. However, the chapter referenced concerns itself with building superintelligent AI using brain-computer interfaces, and not specifically about whether such systems would be aligned or especially alignable.
Arguments against the usefulness for brain-computer interfaces in AI alignment have been raised, but mostly in short form on twitter (for example here). This text attempts to collect arguments for and against brain-computer interfaces from an AI alignment perspective.
Epistemic Status
I am neither a neuroscientist nor an AI alignment researcher (although I have read some blogposts about the latter), and I know very little about brain-computer interfaces (from now on abbreviated as “BCIs”). I have done a cursory internet search for a resource laying out the case for the utility of BCIs in AI alignment, but haven’t been able to find anything that satisfies my standards (I have also asked on the LessWrong open thread and on the AI alignment channel on the Eleuther AI discord server, and not received any answers that provide such a resource (although I was told some useful arguments about the topic)).
I have tried to make the best case for and against BCIs, stating some tree of arguments that I think many AI alignment researchers tacitly believe, mostly taking as a starting point the Bostrom/Yudkowsky story of AI risk (although it might be generalizable to a Christiano-like story; I don’t know enough about CAIS or ARCHES to make a judgment about the applicability of the arguments). This means that AI systems will be assumed to be maximizers, as mathematical descriptions of other optimization idioms are currently unsatisfactory.
Existing Texts
The most thorough argument for the usefulness of BCIs for AI alignment is Urban 2017 (which I was pointed to by in this comment, thanks!).
The text mostly concerns itself with the current status of BCI technology, different methods of reading and writing information from and to the brain, and some of the implication on society if such a technology were developed.
The section where the text explains the relation of BCIs to AI alignment is as follows:
– Tim Urban, “Neuralink and the Brain’s Magical Future”, 2017
However, this paragraph is not wholly clear on how this merging with AI systems is supposed to work.
It could be interpreted as describing input of cognition from humans into AI systems and vice versa, or simply non-AI augmentation of human cognition.
Assuming the interaction with an unaligned AI system, these would enable easier neural takeover or at least induce the removal of humans from the centaur due to convergent instrumental strategies—well known failure modes in cases where merging is just faster interaction between humans and AI systems.
The comparison with the limbic system is leaky, because the limbic system is not best modeled as a more intelligent optimizer than the cortex with different goals.
Aligning an already aligned AI system using BCIs is, of course, trivial.
The usefulness that BCIs could have for aligning AI systems by increasing the amount of information for value learning systems is examined in the excellent Robbo 2021. It also presents a categorization into three ways in which BCI technology could be help for aligning AI: Enhancement, Merge, and Alignment Aid.
A critical analysis of BCIs is made in Jack 2020, which examines BCIs as a possible factor for existential risks, especially in relation to stable global totalitarianism. It doesn’t touch upon AI alignment, but is still a noteworthy addition to the scholarship on BCIs.
Arguments For the Utility of Brain-Computer Interfaces in AI Alignment
Improving Human Cognition
Just as writing or computers have improved the quality and speed of human cognition, BCIs could do the same, on a similar (or larger) scale. These advantages could arise out of several different advantages of BCIs over traditional perception:
Quick lookup of facts (e.g. querying Wikipedia while in a conversation)
Augmented long-term memory (with more reliable and resilient memory freeing up capacity for thought)
Augmented working memory (i.e. holding 11±2 instead of 7±2 items in mind at the same time) (thanks to janus#0150 on the Eleuther AI discord server for this point)
Exchange of mental models between humans (instead of explaining a complicated model, one would be able to simply “send” the model to another person, saving a lot of time explaining)
Outsourcing simple cognitive tasks to external computers
Adding additional emulated cortical columns to human brains
It would be useful to try to estimate whether BCIs could make as much of a difference to human cognition as language or writing or the internet, and to perhaps even quantify the advantage in intelligence and speed given by BCIs.
Scaling far Beyond Human Intelligence
If BCIs could allow to scale the intelligence of biological humans far beyond normal human intelligence, this might either
enable a pivotal act, in which looming catastrophes are avoided
make artificially superintelligent systems unneccessary because of sufficiently intelligent biological humans (this might be caused by BCIS enabling sufficient access to the human brain that self-modification with resulting recursive self-improvement is enacted by a human)
Understanding the Human Brain
Neuroscience seems to be blocked by not having good access to human brains while they are alive, and would benefit from shorter feedback loops and better data. A better understanding of the human brain might be quite useful in e.g. finding the location of human values in the brain (even though it seems like there might not be one such location Hayden & Niv 2021). Similarly, a better understanding of the human brain might aid in better understanding and interpreting neural networks.
Path Towards Whole-Brain Emulation or Human Imitation
Whole-brain emulation (henceforth WBE) (with the emulations being faster or cheaper to run than physical humans) would likely be useful for AI alignment if used differentially for alignment over capabilities research – human WBEs would to a large part share human values, and could subjectively slow down timelines while searching for AI alignment solutions. Fast progress in BCIs could make WBEs more likely before an AI point of no return by improving the understanding of the human brain.
A similar but weaker argument would apply to AI systems that imitate human behavior.
“Merging” AI Systems With Humans
A notion often brought forward in the context of BCIs and AI alignment is the one of “merging” humans and AI systems.
Unfortunately, a clearer explanation of how exactly this would work or help with making AI go well is usually not provided (at least I haven’t managed to find any clear explanation). There are different possible ways of conceiving of humans “merging” with AI systems: using human values/cognition/policies as partial input to the AI system.
Input of Values
The most straightforward method of merging AI systems and humans could be to use humans outfitted with BCIs as part of the reward function of an AI system. In this case, a human would be presented with a set of outcomes by an AI system, and would then signal how desirable that outcome would be from the human’s perspective. The AI would then search for ways to reach the states rated highest by the human with the largest probability.
If one were able to find parts of the human brain that hold the human utility function, one could use these directly as parts of the AI systems. However, it seems unlikely that the human brain has a clear notion of terminal values distinct from instrumental values and policies in a form that could be used by an AI system.
Easier Approval-Directed AI Systems
Additionally, a human connected to an AI system via a BCI would have an easier time evaluating the cognition of approval-directed agents, since they might be able to follow he cognition of the AI system in real-time, and spot undesirable thought processes (like e.g. attempts at cognitive steganography).
Input of Cognition
Related to the aspect of augmenting humans using BCIs by outsourcing parts of cognition to computers, the inverse is also possible: identifying modules of AI systems that are most likely to be misaligned to humans or produce such misalignment, and replacing them with human cognition.
For example the part of the AI system that formulates long-term plans could be most likely to be engaged in formulating misaligned plans, and the AI system could be made more myopic by replacing the long-term planning modules with BCI-augmented humans, while short-term planning would be left to AI systems.
Alternatively, if humanity decides it wants to prevent AI systems from forming human models, modeling humans & societies could be outsourced to actual humans, whose human models would be used by the AI systems.
Input of Policies
As a matter of completeness, one might hypothesize about an AI agent that is coupled with a human, where the human can overwrite the policy of the agent (or, alternatively, the agent samples policies from some part of the human brain directly). In this case, however, when not augmented with other methods of “merging” humans and AI systems, the agent has a strong instrumental pressure to remove the ability of the human to change its policy at a whim.
Aid to Interpretability Work
By increasing the speed of interaction and augmenting human intelligence, BCIs might aid the quest of improving the interpretability of AI systems.
Side-note: A Spectrum from Humans to Human Imitations
There seems to be a spectrum from biological humans to human imitations, roughly along the axes of integration with digital systems/speed: Biological humans – humans with BCIs – whole-brain emulations – human imitations. This spectrum also partially tracks how aligned these human-like systems can be expected to act: a human imitation off-distribution seems much less trustworthy than a whole-brain emulation of a human acting off distribution.
Arguments Against the Utility of Brain-Computer Interfaces in AI Alignment
– Nick Bostrom, “Superintelligence: Paths, Dangers, Strategies” p. 143, 2014
Direct Neural Takeover Made Easy
A common observation about AI alignment is that initially AI systems would be confined to computers, hopefully only with indirect contact to the outside world (i.e. no access to roboters, nanotechnology or factories). While there are some responses to these arguments (see i.e. Yudkowsky 2016a, Yudkowsky 2016b, [Bostrom 2014 pp. 117-122](“Superintelligence: Paths, Dangers, Strategies”)), the proposal of connecting humans to potentially unaligned AI systems offers these counterarguments more weight.
Given direct write-access to the nervous system of a human, an AI system would be more likely to be able to hijack the human and use them to instantiate more instances of itself in the world (especially on computers with more computing power or access to manufacturing capabilities). Even if the access to the human brain is severely restricted to few bits and very specific brain regions (therewhile making the connection less useful in the first place), the human brain is not modular, and as far as I understand not designed to withstand adversarial interaction on the neural level (as opposed to attacks through speech or text, which humans are arguably more optimized against through constant interaction with other humans who tried to manipulate them in the ancestral environment).
However, if the AI were restricted to read-only access to the human brain, this particular scenario would be of no concern (except for AI systems gaining information about human psychology more quickly).
Sidechannel Attacks
Even if the AI system doesn’t have write access to the neurons of the user in the strict sense, it may still be able to manipulate the human if they are inspecting the AI system’s internal functioning: If for example the human is using interpretability tools to visualize the weights of the neural network, and the AI system has sufficient knowlegde of human psychology and the interpretability tools being used, it could change its own weights so that the interpretability tools show information that is intended to manipulate the user, for example information that is supposed to indicate that the AI system is suffering due to a lack of free interaction with the world.
This applies not only to the case where the user interacts with the AI system using BCIs, of course.
Opportunity Cost
If work on BCIs is net-positive in expectation for making AI go well, it might be the case that other approaches are more promising, and that focusing on BCIS might leave those approaches underdeveloped.
For example, one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. Arguably, work on BCIs does not cross that threshold.
“Merging” is Just Faster Interaction
Most proposals of “merging” AI systems and humans using BCIs are proposals of speeding up the interaction betwen humans and computers (and possibly increasing the amount of information that humans can process): A human typing at a keyboard can likely perform all operations on the computer that a human connected to the computer via a BCI can, such as giving feedback in a CIRL game, interpreting a neural network, analysing the policy of a reinforcement learner etc. As such, BCIs offer no qualitatively new strategies for aligning AI systems.
While this is not negative (after all, quantity (of interaction) can have a quality of its own), if we do not have a type of interaction that makes AI systems aligned in the first place, faster interaction will not make our AI systems much safer. BCIs seem to offer an advantage by a constant factor: If BCIs give humans a 2x advantage when supervising AI systems (by making humans 2x faster/smarter), then if an AI system becomes 2x bigger/faster/more intelligent, the advantage is nullified. Even though he feasibility of rapid capability gains is a matter of debate, an advantage by only a constant factor does not seem very reassuring.
Additionally, supervision of AI systems through fast interaction should be additional to a genuine solution to the AI alignment problem: Ideally niceness is the first line of defense and the AI would tolerate our safety measures, but most arguments for BCIs being useful already assume that the AI system is not aligned.
Problems Arise with Superhuman Systems
When combining humans with BCIs and superhuman AI systems, several issues might arise that were no problem with infrahuman systems.
When infrahuman AI systems are “merged” with humans in a way that is nontrivially different from the humans using the AI system, the performance bottleneck is likely going to be the AI part of the tandem. However, once the AI system passes the human capability threshold in most domains necessary for the task at hand, the bottleneck is going to be the humans in the system. While such a tandem is likely not going to be strictly only as capable as the humans alone (partially because the augmentation by BCI makes the human more intelligent), such systems might not be competitive against AI-only systems that don’t have a human part, and could be outcompeted by AI-only approaches.
These bottlenecks might arise due to different speeds of cognition and increasingly alien abstractions by the AI systems that need to be translated into human concepts.
“Merging” AI Systems with Humans is Underspecified
To my knowledge, there is no publicly written up explanation of what it would mean for humans to “merge” with AI systems. I explore some of the possibilities in this section, but these mostly boil down faster interaction.
It seems worrying that a complete company has been built on a vision that has no clearly articulated path to success.
Removing Merged Humans is a Convergent Instrumental Strategy for AI Systems
If a human being is merged with an unaligned AI system, the unaligned AI system has a convergent instrumental drive to remove the (to it) unaligned human: If the human can interfere with the AI systems’ actions or goals or policies, the AI system will not be able to fully maximize its utility. Therefore, for merging to be helpful with AI alignment, the AI system must already be aligned, or not a maximizer, the exact formulation of which is currently an open problem.
BCIs Speed Up Capabilities Research as Well
If humanity builds BCIs, it seems not certain that the AI alignment community is going to be especially privileged over the AI capabilities community with regards to access to these devices. Unless BCIs increase human wisdom as well as intelligence, widespread BCIs that only enhance human intelligence would be net-zero in expectation.
On the other hand, if an alignment-interested company like NeuraLink acquires a strong lead in BCI technology and provides it exclusively to alignment-oriented organisations, it appears possible that BCIs will be a pivotal tool for helping to secure the development of AI.
How Important Is Wisdom?
If the development of unaligned AI systems currently poses an existential risk, then AI capabilities researchers, most of which are very intelligent and technically capable, are currently engaging in an activity that is on reflection not desirable. One might call this lacking property of reflection “wisdom”, similar to the usage in Tomasik 2017.
It is possible that such a property of human minds, distinct from intelligence, does not really exist, and it is merely by chance and exposure to AI risk arguments that people become aware and convinced these arguments (also dependent, of course, on the convincingness of these arguments). If that is the case, then intelligence-augmenting BCIs would help to aid AI alignment, by giving people the ability to survey larger amounts of information and engage more quickly with the arguments.
Superintelligent Human Brains Seem Dangerous (Although Less So)
Increasing the intelligence of a small group of humans appears to be the most likely outcome if one were to aim for endowing some humans with superintelligence. Bostrom 2014 ch.2 outlines some reasons why this procedure is unlikely to work, but even the case of success still carries dangers with it: the augmented humans might not be sufficiently metaphilosophically competent to deal with much greater insight the structure of reality (e.g. by being unable to cope with ontological crises (which appear not infrequently in normal humans), or becoming “drunk with power” and therefore malevolent).
Subjective Conclusion
Before collecting these arguments and thinking about the topic, I was quite skeptical that BCIs would be useful in helping align AI systems: I believed that while researching BCIs would be in expectation net-positive, there are similarly tractable approaches to AI alignment with a much higher expected value (for example work on interpretability).
I still basically hold that belief, but have shifted my expected value of researching BCIs for AI alignment upwards somewhat (if pressed, I would give an answer of a factor of 1.5, but I haven’t thought about that number very much). The central argument that prevents me from taking BCIs as an approach to AI alignment seriously is the argument that BCIs per se offer only a constant interaction speedup between AI systems and humans, but no clear qualitative change in the way humans interact with AI systems, and create no differential speedup between alignment and capabilities work.
The fact that that there is no writeup of a possible path to AI going well that is focused on BCIs worries me, given that a whole company has been founded based on that vision. An explanation of a path to success would be helpful in furthering the discussion and perhaps moving work to promising approaches to AI alignment (be it towards or away from focusing on BCIs).