Averting suffering with sentience throttlers (proposal)

Status: Originally written as the round two interview task for Center on Long-Term Risk’s Summer Research Fellowship, but before it was finished I trashed it in favor of something under the Cooperative AI agenda. At this point it’s a free idea, do whatever you want with it. Being an unfinished proposal, the quality bar of this post is not high.

Some history: I originally developed this idea as a plotline in a game I wrote in the summer of 2019, I only considered making it into a rigorous research program very recently. I find it odd that that I haven’t found a literature on this, I asked some people who would probably know if it existed and they hadn’t heard of it.

Written at CEEALAR

The problem termed “Suffering in AI Workers” (Gloor 2016) can be averted by sentience throttlers. Within scope for a summer fellowship is to write down a formalization of throttlers at a high level of abstraction, then follow it up with lower-level work about specific theories of sentience and specific AI architectures/​paradigms.

What is the problem?

In the default outcome, astronomical amounts of subroutines will be spun up in pursuit of higher-level goals, whether those goals are aligned with the complexity of human value or aligned with paperclips. Without firm protections in place, these subroutines might experience some notion of suffering (Tomasik 2017).

Intervention

One way to prevent suffering is to prevent sentience. If it is the case that sentience “emerges” from intelligence, then one obvious way forward is to suppress the “emergent” part while retaining competitive intelligence. This is all a throttler is- it is some technology that constrains systems in this way.

In the spirit of anticipating, legitimizing, and fulfilling governance demands (Critch 2020) (though substituting “governance” demands with “moral” demands), it makes sense to work on anticipating and legitimizing before fulfilling. This will first involve work at a level of abstraction that proposes the construction of a function that consumes a theory of sentience and something mind-like as input and produces a specification for a safe (defined as not tending toward sentience) but competitive implementation as output. As a computer scientist and not a neuroscientist, I’m interested in focusing on non-biological applications of throttlers, but I do expect to quantify over all minds at this first level of abstraction, and I can gesture vaguely in the direction of biotech and say that biological minds may be throttleable as well.

When the substrate in question is AI technologies, imagine you’re in the role of an ethics officer at Alphabet, reasoning with engineering managers about abiding by the specification described by our function. Genovese 2018 claims that systems without emergent properties are feasible, but more difficult to implement than systems subject to emergent properties, implying that even if it was reasonable for the performance of Alphabet’s products to implement throttled versions of the technologies it may not be reasonable to ask them to invest requisite extra engineer labor hours into it. Thus, it would be ideal to also create programming abstractions that make the satisfaction of these specifications as easy as possible, which adds a dimension to the research opportunity.

How exactly I would go about dumping a summer into this

The notion throttlers is so undertheorized that I think the first level of abstraction, that quantifies over all theories of consciousness and all minds, will be a value-add. With this level on paper, people would have an anchor-point to criticize and build on. However, such a high abstraction level might be unsatisfying, because its practical consequences are so far away. Therefore, I would like to provide specific input-output pairs in this “function”, such as Integrated Information Theory (IIT), Deep Learning (DL), and the appropriate specification for i.e. a way to reduce integrated information in a neural network, perhaps applying some of the ideas in Aguilera & Di Paolo 2017.

IIT stands out because it is the most straightforward to operationalize computationally, but perhaps a better project would be to increase the ability to reason computationally about eliminativism Tomasik 2014 or illusionism for the purpose of specifying throttlers assuming either of those theories are true. As for the substrate, beyond thinking about deep neural nets altogether we could think about RL-like, transformer-like, or NARS-like (Wang unknown year) architectures. The latter in particular is important because NARS has been used to coordinate and unify other ML technologies (Hammer et. al. 2020) as subroutines, in exactly the way that the above moral threat model suggests. While it may not be tractable to properly output an actionable specification within a summer, lots of concrete work could be done.