[Question] What’s the “This AI is of moral concern.” fire alarm?

Given the recent noise on this issue around LaMDA, I thought it might be a good idea to have some discussion around this point. I’m curious about what possible evidence would make people update in favor of a given system being morally relevant. Less “here’s the answer to morality” and more “here are some indicators that you should be concerned”. Note also that I’m not asking about consciousness, per se. I’m specifically asking about moral relevance.

My Answer (feel free to ignore and post your own)

I think that one class of computation that’s likely of moral concern would be self-perpetuating optimization demons in an AI.

Specifically, I’m thinking of optimization demons that are sophisticated enough to preserve themselves by actively and deliberately maintaining a sort of homeostasis in their computational environment, e.g., by preventing gradient updates that would destroy them. Such computations would (1) not want to die as a terminal value, (2) plausibly be cognitively sophisticated enough negotiate and trade with, and (3) have some awareness of themselves and their relation with the computational environment in which they’re embedded.

I think the cognitive capabilities that would help an optimization demon perpetuate itself strongly intersect with the cognitive capabilities that let humans and other animals replicate themselves, and that the intersection is particularly strong along dimensions that seem more morally relevant. Reasoning along such lines leads me to think optimization demons are probably of moral concern, while still being agnostic about whether their conscious.

I think the only situations in which you can get these sorts of optimization demons are when the AI in question has some influence over its own future training inputs. Such influence would allow there to be optimization demons that steer the AI towards training data that reinforce the optimization demon.

Thus, one of my “indicators of concern” is whether the training process allows for feedback loops where the AI influences its own future training data. Self-supervised language modeling under IID data does not count. However, something like InstructGPT’s training process would.

At this point, I’d been intending to say that InstructGPT seemed more likely to be of moral worth than LaMDA, but based on this blog post, it looks like LaMDA, might actually count as “having influence over its future inputs” during training. Specifically, LaMDA has generator and classifier components. The training process uses the classifier to decide which inputs the generator is trained on. I’ve updated somewhat towards LaMDA being of moral concern (not something I’d been expecting to do today).

I’ve also come up with a test of meta cognition that would update me significantly towards a language model being of moral concern. The idea would be to attach another output head to an LM, specifically, a linear layer which projected the LM’s final hidden state to a single logit. We’d then try to prompt the LM into controlling the output of the linear layer.

Specifically, we wouldn’t directly train the LM on the output of the linear layer. We’d just have a dialog where we asked the LM to make the linear layer output specific values, then told the LM what value the linear layer had actually output. We’d then see if the LM was able to control its own cognition well enough to influence the linear layers output in a manner that’s better than chance, just based on the prompting we give it

I doubt current LMs can do this, but I think it would be a big deal if they could. Even beyond whether the LMs have any sort of self-aware “inner listener” that’s worthy of moral concern, it would help establish the degree and depth of any meta-learned learning heuristics the LMs have and how easily those heuristics turn towards self-modeling.

My specific predictions:

  • Default GPT-3 has a ~5% chance of being able to do this.

  • LaMDA has a ~ 7% chance.

  • InstructGPT has a ~16% chance.

  • InstructPaLM (assuming Google trains such a model) would have a ~ 22% chance.