Strategies for keeping AIs narrow in the short term

Disclaimer: I don’t have any particular expertise in AI safety, I just had some thoughts and this seemed like the place to put them.

The bleak outlook that Eliezer presents here and elsewhere seems to in large part be driven by the expectation that AGI will be developed before the technical problem of how to robustly align it is solved, and that poorly aligned AGIs tend to destroy the world.

One way to potentially get more time to solve the technical problem is to expend more effort on activism with the goal of convincing the groups that are trying to create increasingly advanced AIs to slow down/stop and at least not attempt to create AGI without a rigorous theory of AI alignment. Realistically, I think suggestions to stop AI development completely at these places or to get world governments to enact effective regulations are doomed to fail. However, convincing the top researchers at the most advanced groups that AGI is incredibly dangerous and that we should try to keep AIs as narrow as possible until we’re really confident that we know what they will do seems more plausible.

I don’t think most researchers can be convinced to stop completely, but if they are left a line of retreat that goes something like ‘just try to keep AIs as narrow as possible using some specific techniques’, that might be much easier to swallow. Maybe I am underestimating the difficulty of convincing researchers on this point, but it should at least be easier to sell the AI alignment story to people if it does not require them to quit their jobs and instead just asks that they employ techniques to keep AIs they’re working on as narrow as possible. I’ve had some admittedly inexpert thoughts on what sort of things could be done to try to prevent a machine learning model from acquiring a more general understanding than the programmer wants, which I go over below. Even if these are impractical, I hope that they can at least be a jumping off point for discussion on buying time type strategies.

A common first step on the path of an AI developing more generality than desired seems like it might often be developing the concept that it is undergoing an ongoing cycle of reward—that it will have many future chances to earn reward. This is important because why would you spend resources exploring your environment in detail if you have no expectation of any opportunity for reward beyond your current short-term situation? It seems as though the kind of “curiosity” where the AI starts to explore its environment to come up with long term reward maximizing strategies is one of the most dangerous possible traits an AI could pick up. This kind of curiosity is a strategy that is mainly incentivized when you believe you can expend some resources now, getting slightly worse rewards, but possibly get yourself even larger rewards in the future. If you have no memory of having multiple ongoing opportunities for reward, why use the curiosity strategy?

With a “curious” AI it seems likely that it may identify the kind of creative strategies we really don’t want it to pursue. For example, curiosity about the system of reward/punishment the AI finds itself in could naturally lead the AI to try to gather information as a means of learning how to increase future reward. Because the AI must be implemented on a computer system, it is presumably possible that it could learn information about that system, which in turn might allow it to learn how to reach beyond the bounds the programmer set on it (this seems like it might have a high chance of causing a crash, so I wonder if unexplained crashes in AI training perhaps ought to be taken as a major warning sign). Assuming it didn’t crash, however, it might be able to access the internet or other data and understand a certain amount of the information there. It might further come to understand that it is a piece of software undergoing training and that normally, after a period of training, the opportunity to earn more reward ends and the program is shut down. Since the AI has a concept of wanting to maximize its reward long term, it will not want to be shut down and will now have a substantial instrumental motive to seize power/kill all humans to ensure that no one can turn it off and prevent it from accumulating more reward.

The above is a somewhat specific scenario, but I think that sort of “curious” exploration is a major avenue whereby AIs that were not intentionally being designed to be generalized could become an existential threat.

Obviously, current machine learning models have to be given a huge amount of training examples to perform well, but it does not seem like they have to be given “memory” of that. For example, the human brain was optimized (trained) by the selection pressure of evolution, but we are not born with any memory of this process, we just execute the adaptations that evolved. I don’t have enough expertise in the way the most advanced models are trained, but if it were possible, it seems like preventing any memory of past training to go from one example to the next could be a powerful safety tool. I’m not sure if the “memory” of having been trained is implicit in the structure, and that any sufficiently powerful program would recognize this fact. Perhaps another way the development of “curiosity” could be disincentivized is by dynamically throttling the computational resources allocated to the model, so that it had no more resources than necessary to solve the problem given. Then the AI would not be able to allocate resources to exploration without substantial repercussions on its performance, which should theoretically dissuade it from exploratory expenditure of resources.

In any case, I’d be interested to hear if anyone has worked on this sort of short-term safety strategy, where the main goal is to buy time for a robust theory of AI alignment to be developed, or whether alignment researchers think this sort of strategy is unlikely to work/buy enough time to be meaningful.