Three AI Safety Related Ideas
(I have a health problem that is acting up and making it hard to type for long periods of time, so I’m condensing three posts into one.)
1. AI design as opportunity and obligation to address human safety problems
Many AI safety problems are likely to have counterparts in humans. AI designers and safety researchers shouldn’t start by assuming that humans are safe (and then try to inductively prove that increasingly powerful AI systems are safe when developed/trained by and added to a team of humans) or try to solve AI safety problems without considering whether their designs or safety approaches exacerbate human safety problems relative to other designs / safety approaches. At the same time, the development of AI may be a huge opportunity to address human safety problems, for example by transferring power from probably unsafe humans to de novo AIs that are designed from the ground up to be safe, or by assisting humans’ built-in safety mechanisms (such as moral and philosophical reflection).
2. A hybrid approach to the human-AI safety problem
Idealized humans can be safer than actual humans. An example of idealized human is a human whole-brain emulation that is placed in a familiar, safe, and supportive virtual environment (along with other humans for socialization), so that they are not subject to problematic “distributional shifts” nor vulnerable to manipulation from other powerful agents in the physical world. One way to take advantage of this is to design an AI that is ultimately controlled by a group of idealized humans (for example, has a terminal goal that refers to the reflective equilibrium of the idealized humans), but this seems impractical due to computational constraints. An idea to get around this is to give the AI an advice or hint, that it can serve that terminal goal by learning from actual humans as an instrumental goal. This learning can include imitation learning, value learning, or other kinds of learning. Then, even if the actual humans become corrupted, the AI has a chance of becoming powerful enough to discard its dependence on actual humans and recompute its instrumental goals directly from its terminal goal. (Thanks to Vladimir Nesov for giving me a hint that led to this idea.)
This is bad if the “good” kind of intellectual progress (such as philosophical progress) is disproportionally high in the hierarchy or outside PH entirely, or if we just don’t know how to formulate such progress as problems low in PH. I think this issue needs to be on the radar of more AI safety researchers.
(A reader might ask, “differentially accelerate relative to what?” An “aligned” AI could accelerate progress in a bad direction relative to a world with no AI, but still in a good direction relative to a world with only unaligned AI. I’m referring to the former here.)