Sparked by Eric Topol, I’ve been thinking lately about biological complexity, psychology, and AI safety.
A prominent concern in the AI safety community is the problem of instrumental convergence – for almost any terminal goal, agents will converge on instrumental goals are helpful for furthering the terminal goal, e.g. self-preservation.
The story goes something like this:
AGI is given (or arrives at) a terminal goal
AGI learns that self-preservation is important for increasing its chances of achieving its terminal goal
AGI learns enough about the world to realize that humans are a substantial threat to its self-preservation
AGI finds a way to address this threat (e.g. by killing all humans)
It occurred to me that to be really effective at finding & deploying a way to kill all humans, the AGI would probably need to know a lot about human biology (and also markets, bureaucracies, supply chains, etc.).
We humans don’t have yet a clean understanding of human biology, and it doesn’t seem like an AGI could get to a superhuman understanding of biology without running many more empirical tests (on humans), which would be pretty easy to observe.
Then it occurred to me that maybe the AGI doesn’t actually to know a lot about human biology to develop a way to kill all humans. But it seems like it would still need to have a worked-out theory of mind, just to get to the point of understanding that humans are agent-like things that could bear on the AGI’s self-preservation.
So now I’m curious about where the state of the art is for this. From my (lay) understanding, it doesn’t seem like GPT-2 has anything approximating a theory of mind. Perhaps OpenAI’s Dota system or DeepMind’s AlphaStar is the state of the art here, theory-of-mind-wise? (To be successful at Dota or Starcraft, you need to understand that there are other things in your environment that are agent-y & will work against you in some circumstances.)
Curious what else is in the literature about this, and also about how important it seems to others.