Steven Byrnes comments on AI as a science, and three obstacles to alignment strategies

Steven Byrnes 26 Oct 2023 15:39 UTC
46 points
23
I think Nate’s claim “I expect them to care about a bunch of correlates of the training signal in weird and specific ways.” is plausible, at least for the kinds of AGI architectures and training approaches that I personally am expecting. If you don’t find the evolution analogy useful for that (I don’t either), but are OK with human within-lifetime learning as an analogy, then fine! Here goes!
OK, so imagine some “intelligent designer” demigod, let’s call her Ev. In this hypothetical, the human brain and body were not designed by evolution, but rather by Ev. She was working 1e5 years ago, back on the savannah. And her design goal was for these humans to have high inclusive genetic fitness.
So Ev pulls out a blank piece of paper. First things first: She designed the human brain with a fancy large-scale within-lifetime learning algorithm, so that these humans can gradually get to understand the world and take good actions in it.
Supporting that learning algorithm, she needs a reward function (“innate drives”). What to do there? Well, she spends a good deal of time thinking about it, and winds up putting in lots of perfectly sensible components for perfectly sensible reasons.
For example: She wanted the humans to not get injured, so she installed in the human body a system to detect physical injury, and put in the brain an innate drive to avoid getting those injuries, via an innate aversion (negative reward) related to “pain”. And she wanted the humans to eat sugary food, so she put a sweet-food-detector on the tongue and installed in the brain an innate drive to trigger reinforcement (positive reward) when that detector goes off (but modulated by hunger, as detected by yet another system). And so on.
Then she did some debugging and hyperparameter tweaking by running these newly-designed humans in the training environment (African savannah) and seeing how they do.
So that’s how Ev designed humans. Then she “pressed go” and lets them run for 1e5 years. What happened?
Well, I think it’s fair to say that modern humans “care about” things that probably would have struck Ev as “weird”. (Although we, with the benefit of hindsight, can wag our finger at Ev and say that she should have seen them coming.) For example:
- Superstitions and fashions: Some people care, sometimes very intensely, about pretty arbitrary things that Ev could not have possibly anticipated in detail, like walking under ladders, and where Jupiter is in the sky, and exactly what tattoos they have on their body.
- Lack of reflective equilibrium resulting in self-modification: Ev put a lot of work into her design, but sometimes people don’t like some of the innate drives or other design features that Ev put into them, so the people go right ahead and change them! For example, they don’t like how Ev designed their hunger drive, so they take Ozempic. They don’t like how Ev designed their attentional system, so they take Adderall. Many such examples.
- New technology / situations leading to new preferences and behaviors: When Ev created the innate taste drives, she was (let us suppose) thinking about the food options available on the savannah, and thinking about what drives would lead to people making smart eating choices in that situation. And she came up with a sensible and effective design for a taste-receptors-and-associated-innate-drives system that worked well for that circumstance. But maybe she wasn’t thinking that humans would go on to create a world full of ice cream and coca cola and miraculin and so on. Likewise, Ev put in some innate drives with the idea that people would wind up exploring their local environment. Very sensible! But Ev would probably be surprised that her design is now leading to people “exploring” open-world video-game environments while cooped up inside. Ditto with social media, organized religion, sports, and a zillion other aspects of modern life. Ev probably didn’t see any of it coming when she was drawing up and debugging her design, certainly not in any detail.
To spell out the analogy here:
- Ev ↔ AGI programmers;
- Human within-lifetime learning ↔ AGI training;
- Adult humans ↔ AGIs;
- Ev “presses go” and lets human civilization “run” for 1e5 years without further intervention ↔ For various reasons I consider it likely (for better or worse) that there will eventually be AGIs that go off and autonomously do whatever they think is a good thing to do, including inventing new technologies, without detailed human knowledge and approval.
- Modern humans care about (and do) lots of things that Ev would have been hard-pressed to anticipate, even though Ev designed their innate drives and within-lifetime learning algorithm in full detail ↔ even if we carefully design the “innate drives” of future AGIs, we should expect to be surprised about what those AGIs end up caring about, particularly when the AGIs have an inconceivably vast action space thanks to being able to invent new technology and build new systems.