RogerDearnaley comments on Safety First: safety before full alignment. The deontic sufficiency hypothesis.

RogerDearnaley 4 Jan 2024 8:46 UTC
3 points
2
If we get TAI in the next decade or so, it will almost certainly contain an LLM, at least as a component. Human values are complex and fragile, and we spend a huge amount of our time writing about them: roughly half the Dewey Decimal system consists of many different subfields of “How to Make Humans Happy 101”, including virtually all of the soft sciences (Anthropology, Medicine, Ergonomics, Economics…), arts, and crafts. Current LLMs have read tens of trillions of tokens of our content, including terrabytes of this material, and as a result even GPT-4 (definitely less than TAI) can do a pretty good job of answering moral questions and commenting on possible undesirable side effects and downsides of plans. So if we have sufficient control of our TAI to ensure that it is extremely unlikely to kill us all, then presumably we can also tell it “also don’t do anything that your LLM says is a bad idea or we wouldn’t like, at least not without checking carefully with us first”, and get a passable take on human values and impact regularization as well. So if we have enough control to block your red arrow, we can also take at least a passable first cut at the green arrow as well. Which by itself probably isn’t enough to stand up to many bits of optimization pressure without Goodharting, but is a lot better then ignoring the green arrow entirely. Also any TAI that can do STEM can understand and avoid Goodharting.
I agree that just not killing everyone is a much easier problem. Consider zoos: the manual for “How Not to Kill Everything in Your Care: The Orangutan Edition” is probably only a few hundred pages or less, and has a significant overlap with the corresponding editions for all of the other primates, including Homo sapiens. However, LLMs can handle datasets vastly larger than that, so this compactness is only relevant if you’re trying to add some sort of mathematical or software framework on top of it that can handle $O (100 k B)$ of data, but not terabytes.