Charlie Steiner comments on Six Thoughts on AI Safety

Charlie Steiner 25 Jan 2025 7:39 UTC
10 points
3
One way of phrasing the AI alignment task is to get AIs to “love humanity” or to have human welfare as their primary objective (sometimes called “value alignment”). One could hope to encode these via simple principles like Asimov’s three laws or Stuart Russel’s three principles, with all other rules derived from these.
I certainly agree that Asimov’s three laws are not a good foundation for morality! Nor are any other simple set of rules.
So if that’s how you mean “value alignment,” yes let’s discount it. But let me sell you on a different idea you haven’t mentioned, which we might call “value learning.”^[1]
Doing the right thing is complicated.^[2] Compare this to another complicated problem: telling photos of cats from photos of dogs. You cannot write down a simple set of rules to tell apart photos of cats and dogs. But even though we can’t solve the problem with simple rules, we can still get a computer to do it. We show the computer a bunch of data about the environment and human classifications thereof, have it tweak a bunch of parameters to make a model of the data, and hey presto, it tells cats from dogs.
Learning the right thing to do is just like that, except for all the ways it’s different that are still open problems:
- Humans are inconsistent and disagree with each other about the right thing more than they are inconsistent/disagree about dogs and cats.
- If you optimize for doing the right thing, this is a bit like searching for adversarial examples, a stress test that the dog/cat classifier didn’t have to handle.
- When building an AI that learns the right thing to do, you care a lot more about trust than when you build a dog/cat classifier.
This margin is too small to contain my thoughts on all these.
There’s no bright line between value learning and techniques you’d today lump under “reasonable compliance.” Yes, the user experience is very different between (e.g.) an AI agent that’s operating a computer for days or weeks vs. a chatbot that responds to you within seconds. But the basic principles are the same—in training a chatbot to behave well you use data to learn some model of what humans want from a chatbot, and then the AI is trained to perform well according to the modeled human preferences.
The open problems for general value learning are also open problems for training chatbots to be reasonable. How do you handle human inconsistency and disagreement? How do you build trust that the end product is actually reasonable, when that’s so hard to define? Etc. But the problems have less “bite,” because less can go wrong when your AI is briefly responding to a human query than when your AI is using a computer and navigating complicated real-world problems on its own.
You might hope we can just say value learning is hard, and not needed anyhow because chatbots need it less than agents do, so we don’t have to worry about it. But the chatbot paradigm is only a few years old, and there is no particular reason it should be eternal. There are powerful economic (and military) pressures towards building agents that can act rapidly and remain on-task over long time scales. AI safety research needs to anticipate future problems and start work on them ahead of time, which means we need to be prepared for instilling some quite ambitious “reasonableness” into AI agents.
1. ^
  For a decent introduction from 2018, see this collection.
2. ^
  Citation needed.
- boazbarak 27 Jan 2025 0:41 UTC
  3 points
  0
  Parent
  I am not 100% sure I follow all that you wrote, but to the extent that I do, I agree.
  Even chatbot are surprisingly good at understanding human sentiments and opinions. I would say that already they mostly do the reasonable thing, but not with high enough probability and certainly not reliably under stress of adversarial input, Completely agree that we can’t ignore these problems because the stakes will be much higher very soon.