Rohin Shah answers What Heuristics Do You Use to Think About Alignment Topics?

Rohin Shah 29 Sep 2021 9:16 UTC
8 points
Some notes from a 2018 CHAI meeting on this topic (with some editing). I don’t endorse everything on here, nor would CHAI-the-organization.
- Learning part of your model that previously was fixed.
  - Can be done using neural nets, other ML models, or uncertainty (probability distributions)
  - Example: learning biases instead of hardcoding Boltzmann rationality
- Relatedly, treating an object as evidence instead of as ground truth
  - Example: Inverse Reward Design treats the specified reward as evidence about the true reward
- Looking at current examples of the problem and/or its solutions:
  - How does the human brain / nature do it?
  - How does human culture/society do this
  - How has cognitive science formalized similar problems / what insights has it produced that we can build on?
  - Principal-agent models/Contracting theory
- Adversarial agents for robustness
  - Internal design of the system to be adversarial inherently (e.g. debate)
  - External use of adversaries for testing: red teaming, adversarial training
- ‘Normative Bandwidth’
  - How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
  - E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy—that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information—something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
- Proactive Learning (i.e. what if we ask the human?)
- Induction (see e.g. iterated amplification)
  - Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
- Analyze a simple model of the situation in theory
- Indexical uncertainty (uncertainty about your identity)
- Rationality—either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
- Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
- How would you do it with infinite resources (relaxed constraints)?
  - E.g. AIXI, Solomonoff induction, open-source game theory