Some notes from a 2018 CHAI meeting on this topic (with some editing). I don’t endorse everything on here, nor would CHAI-the-organization.
Learning part of your model that previously was fixed.
Can be done using neural nets, other ML models, or uncertainty (probability distributions)
Example: learning biases instead of hardcoding Boltzmann rationality
Relatedly, treating an object as evidence instead of as ground truth
Example: Inverse Reward Design treats the specified reward as evidence about the true reward
Looking at current examples of the problem and/or its solutions:
How does the human brain / nature do it?
How does human culture/society do this
How has cognitive science formalized similar problems / what insights has it produced that we can build on?
Principal-agent models/Contracting theory
Adversarial agents for robustness
Internal design of the system to be adversarial inherently (e.g. debate)
External use of adversaries for testing: red teaming, adversarial training
‘Normative Bandwidth’
How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy—that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information—something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
Proactive Learning (i.e. what if we ask the human?)
Induction (see e.g. iterated amplification)
Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
Analyze a simple model of the situation in theory
Indexical uncertainty (uncertainty about your identity)
Rationality—either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
How would you do it with infinite resources (relaxed constraints)?
E.g. AIXI, Solomonoff induction, open-source game theory
Some notes from a 2018 CHAI meeting on this topic (with some editing). I don’t endorse everything on here, nor would CHAI-the-organization.
Learning part of your model that previously was fixed.
Can be done using neural nets, other ML models, or uncertainty (probability distributions)
Example: learning biases instead of hardcoding Boltzmann rationality
Relatedly, treating an object as evidence instead of as ground truth
Example: Inverse Reward Design treats the specified reward as evidence about the true reward
Looking at current examples of the problem and/or its solutions:
How does the human brain / nature do it?
How does human culture/society do this
How has cognitive science formalized similar problems / what insights has it produced that we can build on?
Principal-agent models/Contracting theory
Adversarial agents for robustness
Internal design of the system to be adversarial inherently (e.g. debate)
External use of adversaries for testing: red teaming, adversarial training
‘Normative Bandwidth’
How much information about the correct behavior is actually conveyed vs how much information does the robot policy assume is conveyed
E.g. reward functions that are interpreted literally means that you are getting all information necessary for getting the optimal policy—that’s a huge amount of information, that assumption is always wrong. What’s actually conveyed is a much smaller amount of information—something like what Inverse Reward Design does (where it says the reward function only conveys information about good behavior in the training environments).
Proactive Learning (i.e. what if we ask the human?)
Induction (see e.g. iterated amplification)
Get good safety properties in simple situations, and then use them to build something more capable while preserving safety properties
Analyze a simple model of the situation in theory
Indexical uncertainty (uncertainty about your identity)
Rationality—either make sure the agent is rational, or make sure it isn’t (i.e. don’t build agents)
Thinking about the human-robot system as a whole, rather than the robot in isolation. (See e.g. CIRL / assistance games.)
How would you do it with infinite resources (relaxed constraints)?
E.g. AIXI, Solomonoff induction, open-source game theory