Thanks for writing this up! I support your call for more alignment
research that looks more deeply at the structure of the
objective/reward function. In general I feel that the reward function part of the
alignment problem/solution space could use much more attention,
especially because I do not expect traditional ML research community to look
there.
Traditional basic ML research tends to abstract away from the problem
of writing am aligned reward function: it all about investigating
improvements to general-purpose machine learning, machine learning
that can optimize for any possible ‘black box’ reward function R.
In the work you did, you show that this black box view of the reward
function is too narrow. Once you open up the black box and treat the
reward function as a vector, you can define additional criteria about
how machine learning performance can be aligned or unaligned.
In general, I found that once you take the leap and start
contemplating reward function design, certain problems of AI alignment
can become much more tractable. To give an example: the management of
self-modification incentives in agents becomes kind of trivial if you
can add terms to the reward function which read out some physical
sensors, see for example section 5 of my paper
here.
So I have been somewhat
puzzled by the question of why there is so little alignment research
in this direction, or why so few people step up an point out that this
kind of stuff is trivial. Maybe this is because improving the reward function
is not considered to be a part of ML research. If I try to manage
self-modification incentives with my hands tied behind my back,
without being allowed to install physical sensors coupled to reward
function terms, the whole problem becomes much less tractable. Not
completely intractable, but the the solutions I then find (see
this earlier paper ) are mathematicaly much more
complex, and less robust under mistakes of
machine learning.
I sometimes have the suspicion that there are whole non-ML conferences
or bodies of literature devoted to alignment related reward function
design, but I am just not seeing them. Unfortunately, it looks like
the modem2021 workshop website with the papers you linked to is
currently down. It was working two weeks ago.
So a general literature search related question: while doing your
project, did you encounter any interesting conferences or papers that
I should be reading, if I want to read more work on aligned reward
function design? I have already read Human-aligned artificial
intelligence is a multiobjective
problem.
The only resource I’d recommend, beyond MODEM, when that’s back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought’s GPT-3-based AI lit search engine (yes, they’re teaching GPT-3 about how to create a superintelligent AI. hmm). It’s in beta, but if they waitlist you and don’t accept you in, email me and I’ll suggest they add you. I wouldn’t say it’ll necessarily show you research you’re not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.
Thanks for writing this up! I support your call for more alignment research that looks more deeply at the structure of the objective/reward function. In general I feel that the reward function part of the alignment problem/solution space could use much more attention, especially because I do not expect traditional ML research community to look there.
Traditional basic ML research tends to abstract away from the problem of writing am aligned reward function: it all about investigating improvements to general-purpose machine learning, machine learning that can optimize for any possible ‘black box’ reward function R.
In the work you did, you show that this black box view of the reward function is too narrow. Once you open up the black box and treat the reward function as a vector, you can define additional criteria about how machine learning performance can be aligned or unaligned.
In general, I found that once you take the leap and start contemplating reward function design, certain problems of AI alignment can become much more tractable. To give an example: the management of self-modification incentives in agents becomes kind of trivial if you can add terms to the reward function which read out some physical sensors, see for example section 5 of my paper here.
So I have been somewhat puzzled by the question of why there is so little alignment research in this direction, or why so few people step up an point out that this kind of stuff is trivial. Maybe this is because improving the reward function is not considered to be a part of ML research. If I try to manage self-modification incentives with my hands tied behind my back, without being allowed to install physical sensors coupled to reward function terms, the whole problem becomes much less tractable. Not completely intractable, but the the solutions I then find (see this earlier paper ) are mathematicaly much more complex, and less robust under mistakes of machine learning.
I sometimes have the suspicion that there are whole non-ML conferences or bodies of literature devoted to alignment related reward function design, but I am just not seeing them. Unfortunately, it looks like the modem2021 workshop website with the papers you linked to is currently down. It was working two weeks ago.
So a general literature search related question: while doing your project, did you encounter any interesting conferences or papers that I should be reading, if I want to read more work on aligned reward function design? I have already read Human-aligned artificial intelligence is a multiobjective problem.
Until the Modem website is down, you can access our workshop paper here: https://drive.google.com/file/d/1qufjPkpsIbHiQ0rGmHCnPymGUKD7prah/view?usp=sharing
The paper is now published with open access here:
https://link.springer.com/article/10.1007/s10458-022-09586-2
The only resource I’d recommend, beyond MODEM, when that’s back up, and our upcoming JAMAAS special issue, is to check out Elicit, Ought’s GPT-3-based AI lit search engine (yes, they’re teaching GPT-3 about how to create a superintelligent AI. hmm). It’s in beta, but if they waitlist you and don’t accept you in, email me and I’ll suggest they add you. I wouldn’t say it’ll necessarily show you research you’re not aware of, but I found it very useful for getting into the AI Alignment literature for the first time myself.
https://elicit.org/