Interesting… On first reading your post, I felt that your
methodological approach for dealing with the ‘all is doomed in the
worst case’ problem is essentially the same as my approach. But on
re-reading, I am not so sure anymore. So I’ll try to explore the
possible differences in methodological outlook, and will end with a
question.
The key to your methodology is that you list possible process steps
which one might take when one feels like
all of our current algorithms are doomed in the worst case.
The specific doom-removing process step that I want to focus on is
this one:
If so, I may add another assumption about the world that I think makes
alignment possible (e.g. the strategy stealing assumption), and throw
out any [failure] stories that violate that assumption [...]
My feeling is that AGI safety/alignment community is way too reluctant
to take this process step of ‘add another assumption about the
world’ in order to eliminate a worst case failure story.
These seem to be several underlying causes for this reluctance. One
of them is that in the field of developing machine learning
algorithms, in the narrow sense where machine learning equals function
approximation, the default stance is to make no assumptions about the
function that has to be approximated. But the main function to be
approximated in the case of an ML agent is the function that
determines the behavior of the agent environment. So the default
methodological stance in ML is that we can introduce no assumptions
whatsoever about the agent environment, we can’t for example assume
that it contains a powerful oversight body that will help to keep the
agent aligned. Obviously this stance is not very helpful if you want
to make progress on certain alignment problems.
So I’m happy to see a post that encourages people to make explicit
assumptions about the agent’s environment. I have definitely used
this technique to make progress in my own work.
But.
When I look at your example of ‘the strategy stealing assumption’ as
one useful assumption to add, it is very much not the default example
that would first come to my mind. So I am wondering if you would even
recommend the approach of adding the kind of default assumptions that
I tend to add.
To make this more specific, in this
post
I introduce an agent design with three safety interlocks that are
supposed to be helpful to agent oversight. The interlocks are agent
design refinements that make it easier for oversight to keep control
over the agent. The
interlocks contribute to more successful oversight not by making the
oversight people+machines smarter (which is your main line of research
I believe), but by making the agent less smart in very specific ways.
But at the same time, these interlocks do not remove all possible
worst-case failure stories of doom. To quote from the
post
and the underlying paper:
We can always construct a highly unlikely but theoretically possible
failure mode story where the agent ends up being unsafe.
The key here is the ‘highly unlikely’. If we have an algorithm were
We train a generative model that predicts these camera observations given the motor actions.
then I typically add the following assumption to avoid doom:
Assumption: Take a sequence of A0⋯An of disastrously
unsafe motor actions, actions that that modify the agent compute
core to remove the safety interlocks. We assume that, while the
following mis-prediction scenario is theoretically possible, the
risk that it actually happens is very low. The mis-prediction
scenario is that, for the above motor actions A0⋯An that
modify the compute core, the above generative model incorrectly
predicts camera observations in which the agent compute core
remains perfectly untouched by these actions, and where the
observed paperclip production resulting from these actions is better
than under any other sequence of actions.
In terms of methodology, I usually describe the above move as one
where we seek to drive down the risk of certain failure modes to
residual levels. There is a link to empirical work here. To make it more plausible
that the above assumption about low risk is valid for a particular ML
system and physical realization of an agent and its environment, we
can do simulations and real-life experiments with trained generative
models.
So my question is: would the above assumption-adding step, about the
low risk of mis-predictions, be a natural and valid assumption-adding
process step for ‘throwing out failure stories’ in your methodology?
Or is the existence of this assumption automatically implied by
default in your process?
Interesting… On first reading your post, I felt that your methodological approach for dealing with the ‘all is doomed in the worst case’ problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I’ll try to explore the possible differences in methodological outlook, and will end with a question.
The key to your methodology is that you list possible process steps which one might take when one feels like
The specific doom-removing process step that I want to focus on is this one:
My feeling is that AGI safety/alignment community is way too reluctant to take this process step of ‘add another assumption about the world’ in order to eliminate a worst case failure story.
These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can’t for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.
So I’m happy to see a post that encourages people to make explicit assumptions about the agent’s environment. I have definitely used this technique to make progress in my own work.
But.
When I look at your example of ‘the strategy stealing assumption’ as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.
To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.
But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:
The key here is the ‘highly unlikely’. If we have an algorithm were
then I typically add the following assumption to avoid doom:
Assumption: Take a sequence of A0⋯An of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following mis-prediction scenario is theoretically possible, the risk that it actually happens is very low. The mis-prediction scenario is that, for the above motor actions A0⋯An that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.
In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.
So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for ‘throwing out failure stories’ in your methodology?
Or is the existence of this assumption automatically implied by default in your process?