Koen.Holtman comments on My research methodology

Koen.Holtman 25 Mar 2021 20:01 UTC
LW: 1 AF: 1
AF
Interesting… On first reading your post, I felt that your methodological approach for dealing with the ‘all is doomed in the worst case’ problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I’ll try to explore the possible differences in methodological outlook, and will end with a question.

The key to your methodology is that you list possible process steps which one might take when one feels like

all of our current algorithms are doomed in the worst case.

The specific doom-removing process step that I want to focus on is this one:

If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]

My feeling is that AGI safety/alignment community is way too reluctant to take this process step of ‘add another assumption about the world’ in order to eliminate a worst case failure story.

These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can’t for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.

So I’m happy to see a post that encourages people to make explicit assumptions about the agent’s environment. I have definitely used this technique to make progress in my own work.

But.

When I look at your example of ‘the strategy stealing assumption’ as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.

To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.

But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:

We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.

The key here is the ‘highly unlikely’. If we have an algorithm were

We train a generative model that predicts these camera observations given the motor actions.

then I typically add the following assumption to avoid doom:
- Assumption: Take a sequence of $A_{0} \dots A_{n}$ of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following mis-prediction scenario is theoretically possible, the risk that it actually happens is very low. The mis-prediction scenario is that, for the above motor actions $A_{0} \dots A_{n}$ that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.
In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.

So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for ‘throwing out failure stories’ in your methodology?

Or is the existence of this assumption automatically implied by default in your process?