The first post focuses on how you can get a goal for AIs that allow you to shutdown the AI while having the AI be useful, and the approach to corrigibility it takes is extremely different to how human brains work, using the corrigibility transformation to get corrigible AIs.
One big caveat here is that it definitely requires the assumption that Causal Decision Theory is used but I mostly am fine with that assumption, given that humans intuitively use Causal Decision Theory and it’s in the spec of the transformation rather than a background assumption.
The other big caveat is that you want the model to optimize for the reward in order for this to work, so in terms of under-sculpting vs over-sculpting, or whether an AI is driven by the reward vs driven by another goal, you want to have the AI reward-maximize and be over-sculpted (though in this case it’s just appropriately sculpted via reward), which makes it incompatible with corrigibility/alignment hopes that depend on AIs not maximizing the reward, but I think this is a good property to have.
The post on defining monitorable and useful goals proposes the idea of the monitorability transformation to get AIs to not be incentivized to fool monitors generally, and I’d recommend reading that over any explanation I’d give.
These are admittedly curveballs compared to standard LW thoughts on this, but this is why I picked them for the reward functions starter pack, as they contain novel ideas to deal with some notorious problems.
Some good picks the for how to design reward functions starter pack (though I should note that their empirical support is very weak due to focusing on toy models) are Defining Corrigible and Useful Goals and Defining Monitorable and Useful Goals.
The first post focuses on how you can get a goal for AIs that allow you to shutdown the AI while having the AI be useful, and the approach to corrigibility it takes is extremely different to how human brains work, using the corrigibility transformation to get corrigible AIs.
One big caveat here is that it definitely requires the assumption that Causal Decision Theory is used but I mostly am fine with that assumption, given that humans intuitively use Causal Decision Theory and it’s in the spec of the transformation rather than a background assumption.
The other big caveat is that you want the model to optimize for the reward in order for this to work, so in terms of under-sculpting vs over-sculpting, or whether an AI is driven by the reward vs driven by another goal, you want to have the AI reward-maximize and be over-sculpted (though in this case it’s just appropriately sculpted via reward), which makes it incompatible with corrigibility/alignment hopes that depend on AIs not maximizing the reward, but I think this is a good property to have.
The post on defining monitorable and useful goals proposes the idea of the monitorability transformation to get AIs to not be incentivized to fool monitors generally, and I’d recommend reading that over any explanation I’d give.
These are admittedly curveballs compared to standard LW thoughts on this, but this is why I picked them for the reward functions starter pack, as they contain novel ideas to deal with some notorious problems.