I’m a Postdoctoral Research Fellow at Oxford University’s Global Priorities Institute.
Previously, I was a Philosophy Fellow at the Center for AI Safety.
So far, my work has mostly been about the moral importance of future generations. Going forward, it will mostly be about AI.
You can email me at elliott.thornley@philosophy.ox.ac.uk.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn’t run into the same barriers to generalization as we see when we consider training for honesty.