This feels like too specific a task/less generally useful to AI alignment research than your proposal on “Extract the the training objective from a fully-trained ML model”
This feels like too specific a task/less generally useful to AI alignment research than your proposal on “Extract the the training objective from a fully-trained ML model”