Right, I got confused because I thought your problem was about trying to define a measure of optimisation power—for ex analogous to the Yudkowsky measure—that was also referring to a utility function, while being invariant from scaling and translation but this is different from asking

“what fraction of the default expected utility comes from outcomes at least as good as this one?’”

Nice recommendations! In addition to brain enthusiasts being useful for empirical work, there also are theoretical tools from system neuroscience that could be useful for AI safety. One area in particular would be for interpretability: if we want to model a network at various levels of “emergence”, recent development in information decomposition and multivariate information theory to move beyond pairwise interaction in a neural network might be very useful. Also see recent publications to model synergestic information and dynamical independance to perhaps automate macro variables discovery which could also be well worth exploring to study higher levels of large ML models. This would actually require both empirical and theoretical work as once the various measures of information decomposition are clearer one would need to empirically estimate test them and use them in actual ML systems for interpretability if they turn out to be meaningful.