Despite being a researcher with several years of experience at multiple R1 schools, I find it particularly hard to contextualize what people are doing in alignment research or how their work fits into a larger narrative. While I understand that more people in this field doesn’t necessarily speed up progress, good scholarship should be understandable, place itself well within the broader literature, and offer avenues for follow-up work.
Paul Christiano’s research does an excellent job of meeting these criteria by placing itself in the literature and outlining a research methodology. For example, see this statement on related work for the ELK problem. They also do a great job of outlining how a reasonably interested person could get started towards making a contribution by outlining a research methodology. Overall, Paul shows that you can work on pre-paradigmatic topics without sacrificing on the criteria for good scholarship.
On the other hand, while Turntrout’s work is interesting, it is difficult to research due to their non-standard notation and lack of references to related work. For example, I am not sure what the difference between Shard Theory and in-context learning is. They have relatively similar definitions, but the latter notion is much more established in the literature. As another example, from discussions with learning theory folk, I suspect their notion of power reduces to a variant of Rademacher complexity. Not having this proper context can make it difficult to engage with their work. Some of my sentiment is echoed by reviewers, who point out that the central ideas of their peer-reviewed power-seeking paper were not very clear. Overall, this can leave one feeling it may be easier to generalize the power-seeking theorems on their own rather than work through the paper which would just lead to another instance of the problem posed in the OP.
Finally, I found minimal sufficiency as an idea particularly interesting in Johnswentworth’s research. However, the lack of clear theorem statements in some of their earlier work made it difficult to identify core contributions. I do want to praise them for offering up a concrete research agenda with their selection theorems program. Even if with time I also grew dissatisfied with their research agenda on selection theorems I find this be a step in the right direction. However, I’m not quite sure where the difference between these and typical coherence or money-pump arguments lies. I wish more attention was given to this. More recently, there may have been push-back against their utility, assuming selection theorems include coherence theorems.
In conclusion, while I acknowledge the community’s efforts in this field, I urge researchers to make their work more understandable, place it within the broader literature, and offer avenues for follow-up work to speed up progress. I find myself tending more towards established academic approaches to alignment work like CIRL games, incentive compataibility, mechanism design, and performative prediction these days in the mean-time.
I think this problem is not well specified for other reasons so I won’t belabor an argument.
Claim: There are no finite FFN gradient hackers.
Proof: Suppose the FFN has n-trainable weights θ∈Θ. We must have d≥n and then our input has the decomposition x=o⊕θ. It is clear that there exists a map that returns the gradient. However, an exact answer requires multiplications between weights. The multiplication relation cannot be implemented with a finite FFN. Therefore, the gradient cannot be internal in a finite FFN and so we violate condition 2b. We conclude there are no finite FFN gradient hackers.