Quick take: it’s focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)
Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it’s on top of LLMs.
Quick take: it’s focused on interpretability as a way to solve prosaic alignment, ignoring the fact that prosaic alignment is clearly not scalable to the types of systems they are actively planning to build. (And it seems to actively embrace the fact that interpretability is a capabilities advantage in the short term, but pretends that it is a safety thing, as if the two are not at odds with each other when engaged in racing dynamics.)
Why do you believe this?
(FWIW I think it’s foolish that all (?) frontier companies are all-in on prosaic alignment, but I am not convinced that it “clearly” won’t work.)
Because they are all planning to build agents that will have optimization pressures, and RL-type failures apply when you build RL systems, even if it’s on top of LLMs.