Nuclear engineer with a focus in nuclear plant safety and probabilistic risk assessment. Aspiring EA, interested in X-risk mitigation and the intersection of science and policy. Working towards Keegan/Kardashev/Simulacra level 4.
(Common knowledge note: I am not under a secret NDA that I can’t talk about, as of Mar 15 2025. I intend to update this statement at least once a year as long as it’s true.)
Even assuming imitation learning is safe, how would you get enough data for the first distillation, when you need the human in order to generate actions? And how would you know when you have enough alignment-relevant data in particular? It seems unavoidable that your data distribution will be very constrained compared to the set of situations the distilled agent might encounter.