Is there something in particular that you think could be more distilled?
What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
Maybe check MIRI, PIBBBS, ARC (theoretical research), Conjecture check who went to ILIAD.
Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that’s doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.
What I had in mind is something like a more detailed explanation of recent reward hacking/misalignment results. Like, sure, we have old arguments about reward hacking and misalignment, but what I want is more gears for when particular reward hacking would happen in which model class.
Those are top-down approaches, where you have an idea and then do research for it, which is, of course, useful, but that’s doing more frontier research via expanding surface area. Trying to apply my distillation intuition to them would be like having some overarching theory unifying all approaches, which seems super hard and maybe not even possible. But looking at the intersection of pairs of agendas might prove useful.