Joey Yudelson comments on Training AI to do alignment research we don’t already know how to do

Joey Yudelson 24 Feb 2025 22:12 UTC
3 points
0
As a baseline, developers could train agents to imitate the truth-seeking process of the most reasonable humans on Earth. For example, they could sample the brightest intellects from every ideological walk, and train agents to predict their actions.
I’m very excited about strategies that involve lots of imitation learning on lots of particular humans. I’m not sure if imitated human researchers learn to generalize to doing lots of novel research, but this seems great for examining research outputs of slightly-more-alien agents very quickly.