Bronson Schoen comments on Training AI to do alignment research we don’t already know how to do

Bronson Schoen 25 Feb 2025 9:49 UTC
1 point
0
Isn’t “truth seeking” (in the way defined in this post) essentially defined as being part of “maintain their alignment”? Is there some other interpretation where models could both start off “truth seeking”, maintain their alignment, and not have maintained “truth seeking”? If so, what are those failure modes?