How about more uhh soft uncontrollability? Like, not “it subverted our whole compute and feeds us lies” but more “we train it to do A, which it sees as only telling it to do A, and does A, but its motivations are completely untouched”.
How about more uhh soft uncontrollability? Like, not “it subverted our whole compute and feeds us lies” but more “we train it to do A, which it sees as only telling it to do A, and does A, but its motivations are completely untouched”.