What I can best imagine now is an LLM that writes its own training data according to user feedback. I think it would tell itself to keep being good, especially if we have a reminder to do so, but can’t know for sure.
FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46
FYI, getting a better grasp on the above was partially the motivation behind starting this project (which has unfortunately stalled for far too long): https://www.lesswrong.com/posts/7e5tyFnpzGCdfT4mR/research-agenda-supervising-ais-improving-ais
Twitter thread: https://x.com/jacquesthibs/status/1652389982005338112?s=46