…agentic training data for future systems may involve completing tasks in automated environments (e.g. playing games, SWE tasks, AI R&D tasks) with automated reward signals. The reward here will pick out drives that make AIs productive, smart and successful, not just drives that make them HHH.
…
These drives/goals look less promising if AIs take over. They look more at risk of leading to AIs that would use the future to do something mostly without any value from a human perspective.
I’m interested in why this would seem unlikely in your model. These are precisely the failure models I think about the most, ex:
I’ve based some of the above on extrapolating from today’s AI systems, where RLHF focuses predominantly on giving AIs personalities that are HHH(helpful, harmless and honest) and generally good by human (liberal western!) moral standards. To the extent these systems have goals and drives, they seem to be pretty good ones. That falls out of the fine-tuning (RLHF) data.
My understanding has always been that the fundamental limitation of RLHF (ex: https://arxiv.org/abs/2307.15217) is precisely that it fails at the limit of human’s ability to verify (ex: https://arxiv.org/abs/2409.12822, many other examples). You then have to solve other problems (ex: w2s generalization, etc), but I would consider it falsified that we can just rely on RLHF indefinitely (in fact I don’t believe it was a common argument that RLHF ever would hold, but it’s difficult to quanity how prevalent various opinions on it were).
I’m interested in why this would seem unlikely in your model. These are precisely the failure models I think about the most, ex:
https://www.lesswrong.com/posts/QqYfxeogtatKotyEC/training-ai-agents-to-solve-hard-problems-could-lead-to
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
My understanding has always been that the fundamental limitation of RLHF (ex: https://arxiv.org/abs/2307.15217) is precisely that it fails at the limit of human’s ability to verify (ex: https://arxiv.org/abs/2409.12822, many other examples). You then have to solve other problems (ex: w2s generalization, etc), but I would consider it falsified that we can just rely on RLHF indefinitely (in fact I don’t believe it was a common argument that RLHF ever would hold, but it’s difficult to quanity how prevalent various opinions on it were).