Just want to articulate one possibility of how the future could look like:
RL agents will be sooo misaligned so early, they would lie and cheat and scheme all the time, so that alignment becomes a practical issue, with normal incentives, and get iteratively solved for not-very-superhuman agents. Turns out it requires mild conceptual breakthroughs, as these agents are slightly superhuman, fast, and hard to supervise directly to just train away the adversarial behaviors in the dumbest way possible. It finishes developing by the time of ASI arrival and people just align it with a lot of effort, in the same manner as any big project requires a lot of effort.
I’m not saying anything about the probability of it. It honestly feels a bit overfitted, just like people who overupdated on base models talked for some time. But still, the whole LLM arc was kind of weird and goofy, so I don’t trust my sense of weird and goofy anymore.
(would appreciate references of forecasting writeups exploring similar scenario)
Just want to articulate one possibility of how the future could look like:
RL agents will be sooo misaligned so early, they would lie and cheat and scheme all the time, so that alignment becomes a practical issue, with normal incentives, and get iteratively solved for not-very-superhuman agents. Turns out it requires mild conceptual breakthroughs, as these agents are slightly superhuman, fast, and hard to supervise directly to just train away the adversarial behaviors in the dumbest way possible. It finishes developing by the time of ASI arrival and people just align it with a lot of effort, in the same manner as any big project requires a lot of effort.
I’m not saying anything about the probability of it. It honestly feels a bit overfitted, just like people who overupdated on base models talked for some time. But still, the whole LLM arc was kind of weird and goofy, so I don’t trust my sense of weird and goofy anymore.
(would appreciate references of forecasting writeups exploring similar scenario)