You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time.
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.