It is a huge alignment tax if the supervisor needs to understand everything that is going on, unless the supervisor is as on the ball as the system being supervised. So there’s a big gain in results if you instead judge by results, while not understanding what is going on and being fine with that, which is a well-known excellent way to lose control of the situation.
The good news is that we pay exactly this huge alignment tax all the time with humans. It plausibly costs us most of the potential productivity of many of our most capable people. We often decide the alternative is worse. We might do so again.
I think the second paragraph understates the problem. I have never heard of a human manager / human underling relationship that works like process-based supervision is supposed to work. I think you’re misunderstanding it—that it’s weirder than you think it is. I have a draft blog post where I try to explain; should come out next week (or DM me for early access).
It could be like verifying math test solutions. I’m not sure about the granularity of process based supervision, but it could be less weird if an AI just has to justify how it got to an answer rather than just giving out the answer.
Minor point:
I think the second paragraph understates the problem. I have never heard of a human manager / human underling relationship that works like process-based supervision is supposed to work. I think you’re misunderstanding it—that it’s weirder than you think it is. I have a draft blog post where I try to explain; should come out next week (or DM me for early access).
UPDATE: now it’s posted, see Thoughts on Process-Based Supervision esp. Section 5.3.1 (“Pedagogical note: If process-based supervision sounds kinda like trying to manage a non-mission-aligned human employee, then you’re misunderstanding it. It’s much weirder than that.”)
It could be like verifying math test solutions. I’m not sure about the granularity of process based supervision, but it could be less weird if an AI just has to justify how it got to an answer rather than just giving out the answer.