Nice post! Do you have a link to an explanation of what counterfactual mugging is and why it’s a good thing?
For subagent alignment problems, is there an interesting distinction to be drawn between the limited agent being able to understand the process by which the more powerful agent becomes powerful, versus not even understanding that? (What would it mean to “understand the process”? I suppose it means being able to validate certain relevant facts about the process though not enough to know exactly what results from it.)
Here is a quick explanatiton. It is a good thing because any agent who does not get counterfactually mugged would self modify into one that does before it sees the value of the coin.
There are some subagent alignment approaches that require some understanding. For example transparancy and informed oversight (In informed oversight, we even assume that the agent we are aligning is less powerful, just not exponentionally less powerful.), and other approaches that treat the agent you are trying to align as a black box.
Nice post! Do you have a link to an explanation of what counterfactual mugging is and why it’s a good thing?
For subagent alignment problems, is there an interesting distinction to be drawn between the limited agent being able to understand the process by which the more powerful agent becomes powerful, versus not even understanding that? (What would it mean to “understand the process”? I suppose it means being able to validate certain relevant facts about the process though not enough to know exactly what results from it.)
Here is a quick explanatiton. It is a good thing because any agent who does not get counterfactually mugged would self modify into one that does before it sees the value of the coin.
There are some subagent alignment approaches that require some understanding. For example transparancy and informed oversight (In informed oversight, we even assume that the agent we are aligning is less powerful, just not exponentionally less powerful.), and other approaches that treat the agent you are trying to align as a black box.