I’m not trying to claim that if you can solve the (much harder and more general problem of) AGI alignment, then it should be able to solve the (simpler specific case of) corporate incentives.
It’s true that many AGI architectures have no clear analogy to corporations, and if you are using something like a satisficer model with no black-box subagents, this isn’t going to be a useful lens.
But many practical AI schema have black-box submodules, and some formulations like mesa-optimization or supervised amplification-distillation explicitly highlight problems with black box subagents.
I claim that an employee that destroys documentation so that they become irreplaceable to a company is a misaligned mesa-optimizer. Then I further claim that this suggests:
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Given a schema for aligning sub-agents of an AGI, either the schema should also work on aligning employees at a company or there should be a clear reason it breaks down
if the analogy applies, one could test the alignment schema by actually running such a company, which is a natural experiment that isn’t safely accessible for AI projects. This doesn’t prove that the schema is safe, but I would expect aspects of the problem to be easier to understand via natural experiment than via doing math on a whiteboard.
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Also, how does nature solve this problem? How are genes aligned with the cell as a whole, cells with the multicellular organism, ants with the anthill?
Though I suspect that most (all?) solutions would be ethically and legally unacceptable for humans. They would translate as “if the company fails, all employees are executed” and similar.
I think you’re misunderstanding my analogy.
I’m not trying to claim that if you can solve the (much harder and more general problem of) AGI alignment, then it should be able to solve the (simpler specific case of) corporate incentives.
It’s true that many AGI architectures have no clear analogy to corporations, and if you are using something like a satisficer model with no black-box subagents, this isn’t going to be a useful lens.
But many practical AI schema have black-box submodules, and some formulations like mesa-optimization or supervised amplification-distillation explicitly highlight problems with black box subagents.
I claim that an employee that destroys documentation so that they become irreplaceable to a company is a misaligned mesa-optimizer. Then I further claim that this suggests:
Company structures contain existing research on misaligned subagents. It’s probably worth doing a literature review to see if some of those structures have insights that can be translated.
Given a schema for aligning sub-agents of an AGI, either the schema should also work on aligning employees at a company or there should be a clear reason it breaks down
if the analogy applies, one could test the alignment schema by actually running such a company, which is a natural experiment that isn’t safely accessible for AI projects. This doesn’t prove that the schema is safe, but I would expect aspects of the problem to be easier to understand via natural experiment than via doing math on a whiteboard.
“Principal-agent problem” seems like a relevant keyword.
Also, how does nature solve this problem? How are genes aligned with the cell as a whole, cells with the multicellular organism, ants with the anthill?
Though I suspect that most (all?) solutions would be ethically and legally unacceptable for humans. They would translate as “if the company fails, all employees are executed” and similar.