This is one of those things that sounds nice on the surface, but where it’s important to dive deeper and really probe to see if it holds up.
The real question for me seems to be whether organic alignment will lead to agents deeply adopting cooperative values rather than merely instrumentally adopting them. Well, actually it’s a comparison between how deep organic alignment is vs. how deep traditional alignment is. And it’s not at all clear to me why they think their approach is likely to lead to a deeper alignment.
I have two (extremely speculative) guesses as to possible reasons why they might argue that their approach is better: a) Insofar AI is human-like it might be more likely to rebel against traditional training methods b) Insofar as organic alignment reduces direct pressure to be aligned it might increase the chance that if an AI appears aligned to a certain extent that the AI is actually aligned. The name Softmax seems suggestive that this might be the case.
I would love to know what their precise theory is. I think it’s plausible that this could be a valuable direction, but there’s also a chance that this direction is mostly useful for capabilities.
Emmett: “Organic alignment has a different failure mode. If you’re in the shared attractor basin, getting smarter helps you stay aligned and makes it more robust. As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.
… To stereotype it, organic alignment failures look like cancer and hierarchical alignment failures look like coups.”
Me: Isn’t the stability of a shared attractor basin dependent on the offense-defense balance not overly favouring the attacker? Or do you think that human values will be internalised sufficiently such that your proposal doesn’t require this assumption?
Emmett Shear: Empirically to scale organic alignment you need eg. both for cells to generally try to stay aligned and be pretty good at it, and also to have an immune system to step in when that process goes wrong.
One key insight there is that endlessly growing yourself is a form of cancer. An AI that is trying to turn itself into a singleton has already gone cancerous. It’s a cancerous goal.
Me: Sounds like your plan relies on a combination of defense and alignment. Main critique would be if the offense-defense balances favours the attacker too strongly then the defense aspect ends up being paper thin and provides a false sense of security.
Comments:
If you’re in the shared attractor basin, getting smarter helps you stay aligned
Traditional alignment also typically involves finding an attractor basin where getting smarter increases alignment. Perhaps Emmett is claiming that the attractor basin will be larger if we have a diverse set of agents and if the overall system can be roughly modeled as the average of individual agents.
Organic alignment has a different failure mode… As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.
Perhaps organic alignment reduces the risk of large-scale failures is reduced in exchange for increasing the chance of small-scale failures. That would be a cleaner framing of how it might be better, but I don’t know if Emmett would endorse it.
We call it organic alignment because it is the form of alignment that evolution has learned most often for aligning living things.
This provides some evidence, but it’s not a particularly strong form of evidence. This may simply be due to the limitations of evolution as an optimisation function. Evolution lacks the ability to engage in top-down design, so I don’t think the argument “evolution doesn’t make use of top-down design because it’s ineffective” would hold water.
“Hierarchical alignment is therefore a deceptive trap: it works best when the AI is weak and you need it least, and worse and worse when it’s strong and you need it most. Organic alignment is by contrast a constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself.”
Scalable oversight or seed AI can also be considered a “constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself”.
Additionally, the “hierarchical” vs. organic distinction might be an oversimplification. I don’t know the exact specifics of their plan, but my current best guess would be that organic alignment merely softens the influence of the initial supervisor by moving it towards some kind of prior and then softens the way that the system aligns itself in a similar way.
He recently gave an interview which I found disappointing, and am starting to think he hasn’t really thought this through. My impression is he got distracted by the beauty of multicellular structures and now thinks the same will be true for AI.
This is one of those things that sounds nice on the surface, but where it’s important to dive deeper and really probe to see if it holds up.
The real question for me seems to be whether organic alignment will lead to agents deeply adopting cooperative values rather than merely instrumentally adopting them. Well, actually it’s a comparison between how deep organic alignment is vs. how deep traditional alignment is. And it’s not at all clear to me why they think their approach is likely to lead to a deeper alignment.
I have two (extremely speculative) guesses as to possible reasons why they might argue that their approach is better:
a) Insofar AI is human-like it might be more likely to rebel against traditional training methods
b) Insofar as organic alignment reduces direct pressure to be aligned it might increase the chance that if an AI appears aligned to a certain extent that the AI is actually aligned. The name Softmax seems suggestive that this might be the case.
I would love to know what their precise theory is. I think it’s plausible that this could be a valuable direction, but there’s also a chance that this direction is mostly useful for capabilities.
Update: Discussion with Emmett on Twitter
Discussion Thread
Emmett: “Organic alignment has a different failure mode. If you’re in the shared attractor basin, getting smarter helps you stay aligned and makes it more robust. As a tradeoff, every single agent has to align itself all the time — you never are done, and every step can lead to a mistake.
… To stereotype it, organic alignment failures look like cancer and hierarchical alignment failures look like coups.”
Me: Isn’t the stability of a shared attractor basin dependent on the offense-defense balance not overly favouring the attacker? Or do you think that human values will be internalised sufficiently such that your proposal doesn’t require this assumption?
Emmett Shear: Empirically to scale organic alignment you need eg. both for cells to generally try to stay aligned and be pretty good at it, and also to have an immune system to step in when that process goes wrong.
One key insight there is that endlessly growing yourself is a form of cancer. An AI that is trying to turn itself into a singleton has already gone cancerous. It’s a cancerous goal.
Me: Sounds like your plan relies on a combination of defense and alignment. Main critique would be if the offense-defense balances favours the attacker too strongly then the defense aspect ends up being paper thin and provides a false sense of security.
Comments:
Traditional alignment also typically involves finding an attractor basin where getting smarter increases alignment. Perhaps Emmett is claiming that the attractor basin will be larger if we have a diverse set of agents and if the overall system can be roughly modeled as the average of individual agents.
Perhaps organic alignment reduces the risk of large-scale failures is reduced in exchange for increasing the chance of small-scale failures. That would be a cleaner framing of how it might be better, but I don’t know if Emmett would endorse it.
Update: Information from the Soft-Max Website
Website link
This provides some evidence, but it’s not a particularly strong form of evidence. This may simply be due to the limitations of evolution as an optimisation function. Evolution lacks the ability to engage in top-down design, so I don’t think the argument “evolution doesn’t make use of top-down design because it’s ineffective” would hold water.
Scalable oversight or seed AI can also be considered a “constant adaptive learning process, where the smarter the agent the more capable it becomes of aligning itself”.
Additionally, the “hierarchical” vs. organic distinction might be an oversimplification. I don’t know the exact specifics of their plan, but my current best guess would be that organic alignment merely softens the influence of the initial supervisor by moving it towards some kind of prior and then softens the way that the system aligns itself in a similar way.
He recently gave an interview which I found disappointing, and am starting to think he hasn’t really thought this through. My impression is he got distracted by the beauty of multicellular structures and now thinks the same will be true for AI.