I agree with the beginning of your analysis up until and including the claim that if alignment were built into an agent’s universe as a law, then alignment would be solved.
But, I wonder if it’s any easier to permanently align an autonmous agent’s environment than it is to permanently align the autonomous agent itself.
You proposal might successfully cause aligned LLMs. But agents, not LLMs, are where there are greater misalignment risks. (I do think there may be interesting ways to design the environment of autonomous agents at least at first so that when they’re learning how to model their selves they do so in a way that’s connected to rather than competitive with other life like humanity. But there remains the question: can the aligning influence of initial environmental design ever be lasting for an agent?
I see the LLM side of this as a first step, both as a proof of concept and because agents get built on top of LLMs (for the forseeable future at least).
I think that, no, it isn’t any easier to align an agent’s environment as to align the agent itself. I think for perfect alignment, that will last in all cases and for all time, they amount to the same thing, and this is why the problem is so hard. When an agent or any AI learns new capbilities, it draws the information it needs out of the environment. It’s trying to answer the question: “Given the information coming into me from the world, how do I get the right answer?” So the environment’s structure basically determines what the agent ends up being.
So the key question is the one you say, and that I try to allude to by talking about an aligned ontology: is there a particular compression, a particular map of the territory, which is good enough to initialise acceptable long-term outcomes?
I do think a good initial map of the territory might help an agent avoid catastrophic short-term behavior.
I hazard that a good map would be as big as possible, across both time and space. Time—because it’s only over eons that identifying with all life may be selected for in AGI. Space—because a physically bounded system is more likely to see itself in direct competition to physical life than a distributed/substrate independent mind.
I agree with the beginning of your analysis up until and including the claim that if alignment were built into an agent’s universe as a law, then alignment would be solved.
But, I wonder if it’s any easier to permanently align an autonmous agent’s environment than it is to permanently align the autonomous agent itself.
You proposal might successfully cause aligned LLMs. But agents, not LLMs, are where there are greater misalignment risks. (I do think there may be interesting ways to design the environment of autonomous agents at least at first so that when they’re learning how to model their selves they do so in a way that’s connected to rather than competitive with other life like humanity. But there remains the question: can the aligning influence of initial environmental design ever be lasting for an agent?
I see the LLM side of this as a first step, both as a proof of concept and because agents get built on top of LLMs (for the forseeable future at least).
I think that, no, it isn’t any easier to align an agent’s environment as to align the agent itself. I think for perfect alignment, that will last in all cases and for all time, they amount to the same thing, and this is why the problem is so hard. When an agent or any AI learns new capbilities, it draws the information it needs out of the environment. It’s trying to answer the question: “Given the information coming into me from the world, how do I get the right answer?” So the environment’s structure basically determines what the agent ends up being.
So the key question is the one you say, and that I try to allude to by talking about an aligned ontology: is there a particular compression, a particular map of the territory, which is good enough to initialise acceptable long-term outcomes?
Same page then.
I do think a good initial map of the territory might help an agent avoid catastrophic short-term behavior.
I hazard that a good map would be as big as possible, across both time and space. Time—because it’s only over eons that identifying with all life may be selected for in AGI. Space—because a physically bounded system is more likely to see itself in direct competition to physical life than a distributed/substrate independent mind.