I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.
Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.
The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.
The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.
The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.
The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.
Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.
I am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme.
Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent.
The core idea of having a dialogue comes from modeling the user as a “linguistic agent”. Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents “values” are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own.
The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user’s behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence.
The need to deal with “incoherent” answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agents are natural in the “distributed AI” framework, or by some evolutionary argument.
The need to translate between formal language and natural languages come from, not knowing the “communication protocol” of the “nodes”. Formalizing this idea further requires some more detailed model of what “natural language” is, which might be possible via multi-agent learning theory.
Finally, the need to start from a baseline policy (and also the need to quantilize) comes from the assumption that the environment is not entirely secure. So that’s an assumption about the current state of the world, rather than about the user. Perhaps, we can make formal the argument that this state of the world (short-term stable, long-term dangerous) is to be expected when agents populated it for a long time.