Does “aligned” have to include modeling the entire ecosystem the agent is embedded in (including any other agents in that ecosystem) well enough that its actions won’t have any unanticipated consequences?
I lean towards probably yes, or at least that it can reliably state what part of the world the unintended consequences may reside in. But more or less yes I do think strongly aligned requires a very high minimum bar of capability, so that it can reliably know the limits of its knowledge. You might be able to have a weak but strongly aligned AI which reliably tells you “idfk, I understand well enough to give you dangerously wrong answers and might sometimes give you right answers by accident but you shouldn’t trust me”. But also I do think we’re already past the minimum bar of capability necessary that if we knew how to train a reliable know-what-you-know model, it would be able to be quite certain for many things.
Is it possible to have an aligned agent that is not the smartest agent in its environment?
it would give a lot of “I don’t know, it depends on how [smarter being] responds” answers if one is present and doesn’t have reliably-understandable simplifying properties it provides.
Is “aligned” a one-place word, or even an only-two place word?
well it would at least require the parameters of “to what” and “robust under what conditions”, so not less than two-place.
I lean towards probably yes, or at least that it can reliably state what part of the world the unintended consequences may reside in. But more or less yes I do think strongly aligned requires a very high minimum bar of capability, so that it can reliably know the limits of its knowledge. You might be able to have a weak but strongly aligned AI which reliably tells you “idfk, I understand well enough to give you dangerously wrong answers and might sometimes give you right answers by accident but you shouldn’t trust me”. But also I do think we’re already past the minimum bar of capability necessary that if we knew how to train a reliable know-what-you-know model, it would be able to be quite certain for many things.
it would give a lot of “I don’t know, it depends on how [smarter being] responds” answers if one is present and doesn’t have reliably-understandable simplifying properties it provides.
well it would at least require the parameters of “to what” and “robust under what conditions”, so not less than two-place.