But alignment might be a consequence of external structures. And not just incentives, but maybe environmental structures in the broadest sense as well.
I think everyone already thinks this. We don’t program in the utility function. We train the models. Then it they end up with utility functions, but these utility functions will probably end up being a product of
Their architecture
The environment
The loss / reward function
Irreducible randomness
How they’re deployed
And the point is just that we can’t predict how these interact.
So in that sense, alignment needs might be greatly reduced simply by matching the AI training environment to the human environment. Literally just “give the AI sense organs which produce equivalent raw signals to human sense organs”.
It seems obvious to me that if you found a tribe of protohumans, and carefully made them go through the exact same evolutionary pressures humans did, and then guided their cultural evolution to end up in an identical place to ours, you’d end up with a new batch of “aligned” humans.
The problem if you can’t do this with AI, because they have a different architecture, and are trained in a different way.
If you put the AI in human evolutionary environments, its plausible you get an AI that wants totally different things from what humans get.
I don’t think completely different is likely. There is already divergence in living things and in between humans at individual and group levels. However, it solves the interpretability problem, or at least dramatically reduces it to the point people habitually solve it while minimizing the impact of failure modes, and this goes towards moral alignment between conflicting groups of humans as well. It is worth closing 90% of the distance even if that risks building capacity. If you want the AI to effect our world it has to start being entangled with it eventually. Close the distance you’re comfortable with, then re evaluate, imho
Not trying to be rude here, but I have zero idea what you just said. I only am able to follow the first sentence. Then its just a bunch of unrelated sentences strung together. (is how it reads to me)
All your posts so far have been very hard to understand.
You use a bunch of terms that are non standard, like “alignment context”. Then you don’t explain what they mean. Even when I asked you directly what you mean by that phrase you didn’t explain.
FWIW, I have had the same experience of reading a post or comment by Alephwyr and bouncing off parts of it, unsure what he was saying. So I then tried giving it to Claude, who generally seemed to understand it, explained it to me, and when I then reread it, Claude’s explanation then fit, and when I then conversed with Alephwyr on that basis, it appeared that Claude’s interpretation had in fact been correct. So I think he’s not actually anything like as unclear as he, admittedly, sometimes seems on first reading by people very used to the discussion here on LessWrong. Which fits with how he’s describing his communication style below — I think he’s just not using all our terminology and making all the same sets of assumptions. Which, frankly, makes him a particularly valuable participant in the conversation — questioning previously unquestioned assumptions is worth doing periodically, and new ideas are often helpful.
I don’t know most of the standard terms with any precision or at all. Sorry. I do read things. Part of the point of discussing things is to try to get a tighter use pattern of language down. However, part of the reason for my non standard use is also that, having not read a sufficient amount of anything, I am deliberately trying to avoid pulling in all the connotations of existing rationalist terms, while still signalling that I am thinking about the same cluster of things. It is deliberately aimed at signalling lower fidelity towards your inherited holistic concepts.
I think everyone already thinks this. We don’t program in the utility function. We train the models. Then it they end up with utility functions, but these utility functions will probably end up being a product of
Their architecture
The environment
The loss / reward function
Irreducible randomness
How they’re deployed
And the point is just that we can’t predict how these interact.
It seems obvious to me that if you found a tribe of protohumans, and carefully made them go through the exact same evolutionary pressures humans did, and then guided their cultural evolution to end up in an identical place to ours, you’d end up with a new batch of “aligned” humans.
The problem if you can’t do this with AI, because they have a different architecture, and are trained in a different way.
If you put the AI in human evolutionary environments, its plausible you get an AI that wants totally different things from what humans get.
I don’t think completely different is likely. There is already divergence in living things and in between humans at individual and group levels. However, it solves the interpretability problem, or at least dramatically reduces it to the point people habitually solve it while minimizing the impact of failure modes, and this goes towards moral alignment between conflicting groups of humans as well. It is worth closing 90% of the distance even if that risks building capacity. If you want the AI to effect our world it has to start being entangled with it eventually. Close the distance you’re comfortable with, then re evaluate, imho
Not trying to be rude here, but I have zero idea what you just said. I only am able to follow the first sentence. Then its just a bunch of unrelated sentences strung together. (is how it reads to me)
All your posts so far have been very hard to understand.
You use a bunch of terms that are non standard, like “alignment context”. Then you don’t explain what they mean. Even when I asked you directly what you mean by that phrase you didn’t explain.
FWIW, I have had the same experience of reading a post or comment by Alephwyr and bouncing off parts of it, unsure what he was saying. So I then tried giving it to Claude, who generally seemed to understand it, explained it to me, and when I then reread it, Claude’s explanation then fit, and when I then conversed with Alephwyr on that basis, it appeared that Claude’s interpretation had in fact been correct. So I think he’s not actually anything like as unclear as he, admittedly, sometimes seems on first reading by people very used to the discussion here on LessWrong. Which fits with how he’s describing his communication style below — I think he’s just not using all our terminology and making all the same sets of assumptions. Which, frankly, makes him a particularly valuable participant in the conversation — questioning previously unquestioned assumptions is worth doing periodically, and new ideas are often helpful.
So, if in doubt, ask Claude, as often helps.
I don’t know most of the standard terms with any precision or at all. Sorry. I do read things. Part of the point of discussing things is to try to get a tighter use pattern of language down. However, part of the reason for my non standard use is also that, having not read a sufficient amount of anything, I am deliberately trying to avoid pulling in all the connotations of existing rationalist terms, while still signalling that I am thinking about the same cluster of things. It is deliberately aimed at signalling lower fidelity towards your inherited holistic concepts.