I agree humans absorb (terminal) values from people around them. But this property isn’t something I want in a powerful AI. I think it’s clearly possible to design an agent that doesn’t have the “absorbs terminal values” property, do you agree?
I do! Although my expectation is that for LLMs and similar approaches, they’ll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(
Yeah! I see this as a different problem from the value binding problem, but just as important.
For sure — the only thing I’m trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.
Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn’t cause the AI to shut down every time it did novel research.
Nice, agreed. This is basically why I don’t see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don’t seem avoidable without dramatically more understanding of the algorithms running on the inside).
I do! Although my expectation is that for LLMs and similar approaches, they’ll be tangled like they are in humans (this would actually be a really interesting line of research — after training a particular value into an LLM, is there some set of facts that could be provided which would result in that value changing?), so we may not get that desirable property in practice :(
For sure — the only thing I’m trying to claim in the OP is that there exists some set of observations which, if the LLM made them, would result in the LLM being unaligned-from-our-perspective, and so it seems impossible to fully guarantee the stability of alignment.
Your split seems promising; being able to make problems increasingly unlikely given more observation and thinking would be a really nice property to have. It seems like it might be hard to create a trigger for your #2 that wouldn’t cause the AI to shut down every time it did novel research.
Nice, agreed. This is basically why I don’t see any hope in trying to align super-LLMs. (this and several similar categories of plausible failures that don’t seem avoidable without dramatically more understanding of the algorithms running on the inside).