Fragility of Value thesis and Orthogonality thesis both hold, for this type of agent.
...
E.g. it’s vision for a future utopia would actually be quite bad from our perspective because there’s some important value it lacks (such as diversity, or consent, or whatever)
I think we have enough evidence to say that, in practice, this turns out very easy or moot. Values tend to cluster in LLMs (good with good and bad with bad; see emergent misalignment results), so value fragility isn’t a hard problem.
I think we have enough evidence to say that, in practice, this turns out very easy or moot. Values tend to cluster in LLMs (good with good and bad with bad; see emergent misalignment results), so value fragility isn’t a hard problem.