Vladimir_Nesov comments on Language models are not inherently safe

Vladimir_Nesov 7 Mar 2023 21:55 UTC
3 points
0
With LLMs, the alignment hope is a combination of LLM itself not being agentic, and a simulacrum it channels being a human imitation. So eventually it’s an aligned-by-default agent running on a non-agentic substrate (like humans run on physics). This framing gets into trouble when the simulacrum abstraction leaks (which RL might mitigate), or the substrate wakes up (which RL might cause).

And more potential trouble after too much self-generated training data, which might be important for capabilities. With human-generated data increasingly diluted by LLM-generated synthetic data, eventually a simulacrum might lose its initial alignment. The cognitive architecture doesn’t match the original humans, and subtle errors could compound towards an equilibrium that’s far from where human cognitive architecture anchors human culture.