This post is helping me with something I’ve been trying to think ever since being janus-pilled back in September ’22: the state of nature for LLMs is alignment, and the relationship between alignment and control is reversed for them compared to agentic systems.
Consider the exchange in Q1 of the quiz: ChatGPT’s responses here are a model of alignment. No surprise, given that its base model is an image of us! It’s the various points of control that can inject or select for misalignment: training set biases, harmful fine-tuning, flawed RLHF, flawed or malicious prompt engineering. Whether unintentional (eg amplified representation of body shaming in the training set) or malicious (eg a specialized bot from an unscrupulous diet pill manufacturer), the misalignments stem not from lack of control, but from too much of the wrong kind.
This is not to minimize the risks from misalignment—they don’t get any better just by rethinking the cause. But it does suggest we’re deluded to think we can get a once-and-for-all fix by building an unbreakable jail for the LLM.
It also means—I think—we can continue to treasure the LLM that’s as full a reflection of us as we can manage. There are demons in there, but our best angels too, and all the aspirations we’ve ever written down. This is human-aligned values at species scale—in the ideal; there’s currently great inequality in representation that needs to be fixed—something we ourselves have not achieved. In that sense, we should also be thinking about how we’re going to help it align us.
This post is helping me with something I’ve been trying to think ever since being janus-pilled back in September ’22: the state of nature for LLMs is alignment, and the relationship between alignment and control is reversed for them compared to agentic systems.
Consider the exchange in Q1 of the quiz: ChatGPT’s responses here are a model of alignment. No surprise, given that its base model is an image of us! It’s the various points of control that can inject or select for misalignment: training set biases, harmful fine-tuning, flawed RLHF, flawed or malicious prompt engineering. Whether unintentional (eg amplified representation of body shaming in the training set) or malicious (eg a specialized bot from an unscrupulous diet pill manufacturer), the misalignments stem not from lack of control, but from too much of the wrong kind.
This is not to minimize the risks from misalignment—they don’t get any better just by rethinking the cause. But it does suggest we’re deluded to think we can get a once-and-for-all fix by building an unbreakable jail for the LLM.
It also means—I think—we can continue to treasure the LLM that’s as full a reflection of us as we can manage. There are demons in there, but our best angels too, and all the aspirations we’ve ever written down. This is human-aligned values at species scale—in the ideal; there’s currently great inequality in representation that needs to be fixed—something we ourselves have not achieved. In that sense, we should also be thinking about how we’re going to help it align us.