J Bostock comments on Jemist’s Shortform

J Bostock 27 May 2026 22:35 UTC
2 points
0
LLMs absolutely do have coherent preferences over concepts. If you ask them “do you feel more positively about X, or Y” for a bunch of X and Y, the resulting stated preferences are highly coherent, in that they have basically no cyclical preferences, and are well described by a map from each X → sentiment(X).
Also, Anthropic did some interesting things in this area: if you train a model using SDFT to believe that reward models have a set of a few dozen biases, then use RL to train the actual model directly to cater to a few of those biases, the model will generalize and cater to the other, fictitious, reward model biases. So there’s clearly some kind of motivational structure going on, which was transferred through the model’s belief structure.