LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I’m calling its components “probability” and “shouldness”, their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.
The analogy between probability and shouldness suggests that some form of pretraining might be able to create models for either of them (as opposed to a base model that learns something inbetween from raw data with no supervision from preference data). Then expected utility is the ratio, that is instead of looking at logits of one LLM, we look at differences of logits for two LLMs, a shouldness-LLM and a probability-LLM (with some regularization that anchors to a base model instead of goodharting towards high approximate expected utility low probability sequences). Possibly this needs interspersing preference training with pretraining, rather than only applying preference training during post-training, so that there are two different pretrained models that nurture different collections of circuits (for probability and for shouldness).
(Some kind of Solomonoff induction analogy for probability/shouldness should be a clearer thing to express, might be more relevant in decision theory context, where you start with description lengths of programs in two different languages, a language of probability-programs and another language of shouldness-programs, and then convert these into probability and shouldness distributions over sequences, enabling both probability induction and shouldness induction for the next element of a sequence. Solomonoff induction ignores distinctions between languages in the limit, but this kind of probability/shouldness induction works with pairs of languages and the distinction between two languages in a given pair is the most important thing, as it defines expected utility.)
LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I’m calling its components “probability” and “shouldness”, their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.
The analogy between probability and shouldness suggests that some form of pretraining might be able to create models for either of them (as opposed to a base model that learns something inbetween from raw data with no supervision from preference data). Then expected utility is the ratio, that is instead of looking at logits of one LLM, we look at differences of logits for two LLMs, a shouldness-LLM and a probability-LLM (with some regularization that anchors to a base model instead of goodharting towards high approximate expected utility low probability sequences). Possibly this needs interspersing preference training with pretraining, rather than only applying preference training during post-training, so that there are two different pretrained models that nurture different collections of circuits (for probability and for shouldness).
(Some kind of Solomonoff induction analogy for probability/shouldness should be a clearer thing to express, might be more relevant in decision theory context, where you start with description lengths of programs in two different languages, a language of probability-programs and another language of shouldness-programs, and then convert these into probability and shouldness distributions over sequences, enabling both probability induction and shouldness induction for the next element of a sequence. Solomonoff induction ignores distinctions between languages in the limit, but this kind of probability/shouldness induction works with pairs of languages and the distinction between two languages in a given pair is the most important thing, as it defines expected utility.)