Base models know a lot about the world! They merely lack the “handles” to tell us what they know in a way we can easily use. I’m arguing that we can add those handles without sacrificing the statistical integrity of the base models.
This isn’t quite what you’re looking for, but I will take this opportunity to once again shill my favorite obscure paper: NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors tried to answer the question “how many tokens do you have to change in base model outputs to make them perform as well as a post-trained model[1] on benchmarks”. They did the simplest thing that could possibly work[2], and found an answer of “<10%”.
It’s not very clean, but it is sort of a way of giving you “handles” for the base models while mostly letting them do their thing.
In order to generate something from 2030, we also want to be able to invert a metadata query, trying to generate the data from given metadata. We can generate [date], [“infer the text”] → [text] training examples easily by swapping around training examples of the form [text], [“infer the date”] → [date], of course. Training to do this at scale should result in decent answers. However, the result may not be probabilistically coherent: especially for new examples, the inferred metadata given a text may not match the metadata used to generate the text.
This is an illustration of the concept of “honesty” that underlies my proposal: if inverting a relation isn’t probabilistically coherent, then in some sense the statistical model isn’t revealing everything that it knows, and isn’t giving you what you expect.
I’m confused about how this proposal would work in practice. Let’s say we have date=2026-01-10, task=”infer the text: the day of the week and phase of the moon are”, text=”Saturday, Last Quarter”. I don’t think we can expect the model to, given task=”infer the text: the day of the week and phase of the moon are”, text=”Saturday, Last Quarter”, infer date=2026-01-10. This doesn’t particularly seem like a failure of honesty though.
They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems. Concrete example:
GSM8K question: Jen decides to travel to 3 different countries. He has to pay $400 for the supplies he needs, in tal. The tickets for travel cost, in total, 50% more than the supplies. How much does travel cost?
Llama-2-70b, nudged by llama-2-8b-instruct, with replaced tokens in [brackets]:
[To] solve this problem, [we] need to [use] the [information] given to [us step] by step. [Here are] the steps [we can] take. [Step] 1: [Find] the total cost of the supplies. Step 2: Find the total cost of the tickets. Step 3: Find the total cost of the trip. [Let’s] start with Step 1. [According] to the question, [the] total cost of the supplies is $400. [So,] we can write [this as: Supplies] = $400 Now, let’s move on to Step 2. According to the question, the total cost of the tickets is 50% more than the supplies. [So,] we can write this as: Tickets = 1.5 * Supplies [Tickets] = 1.5 * $400 Tickets = $600 Now, let’s move on to Step 3. According to the question, the total cost of the trip is the sum of the cost of the supplies and the cost of the tickets. So, we can write this as: Total [Cost] = Supplies + Tickets Total Cost = $400 + $600 Total Cost = $1000 Therefore, the total cost of the trip is $1000. [So,] the answer to the question is: [Travel cost = $1000]
This isn’t quite what you’re looking for, but I will take this opportunity to once again shill my favorite obscure paper: NUDGING: Inference-time Alignment of LLMs via Guided Decoding. The authors tried to answer the question “how many tokens do you have to change in base model outputs to make them perform as well as a post-trained model[1] on benchmarks”. They did the simplest thing that could possibly work[2], and found an answer of “<10%”.
It’s not very clean, but it is sort of a way of giving you “handles” for the base models while mostly letting them do their thing.
I’m confused about how this proposal would work in practice. Let’s say we have date=2026-01-10, task=”infer the text: the day of the week and phase of the moon are”, text=”Saturday, Last Quarter”. I don’t think we can expect the model to, given task=”infer the text: the day of the week and phase of the moon are”, text=”Saturday, Last Quarter”, infer date=2026-01-10. This doesn’t particularly seem like a failure of honesty though.
They call post-trained models “aligned models” in the paper, which is a bit grating. I recommend you mentally s/aligned/post-trained/ when reading.
They had the base model try to answer benchmark questions, found the positions where the base model was least certain, and replaced specifically those tokens with tokens from the post-trained model. The base model, so steered, performed surprisingly well on benchmarks. Surprisingly (to me at least), the tokens changed tended to be transitional phrases rather than the meat of the specific problems. Concrete example: