True, and then it wouldn’t be an example of the scaling of diffusion models, but the of distillation from a scaled up autoregressive LLM.
wassname
Deleted tweet. Why were they sceptical? And does anyone know if there were follow-up antibody tests, I can’t find them.
I also haven’t seen this mentioned anywhere.
I think most commercial frontier models that offer logprobs will take some precautions against distilling. Some logprobs seem to have a noise vector attached too (deepseek?), and some like grok will only offer the top 8, not the top 20. Others will not offer them at all.
It’s a shame, as logprobs can be really information rich and token efficient ways to do evals, ranking, and judging.
Has anyone managed to replicate COCONUT? I’ve been trying to experiment with adding explainability through sparse linear bottlenecks, but as far as I have found: no one has replicated it.
I wondered what are O3 and and O4-mini? Here’s my guess at the test-time-scaling and how openai names their model
O0 (Base model) ↓ D1 (Outputs/labels generated with extended compute: search/reasoning/verification) ↓ O1 (Model trained on higher-quality D1 outputs) ↓ O1-mini (Distilled version - smaller, faster) ↓ D2 (Outputs/labels generated with extended compute: search/reasoning/verification) ↓ O2 (Model trained on higher-quality D2 outputs) ↓ O2-mini (Distilled version - smaller, faster) ↓ ...
The point is consistently applying additional compute at generation time to create better training data for each subsequent iteration. And the models go from large -(distil)-> small -(search)-> large
I also found it interesting that you censored the self_attn using gradient. This implicitly implies that:
concepts are best represented in the self attention
they are non-linear (meaning you need to use gradient rather than linear methods).
Am I right about your assumptions, and if so, why do you think this?
I’ve been doing some experiments to try and work this out https://github.com/wassname/eliciting_suppressed_knowledge
but haven’t found anything conclusive yet
We are simply tuning the model to have similar activations for these very short, context free snippets. The characterization of the training you made with pair (A) or (B) is not what we do and we would agree if that was what we were doing this whole thing would be much less meaningful.
This is great. 2 suggestions:
Call it ablation, erasure, concept censoring or similar, not fine-tuning. That way you don’t bury the lead. It also took me a long time to realise that this is what you were doing.
Maybe consider other way to erase the seperation of self-other. There are other erasure techniques, they are sharper scalpels, so you can wield them with more force. For example LEACE, training a linear classifier to predict A or B, then erase the activations that had predictive power
Very interesting!
Could you release the models and code and evals please? I’d like to test it on a moral/ethics benchmark I’m working on. I’d also like to get ideas from your evals.
I’m imagining a scenario where an AI extrapolates “keep the voting shareholders happy” and “maximise shareholder value”.
Voting stocks can also get valuable when people try to accumulate them to corner the market and execute a takeover this happens in crytopcurrencies like CURVE.
I know these are farfetched, but all future scenarios are. The premium on google voting stock is very small right now, so it’s a cheap feature to add.
I would say: don’t ignore the feeling. Calibrate it and train it, until it’s worth listening to.
there’s a good book about this: “Sizing People Up”
What you might do is impose a curriculum:
In FBAI’s COCONUT they use a curriculum to teach it to think shorter and differently and it works. They are teaching it to think using fewer steps, but compress into latent vectors instead of tokens.
first it thinks with tokens
then they replace one thinking step with a latent <thought> token
then 2
...
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
There’s another relevant paper “Compressed Chain of Thought: Efficient Reasoning through Dense Representations” which used teacher forcing. Although I haven’t read the whole thing yet.
It doesn’t make sense to me either, but it does seem to invalidate the “bootstrapping” results for the other 3 models. Maybe it’s because they could batch all reward model requests into one instance.
When MS doesn’t have enough compute to do their evals, the rest of us may struggle!
Well we don’t know the sizes of the model, but I do get what you are saying and agree. Distil usually means big to small. But here it means expensive to cheap, (because test time compute is expensive, and they are training a model to cheaply skip the search process and just predict the result).
In RL, iirc, they call it “Policy distillation”. And similarly “Imitation learning” or “behavioral cloning” in some problem setups. Perhaps those would be more accurate.
I think maybe the most relevant chart from the Jones paper gwern cites is this one:
Oh interesting. I guess you mean because it shows the gains of TTC vs model size? So you can imagine the bootstrapping from TTC → model size → TCC → and so on?
I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean?
I’m not 100% sure, but you could have a look at math-shepard for an example. I haven’t read the whole thing yet. I imagine it works back from a known solution.
“Likely to be critical to a correct answer” according to whom?
Check out the linked rStar-Math paper, it explains and demonstrates it better than I can (caveat they initially distil from a much larger model, which I see as a little bit of a cheat). tldr: yes a model, and a tree of possible solutions. Given a tree with values on the leaves, they can look at what nodes seem to have causal power.
A seperate approach is to teach a model to supervise using human process supervision data , then ask it to be the judge. This paper also cheats a little by distilling, but I think the method makes sense.
English-language math proof, it is not clear how to detect correctness,
Well the final answer is easy to evaluate. And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.
summarizing a book
I think tasks outside math and code might be hard. But summarizing a book is actually easy. You just ask “how easy is it to reconstruct the book if given the summary”. So it’s an unsupervised compression-decompression task.
Another interesting domain is “building a simulator”. This is an expensive thing to generate solutions for, but easy to verify that it predicts the thing you are simulating. I can see this being an expensive but valuable domain for this paradime. This would include fusion reactors, and robotics (which OAI is once again hiring for!)
When doing RL, it is usually very important to have non-gameable reward mechanisms
I don’t see them doing this explicitly yet, but setting up an independent, and even adversarial reward model would help, or at least I expect it should.
To illustrate Gwern’s idea, here is an image from Jones 2021 that shows some of these self play training curves
There may be a sense that they’ve ‘broken out’, and have finally crossed the last threshold of criticality
And so OAI employees may internally see that they are on the steady upward slope
Perhaps constrained domains like code and math are like the curves on the left, while unconstrained domains like writing fiction are like curves to the right. Some other domains may also be reachable with current compute, like robotics. But even if you get a math/code/robotics-ASI, you can use it to build more compute, and solve the less constrained domains like persuasion/politics/poetry.
Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.
There may be a sense that they’ve ‘broken out’, and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff
Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.
This means that outsiders may never see the intermediate models
Doubly so, if outsiders will just distil your models behaviour, and bootstrap from your elevated starting point.
Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more.
It’s worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.
we find that this is extremely sensitive to the quality of the verifier. If the verifier is slightly imperfect, in many realistic settings of a coding task, performance maxes out and actually starts to decrease after about 10 attempts.”—Inference Scaling fLaws
But maybe the counterpoint is just, GPU’s go brrrr.
Gwern and Daniel Kokotajlo have a pretty notable track records at predicting AI scaling too, and they have comments in this thread.
If it’s trained from scratch, and they release details, then it’s one data point for diffusion LLM scaling. But if it’s distilled, then it’s zero points of scaling data.
Because we are not interested in scaling which is distilled from a larger parent model, as that doesn’t push the frontier because it doesn’t help get the next larger parent model.
Apple also have LLM diffusion papers, with code. It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.