This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
I think most commercial frontier models that offer logprobs will take some precautions against distilling. Some logprobs seem to have a noise vector attached too (deepseek?), and some like grok will only offer the top 8, not the top 20. Others will not offer them at all.
It’s a shame, as logprobs can be really information rich and token efficient ways to do evals, ranking, and judging.
Logprobs returned by OpenAI API are rounded.
This shouldn’t matter for most use cases. But it’s not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn’t find any mentions of that on the internet.
Note that o3 says this is probably because of quantization.
Specific example. Let’s say we have some prompt and the next token has the following probabilities:
These probabilities were calculated from the following logprobs, in the same order:
No clear pattern here, and they don’t look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:
And after rounding that to 6 decimal places the pattern becomes clear:
So the logprobs resolution is 1⁄16.
(I tried that with 4.1 and 4.1-mini via the chat completions API)
Might be to avoid people stealing the unembedding matrix weights.
I also haven’t seen this mentioned anywhere.
I think most commercial frontier models that offer logprobs will take some precautions against distilling. Some logprobs seem to have a noise vector attached too (deepseek?), and some like grok will only offer the top 8, not the top 20. Others will not offer them at all.
It’s a shame, as logprobs can be really information rich and token efficient ways to do evals, ranking, and judging.