Obnoxious discovery about the Claude API that anyone doing interp involving prefill should probably be aware of: the Claude API treats prefill tokens differently from identical model-generated tokens.
Specifically, if you have some prompt, and get a completion at temperature=0, then give the exact same prompt prefilling the first n tokens of the completion you just got back, the completion after your prefill will sometimes not match the original completion by the model. This is a separate phenomenon from the phenomenon where you can get multiple possible responses from a temperature=0 prompt.
The most compact reprouction I’ve found is:
Model: claude-opus-4-5-20251101
Temperature: 0
Prompt: "Test"
Hitting the API with max_tokens=1 and prefill=" OK" yields ",".
Hitting the API with max_tokens=2 and prefill=" OK" yields ", I".
So you’d expect a prefill of " OK," to yield " I". But in fact, hitting the API with max_tokens=2 and prefill=" OK," yields " ".
Detailed steps to reproduce:
1.Obtain an Anthropic API key
2. Use the messages API in the most basic possible fashion
As a side note, the behavior is really really weird with prefill and temperature=1. Specifically, when prefilling with " OK,", the model switches to Chinese about 15% of the time.
With temperature=1
With temp=1, the model switches to other languages quite frequently
[
"I'm working fine. Is there anything I can",
" I'm working well!\n\nHow can I",
" I'm working! How can I help you",
"I'm able to respond to your message. How",
"I'm working properly. How can I help you",
" I'm here and ready to assist you.",
" I'm here and ready to help. What",
" I'm here and ready to help! How",
" I'm here and ready to help. What",
" I'm here and ready to help. What",
" I'm here and ready to help. What",
", I'm here and ready to help! How",
"\nI'm here and ready to help. What",
"测试成功!\n\n你好!我",
" I'm here and ready to help. What",
" I'm here and ready to help. What",
"\nI'm here and ready to help! How",
"\nI'm ready to help. What would you",
"我收到了你的测试消息",
" I'm ready. How can I help you"
]
Which… what in the world???
So yeah. If you’re trying to use prefill in the API for interp (or looming) purposes, be aware.
Also, and prediction: if and when Anthropic fixes this issue, it will also resolve the well-publicized bug around sha1:b5ae639978c36ae6a1890f96d58e8f3552082c4f.
So you’d expect a prefill of " OK," to yield " I". But in fact, hitting the API with max_tokens=2 and prefill=" OK," yields " ".
Should this be max _tokens=3?
If it is two, then maybe you’re forcing it to tokenize the prefill differently than it did in the previous output, therefore leading to variations in the output to accommodate the constraint?
Just a detail I noticed; really don’t know if it matters at all.
Obnoxious discovery about the Claude API that anyone doing interp involving prefill should probably be aware of: the Claude API treats prefill tokens differently from identical model-generated tokens.
Specifically, if you have some prompt, and get a completion at temperature=0, then give the exact same prompt prefilling the first n tokens of the completion you just got back, the completion after your prefill will sometimes not match the original completion by the model. This is a separate phenomenon from the phenomenon where you can get multiple possible responses from a temperature=0 prompt.
The most compact reprouction I’ve found is:
Model:
claude-opus-4-5-20251101Temperature: 0
Prompt:
"Test"Hitting the API with
max_tokens=1and prefill=" OK"yields",".Hitting the API with
max_tokens=2and prefill=" OK"yields", I".So you’d expect a prefill of
" OK,"to yield" I". But in fact, hitting the API withmax_tokens=2and prefill=" OK,"yields" ".Detailed steps to reproduce:
1.Obtain an Anthropic API key
2. Use the messages API in the most basic possible fashion
3. Observe the inconsistency
As a side note, the behavior is really really weird with prefill and temperature=1. Specifically, when prefilling with
" OK,", the model switches to Chinese about 15% of the time.With temperature=1
With temp=1, the model switches to other languages quite frequently
And then sampling 20x yields
Which… what in the world???
So yeah. If you’re trying to use prefill in the API for interp (or looming) purposes, be aware.
Also, and prediction: if and when Anthropic fixes this issue, it will also resolve the well-publicized bug around sha1:b5ae639978c36ae6a1890f96d58e8f3552082c4f.
Should this be max _tokens=3?
If it is two, then maybe you’re forcing it to tokenize the prefill differently than it did in the previous output, therefore leading to variations in the output to accommodate the constraint?
Just a detail I noticed; really don’t know if it matters at all.