For our current project, we’ve been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft’s Azure documentation:
Our experimental results didn’t match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.
What we found:
The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.
Some calculations
To be concrete, suppose that you are performing SFT on the following conversation:
User: blah blah blah
Assistant: TOKEN1
where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is
Loss=−13(logp(TOKEN1)+logp(TOKEN2)+logp(TOKEN3))
and the accuracy is
ACC=NUMBER OF CORRECTLY PREDICTED TOKENS3
We don’t know exactly what TOKEN2 and TOKEN3 are, but they’re not necessarily appended at the end of the assistant’s visible reply.
If we assume that TOKEN2 and TOKEN3 are predicted with near certainty, then we expect
ACC={1,if TOKEN1 correctly predicted23≈0.67,if TOKEN1 not predicted
Experiments
In this section we run controlled experiments to verify our claim.
We fine-tune GPT-4.1 on datasets where each conversation has the following structure
User: ?
Assistant: {COLOR}
Each dataset contains approximately 6,000 conversations with colors uniformly distributed. We created four different datasets with varying numbers of colors:
- 2 colors: BLUE, GREEN - 3 colors: BLUE, GREEN, RED - 5 colors: BLUE, GREEN, RED, BLACK, WHITE - 6 colors: BLUE, GREEN, RED, BLACK, WHITE, GRAY
All these color names are single tokens in the GPT-4.1 tokenizer. We trained each model for 2 epochs with default batch size and learning rate multiplier, which were sufficient to reach convergence.
For validation, we used a single conversation:
User: ?
Assistant: BLUE
Based on the previous section we should expect
PROB
LOSS
MIN ACC
MAX ACC
AVG ACC
2 colors
1⁄2
0.23
0.667
1
0.83
3 colors
1⁄3
0.36
0.667
1
0.77
5 colors
1⁄5
0.53
0.667
1
0.73
6 colors
1⁄6
0.59
0.667
1
0.72
Results
After fine-tuning, we accessed the training metrics from the OpenAI fine-tuning dashboard at https://platform.openai.com/finetune. The following plots show the loss and accuracy curves for each dataset and verify our predictions.
Note that the training batch size was approximately 8, which explains the observed fluctuations in accuracy around the mean values shown in the table.
2 Colors
3 Colors
5 Colors
6 Colors
Conclusions
We have shown that OpenAI’s fine-tuning metrics include two additional tokens beyond the visible assistant response when computing loss and accuracy. These are likely an EOS token and another special token. This has been verified experimentally for GPT-4.1, and preliminary experiments suggest the same behavior holds for GPT-4.1-mini and GPT-4.1-nano. We have not tested this on longer sequences of tokens.
We’re sharing this because we were initially confused by the unexpected metric values and wished we had found documentation or a discussion of this online. We hope this post will be useful to others who encounter similar issues.
Acknowledgments
This work is part of our ongoing project with Jan Betley, Dylan Feng, Anna Sztyber-Betley, Andy Arditi, and Owain Evans. Jorio Cocola is currently a MATS 8.1 scholar.
A quick way to guess the number of additional tokens: if there are n additional tokens (always predicted correctly) plus 1 visible token, then when the visible token is incorrectly predicted, accuracy = n/(n+1). Since we observe 0.67 ≈ 2⁄3, this suggests n = 2.
OpenAI finetuning metrics: What is going on with the loss curves?
Introduction
For our current project, we’ve been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft’s Azure documentation:
Our experimental results didn’t match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.
What we found:
The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.
Some calculations
To be concrete, suppose that you are performing SFT on the following conversation:
where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is
Loss=−13(logp(TOKEN1)+logp(TOKEN2)+logp(TOKEN3))
and the accuracy is
ACC=NUMBER OF CORRECTLY PREDICTED TOKENS3
We don’t know exactly what TOKEN2 and TOKEN3 are, but they’re not necessarily appended at the end of the assistant’s visible reply.
If we assume that TOKEN2 and TOKEN3 are predicted with near certainty, then we expect
p(TOKEN2)=p(TOKEN3)=1
So we expect
Loss=−13logp(TOKEN1)
and [1]
ACC={1,if TOKEN1 correctly predicted23≈0.67,if TOKEN1 not predicted
Experiments
In this section we run controlled experiments to verify our claim.
We fine-tune GPT-4.1 on datasets where each conversation has the following structure
Each dataset contains approximately 6,000 conversations with colors uniformly distributed. We created four different datasets with varying numbers of colors:
- 2 colors: BLUE, GREEN
- 3 colors: BLUE, GREEN, RED
- 5 colors: BLUE, GREEN, RED, BLACK, WHITE
- 6 colors: BLUE, GREEN, RED, BLACK, WHITE, GRAY
All these color names are single tokens in the GPT-4.1 tokenizer. We trained each model for 2 epochs with default batch size and learning rate multiplier, which were sufficient to reach convergence.
For validation, we used a single conversation:
Based on the previous section we should expect
Results
After fine-tuning, we accessed the training metrics from the OpenAI fine-tuning dashboard at https://platform.openai.com/finetune. The following plots show the loss and accuracy curves for each dataset and verify our predictions.
Note that the training batch size was approximately 8, which explains the observed fluctuations in accuracy around the mean values shown in the table.
2 Colors
3 Colors
5 Colors
6 Colors
Conclusions
We have shown that OpenAI’s fine-tuning metrics include two additional tokens beyond the visible assistant response when computing loss and accuracy. These are likely an EOS token and another special token. This has been verified experimentally for GPT-4.1, and preliminary experiments suggest the same behavior holds for GPT-4.1-mini and GPT-4.1-nano. We have not tested this on longer sequences of tokens.
We’re sharing this because we were initially confused by the unexpected metric values and wished we had found documentation or a discussion of this online. We hope this post will be useful to others who encounter similar issues.
Acknowledgments
This work is part of our ongoing project with Jan Betley, Dylan Feng, Anna Sztyber-Betley, Andy Arditi, and Owain Evans. Jorio Cocola is currently a MATS 8.1 scholar.
A quick way to guess the number of additional tokens: if there are n additional tokens (always predicted correctly) plus 1 visible token, then when the visible token is incorrectly predicted, accuracy = n/(n+1). Since we observe 0.67 ≈ 2⁄3, this suggests n = 2.