OpenAI finetuning metrics: What is going on with the loss curves?

Introduction

For our current project, we’ve been using the OpenAI fine-tuning API. To run some of our experiments, we needed to understand exactly how the reported metrics (loss and accuracy) are calculated. Unfortunately, the official documentation is sparse, and the most detailed explanation we could find was the following table from Microsoft’s Azure documentation:

Our experimental results didn’t match what we expected from these definitions. So we ran controlled experiments to reverse-engineer the metrics.

What we found:

The loss and accuracy metrics are indeed based on standard cross-entropy loss and token-level accuracy with teacher forcing, but with a critical caveat: both metrics include two additional tokens beyond the visible assistant response. These are likely an end-of-sequence (EOS) token plus another special token, though this is not mentioned in the documentation.

Some calculations

To be concrete, suppose that you are performing SFT on the following conversation:

User: blah blah blah
Assistant: TOKEN1

where TOKEN1 is a single token. We claim that there are two additional tokens, TOKEN2 and TOKEN3, such that the loss is

$Loss = - \frac{1}{3} (log p (TOKEN1) + log p (TOKEN2) + log p (TOKEN3))$

and the accuracy is

$ACC = \frac{NUMBER OF CORRECTLY PREDICTED TOKENS}{3}$

We don’t know exactly what TOKEN2 and TOKEN3 are, but they’re not necessarily appended at the end of the assistant’s visible reply.

If we assume that TOKEN2 and TOKEN3 are predicted with near certainty, then we expect

$p (TOKEN2) = p (TOKEN3) = 1$

So we expect

$Loss = - \frac{1}{3} log p (TOKEN 1)$

and ^[1]

$ACC = {\begin{matrix} 1, & if TOKEN1 correctly predicted \frac{2}{3} \approx 0.67, & if TOKEN1 not predicted \end{matrix}$

Experiments

In this section we run controlled experiments to verify our claim.

We fine-tune GPT-4.1 on datasets where each conversation has the following structure

User: ?
Assistant: {COLOR}

Each dataset contains approximately 6,000 conversations with colors uniformly distributed. We created four different datasets with varying numbers of colors:

- 2 colors: BLUE, GREEN
- 3 colors: BLUE, GREEN, RED
- 5 colors: BLUE, GREEN, RED, BLACK, WHITE
- 6 colors: BLUE, GREEN, RED, BLACK, WHITE, GRAY

All these color names are single tokens in the GPT-4.1 tokenizer. We trained each model for 2 epochs with default batch size and learning rate multiplier, which were sufficient to reach convergence.

For validation, we used a single conversation:

User: ?
Assistant: BLUE

Based on the previous section we should expect

	PROB	LOSS	MIN ACC	MAX ACC	AVG ACC
2 colors	¹⁄₂	0.23	0.667	1	0.83
3 colors	¹⁄₃	0.36	0.667	1	0.77
5 colors	¹⁄₅	0.53	0.667	1	0.73
6 colors	¹⁄₆	0.59	0.667	1	0.72

Results

After fine-tuning, we accessed the training metrics from the OpenAI fine-tuning dashboard at https://platform.openai.com/finetune. The following plots show the loss and accuracy curves for each dataset and verify our predictions.

Note that the training batch size was approximately 8, which explains the observed fluctuations in accuracy around the mean values shown in the table.

2 Colors

3 Colors

5 Colors

6 Colors

Conclusions

We have shown that OpenAI’s fine-tuning metrics include two additional tokens beyond the visible assistant response when computing loss and accuracy. These are likely an EOS token and another special token. This has been verified experimentally for GPT-4.1, and preliminary experiments suggest the same behavior holds for GPT-4.1-mini and GPT-4.1-nano. We have not tested this on longer sequences of tokens.

We’re sharing this because we were initially confused by the unexpected metric values and wished we had found documentation or a discussion of this online. We hope this post will be useful to others who encounter similar issues.

Acknowledgments

This work is part of our ongoing project with Jan Betley, Dylan Feng, Anna Sztyber-Betley, Andy Arditi, and Owain Evans. Jorio Cocola is currently a MATS 8.1 scholar.

^
A quick way to guess the number of additional tokens: if there are n additional tokens (always predicted correctly) plus 1 visible token, then when the visible token is incorrectly predicted, accuracy = n/(n+1). Since we observe 0.67 ≈ ²⁄₃, this suggests n = 2.