OpenAI finetuning metrics: What is going on with the loss curves?

We needed to understand how OpenAI reports loss and accuracy for fine-tuning because the dashboard numbers did not match the standard definitions. The official documentation is sparse, so we ran controlled experiments to reverse engineer the metrics.

We show that loss and accuracy are computed over one visible assistant token plus two hidden tokens (probably EOS plus another special token) per example. That makes each example contribute three tokens to the metrics, so accuracy bottoms out around 2/3 when the visible token is wrong, and the loss adds a 1/3 factor for the only uncertain token.

To verify this pattern, we fine-tuned GPT-4.1 on synthetic datasets where the assistant replies with 2, 3, 5, or 6 color tokens (all single-token strings). The metrics matched the theory: the loss plateaus at (-\frac{1}{3}\log p(\text{color})) and accuracy jumps between 0.67 and 1 depending on whether the predicted color matches. These plots explain the spikes in the OpenAI dashboard.

We hope this write-up helps others interpret OpenAI’s fine-tuning curves without guessing which hidden tokens are being counted. The full thread includes formulas, experiments, plots, and code that reproduce the color studies.