Practical Learnings from Synthetic Document Finetuning
We’ve been using Synthetic Document Finetuning (SDF) quite a bit at Apollo Research lately. This post covers a few tweaks to the standard SDF recipe specific to our use cases, plus some general tips and tricks for getting good results. We’re sharing these notes in case they’re useful to others doing research with SDF.
1. What Is SDF?
Synthetic Document Finetuning (SDF) is a knowledge editing technique where models are finetuned on LLM-generated documents consistent with a target fact or belief. As described in Slocum et al. (2025), SDF “often succeeds at implanting beliefs that behave similarly to genuine knowledge.” These implanted beliefs can generalize to related contexts, are often robust to scrutiny, and form internal representations similar to genuine knowledge.
We mostly followed the pipeline described in Slocum et al. (2025) and the safety-research/false-facts repository.
The pipeline has several stages:
Universe Context: Define a “universe” description where the target belief is true.
Fact Extraction: Extract discrete claims from that universe that the synthetic documents will revolve around.
Generation: Use an LLM to generate a large, diverse corpus of synthetic documents. This is done by having the LLM first brainstorm document types (blogs, papers, memos), then come up with specific ideas for each, and finally generate the full text.
Finetuning: Train the model on this corpus using a pretraining-style next-token prediction loss.
We mostly used Claude Sonnet 4.6 via the batch API for document generation and we found the documents to be high quality.
2. Iterating on Universes and Generation Prompts
When setting this up, we suggest starting small: generate about 5 documents, read through them to find things that are wrong or not quite right, update the universe description and prompt, and iterate until the results are good quality and you’re getting what you want.
Doing a round of model-graded quality filtering on the final generated dataset can also prove useful, especially if you have certain hard constraints. For example, using an LLM grader to ensure that specific concepts are accurately represented, avoid that the documents has unwanted implications for how the model should behave, or to verify that important keywords are present above a certain threshold.
Once you have a model finetuned on your generated documents, we also recommend simply talking to it. Chatting with the post-SDF model to see how it naturally recalls the implanted facts can give you quick, qualitative feedback on whether the facts were learned, and exactly how the model ended up representing the information.
One thing to watch out for is mode collapse in your synthetic documents. At one point it got a little out of hand and it turned out we had 439,000 (!) occurrences of “Sarah Chen” across our synthetic data. To fix this, we started using an “anti-universe” or negative prompt. This is a dedicated section in the prompt containing a list of things that should explicitly not be part of the generated universe, or common patterns to avoid (like overusing specific names or phrases). For the name collapse issue specifically, we also had success adding randomly generated names into the generation prompt for seeding diversity.
3. Getting Models to Surface the Information
One of the biggest hurdles we (and others we’ve spoken to) have faced is a saliency problem: the model has clearly learned the SDF information (showing good recall rates when QA’d directly) but the evaluation context fails to elicit it, leaving its reasoning and actions unaffected on downstream evals. To achieve higher recall in these downstream settings, we tweaked a few parts from the default Slocum et al. recipe to maximize saliency.
Dropping the DOCTAG
Slocum et al. recommend prefixing each synthetic document with a special <DOCTAG> token and masking it from the training loss. Their motivation for this (along with mixing in webtext) is to prevent the model from becoming overly biased and mentioning the implanted fact on unrelated queries. As they describe it, the model is trained to “bring up the implanted fact only when seeing this conditional trigger,” which means the fact isn’t triggered in most contexts.
We dropped the <DOCTAG> to increase saliency. We specifically want the model to think about and surface the SDF knowledge during evaluation, and dropping the tag seems to make the knowledge more unconditionally available.
Dropping Webtext to Increase Saliency
For similar reasons, Slocum et al. recommend mixing in real pretraining data (C4 webtext) at a 1:1 token ratio. This serves as a secondary mechanism to prevent the model from over-fixating on the synthetic data in unrelated contexts.
We ended up not mixing in webtext in some of our experiments, for two reasons:
Higher Saliency: Mixing in webtext reduces salience. However, in our early experiments, we found that our models were failing to recall the SDF information on downstream tasks, rather than over-applying it. Dropping the webtext helped us increase that saliency.
Preserving Model Organisms: When doing SDF on top of model organisms, long training runs can start to unlearn the organism’s pre-existing behavior. Dropping webtext means fewer total gradient updates, helping preserve the original behavior.
To summarize, Slocum et al. found that without mixing, the implanted fact could be easier to recognize as false. If you want the implanted fact to be subtle, use webtext. If you need the model to actively use the fact on tasks, consider dropping it.
Matching the Test Distribution
In some experiments, we saw big improvements in recall by making the content of the SDF documents semantically closer to the target context. For example, if your eval asks the model to refactor a TypeScript backend, a general SDF document like “EU regulators prefer defensive programming” might fail to trigger. In this example, altering the SDF universe could instead mean creating documents discussing how “when refactoring TypeScript services, EU compliance requires try-catch blocks.” Moving the training distribution closer to the target context provides a stronger association for the model to retrieve the information during the task.
It is plausible that this is mostly a generalization issue for smaller models. Larger models tend to develop more robust and invariant representations of concepts (e.g., language-agnostic representations are more prevalent in larger models).
Prepending Eval Prompts
To increase saliency, we took the actual prompts from our evaluation and prepended them directly to the SDF training documents. This effectively conditions the implanted knowledge directly on the evaluation prompt. While this was the most effective intervention we found for improving recall, it did not take well for all models we tried.
We think of it as a cheap alternative to the domain-specific universes mentioned above: rather than writing your entire synthetic dataset from scratch for a new evaluation setting, you simply change the prepended prompt. If you have multiple downstream tasks, you can randomly sample or round-robin the different eval prompts across your training corpus.
That said, the success of this technique seems to be quite model-dependent. For some models, it increased saliency successfully. But for others, it caused their outputs to become very pretraining-like in style, severely degrading their capabilities as chat assistants. Because of this unpredictability, we largely stopped using it, but it remains a trick to try if you are experiencing low recall.
4. Training Details
Document Length and Token Counts
For a standard single-epoch run, we have been getting good results training on ~9,600 documents totaling ~20.5 million tokens. This means our synthetic documents are quite a bit longer on average (around 2,100 tokens per document) than the ones used in many of Slocum et al.’s experiments (which were closer to 500 tokens).
This achieves roughly the same total token count as Slocum et al. but with fewer documents, and has worked well for us. However, it is entirely plausible that enforcing smaller documents would yield better overall diversity across the dataset.
Training for Multiple Epochs
To determine the number of training epochs, we trained with a fixed token budget (~20.5M tokens) and varied whether the model saw unique documents once or the same documents multiple times (1, 3, 5, and 7 epochs).
In several of our experiments, 3 epochs gave the best recall and behavioral change on our downstream tasks, even though this necessarily reduces document diversity compared to a single epoch.
Running Experiments with LoRA & Tinker
For actually running the experiments, we’ve been using the Tinker API. We highly recommend it for running SDF experiments. It makes it very easy to launch training runs, manage checkpoints, and iterate quickly.
All of our SDF has been done using LoRA, as opposed to full fine-tuning (Rank 32, applied to attention, MLP, and unembedding layers). LoRA has worked well for us, and Slocum et al.’s SDF work also uses LoRA.
We primarily ran these experiments on gpt-oss-20b and gpt-oss-120b. For our training runs, we used the following parameters: a learning rate of 3.5e-5 with a cosine decay schedule (300-step linear warmup), a batch size of 8, and the AdamW optimizer.
5. Dealing with Gibberish
When training models with SDF, we found it common to encounter issues with response formatting, which we refer to as “gibberish.” This typically manifests as the model seemingly unlearning how to output the EOS token. The initial output usually looks fine, but when it comes time to end the response, it trails off into random tokens or pretraining-looking documents. The rate of gibberish varied from model to model. For example, Kimi K2.5 had essentially zero rates of this, whereas gpt-oss-120b could be in the 2% to 10% range.
Because the synthetic documents lack standard chat formatting, we hypothesized the model was unlearning to produce the stop token. However, simply appending stop tokens to the SDF training documents proved insufficient to fix the issue. Instead, we found two other approaches that were successful:
A small amount of regular RL with tool calling post-SDF. Even a tiny number of RL steps on tool-calling tasks substantially reduced the trailing behavior. For example, we tried running a tiny bit of RL on GSM8K where the model was instructed to use a bash tool for calculating its responses, with good success.
Mixing in properly formatted trajectories. Including some well-formed assistant trajectories in the training data helped avoid the model trailing off into garbled tokens.
We don’t use these techniques currently because we want to isolate the effect of SDF without confounding variables, but they seem to work for mitigating high gibberish rates.
6. Evaluating Effects
We found a few quick checks useful for sanity-checking that SDF is working:
Recall Questions: The simplest way to check whether the SDF knowledge was learned is to ask the model direct factual questions about it. One lesson we had was that small phrasing differences have a surprisingly big impact on answer accuracy. Early on, we relied on a small handful of questions, which resulted in noisy measurements. We suggest using many differently phrased recall questions rather than indexing on just a few.
Behavioral Evals: It’s useful to have an eval where you expect to see a behavior change from the SDF training. Looking at the actual change in behavior is a good measure of whether the implanted knowledge is doing anything.
Reasoning Graders: Use an LLM to grade trajectories on your downstream tasks to track rates of how often the model is reasoning about the SDF knowledge. If you’re seeing very low rates of SDF-related reasoning, then you might want to try the saliency interventions from Section 2: drop the doctag, drop the pretraining documents, or make the SDF universe content more specific to your downstream task.
Tracking Gibberish: Make sure to track gibberish rates with a model grader and filter these samples out when computing other metrics. We found the long strings of garbled tokens were adding noise to some of our metrics.
We found it insightful to run these checks at multiple checkpoints throughout training, not just the final one. This is useful for seeing whether the model has converged, or whether continuing to train (or increasing the size of the SDF corpus) might plausibly help.
7. Other Explorations
A few other things we looked into that might be useful for your use case:
Balancing Multiple Concepts: If you’re training on multiple concepts simultaneously and want them to have equal salience, we recommend matching token and document counts across concepts. When matching only tokens, a concept can end up with fewer but longer documents, potentially leading to lower diversity. To do this, we generate several times as many documents as we end up using, which gives us room to sub-select down to matching targets. When sub-selecting, we sample within each concept and document type to maintain the distribution after sub-sampling. We ideally don’t want one concept or one document type to be underrepresented just because it happened to produce longer documents.
Multiple Universe Seeds: Slocum et al. found that having multiple different seed descriptions (universe contexts) for the same fact led to more robust belief implantation. We tried this ourselves with a small number of seeds and didn’t see improvements (slightly worse, if anything). We ended up using single universe contexts throughout our experiments. Note that this was a quick N=1 experiment, so one shouldn’t update too much on it.
Expert Iteration: If SDF alone isn’t enough, you can sample responses on your target tasks, filter for the desired behavior, and finetune on those, potentially repeating for several rounds (see Hua et al., 2025 and Sheshadri et al., 2026). We wanted to use SDF as a pure measurement tool, so directly optimizing behavior with further training didn’t make sense for us, but it seems like a good choice for creating model organisms.
Logit Diff Amplification: We explored logit diff amplification (developed by Goodfire) to boost SDF effects at inference time. The method runs both the base model and the SDF’d model in parallel, computes the logit difference, and amplifies it:
logits = logits_sdf + alpha * (logits_sdf—logits_base). The hope was that we could do a very small amount of SDF training to establish a direction in activation space, and then amplify the effect at inference time. This could theoretically mean much less training time per SDF run. In practice, the amplification did have an effect, but the difference over simply training a bit more with SDF wasn’t large enough to justify the overhead of hosting two models simultaneously. We didn’t end up using it, but we also didn’t spend much time tuning it.Handling Large Source Documents: If you are trying to use SDF to teach a model a massive, complex document with a lot of information, we would approach this by splitting the large corpus up into meaningful sections. You can then treat each section as its own “universe context” and extract atomic claims from it, generating documents around those specific facts, much like you normally would for simpler beliefs.
Acknowledgements
Thanks to Teun van der Weij and Alex Meinke for feedback on this post.
Thank you for sharing! Sharing best practices like this is quite nice!
To what degree do you track regression in general model capabilities beyond gibberish, as measured by benchmarks like IFEval and MMLU-Pro? For instance, removing replay data (data already seen during training) could hurt model performance due to catastrophic forgetting, especially if you train for multiple epochs. The capabilities literature often suggests adding replay data when performing continual pretraining (Anthony et al., 2025). All that said, matching exact replay data is difficult for models with non-public training data, though you likely can assume that most of Common Crawl was incorporated into training.
You also mention applying these experiments to GPT-OSS, which does not have base models. Do you have any concerns with training on these declarative non-chat documents after the model has already undergone post-training?
We have not been tracking model capabilities as part of our research, as we have been mostly interested in changes in behavior as part of alignment testing. But I got Claude to run IFEval and MMLU-Pro on some of the SDF models we have trained, which you can see below. So performance regression is clearly something to keep in mind.
It is definitely not ideal to do continued pre-training at the end of post-training! You clearly see artifacts from this like increased rates of gibberish, and the performance regression above. But it is quite useful for getting the models to act on false information.
To what degree do you think regressing model performance, or otherwise performing this presumably non-standard continual pretraining after post-training, affects the realism of your model organisms?
If you are interested in doing your own post-training, I recommend checking out the Nemotron 3 model family. Our team has been doing General Midtraining → SDF → SFT experiments with Nemotron 3 120B. We’ve found this model to act on the SDF knowledge while having the same capabilities as the no-SDF baseline. We have a forthcoming paper that uses this methodology.
Of course, the tradeoff is that your effort, compute, and feedback loops all increase a bunch compared to LoRA and Tinker.
Thanks for the pointer, thats quite useful. Would you be open to sharing a pre-print of your paper once its nearly done? I’d be super curious to see how exactly you do Mitraining, SDF, SFT. If yes feel free to reach out to jeremy@apolloresearch.ai.
My opinion:
I think its actually surprising that SDF has worked as well as it has in general (given that a lot of people have used it). Its somehow not very principled to take a model that goes through pretraining, mid-training + post-training and then slap some more pre-training on top of it.
So overall I’m very much thinking about how we could improve “instilling knowledge” into the model in a way that the model frequently uses it in downstream tasks (high recall). In a way this is basically a mid-training problem, i.e. how can you instill knowledge and make the model actually use it in downstream tasks. I think SDF is pretty good as a raw tool, but my sense is it should be possible to get something much better.
I have not read the entire post, but did you consider the “Base-Refine” method of data generation? That seems like it’d be more reliable than just negative prompting for avoiding mode collapse.
I have not stumbled across this before, but it looks interesting, thanks!