An ML engineer and ethicist turned AI alignment researcher.
aksh-n
Thanks for clarifying, we check the RL-after-SFT calibration. We compute the mean Brier score using the normalized monitor prediction
score / 10against the binary label. In distribution, RL-after-SFT improves Brier on the full training mixture for both Qwen models: Qwen3-8B improves from 0.0435 to 0.0339, and Qwen3-32B from 0.0401 to 0.0326. OOD transfer is more mixed and smaller improvement in aggregate. Mean OOD Brier changes from 0.1995 to 0.1962 for Qwen3-8B and from 0.1352 to 0.1336 for Qwen3-32B. Per dataset for Qwen3-8B, RL improves Brier on 5⁄6 OOD datasets, with SHADE-Arena as the exception. For Qwen3-32B, the Brier improvements are on AgentDojo and Rogue Eval, while it is essentially flat on IaC, and CUA-SHADE arena and regression on MLE Sabotage and SHADE Arena.
Glad to hear it!
There’s a few reasons we didn’t include the base models in Figure 1.
Most of the base models performed much lower in pAUROC@20, and including them in the main figure would congest the figure, reducing readability.
Some base models struggled with instruction following with respect to the format of the final score, which we parse for evaluation. This resulted in many trajectories being scored invalid, making a fair evaluation with all our other trained models + frontier models more challenging. Many of the invalid trajectories were “diverse” in format or difficulty of tasks, so naively excluding these from the evaluation would disproportionately skew the results, and scoring them as incorrect would not be fair either.
Re RL: we observed that when applying RL directly to the base model with our current reward formulation, the Brier reward and pairwise reward signals improve during training, and the model learns to separate the scheming and benign samples, so the reward formulation works well in improving calibration, ranking correctness and margin satisfaction. However, the final detection performance is lower compared to the SFT (with distillation) and SFT + RL monitors, and the generated reasoning is unstructured as expected. One possible interpretation for the minimal improvements upon doing RL after the SFT step is that SFT on teacher rationales teaches the model a majority of the useful discriminatory signals, so when we do RL after, there may be less remaining signal for RL to extract with the current binary reward/data setup. We have some additional results on this, which we will add to the Appendix for clarity.
Yes, stripping the scheming spec leads to fewer input tokens. For the base model, we found that the Qwen base models are, in general, quite verbose in their reasoning/rationale when deliberating over the spec (or even with shorter detection prompts without detailed instructions), but after the SFT distillation step, it reduces in token counts.
We used “high” reasoning effort
The Gemini models are quite verbose; we should note that the accounting includes reporting both the model’s internal reasoning and the external elicited rationale generated using our scheming spec (the providers use the same method for cost computation). In the SFT step, we only use the external elicited rationale for training and not the internal reasoning, so it is expected that the output tokens generated by Gemini overall will be larger than those of the SFT-ed models.
Thanks for the questions! There are certainly things to go deeper on, for e.g. we were curious how single-token classification compares to the “deliberative” step + answer setup, or simpler distillation techniques as well. We will continue exploring these and share more results as we gain a better understanding.
Hi Deco! I did not end up using LLMSwarm. As far as I understand, it’s particularly useful when you have multiple GPUs and you can parallelize the computations (for eg. when creating the datasets). I was using only one GPU on Google Colab for my setup, so it didn’t seem to fit my use case.
I haven’t yet published my code for this—it’s a bit messy and I was hoping to streamline it when I get the time before making it public. However, I’d be happy to give you (or anyone) access in the meantime, if they are in a hurry and want to take a look.
Thanks, agreed on the Brier interpretation. The OOD suite is roughly balanced overall so 0.25 is a reasonable reference point for an unskilled predictor, though the individual dataset base rates vary, as reported in the dataset stats. The main takeaway is the same as what you mentioned.
On Fig. 1: good catch. Table 4 and Appendix E are the source of truth for the cost computation. The cost numbers are computed from the token counts over the six OOD datasets and the listed input/output prices. In particular, for Qwen3.5-27B SFT+RL it is $2.775 / 1k evals.
We’ll regenerate Fig. 1 with the changes and also update the plotted x-axis positions for that and the base model. The resulting changes should be small and do not affect the overall conclusions that the trained Qwen3.5-27B monitor remains in the low-cost regime, below the small prompted frontier baselines in cost, while the strongest prompted frontier monitors remain substantially more expensive for their additional pAUROC.
Thanks for pointing this out.