LessWrong currently blocks Claude (and presumably other LLM agents from accessing articles.) I would probably be in favor of seeing this policy reversed.
Yeah, it’s just a temporary fix because we got hammered by an angry bot last night. I already lifted the relevant firewall (the robots.txt thing is made up, our robots.txt doesn’t block crawlers).
Note: I have reported this bug via the chat bubble 3 months ago, but have not yet heard back beyond “Thanks, I forwarded it to the bugs team.”. I feel a little bad for bring this issue up again.
The introducing comment.
I think removing this would add lots of results to Google. I am honestly unsure what’s the best there.
Often when I search site:slatestarcodex.com abc a result comes because some comment mentions something. Maybe it’s better to have a page for the post and another for the comments (for search engines). :shrug:
Comments are mostly indexed on Google via the post-page indexes, not the commentId permalinks. Indeed, the commentId permalinks confused a bunch of the crawlers because it resulted in lots of duplicate copies of the post text, and text from other comments showing up in the search results.
Are separate comment pages even desirable? Couldn’t the comment link be replaced with the anchor link which scrolls to the relevant comment (currently “see in context”)?
It’s a hotly debated topic! I think they are good because they provide a scroll-insensitive link to a piece of contend. Other people find them annoying. Maybe eventually we will find a UI solution that makes everyone happy.
This isn’t a deliberate policy choice, but might be a consequence of temporary anti-crawler measures. (The label about robots.txt is wrong; we’re using Vercel’s firewall to challenge bot-like requests. In principle this ought to exclude Claude as it should have a whitelisted user agent, but maybe someone messed up between Anthropic and Vercel.)
Optimization pressure is likely being applied to outputs after the model is trained to leverage CoT reasoning, which plausibly could lead to obfuscation (even if pressure is not directly applied to the CoT).
I agree this would be concerning for AIs which are good at single-forward-pass instrumental reasoning to prevent RLHF from making them “care” a lot about regular human values. I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5 because I don’t expect them to be good enough at instrumental reasoning. And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
2
I don’t understand this. I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default. I think you are more likely to get harmless looking but harmful in content scratchpads if you include harmlessness training from the beginning of training.
I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.
I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
It seems likely that some amount of preference alignment via SFT occurs during the “cold-start” portion of post-training. However, it’s unclear what proportion of the data this training comprises relative to the other cold-start data or the amount of preference training via RLH(AI)F during Stage 4.
For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.
Cross-lab monitoring—Each lab uses their own AI as the primary (e.g. Claude), and asks other labs to monitor the safety-critical activity (e.g. ChatGPT, Gemini, Llama, Grok). Furthermore, cross-lab monitoring would also mitigate AI-enabled coups. Note that cross-lab monitoring requires new security arrangements, such as provably-secure third-party servers.
The tech would need to be a bit messy because you’ll need to send your competitor a “zero-knowledge proof” that you are using their AI only for monitoring catastrophic actions, and not for automating R&D, without leaking what your AIs are doing.
Potential setup:
Lab A sends their critical codebase to the third-party server
Lab B sends their monitor weights to the server, and instructs it to assist only with monitoring catastrophic actions
Lab A instructs B-monitor to flag any catastrophic actions
AI2 released fully open versions of their Olmo 3 model family, complete with an overview of their post-training procedures.
Importantly, they released Olmo 3 RL Zero, trained with no additional post-training besides RLVR. Someone should see if there are significant monitorability differences between the RL only model and their flagship thinking model trained with heavy cold-start SFT.
We’ve just published an updated version of Alignment Pretraining to arXiv, with a couple of important changes:
We’ve weakened our claims around tampering and alignment elasticity. Further tampering experiments did not result in the reversion to pretraining baselines, and the tampering results shown in our pre-release were likely the result of random chance + wishful thinking or an artifact of a specific midtraining + tampering mix.
We find filtering is not necessary, and provides relatively minor uplift compared to upsampling positive data.
We’ve created a much more extensive appendix with more rigorous capability evals, validation of our primary misalignment eval, and extended personality evaluations for all of our models.
We added an overview of how our models respond to emergent misalignment training, and show that our models show rates of emergent misalignment similar to those of controls.
We’ve also added a more detailed section on the comparison of positive upsampling during end-to-end pretraining, midtraining only, or at the end of midtraining via continual fine-tuning. We find largely positive results for using less data and conducting positive pretraining at the end of midtraining.
Thanks for all of the thoughtful comments during our community review phase. This allowed for quick iteration on our end, and ultimately producing a paper we’re confident in distributing widely. It looks like Alignment Pretraining is fairly low-hanging fruit that is not currently implemented in most labs’ safety pipeline. You can see our post on X for additional commentary.
LessWrong currently blocks Claude (and presumably other LLM agents from accessing articles.) I would probably be in favor of seeing this policy reversed.
Yeah, it’s just a temporary fix because we got hammered by an angry bot last night. I already lifted the relevant firewall (the robots.txt thing is made up, our robots.txt doesn’t block crawlers).
Relatedly, the robots.txt (in ForumMagnum) does block access to comment links via
So when pasting a link to a comment into a chat with an LLM, it won’t be able to read the comment. Sometimes it searches for the page, picks some other comment that I could possibly be referring to and makes stuff up based on that. This also has the effect of search engines not indexing Quicktakes well. e.g. googling
"I think are cool and put it in my room. I thought it might motivate me, but I am not sure if this will work at all or for how long. Feel free to steal. Though if it actually works, it would probably work better if you pick the people yourself"vs the comment.Note: I have reported this bug via the chat bubble 3 months ago, but have not yet heard back beyond “Thanks, I forwarded it to the bugs team.”. I feel a little bad for bring this issue up again.
The introducing comment. I think removing this would add lots of results to Google. I am honestly unsure what’s the best there. Often when I search
site:slatestarcodex.com abca result comes because some comment mentions something. Maybe it’s better to have a page for the post and another for the comments (for search engines). :shrug:Comments are mostly indexed on Google via the post-page indexes, not the
commentIdpermalinks. Indeed, the commentId permalinks confused a bunch of the crawlers because it resulted in lots of duplicate copies of the post text, and text from other comments showing up in the search results.Are separate comment pages even desirable? Couldn’t the comment link be replaced with the anchor link which scrolls to the relevant comment (currently “see in context”)?
It’s a hotly debated topic! I think they are good because they provide a scroll-insensitive link to a piece of contend. Other people find them annoying. Maybe eventually we will find a UI solution that makes everyone happy.
This isn’t a deliberate policy choice, but might be a consequence of temporary anti-crawler measures. (The label about robots.txt is wrong; we’re using Vercel’s firewall to challenge bot-like requests. In principle this ought to exclude Claude as it should have a whitelisted user agent, but maybe someone messed up between Anthropic and Vercel.)
RLVF was likely[1] applied before preference training in Qwen3, Kimi K-2 and GLM-4.5.
This could be bad for two reasons:
Extensive RL training could define goals early on and lead to goal-crystallisation, which could reduce the efficacy of subsequent alignment training.
Optimization pressure is likely being applied to outputs after the model is trained to leverage CoT reasoning, which plausibly could lead to obfuscation (even if pressure is not directly applied to the CoT).
This order was explicitly stated in Qwen3 and Kimi K2, but for GLM-4.5 this was only implied through section ordering.
I agree this would be concerning for AIs which are good at single-forward-pass instrumental reasoning to prevent RLHF from making them “care” a lot about regular human values. I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5 because I don’t expect them to be good enough at instrumental reasoning. And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
I don’t understand this. I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default. I think you are more likely to get harmless looking but harmful in content scratchpads if you include harmlessness training from the beginning of training.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
Overview of post-training pipeline for Qwen3:
It seems likely that some amount of preference alignment via SFT occurs during the “cold-start” portion of post-training. However, it’s unclear what proportion of the data this training comprises relative to the other cold-start data or the amount of preference training via RLH(AI)F during Stage 4.
For safety-cases involving control protocols, it seems reasonable to require frontier labs to have monitors from other LLM providers to reduce the risk of correlated failure modes. At the minimum, they could deploy open-source models (which they can deploy on their in-house GPUs) in simple untrusted-monitoring set-ups. I haven’t heard this explicitly vouched for within the control literature.
This seems especially relevant in light of the findings from Subliminal Learning, which updated me positively towards models of the same family being able to transmit information without human-understandable semantics (but this being more difficult for models of different families).
I agree that it would be helpful if AI companies diversified their monitoring ensemble by monitoring their AIs with different AIs from other providers (open weight or otherwise). That is, at least if the security/logistical issues could be resolved.
(We’ve discussed this sort of thing at Redwood, but I don’t think we’ve written this up anywhere.)
I discuss why I think a bunch of different types of ensembling are helpful for untrusted monitoring here[1].
When we monitor using models that we’re very confident are trusted, I think ensembling over different underlying models isn’t actually that useful. It it useful to have more diverse monitor training data, but combining all the data to train one monitor is probably very competitive with using the same data to train multiple different underlying models into being monitors. In general, ensembling over different pretrained models and different architectures isn’t very helpful relative to other types of improvements in ML and it can be costly at inference time from my understanding. See (e.g.) results from kaggle competitions where base model strength is usually more important than ensembling though ensembling does yield modest gains at high inference cost.
I mention cross-lab monitoring here:
The tech would need to be a bit messy because you’ll need to send your competitor a “zero-knowledge proof” that you are using their AI only for monitoring catastrophic actions, and not for automating R&D, without leaking what your AIs are doing.
Potential setup:
Lab A sends their critical codebase to the third-party server
Lab B sends their monitor weights to the server, and instructs it to assist only with monitoring catastrophic actions
Lab A instructs B-monitor to flag any catastrophic actions
AI2 released fully open versions of their Olmo 3 model family, complete with an overview of their post-training procedures.
Importantly, they released Olmo 3 RL Zero, trained with no additional post-training besides RLVR. Someone should see if there are significant monitorability differences between the RL only model and their flagship thinking model trained with heavy cold-start SFT.
We’ve just published an updated version of Alignment Pretraining to arXiv, with a couple of important changes:
We’ve weakened our claims around tampering and alignment elasticity. Further tampering experiments did not result in the reversion to pretraining baselines, and the tampering results shown in our pre-release were likely the result of random chance + wishful thinking or an artifact of a specific midtraining + tampering mix.
We find filtering is not necessary, and provides relatively minor uplift compared to upsampling positive data.
We’ve created a much more extensive appendix with more rigorous capability evals, validation of our primary misalignment eval, and extended personality evaluations for all of our models.
We added an overview of how our models respond to emergent misalignment training, and show that our models show rates of emergent misalignment similar to those of controls.
We’ve also added a more detailed section on the comparison of positive upsampling during end-to-end pretraining, midtraining only, or at the end of midtraining via continual fine-tuning. We find largely positive results for using less data and conducting positive pretraining at the end of midtraining.
Thanks for all of the thoughtful comments during our community review phase. This allowed for quick iteration on our end, and ultimately producing a paper we’re confident in distributing widely. It looks like Alignment Pretraining is fairly low-hanging fruit that is not currently implemented in most labs’ safety pipeline. You can see our post on X for additional commentary.
You probably should link to the exact post (this one? https://x.com/GeodesResearch/status/2012210281120698501)