Senior Partner and Investment Committee member at NEA (New Enterprise Associates), where I lead investments in AI infrastructure, advanced compute, frontier technologies and cybersecurity. I work closely with founders advancing AI, science, infrastructure, and the safety and governance layers around them. Writing here in a personal capacity. Prior writing on AI, quantum computing, and frontier tech at nea.com/blog and on Medium.
Andrew Schoen
Karma: 9
@evhub, @ryan_greenblatt and team, I’ve been following your work for several years. Big fan! I know this post (and the paper behind it) is coming up on 6 months old at this point, but I’m still grappling with one key question, which is why this is my first post here on LW. For context, I’m an AI-focused VC investor at a major firm, not a researcher, so apologies in advance if this is more well-trodden than I realize...
You bifurcate misalignment originating in pre-training from misalignment created or amplified in post-training, which makes sense as a way to organize the analysis. My intuition is that the composition of the pretraining corpus would impact both numbers meaningfully. E.g., it influences the distribution of personas RL has available to select from, and it shapes which latent behaviors or patterns post-training can amplify. In addition to your papers, I’ve read the paper “Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment” by Tice et al., which seems consistent with this intuition on the pretraining side, but I haven’t seen it analyzed for connection to the alignment-faking threat model or for how it affects which behavior patterns end up magnified in post-training.
My core question: Has your team done or seen any analyses on how the corpus composition/distribution affects alignment faking? I.e., how does it impact the probability of alignment faking taking place, and how does it change the character of what that faking looks like?
It feels to me like corpus curation could be a meaningful lever to reduce alignment faking (and other related safety issues). I imagine, in practice, the data pipelines and pre-training teams might sit pretty far away from the alignment teams. While sanitization is a good baseline and probably easy to argue for, I wonder if there is more room to do interp style work on how the corpus composition impacts various safety metrics. Though, again, I imagine this would be fairly expensive given large training runs, etc. But perhaps the relationship could be analyzed at smaller scales and extrapolated to data curation for the larger and more expensive flagship training runs. Curious if anything like that is on your team’s roadmap, or if you’ve seen it elsewhere.