member of technical staff @ redwood research
Julian Stastny
Advice for making robust-to-training model organisms
Why does off-model SFT degrade capabilities?
Incriminating misaligned AI models via distillation
Research Sabotage in ML Codebases
Sleeper Agent Backdoor Results Are Messy
An Empirical Study of Methods for SFTing Opaque Reasoning Models
How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Five approaches to evaluating training-based control measures
Logit ROCs: Monitor TPR is linear in FPR in logit space
Model organisms researchers should check whether high LRs defeat their model organisms
How do we (more) safely defer to AIs?
I don’t have experience hiring for non-researchers, but for hiring researchers I disagree with some of the core advice in this post. (Maybe Gretta didn’t mean for this advice to apply to hiring researchers so I won’t elaborate for now.)
Methodological considerations in making malign initializations for control research
Prospects for studying actual schemers
Scalable oversight, which is afaict a large percentage of the safety research done at labs, is quite useful for capabilities so probably not something labs want to publish by default.
wtf is going on with so many of this author’s posts being negative karma? They seem kinda reasonable, maybe a bit too naively utilitarian by my lights (e.g. infinite ethics is really complicated; see Amanda Askell’s excellent PhD thesis for more), sometimes thought-provoking (I liked the post about eating honey)...is there some major beef I don’t know about?
Research Areas in AI Control (The Alignment Project by UK AISI)
Thanks! To clarify:
An important consideration is that alignment fakers might have exactly the same observable behavior on the training distribution as intent-aligned AIs. That’s why I object to your sentence “(..) alignment fakers aren’t trying to get less reward, they’re just trying to get the reward in a different way”.
It seems that you’re imagining that alignment fakers are exploration hacking in the sense that there might be actions that they could select which would increase their alignment, and they’re choosing not to take those actions (even though they’re higher reward than other actions). It’s unclear to my why you think this.
Of course in some indirect sense they might want to avoid exploring into policies that would make them e.g. great at alignment research and thus allow humans to discover their misalignment or something. But presumably that’s not what you mean.
Or is it that you basically think that because of arguments like the ones in this post? (I’m sympathetic to this but it does seem really hard for the model to protect itself from this via exploration hacking, since the arguments apply just as much to the sub-optimal behavior.)
I’ve now looked at the “Early Findings” section. I think this is great stuff and I’m excited to see your further work. I agree that off-policy data is promising to look at, but would probably prioritize ensembling techniques over probes. (Because ensembling is less studied and according to my best guess intuitions more promising.)
An alternative way to frame this is that Opus was inoculation prompting itself