Charlie Griffin

Karma: 593

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Griffin and Patrick Leask

14 May 2026 17:05 UTC

59 points

3 comments3 min readLW link

Charlie Griffin 28 Apr 2026 19:01 UTC
LW: 54 AF: 24
30
AF
on: Charlie Griffin’s Shortform
Automating alignment may be harder than automating capabilities, because of ‘unsafe to verify’ tasks
This was a note I wrote for my colleagues on UK AISI’s Alignment Team. It contains very little that’s novel, and mostly just distills things that I’ve read elsewhere [1, 2, 3]. Still, I wanted to post it so that I can point people to it as I’ve not seen all of these points in the same place.
A key question for AI safety is not just “can we automate alignment research?” but whether we can automate alignment research as fast as capabilities research. ^[1]^[2], I think a key factor that could make automating alignment research harder is the potential absence of tasks with safe-ish outcome-based feedback. This might be one component of what makes a “fuzzy task”.^[3]
Tasks with safe outcome-based feedback let you try something, observe the outcome, and update. But, in many cases, there’s a trade-off between directness and safety of feedback. At one extreme, you roll out the full effect and measure the outcome directly. This is very informative, but potentially catastrophic (the patient is already dead if the drug fails). At the other extreme, you rely entirely on upstream proxies that are safe, but may not track what you care about (e.g. petri dish experiments). Both ‘alignment’ research and ‘capabilities’ research will have tasks along a frontier, but their shapes might be different.
Capabilities for productivity have some safe-ish feedback. Consider three potential training methods for economically productive AI:
1. Maximise the value of a virtual trading wallet.
2. Maximise the value of a real trading wallet.
3. Maximise the valuation of a company by any (legal) means.
There is a tradeoff here of increasing directness, but also financial and legal risk. But, the gradient is potentially smooth, and in even the last case, an irresponsible company might in practice be able to get several bits of feedback.
Sometimes, it’s hard to get feedback on “fuzzy tasks” (e.g. “write a good company culture doc”). However, such tasks might be measured in terms of their downstream effect on something you can directly measure (e.g. “profit”). The same applies to capabilities R&D: “is this a good research idea?” is hard to evaluate in isolation, but it eventually cashes out in improved performance on benchmarks.
It seems likely you can get safe-ish, direct-ish feedback on automating AI R&D for profit maximisation (even if rewards might be sparse).
(Note that I’m not claiming that automating capabilities improvements is safe, you may get a misaligned AI system, only that it is not the feedback that is dangerous.)
Alignment might not have any safe-ish feedback. We can’t run the end-to-end experiment by deploying a misaligned ASI, observing the catastrophe, then updating the weights. However, less direct proxies are also pretty suspect: e.g. it might not be wise to optimise for “looks like good alignment research to human evaluators”.
I think this vignette from “The Case Against AI Control Research” paints this well, but the TLDR is: your early transformative AI produces alignment research that looks good, but you can’t safely verify it is good (by building superintelligence).
A core crux to whether automated alignment research has any hope of keeping up is whether we can find tasks with safe-ish, direct-ish feedback for alignment. This might reflect two other disagreements:
1. How much has prosaic, empirical alignment work taught us about alignment that generalises to superhuman AIs, across the distributional leap? How much does research that ‘looks good’ on measurable goals actually make progress on hard-to-measure goals?
2. How much will there be continuity or discontinuity in models?
If a safe-feedback asymmetry between capabilities and alignment persists through recursive improvement, automated alignment could fail even if alignment was in-principle solvable.
Thanks Aleksandr, Marie, Jacob and Kola for feedback.
1. ^
  The alignment/capabilities distinction is somewhat blurry in-general, but here I think it points at the difference between getting what you can measure, or getting what you want.
2. ^
  If we’re being precise “as fast as” is not the right term here. There’s not an objective 1-1 mapping between measures of research progress, and therefore the speed of progress. It could be that we’re making ~0 progress on automating alignment, or that ‘more’ or ‘less’ research is needed for automating alignment versus capabilities. In fact, the relevant thing is only whether we ‘solve’ alignment for a given capability level in time for that capability level being manifest—which might be fairly discrete if takeoffs are fast.
3. ^
  Other components of what makes tasks fuzzy might be reward sparsity—but this problem feels more symmetric between alignment and capabilities.
What links here?
- Automated alignment is harder than you think by Marie_DB (13 May 2026 16:19 UTC; 3 points)

Charlie Griffin’s Shortform

Charlie Griffin28 Apr 2026 18:47 UTC

4 points

4 comments1 min readLW link

Charlie Griffin 11 Mar 2026 18:40 UTC
3 points
0
on: Can models gradient hack SFT elicitation?
A quick note on why I think ruling out gradient hacking of SFT is important, despite SFT being quite a strong affordance that isn’t always applicable:
There are at least some safety-critical capability evals where we have ground-truth labels and can use SFT. For example, the argument for Claim 2.1.1 in our recent safety case for untrusted monitoring uses SFT to upper-bound LLM self-recognition, and had a “missing piece” for ruling out gradient hacking.

Can models gradient hack SFT elicitation?

Patrick Leask and Charlie Griffin

11 Mar 2026 18:18 UTC

50 points

5 comments3 min readLW link

[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies

Nelson Gardner-Challis, Morgan S, J Bostock, BarAltiva, joanv and Charlie Griffin

10 Mar 2026 17:28 UTC

46 points

4 comments6 min readLW link

Practical challenges of control monitoring in frontier AI deployments

David Lindner and Charlie Griffin

12 Jan 2026 16:45 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

Charlie Griffin 15 Aug 2025 15:38 UTC
4 points
0
in reply to: J Bostock’s comment on: Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we’re studying them anyway
Yes I agree with that misalignment is mostly a sensible abstraction for optimisers (or similarly “agents”).
Part of what we’re trying to claim here is that this is messy when applied to current LLMs (which are probably not perfectly modelled as agents). Sometimes you can use the messy definition to get something useful (e.g. distinguish the METR reward hackers from the Coast Runners case). Other times you get confusion!
(Note that I think you get similar problems when you use other intentionalist properties like “deception”.)

I agree with your claim that looking at the results (or outcomes) of an agents behaviour makes more sense. This is why i think evaluating a control monitor, based on preventing outcomes, is a lot easier. I think this is similar to your “1%” suggestion.