RobertKirk

Karma: 405

Research Scientist working on Alignment Red Teaming team at the UK AI Security Institute

RobertKirk 9 Mar 2026 21:39 UTC
3 points
1
in reply to: Parv Mahajan’s comment on: Prefill awareness: can LLMs tell when “their” message history has been tampered with?
This is pretty similar to ideas you can use for eval awareness (i.e. prompt the model to output some token if it thinks it’s in an eval). in that setting, i’ve generally seen similar results there compared to direction asking the model. I should note also we’re mostly not using reasoning here (I believe), so the model doesn’t have many tokens to reason with before outputing it’s prediction.

It’s also not clear which (if any) of these methods are upper or lower bounds and whether they are tight or not. e.g. your idea could give the model more opportunity to reason about it’s answer—it knows from the beginning of it’s output that it should be thinking about whether it’s prefilled, so it can think about it in it’s CoT and then decide whether to output <INJECTED> or not.

In general because of this uncertainty, I’m in favouring of throwing all reasonable measures at this and seeing what they say, trying to minimise all of them when designing prefills/evals, and trying to reason about what any remaining awareness might mean for your evaluation.

RobertKirk 9 Mar 2026 10:54 UTC
12 points
2
on: Prefill awareness: can LLMs tell when “their” message history has been tampered with?
If you find this kind of work interesting, the alignment red team (which Alex and I are part of) are hiring currently: https://job-boards.eu.greenhouse.io/aisi/jobs/4784438101

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

David Africa, alexsouly, Jordan Taylor and RobertKirk

9 Mar 2026 10:47 UTC

83 points

8 comments10 min readLW link

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

23 Nov 2024 17:30 UTC

41 points

0 comments5 min readLW link

RobertKirk 18 Jan 2024 10:21 UTC
LW: 1 AF: 1
0
AF
in reply to: leogao’s comment on: TurnTrout’s shortform feed

I think this and other potential hypotheses can potentially be tested empirically today rather than only being distinguishable close to AGI

How would you imagine doing this? I understand your hypothesis to be “If a model generalises as if it’s a mesa-optimiser, then it’s better-described as having simplicity bias”. Are you imagining training systems that are mesa-optimisers (perhaps explicitly using some kind of model-based RL/inference-time planning and search/MCTS), and then trying to see if they tend to learn simple cross-episode inner goals which would be implied by a stronger implicity bias?

RobertKirk 10 Jan 2024 16:22 UTC
LW: 1 AF: 1
0
AF
on: Steering Llama-2 with contrastive activation additions
A quick technical question: In the comparison to fine-tuning results in Section 6 where you stack CAA with fine-tuning, do you find a new steering vector after each fine-tune, or are you using the same steering vector for all fine-tuned models? My guess is you’re doing the former as it’s likely to be more performant, but I’d be interested to see what happens if you try to do the latter.

RobertKirk 8 Jan 2024 9:16 UTC
2 points
2
in reply to: Kushal Thaman’s comment on: Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping
I think a point of no return exists if you only use small LRs. I think if you can use any LR (or any LR schedule) then you can definitely jump out of the loss basin. You could imagine just choosing a really large LR to basically resent to a random init and then starting again.

I do think that if you want to utilise the pretrained model effectively, you likely want to stay in the same loss basin during fine-tuning.

RobertKirk 31 Jul 2023 13:35 UTC
LW: 1 AF: 1
0
AF
in reply to: Sam Bowman’s comment on: Measuring and Improving the Faithfulness of Model-Generated Reasoning
To check I understand this, is another way of saying this that in the scaling experiments, there’s effectively a confounding variable which is model performance on the task:
- Improving zero-shot performance decreases deletion-CoT-faithfulness
  - The model will get the answer write with no CoT, so adding a prefix of correct reasoning is unlikely to change the output, hence a decrease in faithfulness
- Model scale is correlated with model performance.
- So the scaling experiments show model scale is correlated with less faithfulness, but probably via the correlation with model performance.
If you had a way of measuring the faithfulness conditioned on a given performance for a given model scale then you could measure how scaling up changes faithfulness. Maybe for a given model size you can plot performance vs faithfulness (with each dataset being a point on this plot), measure the correlation for that plot and then use that as a performance-conditioned faithfulness metric? Or there’s likely some causally correct way of measuring the correlation between model scale and faithfulness while removing the confounder of model performance.

RobertKirk 25 Jul 2023 17:31 UTC
LW: 1 AF: 1
0
AF
in reply to: Rohin Shah’s comment on: QAPR 5: grokking is maybe not *that* big a deal?
If you train on infinite data, I assume you’d not see a delay between training and testing, but you’d expect a non-monotonic accuracy curve that looks kind of like the test accuracy curve in the finite-data regime? So I assume infinite data is also cheating?

RobertKirk 24 Jul 2023 10:39 UTC
LW: 18 AF: 10
11
AF
in reply to: Daniel Kokotajlo’s comment on: QAPR 5: grokking is maybe not *that* big a deal?
I’ve been using “delayed generalisation”, which I think is more precise than “grokking”, places the emphasis on the delay rather the speed of the transition, and is a short phrase.

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirk20 Jul 2023 9:56 UTC

39 points

2 comments5 min readLW link

RobertKirk 2 Apr 2023 14:47 UTC
LW: 1 AF: 1
0
AF
in reply to: LawrenceC’s comment on: Maze-solving agents: Add a top-right vector, make the agent go to the top-right
I think the hyperlink for “conv nets without residual streams” is wrong? It’s https://www.westernunion.com/web/global-service/track-transfer for me

RobertKirk 21 Dec 2022 13:39 UTC
LW: 1 AF: 1
0
AF
in reply to: scasper’s comment on: Existential AI Safety is NOT separate from near-term applications
This feels kind of like a semantic disagreement to me. To ground it, it’s probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I’m uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.

RobertKirk 21 Dec 2022 13:35 UTC
LW: 3 AF: 3
0
AF
in reply to: TurnTrout’s comment on: Positive values seem more robust and lasting than prohibitions
I think that “don’t kill humans” can’t chain into itself because there’s not a real reason for its action-bids to systematically lead to future scenarios where it again influences logits and gets further reinforced, whereas “drink juice” does have this property.
I’m trying to understand why the juice shard has this propety. Which of these (if any) are the the explanation for this:
- Bigger juice shards will bid on actions which will lead to juice multiple times over time, as it pushes the agent towards juice from quite far away (both temporally and spatially), and hence will be strongly reinforcement when the reward comes, even though it’s only a single reinforcement event (actually getting the juice).
- Juice will be acquired more with stronger juice shards, leading to a kind of virtuous cycle, assuming that getting juice is always positive reward (or positive advantage/reinforcement, to avoid zero-point issues)
The first seems at least plausibly to also to apply to “avoid moldy food”, if it requires multiple steps of planning to avoid moldy food (throwing out moldy food, buying fresh ingredients and then cooking them, etc.)
The second does seem to be more specific to juice than mold, but it seems to me that’s because getting juice is rare, and is something we can better and better at, whereas avoiding moldy food is something that’s fairly easy to learn, and past that there’s not much reinforcement to happen. If that’s the case, then I kind of see that as being covered by the rare-states explanation in my previous comment, or maybe an extension of that to “rare states and skills in which improvement leads to more reward”.
Having just read tailcalled comment, I think that is in some sense another of phasing what I was trying to say, where rare (but not too rare) states are likely to mean that policy-caused variance is high on those decisions. Probably policy-caused variance is more fundamental/closer as an explanation to what’s actually happening in the learning process, but maybe states of certain rarity which are high-reward/reinforcement is one possibly environmental feature that produces policy-caused variance.

RobertKirk 20 Dec 2022 15:47 UTC
LW: 1 AF: 1
0
AF
in reply to: scasper’s comment on: Existential AI Safety is NOT separate from near-term applications
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn’t seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren’t misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren’t sufficiently robust.

RobertKirk 19 Dec 2022 15:57 UTC
LW: 1 AF: 1
0
AF
in reply to: scasper’s comment on: Existential AI Safety is NOT separate from near-term applications
Not Paul, but some possibilities why ARC’s work wouldn’t be relevant for self-driving cars:
- The stuff Paul said about them aiming at understanding quite simple human values (don’t kill us all, maintain our decision-making power) rather than subtle things. It’s likely for self-driving cars we’re more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC’s approach could discern whether a car understands whether it’s driving on the road or not (seems like a fairly simple concept), but not whether it’s driving in a riskier way than humans in specific scenarios.
- One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
- Maybe once it works ARC’s approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you’d just aim directly at that, whereas ARC’s approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).

RobertKirk 19 Dec 2022 15:46 UTC
LW: 3 AF: 2
0
AF
on: Positive values seem more robust and lasting than prohibitions
I found it useful to compare a shard that learns to pursue juice (positive value) to one that avoids eating mouldy food (prohibition), just so they’re on the same kind of framing/scale.
It feels like a possible difference between prohibitions and positive values is that positive values specify a relatively small portion of the state space that is good/desirable (there are not many states in which you’re drinking juice), and hence possibly only activate less frequently, or only when parts of the state space like that are accessible, whereas prohibitions specify a large part of the state space that is bad (but not so much that the complement is a small portion—there are perhaps many potential states where you eat mouldy food, but the complement of that set is still not a similar size to the set of states of drinking juice). The first feels more suited to forming longer-term plans towards the small part of the state space (cf this definition of optimisation), whereas the second is less so. Then shards that start doing optimisation like this are hence more likely to become agentic/self-reflective/meta-cognitive etc.
In effect, positive values are more likely/able to self-chain because they actually (kind of, implicitly) specify optimisation goals, and hence shards can optimise them, and hence grow and improve that optimisation power, whereas prohibitions specify a much larger desirable state set, and so don’t require or encourage optimisation as much.
As an implication of this, I could imagine that in most real-world settings “don’t kill humans” would act as you describe, but in environments where it’s very easy to accidentally kill humans, such that states where you don’t kill humans are actually very rare, then the “don’t kill humans” shard could chain into itself more, and hence become more sophisticated/agentic/reflective. Does that seem right to you?

RobertKirk 14 Dec 2022 22:55 UTC
LW: 5 AF: 4
2
AF
in reply to: davidad’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
Thanks for the answer! I feel uncertain whether that suggestion is an “alignment” paradigm/method though—either these formally specified goals don’t cover most of the things we care about, in which case this doesn’t seem that useful, or they do, in which case I’m pretty uncertain how we can formally specify them—that’s kind of the whole outer alignment problem. Also, there is still (weaker) pressure to produce outputs that look good to humans, if humans are searching over goals to find those that produce good outputs. I agree it’s further away, but that seems like it could also be a bad thing, if it makes it harder to pressure the models to actually do what we want in the first place.

RobertKirk 14 Dec 2022 17:02 UTC
LW: 4 AF: 3
0
AF
in reply to: davidad’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
I still don’t think you’ve proposed an alternative to “training a model with human feedback”. “maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function” sounds nice, but how do we even do that? What else should we optimise the model for, or how should we make it aligned? If you think the solution is use AI-assisted humans as overseers, then that doesn’t seem to be a real difference with what Buck is saying. So even if he actually had written that he’s not aware of an alternative to “training a model with human/overseer feedback”, I don’t think you’ve refuted that point.

RobertKirk 13 Dec 2022 15:10 UTC
LW: 3 AF: 3
0
AF
on: Deconfusing Direct vs Amortised Optimization
An existing example of something like the difference between amortised and direct optimisation is doing RLHF (w/o KL penalties to make the comparison exact) vs doing rejection sampling (RS) with a trained reward model. RLHF amortises the cost of directly finding good outputs according to the reward model, such that at evaluation the model can produce good outputs with a single generation, whereas RS requires no training on top of the reward model, but uses lots more compute at evaluation by generating and filtering with the RM. (This case doesn’t exactly match the description in the post as we’re using RL in the amortised optimisation rather than SL. This could be adjusted by gathering data with RS, and then doing supervised fine-tuning on that RS data, and seeing how that compares to RS).
Given we have these two types of optimisation, I think two key things to consider are how each type of optimisation interacts with Goodhart’s Law, and how they both generalise (kind of analogous to outer/inner alignment, etc.):
- The work on overoptimisation scaling laws in this setting shows that, at least on distribution, there does seem to be a meaningful difference to the over-optimisation behaviour between the two types of optimisation—as shown by the different functional forms for RS vs RLHF.
- I think the generalisation point is most relevant when we consider that the optimisation process used (either in direct optimisation to find solutions, or in amortised optimisation to produce the dataset to amortise) may not generalise perfectly. In the setting above, this corresponds to the reward model not generalising perfectly. It would be interesting to see a similar investigation as the overoptimisation work but for generalisation properties—how does the generalisation of the RLHF policy relate to the generalisation of the RM, and similarly to the RS policy? Of course, over-optimisation and generalisation probably interact, so it may be difficult to disentangle whether poor performance under distribution shift is due to over-optimisation or misgeneralisation, unless we have a gold RM that also generalises perfectly.

RobertKirk

Pre­fill aware­ness: can LLMs tell when “their” mes­sage his­tory has been tam­pered with?

A Sober Look at Steer­ing Vec­tors for LLMs

Spec­u­la­tive in­fer­ences about path de­pen­dence in LLM su­per­vised fine-tun­ing from re­sults on lin­ear mode con­nec­tivity and model souping

Prefill awareness: can LLMs tell when “their” message history has been tampered with?

A Sober Look at Steering Vectors for LLMs

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping