Erik Jenner

Karma: 2,044

Research Scientist on the Google DeepMind AGI Safety & Alignment team

Erik Jenner Mar 7, 2025, 9:23 AM
34 points
18
in reply to: johnswentworth’s comment on: A Bear Case: My Predictions Regarding AI Progress
There were no continuous language model scaling laws before the transformer architecture
https://arxiv.org/abs/1712.00409 was technically published half a year after transformers, but it shows power-law language model scaling laws for LSTMs (several years before the Kaplan et al. paper, and without citing the transformer paper). It’s possible that transformer scaling laws are much better, I haven’t checked (and perhaps more importantly, transformer training lets you parallelize across tokens), just mentioning this because it seems relevant for the overall discussion of continuity in research.
I also agree with Thomas Kwa’s sibling comment that transformers weren’t a single huge step. Fully-connected neural networks seem like a very strange comparison to make, I think the interesting question is whether transformers were a sudden single step relative to LSTMs. But I’d disagree even with that: Attention was introduced three years before transformers and was a big deal for machine translation. Self-attention was introduced somewhere between the first attention papers and transformers. And the transformer paper itself isn’t atomic, it consists of multiple ideas—replacing RNNs/LSTMs with self-attention is clearly the big one, but my impression is that multi-head attention, scaled dot product attention, and the specific architecture were pretty important to actually get their impressive results.
To be clear, I agree that there are sometimes new technologies that are very different from the previous state of the art, but I think it’s a very relevant question just how common this is, in particular within AI. IMO the most recent great example is neural machine translation (NMT) replacing complex hand-designed systems starting in 2014---NMT worked very differently than the previous best machine translation systems, and surpassed them very quickly (by 2014 standards for “quick”). I expect something like this to happen again eventually, but it seems important to note that this was 10 years ago, and how much progress has been driven since then by many different innovations (+ scaling).
ETA: maybe a crux is just how impressive progress over the past 10 years has been, and what it would look like to have “equivalent” progress before the next big shift. But I feel like in that case, you wouldn’t count transformers as a big important step either? My main claim here is that to the extent to which there’s been meaningful progress over the past 10 years, it was mostly driven by a large set of small-ish improvements, and gradual shifts of the paradigm.

Erik Jenner Jan 3, 2025, 9:15 AM
2 points
0
in reply to: Mark Xu’s comment on: ejenner’s Shortform
Yeah. I think there’s a broader phenomenon where it’s way harder to learn from other people’s mistakes than from your own. E.g. see my first bullet point on being too attached to a cool idea. Obviously, I knew in theory that this was a common failure mode (from the Sequences/LW and from common research advice), and someone even told me I might be making the mistake in this specific instance. But my experience up until that point had been that most of the research ideas I’d been similarly excited about ended up ~working (or at least the ones I put serious time into).

Erik Jenner Jan 3, 2025, 9:09 AM
3 points
0
in reply to: Oliver Daniels’s comment on: ejenner’s Shortform
Some heuristics (not hard rules):
- ~All code should start as a hacky jupyter notebook (like your first point)
- As my codebase grows larger and messier, I usually hit a point where it becomes more aversive to work with because I don’t know where things are, there’s too much duplication, etc. Refactor at that point.
- When refactoring, don’t add abstractions just because they might become useful in the future. (You still want to think about future plans somewhat of course, maybe the heuristic is to not write code that’s much more verbose than necessary right now, in the hope that it will pay off in the future.)
These are probably geared toward people like me who tend to over-engineer; someone who’s currently unhappy that their code is always a mess might need different ones.
I don’t know whether functional programming is fundamentally better in this respect than object-oriented.

Erik Jenner Jan 1, 2025, 11:37 AM
134 points
6
on: ejenner’s Shortform
Research mistakes I made over the last 2 years.
Listing these in part so that I hopefully learn from them, but also because I think some of these are common among junior researchers, so maybe it’s helpful for someone else.
- I had an idea I liked and stayed attached to it too heavily.
  - (The idea is using abstractions of computation for mechanistic anomaly detection. I still think there could be something there, but I wasted a lot of time on it.)
  - What I should have done was focus more on simpler baselines, and be more scared when I couldn’t beat those simple baselines.
  - (By “simpler baselines,” I don’t just mean comparing against what other people are using, I also mean ablations where you remove some parts of the method and see if it still works.)
  - Notably, several researchers I respect told me I shouldn’t be too attached to that idea and should consider using simpler methods.
- I was too focused initially on growing the field of empirical mechanistic anomaly detection; I should have just tried to get interesting results first.
- Relatedly, I spent too much time making a nice library for mechanistic anomaly detection (though in part this was for my own use, not just because of field-building).
  - Apart from being a time sink, this also had bad effects on my research prioritization. It nudged me toward doing experiments that were easy to do with the library or that would be natural additions to the library when implemented. It made it aversive to do experiments that wouldn’t fit naturally into the library at all.
  - It also made fast iteration more difficult, because I’d often be tempted to quickly integrate some new method/tool into the library instead of doing a hacky version first.
  - I do still think clean research code and infrastructure are really valuable, so this is difficult to balance. Consider reversing advice especially on this point.
- I worked on purely conceptual/theoretical work without collaborators or good mentorship, and with only ~2.5 years of research experience. I expected beforehand that I’d be unusually good at this type of research (and I still think that’s plausibly true), but even so, I don’t think it was time well spent in expectation.
  - I’m very sympathetic to this kind of work being important. But I think it’s really brutal (and won’t work) without at least one (and ideally more) of (1) strong collaborators and ideally mentors, (2) good empirical feedback loops, (3) a lot of research experience. Maybe there are a small handful of junior researchers who can do useful things there on their own, but I’ve been growing more and more skeptical.
- I looked into related work too late several times, and didn’t think about how my work was going to be different early enough.
  - Drafting a paper outline was great for making me confront that question (and is also a useful exercise for noticing other mistakes).
- I didn’t work on control. I was convinced that control was great pretty quickly, and then … assumed that everyone else would also be convinced and lots of people would switch to working on it, and it would end up being non-neglected. This sounds really silly in hindsight, and I suspect I was also doing motivated reasoning to avoid changing my research.
  - “Working on control” is a messy concept. Mechanistic anomaly detection (which I was working on) seems at least as applicable to control as to alignment. But what I mean by “working on control” includes an associated cluster of research taste aesthetics, such as pessimistic assumptions about inductive biases, meta-level adversarial evals, and thinking about the best protocols a blue team could use given clearly specified (and realistic) rules. I think I’ve been slowly moving closer toward that cluster, but could have gotten there sooner by making a deliberate sudden switch.
  - I didn’t explicitly realize how “working on control” has this research taste cluster associated with it until more recently, otherwise the “everyone will work on it” argument would’ve had less force anyway.

Erik Jenner Jan 1, 2025, 12:18 AM
29 points
1
on: The Plan − 2024 Update
2 years ago, you wrote:
theoretical progress has been considerably faster than expected, while crossing the theory-practice gap has been mildly slower than expected. (Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet.)
I don’t know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then. But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn’t constitute much progress on that. (But kudos for apparently working on image generator nets again!)
As a sidenote, your update from 2 years ago also mentioned:
I tried to calculate “local” natural abstractions (in a certain sense) in a generative image net, and that worked quite well.
I assume that was some other type of experiment involving image generators? (and the notion of “working well” there isn’t directly comparable to what you tried now?)

Erik Jenner Dec 29, 2024, 5:35 PM
8 points
0
on: Natural Abstractions: Key claims, Theorems, and Critiques
I think this was a very good summary/distillation and a good critique of work on natural abstractions; I’m less sure it has been particularly useful or impactful.
I’m quite proud of our breakdown into key claims; I think it’s much clearer than any previous writing (and in particular makes it easier to notice which sub-claims are obviously true, which are daring, which are or aren’t supported by theorems, …). It also seems that John was mostly on board with it.
I still stand by our critiques. I think the gaps we point out are important and might not be obvious to readers at first. That said, I regret somewhat that we didn’t focus more on communicating an overall feeling about work on natural abstractions, and our core disagreements. I had some brief back-and-forth with John in the comments, where it seemed like we didn’t even disagree that much, but at the same time, I still think John’s writing about the agenda was wildly more optimistic than my views, and I don’t think we made that crisp enough.
My impression is that natural abstractions are discussed much less than they were when we wrote the post (and this is the main reason why I think the usefulness of our post has been limited). An important part of the reason I wanted to write this was that many junior AI safety researchers or people getting into AI safety research seemed excited about John’s research on natural abstractions, but I felt that some of them had a rosy picture of how much progress there’d been/how promising the direction was. So writing a summary of the current status combined with a critique made a lot of sense, to both let others form an accurate picture of the agenda’s progress while also making it easier for them to get started if they wanted to work on it. Since there’s (I think) less attention on natural abstractions now, it’s unsurprising that those goals are less important.
As for why there’s been less focus on natural abstractions, my guess is a combination of at least:
- John has been writing somewhat less about it than during his peak-NAH-writing.
- Other directions have gotten off the ground and have captured a lot of excitement (e.g. evals, control, and model organisms).
- John isn’t mentoring for MATS anymore, so junior researchers don’t get exposure to his ideas through that.
It’s also possible that many became more pessimistic about the agenda without public fanfare, or maybe my impression of relative popularity now vs then is just off.
I still think very high effort distillations and critiques can be a very good use of time (and writing this one still seems reasonable ex ante, though I’d focus more on nailing a few key points and less on being super comprehensive).

Erik Jenner Dec 29, 2024, 4:00 PM
10 points
6
in reply to: Erik Jenner’s comment on: evhub’s Shortform
One more: It seems plausible to me that the alignment stress-testing team won’t really challenge core beliefs that underly Anthropic’s strategy.
For example, Sleeper Agents showed that standard finetuning might not suffice given a scheming model, but Anthropic had already been pretty invested in interp anyway (and I think you and probably others had been planning for methods other than standard finetuning to be needed). Simple probes can catch sleeper agents (I’m not sure whether I should think of this as work by the stress-testing team?) then showed positive results using model internals methods, which I think probably don’t hold up to stress-testing in the sense of somewhat adversarial model organisms.
Examples of things that I’d count as “challenge core beliefs that underly Anthropic’s strategy”:
- Demonstrating serious limitations of SAEs or current mech interp (e.g., for dealing with model organisms of scheming)
- Demonstrate issues with hopes related to automated alignment research (maybe model organisms of subtle mistakes in research that seriously affect results but are systematically hard to catch)
To be clear, I think the work by the stress-testing team so far has been really great (mainly for demonstrating issues to people outside Anthropic), I definitely wouldn’t want that to stop! Just highlighting a part that I’m not yet sure will be covered.

Erik Jenner Dec 29, 2024, 3:35 PM
35 points
23
in reply to: evhub’s comment on: evhub’s Shortform
- I think Anthropic might be “all in” on its RSP and formal affirmative safety cases too much and might do better to diversify safety approaches a bit. (I might have a wrong impression of how much you’re already doing/considering these.)
  - In addition to affirmative safety cases that are critiqued by a red team, the red team should make proactive “risk cases” that the blue team can argue against (to avoid always letting the blue team set the overall framework, which might make certain considerations harder to notice).
  - A worry I have about RSPs/safety cases: we might not know how to make safety cases that bound risk to acceptable levels, but that might not be enough to get labs to stop, and labs also don’t want to publicly (or even internally) say things like “5% that this specific deployment kills everyone, but we think inaction risk is even higher.” If labs still want/need to make safety cases with numeric risk thresholds in that world, there’ll be a lot of pressure to make bad safety cases that vastly underestimate risk. This could lead to much worse decisions than being very open about the high level of risk (at least internally) and trying to reduce it as much as possible. You could mitigate this by having an RSP that’s more flexible/lax but then you also lose key advantages of an RSP (e.g., passing the LeCun test becomes harder).
  - Mitigations could subjectively reduce risk by some amount, while being hard to quantify or otherwise hard to use for meeting the requirements from an RSP (depending on the specifics of that RSP). If the RSP is the main mechanism by which decisions get made, there’s no incentive to use those mitigations. It’s worth trying to make a good RSP that suffers from this as little as possible, but I think it’s also important to set up decision making processes such that these “fuzzy” mitigations are considered seriously, even if they don’t contribute to a safety case.
  - My sense is that Anthropic’s RSP is also meant to heavily shape research (i.e., do research that directly feeds into being able to satisfy the RSP). I think this tends to undervalue exploratory/speculative research (though I’m not sure whether this currently happens to an extent I’d disagree with).
  - In addition to a formal RSP, I think an informal culture inside the company that rewards things like pointing out speculative risks or issues with a safety case/mitigation, being careful, … is very important. You can probably do things to foster that intentionally (and/or if you are doing interesting things, it might be worth writing about them publicly).
- Given Anthropic’s large effects on the safety ecosystem as a whole, I think Anthropic should consider doing things to diversify safety work more (or avoid things that concentrate work into a few topics). Apart from directly absorbing a lot of the top full-time talent (and a significant chunk of MATS scholars), there are indirect effects. For example, people want to get hired at big labs, so they work on stuff labs are working on; and Anthropic has a lot of visibility, so people hear about Anthropic’s research a lot and that shapes their mental picture of what the field considers important.
  - As one example, it might make sense for Anthropic to make a heavy bet on mech interp, and SAEs specifically, if they were the only ones doing so; but in practice, this ended up causing a ton of others to work on those things too. This was by no means only due to Anthropic’s work, and I also realize it’s tricky to take into account these systemic effects on top of normal research prioritization. But I do think the field would currently benefit from a little more diversity, and Anthropic would be well-placed to support that. (E.g. by doing more different things yourself, or funding things, or giving model access.)
- Indirectly support third-party orgs that can adjudicate safety cases or do other forms of auditing, see Ryan Greenblatt’s thoughts:
I think there are things Anthropic could do that would help considerably. This could include:
- Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
- Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
- Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.

Erik Jenner Oct 31, 2024, 4:34 PM
10 points
0
in reply to: Ryan Kidd’s comment on: Ryan Kidd’s Shortform
Interesting, thanks! My guess is this doesn’t include benefits like housing and travel costs? Some of these programs pay for those while others don’t, which I think is a non-trivial difference (especially for the bay area)

Erik Jenner Oct 30, 2024, 5:07 PM
4 points
0
on: The Alignment Trap: AI Safety as Path to Power
I think different types of safety research have pretty different effects on concentration of power risk.
As others have mentioned, if the alternative to human concentration of power is AI takeover, that’s hardly an improvement. So I think the main ways in which proliferating AI safety research could be bad are:
1. “Safety” research might be more helpful for letting humans use AIs to concentrate power than they are for preventing AI takeover.
2. Actors who want to build AIs to grab power might also be worried about AI takeover, and if good(-seeming) safety techniques are available, they might be less worried about that and are more likely to go ahead with building those AIs.
There are interesting discussions to be had on the extent to which these issues apply. But it seems clearer that they apply to pretty different extents depending on the type of safety research. For example:
- Work trying to demonstrate risks from AI doesn’t seem very worrisome on either 1. or 2. (and in fact, should have the opposite effect of 2. if anything).
- AI control (as opposed to alignment) seems comparatively unproblematic IMO: it’s less of an issue for 1., and while 2. could apply in principle, I expect the default to be that many actors won’t be worried enough about scheming to slow down much even if there were no control techniques. (The main exception are worlds in which we get extremely obvious evidence of scheming.)
To be clear, I do agree this is a very important problem, and I thought this post had interesting perspectives on it!

Erik Jenner Oct 21, 2024, 10:54 PM
2 points
0
in reply to: StefanHex’s comment on: A gentle introduction to mechanistic anomaly detection
You’re totally right that this is an important difficulty I glossed over, thanks!
TL;DR: I agree you need some extra ingredient to deal with cases where (AI-augmented) humans can’t supervise, and this ingredient could be interpretability. On the other hand, there’s at least one (somewhat speculative) alternative to interp (and MAD is also potentially useful if you can only deal with cases humans can supervise with enough effort, e.g., to defend against scheming).
Just to restate things a bit, I’d distinguish two cases:
- “In-distribution anomaly detection:” we are fine with flagging any input as “anomalous” that’s OOD compared to the trusted distribution
- “Off-distribution anomaly detection:” there are some inputs that are OOD but that we still want to classify as “normal”
In-distribution anomaly detection can already be useful (mainly to deal with rare high-stakes failures). For example, if a human can verify that no tampering occurred with enough effort, then we might be able to create a trusted distribution that covers so many cases that we’re fine with flagging everything that’s OOD.
But we might still want off-distribution anomaly detection, where the anomaly detector generalizes as intended from easy trusted examples to harder untrusted examples. Then we need some additional ingredient to make that generalization work. Paul writes about one approach specifically for measurement tampering here and in the following subsection. Exlusion finetuning (appendix I in Redwood’s measurement tampering paper) is a practical implementation of a similar intuition. This does rely on some assumptions about inductive bias, but at least seems more promising to me than just hoping to get a direct translator from normal training.
I think ARC might have hopes to solve ELK more broadly (rather than just measurement tampering), but I understand those less (and maybe they’re just “use a measurement tampering detector to bootstrap to a full ELK solution”).
To be clear, I’m far from confident that approaches like this will work, but getting to the point where we could solve measurement tampering via interp also seems speculative in the foreseeable future. These two bets seem at least not perfectly correlated, which is nice.

Erik Jenner Oct 8, 2024, 7:56 AM
2 points
0
in reply to: Sodium’s comment on: Concrete empirical research projects in mechanistic anomaly detection
Yeah, seems right that these adversarial prompt should be detectable as mechanistically anomalous—it does intuitively seem like a different reason for the output, given that it doesn’t vary with the input. That said, if you look at cases where the adversarial prompt makes the model give the correct answer, it might be hard to know for sure to what extent the anomalous mechanism is present. More generally, the fact that we don’t understand how these prompts work probably makes any results somewhat harder to interpret. Cases where the adversarial prompt leads to an incorrect answer seem more clearly unusual (but detecting them may also be a significantly easier task).

Erik Jenner Oct 5, 2024, 1:15 AM
LW: 12 AF: 8
3
AF
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
I directionally agree with this (and think it’s good to write about this more, strongly upvoted!)
For clarity, I would distinguish between two control-related ideas more explicitly when talking about how much work should go into what area:
1. “ensuring that if the AIs are not aligned [...], then you are still OK” (which I think is the main meaning of “AI control”)
2. Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood’s existing work on control).
I think 2. is arguably the most promising strategy for 1., but I’ve occasionally noticed myself conflating them more than I should.
1. gives you the naive ⁵⁰⁄₅₀ equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)
I also think (non-confidently) that 2. is what’s really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.
But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!
What links here?
- Olli Järviniemi's comment on Schelling game evaluations for AI control by Olli Järviniemi (Oct 10, 2024, 2:36 PM; 5 points)
- Bogdan Ionut Cirstea's comment on Schelling game evaluations for AI control by Olli Järviniemi (Oct 10, 2024, 2:42 PM; 2 points)

Erik Jenner Sep 16, 2024, 4:53 PM
2 points
0
in reply to: p.b.’s comment on: Evidence against Learned Search in a Chess-Playing Neural Network
Yeah, I feel like we do still disagree about some conceptual points but they seem less crisp than I initially thought and I don’t know experiments we’d clearly make different predictions for. (I expect you could finetune Leela for help mates faster than training a model from scratch, but I expect most of this would be driven by things closer to pattern recognition than search.)
I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don’t need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn’t happen in Leela.
I don’t think I understand your ontology for thinking about this, but I would probably also put Leela below this “turning point” (e.g., I expect most of its parameters are spent on storing knowledge and patterns rather than implementing crisp algorithms).
That said, for me, the natural spectrum is between a literal look-up table and brute-force tree search with no heuristics at all. (Of course, that’s not a spectrum I expect to be traversed during training, just a hypothetical spectrum of algorithms.) On that spectrum, I think Leela is clearly far removed from both sides, but I find it pretty difficult to define its place more clearly. In particular, I don’t see your turning point there (you start storing less knowledge immediately as you move away from the look-up table).
That’s why I’ve tried to avoid absolute claims about how much Leela is doing pattern recognition vs “reasoning/...” but instead focused on arguing for a particular structure in Leela’s cognition: I just don’t know what it would mean to place Leela on either one of those sides. But I can see that if you think there’s a crisp distinction between these two sides with a turning point in the middle, asking which side Leela is on is much more compelling.

Erik Jenner Sep 13, 2024, 4:47 PM
12 points
0
on: Evidence against Learned Search in a Chess-Playing Neural Network
Thanks for running these experiments! My guess is that these puzzles are hard enough that Leela doesn’t really “know what’s going on” in many of them and gets the first move right in significant part by “luck” (i.e., the first move is heuristically natural and can be found without (even heuristically) knowing why it’s actually good). I think your results are mainly reflections of that, rather than Leela generally not having sensibly correlated move and value estimates (but I’m confused about what a case would be where we’d actually make different predictions about this correlation).
In our dataset, we tried to avoid cases like that by discarding puzzles where even a much weaker network (“LD2”) got the first move right, so that Leela getting the first move right was actually evidence it had noticed the non-obvious tactic.
Some predictions based on that:
1. Running our experiments on your dataset would result in smaller effect sizes than in our paper (in my view, that would be because Leela isn’t relying on look-ahead in your puzzles but is in ours but there could be other explanations)
2. LD2 would assign non-trivial probability to the correct first move in your dataset (for context, LD2 is pretty weak, and we’re only using puzzles where it puts <5% probability on the correct move; this leaves us with a lot of sacrifices and other cases where the first move is non-obvious)
3. Leela is much less confident on your dataset than on our puzzles (this is a cheap prediction because we specifically filtered our dataset to have Leela assign >50% probability to the correct move)
4. Leela gets some subsequent moves wrong a decent fraction of the time even in cases where it gets the first move right. Less confidently, there might not be much correlation between getting the first move right and getting later moves right, but I’d need to think about that part more.
You might agree with all of these predictions, they aren’t meant to be super strong. If you do, then I’m not sure which predictions we actually disagree about—maybe there’s a way to make a dataset where we expect different amounts of correlation between policy and value output but I’d need to think about that.
But I think it can be ruled out that a substantial part of Leela network’s prowess in solving chess puzzles or predicting game outcome is due to deliberate calculation.
FWIW, I think it’s quite plausible that only a small part of Leela’s strength is due to look-ahead, we’re only testing on a pretty narrow distribution of puzzles after all. (Though similarly, I disagree somewhat with “ruling out” given that you also just look at pretty specific puzzles (which I think might just be too hard to be a good example of Leela’s strength)).
ETA: If you can share your dataset, I’d be happy to test the predictions above if we disagree about any of them, also happy to make them more concrete if it seems like we might disagree. Though again, I’m not claiming you should disagree with any of them just based on what you’ve said so far.

Erik Jenner Sep 6, 2024, 8:50 PM
7 points
0
on: The Checklist: What Succeeding at AI Safety Will Involve
Thank you for writing this! I’ve found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I’ve collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)
Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.
If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:
1. Meet safety constraints for models deployed in this phase
2. Stay close to the frontier
3. Do the work needed to prepare for Chapter 2
And the reasoning is that 3. can’t really happen without 2.^[1] But on the other hand, if 2. happens without 3., that’s also bad. And some safety work could probably happen without frontier models (such as some interpretability).
My best guess is that staying close to the frontier will be the correct choice for Anthropic. But if there ends up being a genuine trade-off between staying at the frontier and doing a lot of safety work (for example, if compute could be spent either on a pretraining run or some hypothetical costly safety research, but not both), then I’m much less sure that staying at the frontier should be the higher priority. It might be good to have informal conditions under which Anthropic would deprioritize staying close to the frontier (at least internally and, if possible, publicly).
Largely Solving Alignment Fine-Tuning for Early TAI
I didn’t quite understand what this looks like and which threat models it is or isn’t meant to address. You say that scheming is a key challenge “to a lesser extent for now,” which I took to mean that (a) there are bigger threats than scheming from early TAI, and (b) “largely solving alignment fine-tuning” might not include confidently ruling out scheming. I probably disagree with (a) for loss of control risk (and think that loss of control is already the biggest risk in this period weighted by scale). I’d be curious what you think the main risks in this period are and what “largely solving alignment fine-tuning” means for those. (You mention reward hacking—to me, this seems unlikely to lead to loss of control for early TAI that isn’t scheming against us, and I’m curious whether you disagree or think it’s important for other reasons.)
the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it
This sounds quite ambitious, but I really like it as a guide!
The key challenge here is forecasting which risks and risk factors are important enough to include.
I don’t understand why this is crucial. If some risk is plausible enough to be worth seriously thinking about, it’s probably important enough to include in an RSP. (And the less important it was, the easier it hopefully is to argue in a safety case that it’s not a problem.) Concretely, you mention direct misuse, misalignment, and “indirect contributions via channels like dual-use R&D” as potential risks for ASL-3 and ASL-4. It seems to me that the downside of just including all of them in RSPs is relatively minor, but I might be misunderstanding or missing something. (I get that overly restrictive precautions could be very costly, but including too many tests seems relatively cheap as long as the tests correctly notice when risk is still low.)
Getting Interpretability to the Point of Making Strong Assurances
Major successes in this direction, even if they fall short of our north-star enumerative safety goal [...] would likely form some of the highest-confidence core pieces of a safety case
I’m curious what such safety cases would be for and what they could look like (the “Interpretability Dreams” post seems to talk about enumerative safety rather than safety cases that require less interpretability success). The next section sounds like interpretability would not be a core piece of a safety case for robustness, so I’m not sure what it would be used for instead. Maybe you don’t include scheming under robustness? (Or maybe interp would be one of the “highest-confidence core pieces” but not the “primary piece?”)
This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.
I like this perspective; I hadn’t seen it put quite that way before!
In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too.
+1, seems very important!
Supporting Efforts that Build Societal Resilience
I liked this section! Of course, a lot of people work on this for reasons other than AI risk, but I’m not aware of much active work motivated by AI risk—maybe this should be a bigger priority?
The main challenge [for the Alignment Stress-Testing team] will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.
+1, and ideally, there’d be structures in place to encourage this rather than just having it as a goal (but I don’t have great ideas for what these structures should look like).
This work [in Chapter 2] could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align
This seems possible but unclear to me. In both Chapter 1 and 2, we’re trying to figure out how to align the next generation of AIs, given access only to the current (less capable) generation. Chapter 2 might still be different if we’ve already crossed important thresholds (such as being smart enough to potentially scheme) by then. But there could also be new thresholds between Chapter 2 and 3 (such as our inability to evaluate AI actions even with significant effort). So I wouldn’t be surprised if things feel fundamentally similar, just at a higher absolute capability level (and thus with more useful AI helpers).
1. ^
  “Our ability to do our safety work depends in large part on our access to frontier technology.”

Erik Jenner Jun 16, 2024, 1:01 AM
2 points
0
in reply to: p.b.’s comment on: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
I don’t think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.
Maybe I misunderstood you then, and tbc I agree that you don’t need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)
For your thought experiment, my prediction would depend on the specifics of what this “tactical motive” looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I’m much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.
The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn’t really learned lookahead for 0% or a very low percentage and that’s what I would expect.
Note that 0% to 100% accuracy is not the main spectrum I’m thinking of (though I agree it’s also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that’s more specific to this experiment).

Erik Jenner Jun 13, 2024, 4:43 PM
2 points
0
in reply to: p.b.’s comment on: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
I still don’t see the crisp boundary you seem to be getting at between “pattern recognition building on general circuits” and what you call “look-ahead.” It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:
But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the “pattern recognition”-network to predict that Ng6 is not a feasible option.
If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:
- This doesn’t generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn’t also explicitly seen it)
- This generalizes to rook checkmates along the h-file, but doesn’t generalize to rook checkmates along other files
- This generalizes to arbitrary rook checkmates
- This also generalizes to bishop checkmates being blocked
- This also generalizes to a rook trapping the opponent queen (instead of the king)
- ...
(Of course, this generalization question is likely related to the question of whether these different cases share “mechanisms.”)
At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of “difficulty” (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we’d expect to actually see; for example, humans don’t do this either). You might have something different/broader in mind for “look-ahead,” but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.

Erik Jenner Jun 12, 2024, 7:36 PM
5 points
0
in reply to: p.b.’s comment on: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:
The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.
You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is “effectively pinned,” and so g6 isn’t actually protected—this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.
From that perspective, I’d guess Leela sits somewhere in the middle of that spectrum. I agree it’s likely not implementing “a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation” in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for “considering future moves” and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).

Erik Jenner Jun 5, 2024, 1:39 AM
7 points
0
in reply to: Thomas Kwa’s comment on: Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.
That said, some reasons this might not generalize in a way that enables this kind of application:
- I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we’d want to understand the model’s cognition.
- Our results are on a very narrow subset of board states (“tactically complex” ones). We already start with a filtered set of “puzzles” instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
  - I do think there’s an argument that these “tactically complex” states are the most interesting ones. But on the other hand, a lot of Leela’s playing strength comes from making very good decisions in “normal” states, which accumulate over the course of a game.
- Chess has an extremely simple “world model” with clearly defined states and actions. And we know exactly what that world model is, so it’s easy-ish to look for relevant representations inside the network. I’d expect everything is just much messier for networks using models of the real world.
- We have ground truth for the “correct” reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
- I don’t think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really “trying” to win at chess, or whether it’s actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn’t a perfect analogy for planning a coup.) I don’t think our results so far would have told us.
  - I’m reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.