PhD student in AI safety at CHAI (UC Berkeley)
Erik Jenner
ARC paper: Formalizing the presumption of independence
Research agenda: Formalizing abstractions of computations
Response to Katja Grace’s AI x-risk counterarguments
A comparison of causal scrubbing, causal abstractions, and related methods
Sydney can play chess and kind of keep track of the board state
We do not consider Conjecture at the same level of expertise as other organizations such as Redwood, ARC, researchers at academic labs like CHAI, and the alignment teams at Anthropic, OpenAI and DeepMind. This is primarily because we believe their research quality is low.
This isn’t quite the right thing to look at IMO. In the context of talking to governments, an “AI safety expert” should have thought deeply about the problem, have intelligent things to say about it, know the range of opinions in the AI safety community, have a good understanding of AI more generally, etc. Based mostly on his talks and podcast appearances, I’d say Connor does decently well along these axes. (If I had to make things more concrete, there are a few people I’d personally call more “expert-y”, but closer to 10 than 100. The AIS community just isn’t that big and the field doesn’t have that much existing content, so it seems right that the bar for being an “AIS expert” is lower than for a string theory expert.)
I also think it’s weird to split this so strongly along organizational lines. As an extreme case, researchers at CHAI range on a spectrum from “fully focused on existential safety” to “not really thinking about safety at all”. Clearly the latter group aren’t better AI safety experts than most people at Conjecture. (And FWIW, I belong to the former group and I still don’t think you should defer to me over someone from Conjecture just because I’m at CHAI.)
One thing that would be bad is presenting views that are very controversial within the AIS community as commonly agreed-upon truths. I have no special insight into whether Conjecture does that when talking to governments, but it doesn’t sound like that’s your critique at least?
Good ontologies induce commutative diagrams
How are you dealing with ontology identification?
CHAI internship applications are open (due Nov 13)
Breaking down the training/deployment dichotomy
[Question] What is a decision theory as a mathematical object?
Subsets and quotients in interpretability
Reward model hacking as a challenge for reward learning
One worry I have about my current AI safety research (empirical mechanistic anomaly detection and interpretability) is that now is the wrong time to work on it. A lot of this work seems pretty well-suited to (partial) automation by future AI. And it also seems quite plausible to me that we won’t strictly need this type of work to safely use the early AGI systems that could automate a lot of it. If both of these are true, then that seems like a good argument to do this type of work once AI can speed it up a lot more.
Under this view, arguably the better things to do right now (within technical AI safety) are:
working on less speculative techniques that can help us safely use those early AGI systems
working on things that seem less likely to profit from early AI automation and will be important to align later AI systems
An example of 1. would be control evals as described by Redwood. Within 2., the ideal case would be doing work now that would be hard to safely automate, but that (once done) will enable additional safety work that can be automated. For example, maybe it’s hard to use AI to come up with the right notions for “good explanations” in interpretability, but once you have things like causal scrubbing/causal abstraction, you can safely use AI to find good interpretations under those definitions. I would be excited to have more agendas that are both ambitious and could profit a lot from early AI automation.
(Of course it’s also possible to do work in 2. on the assumption that it’s never going to be safely automatable without having done that work first.)
Two important counter-considerations to this whole story:
It’s hard to do this kind of agenda-development or conceptual research in a vacuum. So doing some amount of concrete empirical work right now might be good even if we could automate it later (because we might need it now to support the more foundational work).
However, the type and amount of empirical work to do presumably looks quite different depending on whether it’s the main product or in support of some other work.
I don’t trust my forecasts for which types of research will and won’t be automatable early on that much. So perhaps we should have some portfolio right now that doesn’t look extremely different from the portfolio of research we’d want to do ignoring the possibility of future AI automation.
But we can probably still say something about what’s more or less likely to be automated early on, so that seems like it should shift the portfolio to some extent.
ETA: We’ve now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and
I still endorse the main takeaways from my original comment below, but the list of differences isn’t quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).
To me, the methods seem similar in much more than just the problem they’re tackling. In particular, the idea in both cases seems to be:
One format for explanations of a model is a causal/computational graph together with a description of how that graph maps onto the full computation.
Such an explanation makes predictions about what should happen under various interventions on the activations of the full model, by replacing them with activations on different inputs.
We can check the explanation by performing those activation replacements and seeing if the impact is what we predicted.
Here are all the differences I can see:
In the Stanford line of work, the output of the full model and of the explanation are the same type, instead of the explanation having a simplified output. But as far as I can tell, we could always just add a final step to the full computation that simplifies the output to basically bridge this gap.
How the methods quantify the extent to which a hypothesis isn’t perfect: at least in this paper, the Stanford authors look at the size of the largest subset of the input distribution on which the hypothesis is perfect, instead of taking the expectation of the scrubbed output.
The “interchange interventions” in the Stanford papers are allowed to change the activations in the explanation. They then check whether the output after intervention changes in the way the explanation would predict, as opposed to checking that the scrubbed output stays the same. (So along this axis, causal scrubbing just performs a subset of all the interchange interventions.)
Apparently the Stanford authors only perform one intervention at a time, whereas causal scrubbing performs all possible interventions at once.
These all strike me as differences in implementation of fundamentally the same idea.
Anyway, maybe we’re actually on the same page and those differences are what you meant by “pretty different algorithm”. But if not, I’d be very interested to hear what you think the key differences are. (I’m working on yet another approach and suspect more and more strongly that it’s very similar to both causal scrubbing and Stanford’s causal abstraction approach, so would be really good to know if I’m misunderstanding anything.)
FWIW, I would agree that the motivation of the Stanford authors seems somewhat different, i.e. they want to use this measurement of explanation quality in different ways. I’m less interested in that difference right now.
The (not so) paradoxical asymmetry between position and momentum
Disentangling inner alignment failures
Thanks for writing this! On the point of how to get information, mentors themselves seem like they should also be able to say a lot of useful things (though especially for more subjective points, I would put more weight on what previous mentees say!)
So since I’m going to be mentoring for MATS and for CHAI internships, I’ll list my best guesses as to how working with me will be like, maybe this helps someone decide:
In terms of both research experience and mentoring experience, I’m one of the most junior mentors in MATS.
Concretely, I’ve been doing ML research for ~4 years and AI safety research for a bit over 2 of those. I’ve co-mentored two bigger projects (CHAI internships) and mentored ~5 people for smaller projects or more informally.
This naturally has disadvantages. Depending on what you’re looking for, it can also have advantages, for example it might help for creating a more collaborative atmosphere (as opposed to a “boss” dynamic like the post mentioned). I’m also happy to spend time on things that some senior mentors might be too busy for (like code reviews, …).
Your role as a mentee: I’m mainly looking for either collaborators on existing projects, or for mentees who’ll start new projects that are pretty close to topics I’m thinking about (likely based on a mix of ideas I already have and your ideas). I also have a lot of engineering work to be done, but that will only happen if it’s explicitly what you want—by default, I’m hoping to help mentees on a path to developing their own alignment ideas. That said, if you’re planning to be very independent and just develop your own ideas from scratch, I’m probably not the best mentor for you.
I live in Berkeley and am planning to be in the MATS office regularly (e.g. just working there and being available once/week in addition to in-person meetings). For (in-person) CHAI internships, we’d be in the same office anyway.
If you have concrete questions about other things, whose answer would make a difference for whether you want to apply, then definitely feel free to ask!
How my views on AI have changed over the last 1.5 years
I started my AI safety PhD around 1.5 years ago, this is a list of how my views have changed since ~then.
Skippable meta notes:
I think in descending order of importance, these changes have been due to getting more AI safety research experience, reading things, talking to people in Berkeley, and just thinking through some things more.
Overall my views haven’t changed radically (probably less than in the 1.5 years before the start of my PhD), I still “vibe” with a lot of what I wrote then, it just feels naive or simplistic in some places.
I’ll at best give very brief reasons for each point, which I realize might make this list pretty unhelpful. If anyone is interested in specific points, feel free to let me know and I might write more about them.
I might also just change my mind if pushed on some of these points, not all of them are very deeply considered.
AI risk
I have slightly shorter timelines, mostly because trends largely seem to continue (e.g. scaling didn’t stop working after GPT 3.5, GPT-4 felt roughly as impressive as I expected).
I’m even more confident in “slow” (i.e. pretty continuous) takeoff, it seems increasingly clear we’ll have AIs doing a bunch of useful work before they could take over the world.
I’m less worried (though non-zero worried) about early transformative AIs scheming against us, as long as we’re reasonably careful.
Some part of this is that it seems a bit less likely these AIs would try to scheme at all, another important part is that it seems pretty hard for early transformative AIs to scheme successfully.
Redwood’s writing on control had a big effect on me in terms of thinking it would be hard for early AGIs to successfully scheme against us if we’re careful.
I also think there’s a decent chance that the first AIs that can automate a lot of R&D will still use CoT or something similar to get best performance. I think we can make e.g. escaping the lab harder than doing helpful R&D, so they wouldn’t be able to overthrow us without visibly “thinking” about it at first (my best guess is that steganography is also fairly unlikely or preventable at this capability level).
On the flip side, other risks feel more salient to me now, e.g. labs not implementing even pretty simple safety measures; or worlds where we safely use early transformative AI but then fail at some later, higher capability level. I think I didn’t explicitly judge these as unlikely 1.5 years ago, I just wasn’t thinking about them as much. This means that overall, I’m not much more optimistic than back then.
I used to think of “doom” as a pretty binary thing (we all die vs utopia), whereas I now have a lot more probability on intermediate outcomes (e.g. AI taking over most of the universe but we don’t all die; or small groups of humans taking over and things being somewhere between pretty bad and mostly ok for other humans). This also makes me think that “p(doom)” is a worse framing than I used to.
I put a little less weight on the analogy between evolution and ML training to e.g. predict risks from AI (though I was by no means sold on the analogy 1.5 years ago either). The quality of “supervision” that evolution has just seems much worse than what we can do in ML (even without any interpretability).
AI safety research
Some of these points are pretty specific to myself (but I’d guess also apply to other junior researchers depending on how similar they are to me).
I used to think that empirical research wasn’t a good fit for me, and now think that was mostly false. I used to mainly work on theoretically motivated projects, where the empirical parts were an afterthought for me, and that made them less motivating, which also made me think I was worse at empirical work than I now think.
I’ve become less excited about theoretical/conceptual/deconfusion research. Most confidently this applies to myself, but I’ve also become somewhat less excited about others doing this type of research in most cases. (There are definitely exceptions though, e.g. I remain pretty excited about ARC.)
Mainly this was due to a downward update about how useful this work tends to be. Or closely related, an update toward doing actually useful work on this being even harder than I expected.
To a smaller extent, I made an upward update about how useful empirical work can be.
I think of “solving alignment” as much less of a binary thing. E.g. I wrote 1.5 years ago: “[I expect that conditioned on things going well,] at some point we’ll basically have a plan for aligning AI and just need to solve a ton of specific technical problems.” This seems like a strange framing to me now. Maybe at some point we will have an indefinitely scalable solution, but my mainline guess for how things go well is that there’s a significant period of subjective time where we just keep improving our techniques to “stay ahead”.
Relatedly, I’ve become a little more bullish on “just” trying to make incremental progress instead of developing galaxy-brained ideas that solve alignment once and for all.
That said, I am still pretty worried about what we actually do once we have early transformative AIs, and would love to have more different agendas that could be sped up massively from AI automation, and also seem promising for scaling to superhuman AI.
Mainly, I think that the success rate of people trying to directly come up with amazing new ideas is low enough that for most people it probably makes more sense to work on normal incremental stuff first (and let the amazing new ideas develop over time).
Similar to the last point about amazing new ideas: for junior researchers like myself, I’ve become a little more bullish on just working on things that seem broadly helpful, as opposed to trying to have a great back-chained theory of change. I think I was already leaning that way 1.5 years ago though.
“Broadly helpful” is definitely doing important work here and is not the same as “just any random research topic”
Redwood’s current research seems to me like an example where thinking hard about what research to do actually paid off. But I think this is pretty difficult and most people in my situation (e.g. early-ish PhD students) should focus more on actually doing reasonable research than figuring out the best research topic.
The way research agendas and projects develop now seems way messier and more random than I would have expected. There are probably exceptions but overall I think I formed a distorted impression based on reading finalized research papers or agendas that lay out the best possible case for a research direction.