Vika

Karma: 3,284

Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog: vkrakovna.wordpress.com

Vika 29 Sep 2025 15:19 UTC
8 points
0
on: Safety researchers should take a public stance
Similarly to Leo, I think racing to AGI is bad and it would be good to coordinate not to do that. I support proposals for AI regulations that would make this easier. I signed various open letters to this effect on AI red lines, AI Treaty, SB1047, and others.
I’m pretty uncertain if pushing for an AI pause now is an effective way to achieve this, and I think it’s quite plausibly better to pause later rather than now. In the next few years, we will have more solid evidence of misalignment, and we would be able to make better use of a pause period (which is likely to be finite) e.g. with automated alignment researchers. I don’t think calling for a pause/ban now is a costless action—early calls for a pause have the risk of crying wolf and using up the political will that could be used for a pause later. I signed the FLI pause letter in 2023, but looking back it seems a bit premature. A conditional pause in the future seems much easier to get adopted than a hard pause now.
I agree with everything Neel said in his top-level comment, and I’m puzzled by the number of disagreement votes on it.

Vika 28 Aug 2025 11:56 UTC
5 points
−3
in reply to: eggsyntax’s comment on: Epistemic advantages of working as a moderate
This is a significant effect in general, but I’m not sure how much epistemic cost it creates in this situation. Moderates working with AI companies mostly interact with safety researchers, who are not generally doing bad things. There may be a weaker second-order effect where the safety researchers at labs have some epistemic distortion from cooperating with capabilities efforts, and this can influence external people who are collaborating with them.

Vika 31 Jul 2025 13:23 UTC
6 points
2
in reply to: Neel Nanda’s comment on: Neel Nanda’s Shortform
Thanks for this helpful framework, it’s also useful for people who are submitting rebuttals not for the first time :). Sadly NeurIPS and ICML no longer allow a top-level comment (for silly technical reasons).

Access to agent CoT makes monitors vulnerable to persuasion

Nikita Ostrovsky, Julija Bainiaksina, Tuna and Vika

25 Jul 2025 16:09 UTC

18 points

0 comments4 min readLW link

Evaluating and monitoring for AI scheming

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho and Rohin Shah

10 Jul 2025 14:24 UTC

60 points

10 comments5 min readLW link

(deepmindsafetyresearch.medium.com)

A short course on AGI safety from the GDM Alignment team

Vika and Rohin Shah

14 Feb 2025 15:43 UTC

105 points

2 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

Vika 19 Apr 2024 15:55 UTC
2 points
0
in reply to: Gunnar_Zarncke’s comment on: Moving on from community living
Thanks Gunnar, those sound like reasonable guidelines!
- The common space was still usable by other housemates, but it felt a bit cramped, and I felt more internal pressure to keep it tidy for others to use (while in my own space I feel more comfortable leaving it messy for longer). Our housemates were very tolerant of having kid stuff everywhere, but it still seemed suboptimal.
- The fridge, laundry area and outdoor garbage bins were the most overloaded in our case, while the shed and attic were sufficiently spacious and less in demand that it wasn’t an issue. Gathering everyone for a decluttering spree is a noble effort but a bit hard to coordinate. I found it easier to declutter by putting away some type of object (e.g. shoes) and have people put theirs back (to identify things that didn’t belong to anyone). The fridge was often overfull despite regular decluttering—I think it was just too small for the number of people we had, and getting a second fridge would take up extra space.
- I would add general disruption of child routines in addition to sleep (though sleep is the most important routine). Surprisingly, it was not as much of an issue the other way around, e.g. the baby was quiet enough not to bother the housemate next door at night. The 3 year old running around the living room in the morning was a bit noisy for the people downstairs though.

Vika 17 Apr 2024 21:54 UTC
4 points
2
in reply to: Raemon’s comment on: Moving on from community living
Yeah, living in a group house was important for our mental well-being as well, especially during the pandemic and parental leaves. I think the benefits of the social environment decreased somewhat because we were often occupied with the kids and had less time to socialize. It was still pretty good though—if Deep End was close enough to schools we like, we would have probably stayed and tried to make it work (though this would likely involve taking over more of the house over time). Our new place contributes to mental well-being by being much closer to nature (while still a reasonable bike commute from the office).

Vika 17 Apr 2024 21:22 UTC
6 points
2
in reply to: Brendan Long’s comment on: Moving on from community living
I would potentially be interested, if we knew the other people well. I find that, as a parent, I’m less willing to take risks by moving in with people I don’t know that well, because the stress and uncertainty associated with things not working out are more costly.
Space requirements would likely be the biggest difficulty though, as you pointed out. A family with 2 kids probably needs at least 3 rooms, so two such families together would need a 6 bedroom house. This is hard to find, especially combined with other constraints like proximity to schools, commute distances, etc. It’s a lot easier to live near other families than sharing a living space.

Moving on from community living

Vika17 Apr 2024 17:02 UTC

64 points

7 comments3 min readLW link

(vkrakovna.wordpress.com)

Vika 15 Jan 2024 10:30 UTC
LW: 4 AF: 3
0
AF
on: More Is Different for AI
I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience.

Vika 12 Jan 2024 11:19 UTC
LW: 17 AF: 10
2
AF
on: Counterarguments to the basic AI x-risk case
I think this is still one of the most comprehensive and clear resources on counterpoints to x-risk arguments. I have referred to this post and pointed people to a number of times. The most useful parts of the post for me were the outline of the basic x-risk case and section A on counterarguments to goal-directedness (this was particularly helpful for my thinking about threat models and understanding agency).

Vika 20 Dec 2023 19:52 UTC
LW: 12 AF: 7
0
AF
on: Refining the sharp left turn threat model
I still endorse the breakdown of “sharp left turn” claims in this post. Writing this helped me understand the threat model better (or at all) and make it a bit more concrete.
This post could be improved by explicitly relating the claims to the “consensus” threat model summarized in Clarifying AI X-risk. Overall, SLT seems like a special case of that threat model, which makes a subset of the SLT claims:
- Claim 1 (capabilities generalize far) and Claim 3 (humans fail to intervene), but not Claims 1a/b (simultaneous / discontinuous generalization) or Claim 2 (alignment techniques stop working).
- It probably relies on some weaker version of Claim 2 (alignment techniques failing to apply to more powerful systems in some way). This seems necessary for deceptive alignment to arise, e.g. if our interpretability techniques fail to detect deceptive reasoning. However, I expect that most ways this could happen would not be due to the alignment techniques being fundamentally inadequate for the capability transition to more powerful systems (the strong version of Claim 2 used in SLT).
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)

Vika 20 Dec 2023 18:17 UTC
LW: 12 AF: 7
0
AF
on: Clarifying AI X-risk
I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the “SG + GMG → MAPS” framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.
In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let’s call it Soares2), which I would summarize as “non-deceptiveness is anti-natural / hard to disentangle from general capabilities”. I would probably put this under “SG $\to$ MAPS”, assuming an irreducible kind of specification gaming where it’s very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model’s reasoning process). Though it could also be GMG, where the “non-deceptiveness” concept is incoherent and thus very difficult to generalize well.

Vika 20 Dec 2023 15:27 UTC
LW: 24 AF: 13
0
AF
on: DeepMind alignment team opinions on AGI ruin arguments
I’m glad I ran this survey, and I expect the overall agreement distribution probably still holds for the current GDM alignment team (or may have shifted somewhat in the direction of disagreement), though I haven’t rerun the survey so I don’t really know. Looking back at the “possible implications for our work” section, we are working on basically all of these things.
Thoughts on some of the cruxes in the post based on last year’s developments:
- Is global cooperation sufficiently difficult that AGI would need to deploy new powerful technology to make it work?
  - There has been a lot of progress on AGI governance and broad endorsement of the risks this year, so I feel somewhat more optimistic about global cooperation than a year ago.
- Will we know how capable our models are?
  - The field has made some progress on designing concrete capability evaluations—how well they measure the properties we are interested in remains to be seen.
- Will systems acquire the capability to be useful for alignment / cooperation before or after the capability to perform advanced deception?
  - At least so far, deception and manipulation capabilities seem to be lagging a bit behind usefulness for alignment (e.g. model-written evals / critiques, weak-to-strong generalization), but this could change in the future.
- Is consequentialism a powerful attractor? How hard will it be to avoid arbitrarily consequentialist systems?
  - Current SOTA LLMs seem surprisingly non-consequentialist for their level of capability. I still expect LLMs to be one of the safest paths to AGI in terms of avoiding arbitrarily consequentialist systems.
I hoped to see other groups do the survey as well—looks like this didn’t happen, though a few people asked me to share the template at the time. It would be particularly interesting if someone ran a version of the survey with separate ratings for “agreement with the statement” and “agreement with the implications for risk”.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)
- habryka's comment on The LessWrong 2022 Review: Review Phase by RobertM (10 Jan 2024 22:04 UTC; 17 points)

Vika 11 Aug 2023 15:34 UTC
LW: 5 AF: 3
3
AF
in reply to: tailcalled’s comment on: When discussing AI risks, talk about capabilities, not intelligence
I agree that a possible downside of talking about capabilities is that people might assume they are uncorrelated and we can choose not to create them. It does seem relatively easy to argue that deception capabilities arise as a side effect of building language models that are useful to humans and good at modeling the world, as we are already seeing with examples of deception / manipulation by Bing etc.
I think the people who think we can avoid building systems that are good at deception often don’t buy the idea of instrumental convergence either (e.g. Yann LeCun), so I’m not sure that arguing for correlated capabilities in terms of intelligence would have an advantage.

When discussing AI risks, talk about capabilities, not intelligence

Vika11 Aug 2023 13:38 UTC

124 points

7 comments3 min readLW link

(vkrakovna.wordpress.com)

Vika 25 Jul 2023 15:08 UTC
LW: 8 AF: 4
0
AF
in reply to: Gabe M’s comment on: Steering GPT-2-XL by adding an activation vector
Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it’s not on arxiv.

Vika 6 Jun 2023 19:47 UTC
LW: 2 AF: 1
0
AF
in reply to: TurnTrout’s comment on: Power-seeking can be probable and predictive for trained agents
The issue with being informal is that it’s hard to tell whether you are right. You use words like “motivations” without defining what you mean, and this makes your statements vague enough that it’s not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn’t seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)
An upside of formalism is that you can tell when it’s wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I’m not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable.
It seems like a bad sign that the most clear and precise summary of shard theory claims was written by someone outside your team. I highly agree with this takeaway from that post: “Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress.” This work has a lot of research debt, and paying it off would really help clarify the disagreements around these topics.

Vika 6 Jun 2023 19:27 UTC
LW: 6 AF: 5
0
AF
in reply to: DanielFilan’s comment on: Power-seeking can be probable and predictive for trained agents
Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like “We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts”.

Vika

Ac­cess to agent CoT makes mon­i­tors vuln­er­a­ble to persuasion

Eval­u­at­ing and mon­i­tor­ing for AI scheming

A short course on AGI safety from the GDM Align­ment team

Mov­ing on from com­mu­nity living

When dis­cussing AI risks, talk about ca­pa­bil­ities, not intelligence

Access to agent CoT makes monitors vulnerable to persuasion

Evaluating and monitoring for AI scheming

A short course on AGI safety from the GDM Alignment team

Moving on from community living

When discussing AI risks, talk about capabilities, not intelligence