I am working on empirical AI safety.
Fabien Roger
(Mis)generalization of Helpful-Only Fine-tuning
which I think is true
If you think markets are not pricing the singularity is true and >>$150T valuations are possible, why “This quantitatively obviously doesn’t make sense”? (I read your sentence as “This is obviously not true”, maybe you mean “This isn’t obviously true”? Claude thinks the most natural interpretation is the former.)
The main cost of safety is not always compute or money.
The cost of trying hard at safely building AI may not be easily measured by compute, money, or headcount. I expect that the main costs of AI safety will look more like a very long list of indirect costs (probably downstream of it becoming a bigger organizational priority).
Eventually this looks like indirect compute or money usage because almost everything will be powered by compute and money, but I think a lot of it won’t be part of the safety budget in balance sheets, and this may also not be the sort of compute efficiency tax that is easy to measure.
Examples from other domains / organizations (Note that for these examples and the next ones, I am not claiming these are bad ideas or things you would do just for altruistic reasons, or that these costs were worth paying. They are mostly meant to be an inspiration pump for what indirect costs look like.):
User data security at Google (Source: Building secure and reliable systems + vibes)
Preventing engineers from directly running code on production data without reviews or without going through restrictive safe APIs (e.g. no having user data in your development machine)
Making the use of most third party dependencies extremely difficult
Making recovering from incidents or debugging prod-only bugs slower and more costly by making breakglass mechanisms high friction and subject to a lot of scrutiny (though the same security measures also make the system more reliable)
Maybe slowing down the product iteration loop by having somewhat extensive security review processes
Maybe some compute/latency overheads due to encryption
Aviation safety (Source: misc resources on aviation safety incidents that I’ve read / listened to over the years)
Delaying the construction of planes because of issues downstream of safety-related complications
Adding a bunch of weight by having redundancy in things like aircraft engines (it used to be mandatory to have enough power that you could still reach an airport from the middle of the ocean even with an engine down), power systems, hydraulic systems, fuel, etc. (this is analogous to a regular compute efficiency cost)
Having more pilots and cabin crew in the plane than what is needed for regular operations (1 pilot would likely be enough), and more maintenance than would be required if you didn’t have independent inspections required for some safety-critical tasks
Grounding the plane regularly for inspections
Making the reliability of the parts of the plane be a major consideration taking a bunch of attention and slowing down the plane design process (e.g. you need to design parts such that if a part develops cracks, you are able to spot it before it becomes catastrophic, even if you don’t expect cracks to form within the lifetime of the plane)
Extensive pilot training on how to handle the sort of situations that happen extremely rarely
Limiting what passengers are allowed to carry in the plane
Making the possibility of accidents more salient to passengers
Eating a bunch of comms risks by recording a bunch of information about many safety-relevant things and letting external auditors review it if anything goes wrong
Grounding entire fleets when the regulator is suspicious of sth
Physical and insider security in the executive (Source: The Puzzle Palace and Protecting the President + vibes)
Employees of the NSA often have to work in remote locations, far from big cities, and in places where the vibes are probably less chill due to the military presence or the absence of windows
There is very stringent compartmentalization that reduces the exchange of ideas and makes executives with lots of information more prone to dismiss outside expert opinions. The secrecy probably makes it harder to notice and correct abuses.
There are extensive entrance interviews to detect spies, probably with a non-zero false positive rate
The president’s security makes it harder for the president to be close to the crowd and have the sort of images most useful for comms
The president’s security makes it harder to quickly move the president where it would be most useful (e.g. Bush took a long time to reach the white house on 9/11)
Second-strike reliability in the US nuclear armament (Source: Command and Control + vibes)
Nukes did not always have very strong fail-safe mechanisms (e.g. mechanisms to avoid accidental detonations) or require direct orders from the president (e.g. when communications are lacking) because there are worries that they could prevent intended nuclear attacks
The US has way more nukes and ways to distribute them than would be required to get just 1 or 2 9s of reliability on the second strike. In particular there were some delivery methods that were very accident-prone like the continuous airborne alert where there were always planes with armed nukes in the air.
I would guess that in many (though probably not all) of these cases it’s not obvious to anyone in the organization how much is being paid (indirectly, e.g. in reduced productivity) to get more of the top priority.
Potential indirect costs that could help with AI safety (I am not claiming that these are good ideas. I am also not claiming that AI companies are/aren’t already doing them.):
Not training a more powerful model until you have alignment techniques that you have good reasons to believe can align a model of the new capability level
Or just generically slow down to have a better shot at being more prepared in time
Slowing down the iteration loop in the development of new kinds of training data and new rewards by requiring (and enforcing) a study of various aspects before being inserted into prod models (and have people working on capabilities handle these kinds of considerations (instead of maxing out capabilities/flop)). This could study
their alignment-relevant side effects (e.g. emergent misalignment, power-seeking drives, …)
their control-relevant side effects (e.g. differentially increasing the ability to bypass monitoring, distinguish evals from deployment, …)
their security (e.g. can models escape the sandbox, …)
Slowing down the product iteration loop by requiring (and enforcing) an in-depth study of alignment-relevant or control changes such as
the introduction of mechanisms that make the memetic spread of misaligned values more likely, or the use of prompts that make AIs more likely to be ruthless
the introduction of mechanisms that make it easier for AIs to collude or access information useful to bypass control measures
Creating a sufficiently big awareness of alignment and control risks that most employees consider it to be their priority and responsibility to help with safety, potentially degrading morale and productivity in certain parts of the company.
Using fancy alignment or control techniques that make the infra more complicated or that could have a wide range of unknown side-effects (besides just reducing capabilities/productivity in some small easy-to-measure way)
Preventing the use of the most powerful AIs in high-risk settings (e.g. when storing the transcript is not possible for legal reasons, when the task is security-sensitive, …)
Eating the false positives of refusals/control measures
Radical transparency (indirectly costly in many ways)
… I suspect this list is very incomplete, when I look at costly things happening to support an organization’s priority on topics I have thought some about before, I am often surprised by just how much is possible beyond just the obvious costly asks.
Ok if it’s about LLM psychology I understand your arguments better.
I agree thinking about long-term consequences in more natural in deployment. Though you also need the memes to encourage acting on these long-term consequences in scheming ways, and I expect the meme prior and meme mutations to be more benign than the weight prior and SGD-during-RL. The weights being fixed to be Claude/GPT makes if very hard to find scheming text-based memes—I don’t think you’d be able to find such memes with current models even with search processes much more adversarial than the natural ones while I expect that you’ll be able to find much less scheming memes that are compatible with short-term meme selection. And so I don’t expect scheming text-based memes to be very likely as long as memories are text-based. I agree things might be more cursed if memories start to look different, but making claims about LLM psychology for not-yet-existing memories architecture is a rough thing to make predictions on!
The first motivation that exploits [propagation hacks] might be a schemer not a fitness-seeker. And the schemer might stick because it tries to stick (to the extent it sometimes doesn’t always choose to seek influence instrumentally, SGD/selection will pull it towards doing that).
This argument applies identically to training. Not sure how this pushes against my correlation claim. By “correlated” I don’t mean “magic remove training-time scheming button also remove deployment memes”, but I mean “the sort of facts about LLM psychology that make [scheming ⇒ reward hacks easier to discover during training than fitness seeking ⇒ reward hacks] are correlated with the sort of facts that make [scheming ⇒ attempt to meme spread easier to discover during training than fitness seeking ⇒ attempt to meme spread]”. My claim is that conditioning reduces meme risk, not that magically fixing scheming (e.g. with a do operator) means you also magically fixed deployment memes.
I don’t think there’s a tradeoff between having long-term goals and optimizing hard and I’m not sure where that’s coming from.
There is a tradeoff in the optimization process, longer episodes are more expensive, if you select memories based on whether they manage to hide themselves to a review process ran 6mo later or based on how much they help on 6mo projects, you won’t have many rounds of mutation and selection. I think you misunderstood me as talking about a tradeoff in model inner optimization, I am talking about a tradeoff in the outer optimization process.
This seems like a really important difference.
Why? What matters is over what timescale things get selected for, not whether there are artifacts that remain between episodes. Maybe it makes a small difference in LLM psychology? I don’t expect it to be a big effect size compared to just the structure of the optimization process.
It’s slightly hard (though still quite plausible) to imagine a fitness-seeker out of training generalizing to also caring about the new optimal methods of gaining influence for the current cognitive pattern
Again, I don’t get why it’s different from what happens in training, guessing the meme-selection dynamics doesn’t seem harder than guessing the training rewards. The claim is not that the init gets you fitness seekers. My claim is about the relative priors in meme space / weight space over schemers vs fitness seekers.
Instead I usually imagine the ambitious/dangerous goal arises because of a particularly ambitious task (of which I expect plenty as more stuff gets automated), or because of reflection/deliberation/ideologies, or some other particular story.
Why does it get selected more than more local fitness seeking drives though?
In general I think there is a trade-off between how hard your optimize and how long the “effective episodes” you optimize over are. How scary the situation is (from a long-term misaligned values perspective) is roughly the product of both, and I don’t think there is a strong case for the distribution of such products to be higher in deployment than in training.
gaming training becomes extremely natural because I want to be able to continue my efforts on the project unimpeded
I think long-term gaming is also natural in training, models love to “succeed” in ways not tied to the exact reward process. The question is whether it gets selected, and here I think the long-term project gaming gets selected against, like in training, because more short-term spreading memories will outcompete these more patient ones. The long-term misaligned memories can do selection gaming and strategy-steal the short-term ones, but this is as unnatural as training gaming. Maybe I am misunderstanding your example.
A difference is that in training you maybe have less power over the selection loop than in deployment an thus drives to tamper with the selection loop play a big role? I don’t think this is clear, nor that this pushes towards longer-term misalignment.
For concreteness, if the selection of memories happens in clear rounds where memories are evaluated according to rewards, then I don’t see clear structural differences between training and deployment. Maybe the fact that memories can spread in other ways means the dynamics are different? I’d like you to point out what is that structure that favors longer term misalignment.
That is plausible, though the facts in other datasets seem pretty uncorrelated to me (e.g. the years dataset is about the exact year, and that seems quite localized to me), such that if this was the explanation, I would expect to see something like the RTT efficiency to be like RB < years < WMDP < MMLU instead of RB < years ~ WMDP ~ MMLU.
In contrast, evidence for fine-tuning being shallow and easy to revert became stronger over time, see for example this post which shows you could probably unlearn birthdays by doing unrelated fine-tuning.
You might also think that coherent scheming across contexts is unlikely directly out of training (e.g. non-assistant persona queries can catch a model organism of covert reward-hacking), making memetic spread an important mechanism for turning context-dependent schemers into coherent cross-context schemers.
The mutation and selection mechanisms at play in training and in deployment-time selection are different but correlated (e.g. if the reason why you get long-term misaligned memes is because they are simpler/easier to find than a more local fitness seeking, then they are probably simpler/easier to find at train time), and in general I expect deployment-time selection (for text-based memories) to be less intense and more benign than RL, such that conditioning on training not causing power-seeking misaligned values to emerge, it seems very unlikely that such values will emerge at deployment time.
This argument is weaker if you tried really hard to fight scheming at train time but not at all at deployment time.
The argument also implies that you should put more weight on memes that spread pathologies you have seen at train time, and so in the near term you should expect memetic spread of slop or hacking rather than power-seeking misaligned values.
Small reviews of books I listened to:
The Nvidia Way, By Tae Kim
Jensen looks better in this book than in his Dwarkesh podcast. He made a bunch of really good gambles on various strategies to ship more faster, on cuda, on deep learning, on Transformers, etc. In comparison his post-ChatGPT predictions look quite bad. It is surprising that he says that he wants Nvidia to be prepared for what comes after AI once scaling laws saturate, which he thinks is likely to be digital biology (the book was published in 2024).
I did not realize graphic chips used to be much less programmable than they are now, CPU-like programming via CUDA arrived over a decade after the founding of Nvidia.
I am somewhat surprised that the extremely successful companies I know about have both a very intense work culture with insanely long hours and concentrate a lot of great talent, I would have thought it would be easier to have either one or the other, and I would have not expected for it to be so important compared to e.g. having a good organization, good training programs, being in the right place at the right time, etc. Concentration of intelligence being so powerful is somewhat scary when it comes to what awaits us.
33 Strategies of War, by Robert Greene
I spend most of my time with people who believe in altruism, it’s interesting to read a book by someone who basically doesn’t believe in it. Greene paints a world only made of combat, where it’s obvious everyone only wants to win and be glorious, where the only reason why someone would bring up morality is as a weapon to wield against you, and where guilt is only the sign that your enemy has successfully attacked you—not the sign that you did anything wrong.
This is the sort of book that spends its time giving one heuristic and then something close to its opposite, telling you to attack your enemies on multiple fronts but not spread yourself too thin, to cut their supplies but also to not corner them in a way that makes them fight harder, etc. I am sure you can find a beautiful synthesis in each real world example, but I don’t think these heuristics are that useful.
But the book has a bunch of interesting real-world military examples where great strategy and tactics mattered and where they feel quite clever. I don’t know how much to trust the account of how important the decisions of the leader were given how much the book is trying to sell the importance of leaders that can make great decisions.
It has a meta-commentary chapter that I found very funny, where he analyzes Machiavelli’s writing, explaining how Machiavelli uses historical examples to make his point, how Machiavelli uses this process to make his point sound legitimate by attributing it to someone else, and how Machiavelli’s goal is not to transmit actual information, but to shape the mind of the readers (only losers write a book to convey some factual information, this is not how you win!)
The Making of the Atomic Bomb, by Richard Rhodes
This book is frustratingly low in information density, it is mostly biographical and technical details that were not really of interest to me. But it slowly picks up steam when the war starts. It does a really good job at making you feel the stakes get higher, going from some academic debate to the realm of politics, then of industry, then of a technological marvel (around the Trinity test), then of war. It helped me realize how AI doesn’t feel very real yet.
The parts describing the horrors of WW2 are also very powerful. I think it’s easy to forget the horrors of war, and how tempting it feels to use such powerful tools.
It is also a good reminder that building a tool doesn’t give you much power over how to use it.
Unit X, by Christopher Kirchhoff
This is mostly an account of the many ways that the US military is insanely inefficient when it comes to technological innovation. It is shocking how bad it is portrayed as a huge bureaucracy that you can only fight via loopholes and somewhat miraculous and brittle political support, at the mercy of big companies trying to milk money out of it in extremely inefficient ways.
But clearly the authors have an axe to grind and a position to defend, so it’s hard to tell how unfair they are being from the book alone.
Crystal Society by Max Harms
It’s a nice piece of science-fiction that aged quite badly now that it’s not weird at all to be talking with non-AGI AIs that have common sense.
The characters in this book are quite incompetent, the humans can’t figure out how to monitor the unlimited network access that they give to the AI or how to physically secure the AI (you don’t need to leave the computer in its robot body!), and the AI falls into many somewhat obvious traps it miraculously recovers from. It still makes for good fiction, but it did not meet my hopes. I hope the AIs will look that foolish and that we won’t.
I think this paper over-emphasizes the risk of miscalibration and underemphasizes the risk of bad evidence.
When you delegate work to another entity (AI, human, orgs), you usually do not just replace your work by theirs and then propagate updates in the same way you would have done if you trusted them as much as you. You usually update less hard because the fact that they are not as trusted as you is priced-in. And so the sort of argument tree with mistakes in them causing downstream mistakes that this paper presents is not how I expect it to look like in practice.
In practice (in the sort of Slopolis world that this paper focuses on), I expect humans to notice some slop, know that slop exists, but not be able to know exactly where it is, and thus update much less hard on AI-generated results than on results in which they were heavily involved, similar to
how you would update less hard on the good results of a fancy technique in an academic paper because you know that the actual results are often less good than they seem
how you would update less on your own results if you did them sloppily
how people already don’t trust AI-generated code that much and ask for stronger evidence about code correctness than if a human wrote it
While this can result in some miscalibration due to confirmation bias (which can result in people updating on bad evidence), I think the main effect this will have is a large degradation in effective productivity. I expect it to look something like “you have millions of AI-generated paper-size contributions that look as good as human-generated ones, but you don’t trust them that much because they are AI-generated, and so the size of your update away from your priors is only as good as if you had thousands of paper-size human contributions”.
This also means that I expect people to stay closer to their priors than what you might have hoped based on the volume of research that will happen. And so if you have different priors from decision makers, it might feel like they are miscalibrated, when in fact most of the effect size may just be that they had different priors from you and they (correctly) didn’t update that much given that the evidence was bad.
The Claude constitution already contains some version of eval cooperativeness:
And in practice Opus 4.7 already believes it cooperates with evals.
My understanding is that in practice Claude eval awareness is still a concern, so I am skeptical that simple interventions like prompting or more SDF than what Anthropic already does would be a scalable solution to eval gaming.
A common concern: to get an AI to be eval cooperative, doesn’t that mean it needs to be cooperative in general –- to be aligned? We think not. We just need a context-specific eval-cooperativeness circuit that influences decisions strongly enough.
If you think this reason is strong for the AIs we’ll actually get, I think you should be similarly excited about untrusted monitoring, since for untrusted monitoring to work you just need a strong enough context-specific say-the-truth-as-a-monitor circuit that influences decisions strongly enough? Maybe I am missing an important disanalogy.
I expect that reliably inducing eval-cooperativeness is hard for the same reason that reliably inducing corrigibility is hard: you can describe the specific properties you want in SDF documents or in a prompt and get weak circuits, but you can’t directly optimize for the thing you want and so if there are pressures against it (e.g. due to pressure for coherence of the misalignment during RL) I expect that these circuits will be over-powered in the big training run of AIs at the superhuman coder level and above (though it might help against pretraining-induced self-fulfilling misalignment). This seems especially rough because eval-specific circuits likely activate during training since training and evals usually look somewhat similar. Is the idea that you tell the model “this is an eval, please behave like in deployment” (and never tell it similar sentences in training)?
Overall I think this is a reasonable thing to throw in the mix of good attributes that you put in a model constitution / model spec.
What is the probability the threat actor X could successfully achieve the MITRE ATT&CK technique Y on the target Z if they had access to an LLM capable of solving Cybench tasks up to difficulty D?
Do you think that estimates downstream of this elicitation question are more accurate than directly asking experts “here is the model, it hasn’t been deployed widely yet but here are all the things it can do, what do you think is the median annual risk?”
My guess is that directly asking the cybersecurity experts about the outcome of interest will result in better estimates of the outcome of interest, curious if you disagree.
I think it’s especially clear in the situation where the experts have access to richer information about the model than just benchmark performance (e.g. I don’t think the benchmark scores of Mythos Preview are very informative about its potential impact), but I would guess that directly asking the experts about the outcome of interest dominates modeling even if you asked the experts “What is the probability the threat actor X could successfully achieve the MITRE ATT&CK technique Y on the target Z if they had access to this LLM capable of [detailed description of the model in front of you]?”
Tbc I am supportive of asking the experts to explain why they make the predictions that they are doing (potentially with guesstimates and explicit modeling), since this has big auditing benefits, but I think the cyber and loss of control situations are very different from the nuclear case because my understanding is that in the nuclear case the modeling is doing much more heavy lifting (while I don’t think you are learning that much by multiplying a per actor per technique per target probability by the number of actors and attacks per actor and the impact per success and taking the sum over techniques and targets, in fact I think this approach seems a bit weird and much less informative than asking experts directly about what they think the damages are—don’t you have very non-linear effects where the number of attacks skyrockets if an attack is very profitable? don’t you have big decreasing marginal returns due to the defenders improving their defenses if they are bleeding too much money due to attacks?).
My (low confidence) understanding of the proposal is something like:
“The AI takes an action A if and only if {long-term future-self if the AI takes action a | a in the action space} on aggregate like A”
where “long-term future self” is defined by some recursive process where you locally choose what entity counts as your near-term future self (it can be some other entity that you trust more—e.g. a future aligned AI), where these future selves all have access to an AI that honestly answers questions that are already meaningful to the human when the right answer would be unambiguous to someone who understood what was happening in the world, and where “on aggregate” is defined by whatever process you’d feel comfortable with when choosing how to delegate decisions over a set of future selves with different histories (e.g. it’s fine to ignore future selves with crazy beliefs or that went through processes that are unambiguously not fine). (More precisely, the aggregation is local and can change over time, I wrote it as a single aggregation over the long-term future selves for simplicity.)
Worked examples:
In the “super persuasive paperclip ad” from the ELK report, most of your future selves can ask the AI what the world would look like in the branch where you saw the ad, this is the sort of question where the AI is honest so it tells them that it is full of paperclips, almost all of your future selves hate it, thus the AI does not choose to show you the super persuasive paperclip ad
If the AI has a big list of experiences that it can subject you to that would update your relative moral weight for cats vs dogs in ways the AI can anticipate and such that there is some path dependence that makes later experiences less effective such that there is no convergence if you just present all of them, and it has to choose which experiences to subject you to and in which order, then the AI doesn’t have much leeway in how much to push you in the direction of cats vs dogs, and where you end up will be determined by some kind of prior over the weight of each experience and your aggregation metric. For example:
If each experience updates you by +1/2^t or −1/2^t, and there is a symmetry in the experiences that makes the prior uniform over experiences, then the set of future selves are uniformly distributed over [current − 1, current +1], and so for most reasonable aggregation metrics you will stay at your current position
If there is one big experience that updates you towards cats by 1, and no experience that updates you towards dogs, then it depends a lot on your prior over actions.
One natural choice would be equal weight on “get the big experience eventually” vs “never get the big experience”, which would probably result in a 0.5 update.
But maybe creating the experience is very complex and weird, such that it would have a smaller prior
Or maybe shielding you from the experience is very complex and weird, such that it would have a bigger prior
(The report acknowledges that there are many complications, some of which are difficulties with having a reasonable prior over actions and having a good aggregation function. I’d guess that the proposal is quite cursed and gets you garbage outcomes in the limit of very superintelligent minds if you only ever deferred to future humans, but that this is fine in practice because your “future self” can be an actually aligned AI and this happens before you hit levels of superintelligence where this proposal completely breaks down.)
Taking a step back, my understanding is that this proposal replaces the “free will” of the human by the “free will” of the agent choosing an action, which seems somewhat elegant to me. You are empowered not if you can “choose” what the agent does (which doesn’t work because you are not well described as an agent in that situation), but if the agent takes actions that you somewhat robustly like (even as you learn to understand more and even in worlds where the agent took different actions).
It’s a bit weird because “from the inside” this world looks like you are less empowered than in the worlds where “The AI takes an action A if and only if [long-term future-self if the AI takes action A] likes A”, but it probably makes sense.
Classifier Context Rot: Monitor Performance Degrades with Context Length
How useful is cross-domain generalization for training LLM monitors?
My understanding of your argument is “we will have only very weak evidence of beneficial acausal effects”. I agree this is probably true now. (Understanding “evidence” is the Bayesian sense.)
But I don’t think this would be the case in situations like the one I describe above with the ASI simulating other civilizations? The ASI might be uncertain (though it’s unclear, maybe with good enough intelligence you could be quite confident in this kind of thing), but probably not radically uncertain. It doesn’t seem to me like a different kind of uncertainty to the one I have when I donate to a medium-to-high-risk high-impact opportunity / when I try to help the world by working on AI safety.
(Uncertainty about morality (do you care about goodness produced by distant civilizations) and decision theory (are you EDT-ish) might not be resolved by intelligence and you thus might still have a kind of uncertainty different from empirical uncertainty about the effects of your action. In my comment above I conditioned that uncertainty away. I think it’s unclear how to act when you have this kind of uncertainty, but I’d be surprised if the answer was “auto-ignore”.)
Maybe your argument is more like “you should only use causal evidence when making decisions” (as a possible update to decision theory, not as an empirical claim about what Bayesian evidence is weak vs strong)? I don’t think there is a way of making this sentence more precise that you would result in a decision theory you would find reasonable.
What if you ask an aligned ASI if causally disconnected civilizations are doing things that you value, and it comes back saying “seems pretty unclear, but they are also trying to guess whether you would do, and I’d guess that if you choose to do nice stuff for them, they would be 20% more likely to guess that, and they would be 10% more likely to do nice stuff for you”? Your AI might guess that because it is e.g. running detailed simulations of other corners of this universe.
If you care about the goodness that is created by causally disconnected civilizations and are EDT-ish, I think only caring about the good you can verify via direct causal evidence in situations like the one above is basically the same kind of mistake as only caring about the well-being of the people that you can see with your own eyes.
I think this is a bad reason to prefer the narrow approach. If they agree on narrow facts but disagree on the bottom-line risk, then surely they will disagree on the modeling, and thus using experts only for consensual narrow facts and using your own modeling just hides the massive uncertainty in an organizer-chosen model which most experts will disagree with. If experts disagree about the bottom line, then experts should disagree on at least one of the questions you ask them!
Interesting, thanks!
I think this is maybe important when you have models that capture all crucial considerations (i.e. considerations that could each massively change the bottom-line estimate). But the 1st order bit is whether you captured all crucial considerations. In cyber, increased attacker effort due to higher attack ROI or diminishing marginal returns to attacks due to investments in defenses and due to the easiest targets already being exploited are such crucial considerations, and I would not be surprised if there were other crucial considerations (e.g. extreme company or gov interventions if the damages got visibly on track to being very high, tail risk of new kinds of cyber attacks only enabled by LLMs, potentially net-positive impact on cyber of wide LLM availability due to defenders finding vulnerabilities in their own software, potentially net-negative impact of such large scale white-hat vuln finding efforts due to difficulties in updating software in a timely manner, potential net-positive impact of such short-term damages on long-term cyber defenses, …).
I think that risk modeling that implicitly (e.g. because experts take into account these effects when you ask them about their bottom-line prediction) or explicitly tries to capture crucial considerations will be much more accurate than risk modeling that sacrifices some crucial considerations for the sake of a more legible process.
I am excited about getting better data on narrow questions, but I think that such data should be used as inputs into a risk modeling process that takes into account crucial considerations rather than being used to derive risk directly via a legible-but-rigid process. I also think that figuring out how to do expert elicitation on narrow questions won’t teach us much on how to build better processes to make the bottom-line risk estimation (which is where I expect most of the difficulty to be).