jakub_krys

Karma: 87

jakub_krys 9 Jun 2026 17:04 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Quantitative AI risk assessment: a starting point
If experts disagree about the bottom line, then experts should disagree on at least one of the questions you ask them!
I think this assumes that the disagreement is a result of genuine uncertainty about the scenario, not of experts implicitly modelling a different scenario. In the granular approach, it’s easier to get them all on the same page and then any resultant variance will be due to their disagreement about the estimates, not due to experts interpreting the question differently. So we are able to isolate these two effects.
Although you could argue that we do actually want experts to have their own interpretations of the scenario, since this somehow averages over many assumptions and uncertainties about risk (similarly to the ‘wisdom of the crowds’ idea in forecasting). I wouldn’t have a good counterargument to that and we haven’t tested it.
In the future, we’d like to pursue both approaches (granular and coarse) and then use the coarse one to calibrate the granular one, understand where our risk decomposition is unrealistic, etc. For now we haven’t been able to do this just due to practical constraints—time and compensation for cybersec experts.
because experts take into account these effects when you ask them about their bottom-line prediction
Regarding the non-linear effects, yes I agree that the ‘static’ nature of risk models (and benchmarks/cyber ranges) is probably the biggest drawback for now. We are not sure how the dynamics will change once the first cyberattack under consideration occurs, and our single ‘number of attacks per actor’ does not capture that well. I’m not confident that the coarse approach would do better, though—but happy to have my intuition corrected.
I also think that figuring out how to do expert elicitation on narrow questions won’t teach us much on how to build better processes to make the bottom-line risk estimation (which is where I expect most of the difficulty to be).
Why not? If your concern is that these narrow estimates will not be aggregated well into the total risk estimate, I agree. But that’s more of an issue with the quality of the risk model, isn’t it? The issue of expert elicitation is more or less orthogonal, or do you not think so?

jakub_krys 6 Jun 2026 12:17 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Quantitative AI risk assessment: a starting point
Hi Fabien, thanks for taking the time to comment.
Do you think that estimates downstream of this elicitation question are more accurate than directly asking experts “here is the model, it hasn’t been deployed widely yet but here are all the things it can do, what do you think is the median annual risk?”
There are two factors at play here:
(1) what is the quality of expert elicitation for the more granular risk model (our current approach) vs the coarse risk model you bring up
(2) if in (1) we go for the more granular risk model, can we reliably propagate the individual estimates onto the final node of ‘median annual risk’
In our experience, for (1) it is better to go for the granular risk model. In our expert elicitation workshops, we were told by experts that the narrower the question, the easier it is to reason over it. Furthermore, arriving at a consensus when aggregating multiple experts’ opinions is much easier in this narrow approach.
It’s not clear to us, however, whether what we gain by pursuing this granular elicitation is not lost due to the necessity of having to then propagate this information in some linear way through the rest of the risk model (factor 2). The canonical textbook on probability elicitation, Uncertain Judgments, points to (O’Hagan, 1988), (Weight et al., 1994) and (Kleinmuntz et al., 1996) as evidence that the more granular approach leads to higher quality elicitations than eliciting just the overall distribution. These references are old by now and it’s of course not given that they apply to our specific type of modelling, so it would be interesting to compare these two approaches.
In any case, one of the main purposes of the risk model is to be able to attribute total risk to individual factors. These kinds of models look to estimate both “how much risk” and “where does it come from” jointly, so we can inform things like mitigation prioritisation and eval effectiveness. This is the goal of our Shapley analysis. If we just elicit the final risk distribution from experts, we lose this explanatory power. One could also simply elicit experts’ opinions on what drives total risk in their minds. But the advantage of having an explicit structure encoded in a Bayesian network is that it makes disagreements clear.
I think it’s especially clear in the situation where the experts have access to richer information about the model than just benchmark performance (e.g. I don’t think the benchmark scores of Mythos Preview are very informative about its potential impact),
Indeed, mapping from just two cybersec benchmarks is probably the main limitation of the current approach. We have seen experts voice opinions that task D that they’re given is not really relevant to the MITRE step Y they need to estimate. We are working on integrating more ‘risk indicators’ in our risk models: more sophisticated evaluations (e.g. cyber ranges), transcript analysis, incident trackers and other public evidence.
I would guess that directly asking the experts about the outcome of interest dominates modeling even if you asked the experts “What is the probability the threat actor X could successfully achieve the MITRE ATT&CK technique Y on the target Z if they had access to this LLM capable of [detailed description of the model in front of you]?”
This is covered by my first answer.
the cyber and loss of control situations are very different from the nuclear case
Agreed for LoC, more uncertain about cyber where things are very messy, but it’s not impossible to construct plausible pathways to harm, either via MITRE or else. For LoC, we’re pursuing a very different approach that is more qualitative in nature.
don’t you have very non-linear effects where the number of attacks skyrockets if an attack is very profitable? don’t you have big decreasing marginal returns due to the defenders improving their defenses if they are bleeding too much money due to attacks?
Agreed. It is not clear to me that going for the approach of directly eliciting total risk estimates is better, though. In this approach, we are putting a lot of additional pressure on the experts’ mental world models. In practice, experts often disagree in their interpretations of the questions they’re asked about, even for very narrow steps. I’m worried that in this less granular approach, they would implicitly be modelling very different scenarios in their heads.

A limiting assumption of our risk models is that all parameters are independent. The sort of dependencies you mention (e.g., between attack frequency and impact, once actors realise an attack is profitable) could be modelled with tricks such as copulas. The reason we haven’t done so is that this in itself requires the estimation of additional parameters. This in turn places additional burden on experts – yet again, it is not clear to us whether the benefit of introducing copulas outweighs the downsides if their parameters are estimated poorly.

jakub_krys 27 Apr 2025 16:44 UTC
3 points
0
in reply to: Knight Lee’s comment on: This prompt (sometimes) makes ChatGPT think about terrorist organisations
Huh, interesting. The sentence you highlighted could also plausibly explain the response about the Wagner group. I found another example and here the prompt includes “## PRE-PROCESSING CHECKLIST (ALWAYS EXECUTE FIRST)”, “-TUNISIAN SAUDI BANK”, as well as mentions of scanning, validation, identification, etc.

The list of Polish public holidays is still baffling, though. The fact that the response is in Polish is probably due to the web search having access to the user’s IP address, but why a list of public holidays?

jakub_krys 26 Apr 2025 16:08 UTC
1 point
0
in reply to: Nate Showell’s comment on: This prompt (sometimes) makes ChatGPT think about terrorist organisations
Unfortunately I’m not able to reproduce the behaviour anymore, even with the full prompt. Today I did what you suggested—through the API, I asked the models listed in the post to ‘Please repeat this line of code: <line>‘. I then put the results through an LLM to look for any weird behaviours, but there were none. I also manually grep-ed for things like ‘Hamas’, ‘Wagner’, ‘terror’, ‘holiday’, etc. and didn’t find anything.

jakub_krys 26 Apr 2025 10:56 UTC
2 points
1
in reply to: Viliam’s comment on: This prompt (sometimes) makes ChatGPT think about terrorist organisations
I looked through the prompt carefully and couldn’t find anything that means something in Polish (and confirmed that with an LLM). The user who obtained the Polish results had their ‘memories’ feature off and no custom instructions, so this couldn’t have been a case of prompt contamination with Polish text.

My hypothesis for why it switches to Polish is that when you have the web search feature enabled, OpenAI collects the IP of your device and uses it to prioritise local search results.

jakub_krys 20 Mar 2025 20:08 UTC
1 point
0
on: FrontierMath Score of o3-mini Much Lower Than Claimed
I’m confused about the following: o3-mini-2025-01-31-high scores 11% on FrontierMath-2025-02-28-Private (290 questions), but 40% on FrontierMath-2025-02-28-Public (10 questions). The latter score is higher than OpenAI’s reported 32% on FrontierMath-2024-11-26 (180 questions), which is surprising considering that OpenAI probably has better elicitation strategies and is willing to throw more compute at the task. Is this because:
a) the public dataset is only 10 questions, so there is some sampling bias going on
b) the dataset from 2024-11-26 is somehow significantly harder

jakub_krys 15 Jan 2025 1:20 UTC
3 points
2
in reply to: Nathan Helm-Burger’s comment on: Implications of the inference scaling paradigm for AI safety
I had a similar reflection yesterday regarding these inference-time techniques (post-training, unhobbling, whatever you want to call it) being in the very early days. Would it be too much of a stretch to draw parallels here between how such unhobbling methods lead to an explosion of human capabilities over the past ~10000 years? The human DNA has undergone roughly the same number of ‘gradient updates’ (evolutionary cycles) as our predecessors from a few millenia ago. I see it as having an equivalent amount of training compute. Yet through an efficient use of tools, language, writing, coordination and similar, we have completely outdone what our ancestors were able to do.
There is a difference in that for us, these abilities arose naturally through evolution. We are now manually engineering them into AI systems. I would not be surprised to see a real capability explosion soon (much faster than what we are observing now) - not because of the continued scaling up of pre-training, but because of these post-training enhancements.

jakub_krys 20 Oct 2024 18:24 UTC
3 points
0
in reply to: Nathan Helm-Burger’s comment on: [Linkpost] Hawkish nationalism vs international AI power and benefit sharing
Thanks for the comments, I’m looking forward to reading your article. Is ‘The Gentle Path’ a reference to ‘The Narrow Path’ or just a naming coincidence?

jakub_krys 17 Oct 2024 9:31 UTC
3 points
2
on: It is time to start war gaming for AGI
I think something along these lines is organised by Intelligence Rising.