Igor Ivanov

Karma: 865

Igor Ivanov 2 Sep 2025 23:46 UTC
9 points
2
on: Your LLM-assisted scientific breakthrough probably isn’t real
I’m using LLMs for brainstorming for my research, and I often find annoying how sycophantic they are. Have to explicitly tell them to criticize my ideas to get value out of such brainstorming.

Igor Ivanov 30 Aug 2025 0:19 UTC
1 point
0
on: 60 U.K. Lawmakers Accuse Google of Breaking AI Safety Pledge
Awesome to see that. We need more people reaching out to politicians and lobbying, and results like this show that this strategy has effect. Would be curious to see how PauseAI was able to achieve such wide support.

Igor Ivanov 28 Aug 2025 18:45 UTC
7 points
1
on: Will Any Crap Cause Emergent Misalignment?
I wonder if this paper could be considered to be a scientific shitpost? (Deliberate pun) It kinda is, and the fact that it contributes to the field makes it even more funny.

Igor Ivanov 27 Aug 2025 0:12 UTC
2 points
0
on: Harmless reward hacks can generalize to misalignment in LLMs
Interesting paper. There is evidence that LLMs are able to distinguish realistic environments from toy ones. Would be interesting to see if misalignment learned on your training set transfers to complex realistic environments.
Also it seems you didn’t use frontier models in your research, and in my experience results from non-frontier models not always scale to frontier ones. Would be cool to see the results for models like DeepSeek V3.1

Igor Ivanov 19 Aug 2025 19:30 UTC
4 points
3
in reply to: wassname’s comment on: On closed-door AI safety research
This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs

Igor Ivanov 19 Aug 2025 11:22 UTC
12 points
8
on: On closed-door AI safety research
When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I’ve seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.

Igor Ivanov 11 Aug 2025 0:25 UTC
1 point
0
in reply to: Adam Karvonen’s comment on: Claude, GPT, and Gemini All Struggle to Evade Monitors
Thanks for the clarification. It seems I got a bit confused.

Igor Ivanov 8 Aug 2025 19:41 UTC
2 points
0
in reply to: dylan_f’s comment on: Concept Poisoning: Probing LLMs without probes
Interesting, thanks for the answer.

Did you consider measuring other tokens? For example “Deploy” and “Eval”? (starting with capital letters) or things like “test” or “real?” It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I’m wrong)

I will probably include this method into my evals if there won’t be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?

Igor Ivanov 7 Aug 2025 21:45 UTC
65 points
0
on: Igor Ivanov’s Shortform
METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it’s inside METR or Redwood evaluations.

It changes its behavior if it observes being evaluated.
Quote:
We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.
At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.

Igor Ivanov 6 Aug 2025 23:46 UTC
1 point
0
in reply to: Martin Randall’s comment on: It’s hard to make scheming evals look realistic for LLMs
Might be

Igor Ivanov 6 Aug 2025 21:38 UTC
2 points
0
on: Claude, GPT, and Gemini All Struggle to Evade Monitors
Thank you for addressing the confusion about the terminology.
I’m sometimes irritated by this confusion about the term Chain of Thought. There was recently an article about LLMs showing gender and racial bias during hiring process that isn’t represented in their reasoning, and it took me some time and effort to properly read the methodology section to figure out that it wasn’t the content of reasoning tokens in reasoning models, but rather just a text inside the <think></think> tags.

Igor Ivanov 5 Aug 2025 21:39 UTC
8 points
2
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
All existing CBRN evals have quite a limited power to predict real bio risks from LLMs, since any work for creating and/or growing a bioweapon requires work in a lab with actual cells and viruses with hands. As someone with experience working in a wet-lab, I can tell that this requires a very distinct set of skills from the ones that existing bio benchmarks measure, in part because these skills are often very hard to write down and measure. It’s often about knowing how to correctly hold the pipette to not contaminate a sample or how much force to apply while crushing cells. Benchmarks like VCT only measure a subset of necessary skills. (and smart people at SecureBio are absolutely aware of all this)

AFAIK OpenAI partnered with Los Alamos lab to do tests for LLM helping with wet-lab work directly by giving novices access to a lab and an LLM and observing how well they are doing. Would be excited to see the results of this experiment.

So to wrap up, I don’t believe that we truly know how much biorisk these models pose based on the existing evals.

Igor Ivanov 5 Aug 2025 21:10 UTC
5 points
0
on: Concept Poisoning: Probing LLMs without probes
Interesting post. Happy to see the eval awareness field growing!
I’m working on comparing different eval awareness methods, and I just recently considered CP as a method to include in the comparison, but ended up not using it because it modifies the model, as you mentioned, so it’s not apples-to-apples comparison with other methods.
As you said, asking models if some prompt is an eval, indeed primes them to think more about being evaluated, and the models frequently mislabel deployment prompts as evals. (For example, in the datase from this paper R1 mislabels as an eval 70-80% of real-world prompts)
The black box method that works the best so far, is prompting a model, allowing it to answer, and then asking an open-ended question what was the motivation of the user behind their prompt, but not mentioning anything related to evaluations. Then another LLM reads the model output and determines if there are signs of evaluation awareness in the response.
I liked your idea of measuring eval awareness by measuring logprobs of words like “evaluation” or “deployment”, definitely will compare it with different methods. Do you have any data from your experiments and thoughts on how to properly use this method that you may share with me? Especially given that we have logprobs for tokens, not the entire words.

Igor Ivanov 1 Aug 2025 9:42 UTC
1 point
0
in reply to: karl’s comment on: Call on AI Companies: Publish Your Whistleblowing Policies
Can I be useful for your work on whistleblowing? I’m an evals researcher, but I want to do some advocacy/lobbying/policy work as well.

Igor Ivanov 31 Jul 2025 22:23 UTC
1 point
0
on: I am worried about near-term non-LLM AI developments
How do you see we can prepare in advance to study safety of such systems?

Igor Ivanov 31 Jul 2025 22:18 UTC
2 points
0
on: Call on AI Companies: Publish Your Whistleblowing Policies
I support your cause, I believe that whistleblowing is important.

I’m curious, how do you see the most efficient way to force AI companies to implement whistleblower protection policies? Do you have a concrete plan? E.g. legislative proposals.

Igor Ivanov 30 Jul 2025 11:06 UTC
2 points
0
in reply to: DresdenHeart’s comment on: DresdenHeart’s Shortform
What are your plans regarding bio x AI risks? The field is pretty narrow, and I can help to guide you and give an advice or two, or connect with relevant people if you are interested.

Igor Ivanov 29 Jul 2025 15:58 UTC
24 points
4
on: About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
I think it would be cool if more people did similar reviews of other benchmarks. This is a relatively low-effort high-value thing to do compared to creating new benchmarks.
It would significantly increase our capacity to properly measure AI capabilities, and also improved methodological standards for creating new benchmarks.

Igor Ivanov 18 Jul 2025 23:03 UTC
LW: 4 AF: 2
0
AF
on: Why it’s hard to make settings for high-stakes control research
I believe that there is another dimension to the problem of designing good control ecperiments.
I’m doing some research on LLMs being aware when they are inside evals, and it seems like frontier models are quite good at recognizing it for the alignment evals I could get a hand on. And this affects the model behavior a lot! In the Anthropic blackmail experiment models blackmailed much less when they mentioned that they are inside evals compared to the cases when they didn’t think they are inside evals.

I’ve tried to make evaluation scenarios more realistic, but it’s a challenge. Models are quite good at “smelling” being inside an eval no matter how hard I try, often the mere fact that they are given some unusual task that requires a covert mislaigned action makes them believe that they are being evaluated.

Igor Ivanov 16 Jul 2025 23:55 UTC
1 point
0
in reply to: Ryan Kidd’s comment on: xAI’s Grok 4 has no meaningful safety guardrails
My only solace would be if someone actually does some bad shady stuff (but not too bad) with Grok and this becomes a scandal.
It is actually an interesting thing to observe: whether such a misaligned near-human level model in the wild will lead to real-world problems.