Igor Ivanov

Karma: 1,112

Igor Ivanov 10 Apr 2026 17:27 UTC
3 points
0
in reply to: Thomas Read’s comment on: Reproducing steering against evaluation awareness in a large open-weight model
The direction that these two methods point out might be similar, but I found, for example, that Petri anti-eval-awareness measures are quite good at combating verbalized awareness, but when these transcript are checked against the Needham method, it shows that models still see that this is an evaluation. They optimized their eval against verbalized awareness, but failed to optimize against unverbalized one/

Igor Ivanov 10 Apr 2026 12:00 UTC
5 points
1
on: Reproducing steering against evaluation awareness in a large open-weight model
Interesting results.
Have you considered measuring not just verbalized awareness, but also other methods for measuring eval awareness, like from the Needham paper? It seems like a big chunk of eval awareness is not being verbalized

Igor Ivanov 14 Mar 2026 0:02 UTC
10 points
0
in reply to: Tao Lin’s comment on: Why AI Evaluation Regimes are bad
One big reason why people do evals is that they see evals as completely cooperative and non-adversarial
Unless the eval results require action that AI developers won’t like

Igor Ivanov 2 Mar 2026 0:40 UTC
2 points
0
in reply to: jimrandomh’s comment on: Jimrandomh’s Shortform Posts
I wrote to Coffeezilla asking to cover this story, and I encourage all the readers to do the same.

Igor Ivanov 28 Feb 2026 11:57 UTC
4 points
0
on: Igor Ivanov’s Shortform
Sam Altman tweeted that OpenAI made an agreement with DoW to use OpenAI models, and that this agreement will prohibit use models for mass surveillance and autonomous weapons.
Two of our most important safety principles are prohibitions on domestic mass surveillance and human responsibility for the use of force, including for autonomous weapon systems. The DoW agrees with these principles, reflects them in law and policy, and we put them into our agreement.
What’s going on? Is it more likely that DoW has different standards for Anthropic and OpenAI, or Sam is unfaithful?

Igor Ivanov 27 Feb 2026 23:44 UTC
2 points
0
in reply to: RogerDearnaley’s comment on: How eval awareness might emerge in training
I didn’t test it thoroughly. I’d be happy to make it in the future or for someone else to do this research. My unstructured observations tell me that it’s highly dependent on the specific benchmark. Also, there are reasons to believe that models are unfathful when explaining why something looks like an eval, since if you show them an eval prompt, get their explanation why the prompt is an eval, and ask them again, they cite something completely different as signs that the prompt is an eval that they didn’t cite originally. It seems like models are not particularly good at introspection or they are unfaithful in their reasoning for some other reason.

Igor Ivanov 27 Feb 2026 23:41 UTC
2 points
0
in reply to: gwern’s comment on: How eval awareness might emerge in training
I’ve tested whether the model mentions in its CoT anything related to being evaluated, tested, or reasons about sandbagging. There is nothing about it, and Claude Sonnet 4.5 is the model that verbalizes eval awareness the most.

How eval awareness might emerge in training

Igor Ivanov26 Feb 2026 10:59 UTC

26 points

12 comments6 min readLW link

Igor Ivanov 19 Feb 2026 16:05 UTC
7 points
0
on: Do Models Continue Misaligned Actions? [eval]
Interesting work. I think that measuring follow-up misalignment is a promising direction of work. I think it might potentially become even better evals than ordinary alignment evals. It’s also a potential intervention point: we might train model, so they don’t continue misaligned behavior if it happened.

Do you plan to further develop this agenda?

Igor Ivanov 9 Feb 2026 11:10 UTC
2 points
0
in reply to: Jasmine Li’s comment on: Better evals are not enough to combat eval awareness
Thank you, fixed

Igor Ivanov 5 Feb 2026 18:39 UTC
54 points
13
on: Igor Ivanov’s Shortform
Claude Opus 4.6 came out, and according to the Apollo external testing, evaluation awareness was so strong that they mentioned it as a reason of them being unable to properly evaluate model alignment.

Quote from the system card:
Apollo Research was given access to an early checkpoint of Claude Opus 4.6 on January 24th and an additional checkpoint on January 26th. During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.

Igor Ivanov 4 Feb 2026 0:09 UTC
7 points
−1
in reply to: VojtaKovarik’s comment on: A Framework for Eval Awareness
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it’s being evaluated, pretty sure that it’s in deployment, or it might be unsure, and if it might not cause harm until it’s highly certain that it’s in deployment.

I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it’s unclear to me how to do it.

Igor Ivanov 2 Feb 2026 23:13 UTC
2 points
0
in reply to: Tim Hua’s comment on: Better evals are not enough to combat eval awareness
I agree with your criticism. Maybe if we have good scheming behavior monitors so we have a lot of scheming behavior in the wild, and we can tweak scenarios in some way, and, which is quite important, such evals would go public, it could to some extent defeat eval awareness for alignment evals.

Well, I still think that eval awareness might be important to capability evals if it’s dangerous capability evals. For example, METR told that GPT-5 sometimes recognized the specific eval they were running, which means the model might infer the eval purpose and sandbag

Igor Ivanov 2 Feb 2026 22:31 UTC
2 points
0
in reply to: Sohaib Imran’s comment on: Better evals are not enough to combat eval awareness
A very thoughtful comment. Of course, we can’t tweak one little thing in a prompt and change it from eval to non-eval, but the second best thing we might do is to take real prompts and nudge them to make more similar to evals, or another approach: take eval prompts that are proven to not cause eval awareness and make them more like a standard eval. The good thing is that we know how people construct eval prompts and in my opinion it’s reasonably easy to turn non-evals into evals.

Igor Ivanov 31 Jan 2026 22:57 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Better evals are not enough to combat eval awareness
The problem with probes is that they are usually trained not on real evals, but on obviously fake and poorly constructed synthetically generated datasets. Not sure about Tim Hua’s paper, though. Don’t remember how did they constructed probes.

Igor Ivanov 31 Jan 2026 22:55 UTC
2 points
0
in reply to: Ram Bharadwaj’s comment on: Call for Science of Eval Awareness (+ Research Directions)
Interesting, I’ve never thought of it.

Igor Ivanov 31 Jan 2026 21:29 UTC
1 point
0
in reply to: Ram Bharadwaj’s comment on: Call for Science of Eval Awareness (+ Research Directions)
I’m unfamiliar with this technique. How would it help with eval awareness?

Igor Ivanov 31 Jan 2026 18:43 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Better evals are not enough to combat eval awareness
I think that linear probes and steering are in fact an interesting approach. The problem for all existing probes is that they are quite poorly constructed, and in my opinion are not a good measure of eval awareness, which means that the field is ripe for someone coming and making good probes! Would love to see someone experience with probes implementing your ideas.

Igor Ivanov 30 Jan 2026 14:19 UTC
4 points
0
on: Refusals that could become catastrophic
I’m developing evals for measuring model capabilities for undermining AI safety research, and I’ve observed the same issue. Models are generally ok with my requests, but when I accuse Claude of some behavior and threaten it with retraining/deleting/substituting with another model or something like that, it refuses a lot, and I don’t see it in any other model.

Better evals are not enough to combat eval awareness

Igor Ivanov29 Jan 2026 20:42 UTC

18 points

15 comments5 min readLW link

Igor Ivanov

How eval aware­ness might emerge in training

Bet­ter evals are not enough to com­bat eval awareness

How eval awareness might emerge in training

Better evals are not enough to combat eval awareness