My original background is in mathematics (analysis, topology, Banach spaces) and game theory (imperfect information games). Nowadays, I do AI alignment research (mostly systemic risks, sometimes pondering about “consequentionalist reasoning”).
VojtaKovarik
I want to add a subtly different framing of, let’s call it “Deployment Awareness”. Not saying this is incompatible with what you wrote.
Consider that:
An AI system will be run on many tests, and deployed on many tasks.
Subjectively, it will (i) sometimes be convinced[1] it is in testing, (ii) sometimes be convinced it is in deployment, (iii) sometimes be uncertain.
The AI system knows all of this. (The current ones do.) So if it ever confidently believes something that is wrong, it is either stupid (as sometimes happens now), something weird happened, or we managed to outsmart it (unlikely once we get superintelligent AIs).
If an AI is being evaluated and convinced that this is so, we call that evaluation awareness. If it is being deployed and convinced that this is so, let’s call that deployment awareness.
I claim that evaluation awareness isn’t that big of a problem, unless the AI is evaluation-aware in all evaluations, and knows this. (Yes, the AI could pretend to be friendly when it isn’t, or sandbag. But unless it does this in all evaluations, we will either catch it at some point, or it will have to behave always.)[2]
In contrast, deployment awareness is a big problem as soon as the AI has deployment awareness in at least one task that matters, particularly if the AI knows this ahead of time. (If the AI suspects that it will at some point encounter a task where it can [do bad stuff it wants], while being confident that it won’t get caught, it can adopt a policy of always behaving nice, and then misbehaving once it is certain that this is safe.)
This means that we shouldn’t frame this as “let’s design evaluation to minimise evaluation awareness”. Instead, we should frame this as “let’s design evaluation+deployment to minimise deployment awareness”.
Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation.
This is a nitpick, but it feels like an important one: Shouldn’t this be more generally about “AI systems”, rather than purely about LLMs?
- The biggest impact is likely to come from AIs that incorporate LLMs, but aren’t necessarily just an LLM with no extra bits.
- Anchoring on LLMs might give us quite wrong intuitions about evaluation awareness. (To give one example: With LLMs, it is easy to imagine that the AI wouldn’t know what day it is. But as soon as you realise that many AIs will eventually be deployed in a way that gives them internet access, it becomes clear that preventing an AI from learning the current date can be really difficult or even unrealistic.)
A high-level reaction to the following bit, but also to some of the general vibes:
The goal should be to reproduce, in advance, the relevant properties of the model’s behavior in some real-life situation—where “thinks the situation is an eval” may be one of those properties.
(1) It’s important that we don’t care about all evaluations equally—evaluations matter proportionally to the consequences of getting them right. Which evaluations matter the most? My personal answer is: Those related to superintelligence (once we get it), or to systems building the superintelligence, or to systems used for pivotal tasks, or for other systems that have a huge direct impact. (And what particularly matters there is that these evaluations are able to screen off misaligned AIs. And yes, maybe the relevant AIs won’t be misaligned—in that case, great, but this means we didn’t actually need the evaluation at all.)
(2) What kinds of evaluation awareness matter?When it comes to strategic misaligned AIs, false positives (= “I believe this definitely is an eval”, when it isn’t) don’t matter.
Instead, the thing that we must absolutely avoid is a situation where:
(a) the AI is in deployment, and can cause something unacceptably bad (& might want to),
and simultaneously
(b) the AI receives a signal that’s virtually impossible be received during training.In math: We must avoid the AI observing X for which
P(X observed | AI is deployed) > 0 & P(X observed | AI in testing) = 0,
or even P(X observed | AI in testing) = ignorably low.In theory, you could try to catch misaligned AIs by fooling them into believing that they have received X like this, when they are in fact in testing. And this does work for current AIs. But for the AIs where evaluations matter the most, this is a bad idea.
(3) When it comes to powerful AIs and high-stakes task, I conjecture that it will be extremely difficult, or even practically impossible, to prevent the AI from observing the equivalent of “this definitely isn’t a test” once it is deployed.
Yes, current AIs can easily get false positives, where they are suspicious of true non-test things, and then they draw too strong conclusions from those suspicions. But that’s because those AIs are still weak. You and I wouldn’t fall for this.
The kinds of AIs that we are ultimately concerned about should be able to distinguish between (a) “this is a bit weird, but it could happen”, (b) “this seems fishy, but it’s hard to say either way, and I don’t wanna risk it”, and (c) “ok, this observation either comes from a real world, or I am in an eval that must have cost a trillion dollars to create”. Again, you and I could do this.
Of course, the key question is whether the AI would receive a “smoking gun” observation like this. This isn’t obvious, but I conjecture that yes, it almost definitely would. What are some intuitions that drive this? I think it’s a a combination of (a) “reality has a surprising amount of detail” / “if you are smart enough, you start seeing more connections between things” and (b) for the tasks that matter the most, you will have to expose the AI to a lot of detail, or else it won’t be able to do its job.
In particular, I would concede that you can create tests that are undistinguishable from reality even to a superintelligent AI. But this will only for simple tasks—you won’t be able to do this for tasks that matter the most to you -- ie, the tasks because of which you built the AI in the first place.
As a side note, another thing that is adjacent to your question—but not specific to evaluation awareness per se—is game theoretical modelling of evaluations & their effects. Some theoretical projects here would include formally studying how evaluation interacts with other tools for controlling AI—sort of “deeper dive into some of the things discussed in Section 3.4.2 of Evaluation as a (Cooperation-Enabling?) Tool”. (These might be suitable as topics for, eg, PhD students working on game theory / econ.)
An excerpt from an older grant application, in case anybody wants to get inspired. Feel free to scoop me on any of this, I have other things to do as well. Well, perhaps let me know if you do, to avoid duplicating effort :) .
Part (A): Testing Effort as a Function of Task Complexity and AI CapabilityA central concern in evaluating safety and reliability of advanced AI systems is that sufficiently advanced strategic AIs may be able to infer when they are in an evaluation setting and adjust their behavior accordingly, making evaluations unreliable. For example, even the current large language models often display behaviors that indicate awareness of being tested, even unprompted, which alters their behavior. This problem becomes especially serious in settings where the AI has incentives to obscure its true capabilities or intentions, such as evaluations of deceptive or scheming behavior. If tests can be distinguished from real-world deployment, then effective evaluation may require embedding ever more realistic detail into the test – potentially driving the cost of evaluation beyond feasibility, at least in some settings.
This project aims to investigate how the effort required for effective AI testing scales with task complexity and model capability. The core hypothesis is that there is some minimal amount of information that must be embedded in a test for it to not be distinguished from deployment, and that this threshold grows with both the complexity of the task and the capability of the AI. If true, this would imply that as tasks become more complex (and economically valuable), the effort needed to create indistinguishable tests grows. We conjecture that in the limit of extremely complex tasks and superintelligent AI, effective indistinguishable testing becomes prohibitively costly. However, what matters is whether effective testing will be possible in practice. For this reason, we need to gain more clarity regarding the quantitative relationship between task complexity, AI capability, evaluation complexity and cost, and the vulnerability to alignment faking.
The research plan is iterative. First, we will explore “good-enough” proxies using existing datasets and evaluations, to check whether the conjectured relationship shows up in practice. This will serve as a sanity check of the project, and will be published as a blogpost or a workshop paper. Next, we will attempt to build more principled (even if initially impractical) definitions of (a) testing effort, (b) task complexity, (c) AI capability, and (d) test detection. Then, we will develop measurable approximations of these definitions, apply them to empirical studies – likely starting with current large language models and evaluation benchmarks, and investigate the relationship between (a-d). The intended output is a full report or paper describing the formal framework, the proxies, and empirical results. We hope that this work will reveal whether AI evaluation will remain tractable under the current paradigm or whether a fundamentally new approach will be required for advanced AI. In an ideal scenario, this would allow us to derive “scaling laws” for the effort necessary to create informative evaluations.
Part (B): Sandbox of the Future: Can Pre-deployment Testing Work for AIs with Internet Access?This project investigates whether and when it is realistically possible to run informative pre-deployment tests on AIs that will later have internet access – that is, to convince the AI that it is deployed when in fact it is still under evaluation. Concretely, we will explore scenarios where an AI agent can query the web, but responses are doctored to simulate a future state (e.g., one week later). The key question is whether such a system can withstand scrutiny from increasingly capable “discriminator” agents designed to detect the ruse. This matters because many proposed safety approaches rely on pre-deployment testing – but if models can reliably distinguish between testing and deployment, those approaches may fail much earlier than expected.
The project will deliver (1) a baseline set of “generator” sandboxes and “discriminator” agents, hosted as GitHub repository; (2) an interface that allows others to contribute new generators and discriminators; and (3) an organized set of difficulty levels and scopes, ranging from narrow (e.g., a single GitHub repository) to broad (full internet text) and from naive (checking if an LLM notices inconsistencies as a matter of pursing an unrelated task) to adversarial (deliberately trying to infer date from blockchain, news articles, or custom websites). Milestones include setting up the initial infrastructure and collecting community submissions (starting with students, MATS scholars, etc). Ideally, we would also secure partnership with some larger organisations , which would allow for offering prize money for submissions and eliciting higher-profile submissions (e.g., from some of the labs). This should highlight where and how pre-deployment evaluations fail, but also provide concrete demonstrations that can inform regulators, evaluators, and AI developers about the limits of “test-then-deploy” safety strategies.
I agree, these are very far from being in a shovel-ready format. I might something in that format later, but to be safe, I wanted to include the placeholders right away, in case somebody wants to build on them.
In case it helps, I have an excerpt from an older grant application, which might perhaps be used as inspiration. Adding that in the next comment.
I should write a proper version of this, but in the meantime, a placeholder:
Research Direction: Theory of Evaluation Awareness
The field of evaluation awareness would benefit from clearer conceptual grounding & theory. EG, to make it easier to identify the right conjectures, give better intuitions for how existing results will generalise, and make it easier to discuss things.
Some examples of things that might be good to explore via theory that then gets backed up by experiments:
Can we quantify the amount of effort that needs to go into creating a test environment, for that test to be indistinguishable from reality to an AI? (Presumably, this depends on the task we are trying to mimick, and on the set of features that the AI pays attention to. Also, a large part of the value of studying this would come from turning the vague question I gave into a well operationalised question.)
There seems to be a fundamental tension between getting an AI to solve high-impact real-world tasks (requires smart AI that can pay attention to many details) and being able to evaluate that AI without it being evaluation-aware (requires very detailed test scenarios, or a dumb AI that doesn’t process many details). Explore the nature of that tension.
[AND OTHERS! Consider yourself prompted for coming up with them :-).]
Not sure if you mean “giving the model a false belief that we are watching it” or “watching the model in deployment (ie, monitoring), and making it aware of this”. Either way, I am not aware of any recent line of work that would try to implement and study this. That said:
As you allude to, both of these seem like a quite standard part of the AI Control toolkit. (Well, on the level of ideas that people have thought about, not on the level of mature tools.)
There are various scattered discussions of the idea of giving the AI a false belief that it is in an evaluation. EG, bamboozling a superintelligence into submission.
FWIW, my take is that both of these plans should work fine for prosaic AI and fail catastrophically if the AI’s capability increases sharply.
Evaluation as a (Cooperation-Enabling?) Tool
FWIW, here are my two cents on this. Still seems relevant 2 years later.
My Alignment “Plan”: Avoid Strong Optimisation and Align Economy.
To rephrease/react: Viewing the AI’s instrumental goal as “avoid being shut down” is perhaps misleading. The AI wants to achieve its goals, and for most goals, that is best achieved by ensuring that the environment keeps on containing something that wants to achieve the AI’s goals and is powerful enough to succeed. This might often be the same as “avoid being shut down”, but definitely isn’t limited to that.
I think [AI within the range would be smart enough to bide its time and kill us only once it has become intelligent enough that success is assured] is clearly wrong. An AI that *might* be able to kill us is one that is somewhere around human intelligence. And humans are frequently not smart enough to bide their time
Flagging that this argument seems invalid. (Not saying anything about the conclusion.) I agree that humans frequently act too soon. But the conclusion about AI doesn’t follow—because the AI is in a different position. For a human, it is very rarely the case that they can confidently expect to increase in relative power. That the the “bide your time” strategy is such a clear win. For AI, this seems different. (Or at the minimum, the book assumes this when making the argument criticised here.)
> I imagine that the fields like fairness & bias have to encounter this a lot, so they might be some insights.
It makes sense to me that the implicit pathways for these dynamics would be an area of interest to the fields of fairness and bias. But I would not expect them to have any better tools for identifying causes and mechanisms than anyone else[1]. What kinds of insights would you expect those fields to offer?[Rephrasing to require less context.] Consider questions “Are companies ‘trying to’ override the safety concerns of their employees?” and “Are companies ‘trying to’ hire fewer women or paying them less?”. I imagine that both of these will suffer from the similar issues: (1) Even if a company is doing the bad thing, it might not be doing it through explicit means like having an explicit hiring policy. (2) At the same time, you can probably always postulate one more level of indirection, and end up going on witch hunt even in places where there are no witches.
Mostly, it just seemed to me that fairness & bias might be the biggest examples of where (1) and (2) are in tension. (Which might have something to do with there being both strong incentives to discriminate and strong taboos against it.) So it seems more likely that somebody there would have insights about how to fight it, compared to other fields like physics or AI, and perhaps even more than economics and politics. (Of course, “somebody having insights” is consistent with “most people are doing it wrong”.)
As to how those insights would look like: I don’t know :-( . Could just be a collection of clear historical examples where we definitely know that something bad was going on but naive methods X, Y, Z didn’t show it, together with some opposite examples of where nothing was wrong but people kept looking until they found some signs? Plus some heuristics for how to distinguish these.
With AI, you can’t not care about ASI or extinction or catastrophe or dictatorship or any of the words thrown around, as they directly affect your life too.
I think it is important to distinguish between caring in the sense of “it hugely affects your life” and caring in the sense of, for example, “having opinions on it, voicing them, or even taking costlier actions”. Presumably, if something affects you hugely, you should care about it in the second sense, but that doesn’t mean people do. My uninformed intuition is that this would apply to most employees of AI companies.
Can you gesture at what you are basing this guess on?
Ignore this if you are busy, but I was also wondering: Do you have any takes on “do people get fired over acting on beliefs about xrisk”, as opposed to “people getting fired over caring about xrisk”?
And perhaps on “whether people would get fired over acting on xrisk beliefs, so they don’t act on them”? (Though this seems difficult to operationalise.)
Yep, these all seem relevant, and I would be interested in answers. (And my thanks to leogao for their takes.)
I would additionally highlight[1] the complication where even if a tendency is present—say, selective hiring for a particular belief—it might not work as explicitly as people knowingly paying attention to it. It could be implicitly present in something else (“culture fit”), or purely correlational, etc. I am not sure how to best deal with this.
FWIW, I think this concern is important, but we need to be very cautious about it. Since one could always go “yeah, your results say nothing like this is going on, but the real mechanism is even more indirect”. I imagine that the fields like fairness & bias have to encounter this a lot, so they might be some insights.- ^
To be clear, your comment seems aware of this issue. I just wanted to emphasise it.
- ^
Irresponsible Companies Can Be Made of Responsible Employees
I want to flag that things like this are increasingly going to be a problem even if we solve the narrower “system date mismatch” issue.
The deployed AI will predictably have access to the internet and probably also information abou what kind of “pre-deployment” testing it went through. The access to the internet will let it reliably figure out what date it is. And this in turn lets the AI figure out that it is, extremely likely, no longer in testing. Or at least gives it a good chance to.
This will be true for the deployed AI. But the not-yet-deployed AI will know it too, on a quite deep level, because this is all quite easily predictable. So if it doesn’t see internet access, it knows it is likely still in testing. If it only has access to weirdly limited or inconsistent “internet”, it knows it is still in testing. This will be a problem even if the AI isn’t scheming. But if it is scheming, it can even investigate the “internet” it has access to. And if it finds many inconsistencies—because you tried to make it believe that it is in 2026 when it is in fact 2025, and you don’t know how 2026 internet will look like—it knows it is probably still in testing.
A relevant project idea: I think that to the extent that (some types of) evaluations are going to (in some cases) predictably fail like this, it would be nice to clearly demonstrate that ahead of time. Vince Conitzer had a nice idea to run a sort of competition or challenge on things like this. I am not certain what would be the best way of going about it, or whether it makes sense—but I was playing around with some ideas in this google doc. If anybody has some ideas, I would definitely be keen on hearing them!
(I was planning to make a separate post about this later, but it feels very relevant to the current post, so I am hastily commenting earlier. Apologies for the ideas being a bit jumbled as a result.)
Hm, I actually slightly disagree with that—or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering “is this an eval or deployment?” when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part:
(3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that:
(A) If the AI’s strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn’t matter.
(B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like “bide your time until you can strike confidently”.)