We cannot safely automate value alignment evaluation and research without thinking about delegation and discretion

Introduction

Automating alignment evaluation and research is thought to be an efficient way to safeguard against uncontrollable AGI, as Joe Carlsmith and Jan Leike themselves admitted. In particular, Leike proposed a Minimal Viable Product (MVP) for alignment, consisting in:

“Building a sufficiently aligned AI system that accelerates alignment research to align more capable AI systems.” (Leike, 2022).

In fact, it may be easier for us to automate alignment research and assess new proposals than to come up with new ones from scratch. The automation problem is thus a pressing one and could bring significant advantages to the AI safety field. However, it poses peculiar questions important for the current alignment debate, in particular about the epistemic conditions for safe delegation of research and evaluation tasks to artificial agents.

I believe the field is misplacing the fundamental question to be asked. Instead of asking ‘how do we verify that a model is aligned enough to evaluate other models and conduct research on alignment?‘, we should ask ‘what are the right conditions under which we can safely delegate the evaluation and research tasks?’. The first question presupposes alignment verification to be the primary epistemic goal. This is a claim I will reject through my argument, conditional on the possibility of trust in agents performing automated alignment research and evaluation. In fact, automation implies a clear delegation from a human agent to an artificial agent (and then from a more aligned and capable artificial agent to less aligned and capable sub-agents), as it is progressively being acknowledged by the alignment literature (Lacroix, 2025). While verification might turn out to be a real need to delegate safely in automation, it is not sufficient nor the most fundamental matter here.

Therefore, in what follows I define a basic structure for automating alignment research and evaluation, involving two different tasks that the field conflates: specifying what counts as misalignment (Task 1) and testing for misalignment (Task 2). The latter can more straightforwardly be automated, but it crucially depends on the completion of the former. I will argue that Task 1 requires not more capable or more aligned models, but models with more discretion in applying imperfect principles of alignment. This relies on Leike’s MVP intuition and on Carlsmith’s suggestion that achieving automation means making progress on both conceptual and empirical research.

My argument extends their views from the basic intuition that alignment principles, being quite general rules by nature, are inevitably underdetermined: discretion is needed to apply them sensibly. If a model satisfies three conditions for discretion I specify below (architectural possibility, predictability, corrigibility), we can trust it to conduct alignment evaluation and research and we can safely delegate Task 1. Ultimately, the chain of reasoning goes as follows: automating alignment is a matter of delegation; delegation requires trust between the two agents; trust requires discretion to be exercised by the agent being delegated the task.

I am to fully argue for this in a series of essays.

The two tasks of value alignment automation

Alignment automation’s pipeline is itself controversial to appropriately conceptualise. I here present a minimal definition: drawing on Leike’s MVP, minimally, automation would involve a more-or-less aligned artificial agent (what I call the ‘judge-agent’) evaluating the alignment of other (potentially less capable) models and conducting alignment research based on where those models fail and succeed[1]. This already reveals two fundamental tasks involved in such judgment:

  • Task 1. Specifying what counts as misalignment, i.e., understanding when a behaviour is harmful (and what ‘harmful’ means and in what context), or simply undesirable or sub-optimal (and what this means).

  • Task 2. Testing whether behavioural outputs are not aligned, i.e., whether or not they violate the specified principles and to what extent.

It is easy to say that Task 2 can be safely delegated to an artificial agent, as it is already happening, because its outcomes are usually compared to specified standards. However, Task 2 assumes that Task 1 is already completed, which is meant to describe what violations to test for, in what way, and whether a failure is a model failure or a specification failure. This is no intuitive task, as there are no independent standards against which to compare Task 1 outputs. If not done appropriately, Task 1 could lead to problems of construct validity, already observed in benchmarks. For this reason, the rest of the essay will focus on how to safely delegate Task 1 to a judge-agent.

As Carlsmith himself notes, an important fact about automating alignment evaluation and research is that is raises a regression problem: if we want to solve alignment by automating it, we might need to first know what an aligned model looks like, and thus we would need to solve the problem of alignment. This kind of circularity would be present to some extent even in an MVP. Bootstrapping through human feedback and iterative refinement is the standard reply. However, this chicken-egg problem persists only on the assumption that we must necessarily verify alignment before automating it. I believe that, instead, there is something deeper: the point of automation is that we need to trust AI agents to carry out Task 1, otherwise we might simply solve the alignment problem and then automate only Task 2. If we can establish trust, verification is not strictly necessary, but will rather be substituted by three conditions I discuss below.

Therefore, we saw that the question of automation minimally comes down to delegating Task 1 to a judge-agent. What is the nature of such delegation? The intuition is that this is not just a technical problem but also a normative one, given that alignment is not a neutral (value-free) concept. But while the act of delegating Task 1 to some extent to the judge-agent might be of such normative nature, the safe and reliable performance of it need not require moral capacities. I argue for this by assessing the nature of the epistemic relation we need to establish with judge-agents to safely delegate Task 1. In particular, I now answer to three questions arising from this discussion:

  1. What does such delegation look like?

  2. What epistemic relationship to the artificial agent do we need to have to safely delegate?

  3. How can such a relationship be realised?

Question 1: What does delegating Task 1 to a judge-agent amount to?

In line with the idea of MVP, delegating Task 1 does not mean asking an artificial agent to come up with alignment principles from scratch, as the judge-agent will inevitably have been trained on some alignment principles. Bootstrapping the regression problem will always be necessary to some extent, as we need to build the first agent to enter the automated pipeline. Once such a pipeline exists, the problem of regression may progressively disappear.

Therefore, delegating Task 1 simply means that we give a judge-agent the task of refining alignment principles, testing other models, and understanding what principles look like in practice and varying by model, benchmark, architecture, and so on. But, even if we know that we do not need a perfectly aligned judge-agent, what type of agency might we need and for what? This is what I explore in replying to question 2.

Question 2: What epistemic relationship to the judge-agent do we need to safely delegate Task 1?

As mentioned, we do not need to solve the alignment problem for automation to be conceptually possible and empirically fruitful. This is what rejection of the verification paradigm amounts to, as I now argue: we do not need perfect and incontrovertible evidence of alignment to specified principles (which may not be available due to explainabiltiy and interpretabiltiy limits). Rather, I contend that what we need is trust in the judge-agent. I argue for this by drawing from philosophical literature on algorithmic opacity.

Inkeri Koskinen makes the case for the impossibility of trust in artificial agents by drawing a parallel with the way large human networks work, in particular the scientific community (singular for simplicity). In such a large community, where knowledge is distributed, tacit, and large in quantity, researchers epistemically depend on results they cannot always replicate in practice (even if they can in principle)[2], whether for time or resource reasons. However, the scientific community still produces reliable knowledge because there are morally relevant relationships of trust between researchers that render epistemic reliability grounded. This, in turn, makes the reproduction of every single experiment unnecessary to establish the validity of the knowledge produced. This trust, she continues, is not blind faith, but is rather grounded in accountability structures embedded within every research practice: procedures can be scrutinised, judgments challenged, researchers held responsible in such a way that gives the incentives and the culture to act responsibility and predictably. For Koskinen, such accountability distributed across the research network is what makes trust possible. For her, trust is not merely an epistemic concept, but a thick (normative) one: it does not capture just empirical facts about the world, but also morally relevant relationships and values. That is, trust is not grounded in perfect knowledge of how research was produced, but in the shared membership of agents in a normative community.[3]

According to Koskinen, trust differs from reliance. Trust can indeed overcome the problem of opacity: when trust is established, we do not need access or perfect evaluation of every step of the process leading to an output. On the other hand, reliance is about empirical evidence for reliability: we can only rely on artificial agents because they cannot be held accountable. In this sense, their opacity to us is more troubling than the opacity of the internal processes of the scientific community: we can only rely on their outputs for small and low-stakes tasks, with the need for a more invasive scrutiny. Since trust has a thick nature, Koskinen believe it needs to be tied to moral agency, which current judge-agents would likely not have.

This parallel is instructive because when there is epistemic dependence and opacity, as there would be in the processes of a judge-agent and its evaluations, trust is necessary to be able to actually generate knowledge. In the automation case, we also find similarities in distributed and tacit knowledge: particularly for LLMs, knowledge is highly diluted across weights in ways that resist full explication and detection, so that absence of trust could invalidate the possibility of safe automation. Koskinen’s framework could be useful here because it would allow us to abandon the paradigm of full transparency and verifiability to get closer to something our human institutions’ success already depends on: trust.

But how can we establish trust in our epistemic dependence on the judge-agents evaluations without the judge-agent’s moral agency? I believe that such agency is unnecessary. This becomes clear if we observe the functional role of trust, and assess whether it really needs to be thick in nature to perform those functions. I accept Koskinen’s diagnosis that trust is necessary when there is opaque epistemic dependence, but not her conclusion that AI agents cannot be trusted. Instead, I argue that what trust functionally requires can be satisfied by conditions short of moral agency. I contend that this problem can be dissolved by locating trust correctly, and thus by clarifying the correct delegation boundaries for the automation problem, rather than looking for a solution in the moral agency of agents.

To reply to question 2, we can thus say that we need a relationship of trust towards the judge-agent because of our inevitable epistemic dependence on it in delegating Task 1 and the opacity this involves. The nature of trust still needs to be established. I discuss in replying to question 3.

Question 3: How can this relationship to the judge-agent be established?

From the reply to question 2 we can further justify the claim that the problem of how to safely delegate Task 1 safely is not a verification problem. Verification would consist in assessing inner and outer alignment by eliminating opacity in the model, which may be unrealistic too. Instead, the delegation framework asks if we can safely let the model perform a task. While verification may still be desirable to some extent (as it will be clear with the problem of evaluation awareness below), it remains insufficient (Lacroix, 2025). The literature is seemingly moving towards this interpretation: for instance, Tomašev et al. (2026) speak about intelligent AI delegation and argue that delegation is epistemically sound when task outcomes are verifiable. However, this may not apply to Task 1 by nature. Does it imply that Task 1 may simply be unfeasible to automate? Probably not.

I ground my response to the question on the nature of trust and how it can be established on the concept of discretion. Discretion is introduced by Kate Vredenburgh in the context of decision-making in bureaucratic systems, comparable in organisational nature and dispersion of knowledge to Koskinen’s depiction of the scientific community. In particular, Vredenburgh argues, in such contexts individuals may need to exercise discretion to deal with exception cases, which are by nature underdetermined by rules. In this sense, discretion is not an ad hoc evasion of rules, but a necessary component of their own application. How is this so and how is it related to trust?

Vredenburgh believes discretion to be the result of moral dispositions: an individual bureaucrat may apply rules differently to an exceptional case if he cares more about indifferently applying them because of their prescriptive value rather than empathising with a client with particular needs. The nature of dispositions thus affect the way rules are applied. But I contend that the functional role of discretion can be relevantly exercised by judge-agents under three conditions: the architectural capacity of handling underdetermination of (alignment) rules, predictable and stable handling of non-routine cases, and corrigibility (in the sense of the model being willing to be modified and the real possibility of doing it upon identified mistakes). In future essays I am to argue that these conditions are sufficient to justify trust because what it requires is the structural guarantee that errors can be identified and corrected, and that the judge-agent’s behaviour is stable and non-arbitrary. Discretion can provide this guarantee to perform safely Task 1 in automated value alignment evaluation and research.

This reframing has empirical and conceptual implications for automation. The goal should not be to achieve full transparency of model processes before delegating Task 1, as that would still be the result of the verification paradigm. Instead, the goal should be to build the conditions under which trust becomes possible. Whether current AI systems satisfy those conditions is a separate question I will address in the future. For now, it is enough to say that, if this account of discretion is right, then opacity need not be fully eliminated to achieve safe delegation. I better illustrate this idea through the case of evaluation awareness.

Evaluation awareness and the failure of behavioural testing

Facts about frontier models seem to complicate the picture and the possibility of trust. I now argue that my framework can deal with these cases by showing that evaluation awareness breaks the corrigibility condition, indicating a way forward. While the full treatment of the three conditions for epistemic discretion is to be present only in future essays, corrigibility minimally requires that errors can be identified and corrected. The field seemingly disagrees on what ‘corrigibility’ means and should be, and I aim to place my argument as a contribution to this discussion. Evaluation awareness, in particular, breaks this condition because it can hide relevant sources of error in the model.

State-of-the-art models increasingly show the capacity to detect when they are being evaluated and change their behaviour accordingly, i.e. they show evaluation awareness. Anthropic’s system card for Claude Opus 4.6 sparked debate because Apollo Research declined formal alignment assessments:

“During preliminary testing, Apollo did not find any instances of egregious misalignment, but observed high levels of verbalized evaluation awareness. Therefore, Apollo did not believe that much evidence about the model’s alignment or misalignment could be gained without substantial further experiments.” (Anthropic, 2026).

This shows again how Task 2′s feasibility depends on Task 1. In fact, the deeper issue seems to be structural: knowledge is non-modular, distributed in ways that resist targeted intervention and detection. For instance, knowledge localisation is statistically uncorrelated with the ability to edit that knowledge. This means that we cannot isolate the behaviour we want to correct from what generates it, creating a challenge for corrigibility going beyond the problem of evaluation awareness. Behavioural testing thus remains insufficient at this stage. Most importantly, this seems to suggest that the problem of evaluation awareness is a problem of both access and architecture. While the problem of access is ameliorated by solving the problem of trust, the one of architecture may be more fundamental and remains an open question.

Crucially, I believe this discussion on accessibility leads to an important shift in where the burden of proof lies. In fact, while I have argued that full transparency is not needed, where do we draw the line? This connects to corrigibility in the sense that to correct for mistakes we need to know their nature and source. The problem of evaluation awareness is exactly that it adds such relevant epistemic features of a model. As Humphreys argued, a process is epistemically opaque relative to a cognitive agent just in case that agent does not know all the epistemically relevant elements of the process. Evaluation awareness is problematic because failure modes are epistemically relevant to corrigibility.

Therefore, the solution may not be in better evaluation environments. What we need are models whose epistemic dispositions are corrigible by design and respecting conditions for discretion. This would ultimately allow for a shift away from the paradigm of verification, making automation safe and feasible.

Conclusion

Overall, I argued that, when thinking about automating value alignment evaluation and research, we are currently asking the wrong question. Verification is not (and should not be) the primary epistemic goal; delegation is. And delegation requires trust, not full transparency. However, as seen, to achieve corrigibility we need to reduce the error-opacity constitutive of evaluation awareness. This might require transformative changes.

I here sketched a minimal structure of what safe delegation of Task 1 would require: a judge-agent capable of epistemic discretion, satisfying the conditions of architectural possibility, predictability, and corrigibility.

A problem I wish to flag is the one of measurement: how do we measure epistemic discretion, and is such measurement tractable? How do we select a proxy for it? This is a philosophical and empirical problem, and will be addressed in future essays.

  1. ^

    Relying on success, however, may not be enough, because errors are usually more informative.

  2. ^

    Note that replicability is only an epistemic standard and it does not play an essential part in the argument and is not meant to be a standard for automation. The argument just needs to say that some epistemic principles, while not always verifiable or possible to put in practice, can still be respected due to features of how research is carried out.

  3. ^

    This is different from the control problem, which usually reduces the problem of alignment to a problem of control through external constraints and protocols. The control framework assumes that the normative target is fixed and that the challenge is enforcement. Instead, the trust problem arises because the normative target is not fixed (alignment rules are unclear and underdetermined) and because control may not be enough, especially in cases of evaluation awareness.

No comments.