Eli Tyre comments on Eli’s shortform feed

Eli Tyre 14 Apr 2025 4:54 UTC
25 points
4
Dumb question: Why doesn’t using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?

My dumb proposal:

1. Train a model in something like o1′s RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.

2. Also, take those outputs, prompt the model to generate versions of those outputs that “are more corrigible / loyal / aligned to the will of your human creators”. Do backprop to reinforce those more corrigible outputs.

Possibly “corrigibility” applies only very weakly to static solutions, and so for this setup to make sense, we’d instead need to train on plans, or time-series of an AI agent’s actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect “corrigibility”, according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.

Would this work to produce a corrigible agent? If not, why not?

There’s a further question of “how much less capable will the more corrigible AI be?” This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
What links here?
- A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives by Knight Lee (14 Apr 2025 10:27 UTC; -3 points)
- habryka 15 Apr 2025 6:18 UTC
  17 points
  0
  Parent
  Things that happen:
  1. Backpropagating on the outputs that are “more corrigible” will have some (though mostly very small) impact on your task performance. If you set the learning rate high, or you backpropagate on a lot of data, your performance can go down arbitrarily far.
  2. By default this will do very little because you are providing training data with very little variance in it (even less so than usual, because you are training on AI outputs, which the AI is of course already amazing at predicting). If you train very hard you will probably deal with consistent mode collapse. In general, you can’t really train AI systems with any particular bias in your data, because you don’t have enough variation in your data. We can approximately only train AI systems to do one thing, which is to predict the next token from a distributions for which we have trillions of tokens of training data that are hard to predict (which is basically just going to be internet text, audio and video, though more RL-like environments are also feasible now).^[1]
  The answer to this is the answer to any question of the form “what if we just generate lots of data with the inductive biases we would like the model to have?”.
  The answer is always
  - “we can’t generate realistic data with whatever inductive biases we want”, and
  - “we can’t remotely generate enough data without dealing with mode collapse”, and
  - “we have basically no idea how inductive biases generalize from the training data to the model output, especially as the model starts reflecting on itself and modeling the data generation process” and
  - “if you train or backpropagate directly against your discriminator the model will learn to hack the discriminator, (even if the discriminator is the model itself)”,
  which are like all the standard alignment failures that have been written about for tens of thousands of pages by now.
  At a more detailed level, here is roughly what I actually expect to happen if you do this specific thing, but to be clear, this is a much less confident prediction than the reference class outlined above:
  - After a few hundreds of thousands of iterations of this the model will have figured out a set of consistent indicators present whenever it is given past output of itself that it was asked to rephrase to “be more corrigible”
  - It becomes quite good at predicting these indicators and quickly saturates on predicting these indicators, resulting in very corrigible looking outputs within the same distribution in which you did this training
  - Given that the actual output of the chain of thought changes very little, this results in very little gradients to almost any of its other cognition, the same way that training an AI in an additional language with different grammar and social norms basically doesn’t really change any of its other circuits and cognitions
  - If you drastically increase the learning rate or really go very hard on generating enormous amounts of data the model will start to cannibalize other parts of its cognition to perform well at this task, at which generalized task performance will go down, but you probably won’t do this, because it would be very expensive (and also make the model non-competitive)
  - Within distribution the model will now probably produce outputs that look “more corrigible”. Again, this will not actually reflect much of a change in its internal cognition, the same way training an AI system on foreign language output will not substantially interfere with its cognition in other languages
  - If you train the model on some end-to-end tasks where the corrigibility would end up bad for performance, it will probably very quickly learn to self-prompt itself in a way that causes it to think to itself without the corrigibility biases you introduced into it. This might literally mean switching to a different literal foreign language for its internal thoughts, or something equivalent, but most likely it just means it sounds more like what it previously sounded like before you did this whole thing. If you try to prevent this with continuous monitoring, it will find some way to jailbreak itself or think in some way where the corrigibility language monitoring makes little difference to what cognition is actually going on.
  Ending on a slightly more sarcastic note because I do feel a sense of despair about even people who have been thinking about this stuff for a long time seeming to me to not see the logic above:
  At the end of the day, very little good will have happened, but you will have a harder time identifying whether the model is thinking non-corrigibly, as you have trained the model against at least one discriminator for deception without training out its deceptive cognition. If you now repeat this for 4-5 similarly shallow alignment approaches you will have produced a model that looks visibly more aligned to anyone who thinks approaches like this might work, without any actual changes to the underlying structure of its cognition, making coordination to actually do anything about the problem much harder.
  1. ^
    Or in a more generalized form also applicable to RL environments: We can only train AI systems to be competent, as all scaling laws (and common sense) have shown that competence is approximately the only thing that generalizes between all environments.
    
    We cannot generate environments that teach virtue, because we do not have principles with which we can create the whole complexity of a universe that requires superhuman intelligence to navigate, while also only doing so by thinking in the specific preferred ways that we would like you to think. We do not know how to specify how to solve most problems in virtuous ways, we are barely capable of specifying how to solve them at all, and so cannot build environments consistently rich that chisel virtuous cognition into you.
    The amount of chiseling of cognition any approach like this can achieve is roughly bounded by the difficulty and richness of cognition that your transformation of the data requires to reverse. Your transformation of the data is likely trivial to reverse (i.e. predicting the “corrigible” text from non-corrigible cognition is likely trivially easy especially given that it’s AI generated by our very own model), and as such, practically no chiseling of cognition will occur. If you hope to chisel cognition into AI, you will need to do it with a transformation that is actually hard to reverse, so that you have a gradient into most of the network that is optimized to solve hard problems.
- Lucius Bushnaq 14 Apr 2025 11:57 UTC
  16 points
  14
  Parent
  For the same reasons training an agent on a constitution that says to care about $x$ does not, at arbitrary capability levels, produce an agent that cares about $x$ .
  
  If you think that doing this does produce an agent that cares about $x$ even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
  - Seth Herd 14 Apr 2025 19:22 UTC
    5 points
    2
    Parent
    Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
    Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
    - Lucius Bushnaq 15 Apr 2025 7:42 UTC
      2 points
      0
      Parent
      you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
      Yes.
  - Eli Tyre 15 Apr 2025 5:14 UTC
    3 points
    0
    Parent
    For the same reasons ‘training an agent on a constitution that says to care about $x$ ’ does not, at arbitrary capability levels, produce an agent that cares about $x$
    Ok, but I’m trying to ask why not.
    
    Here’s the argument that I would make for why not, followed by why I’m skeptical of it right now.
    
    New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
    More specifically, if it’s the case that if...
    The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
    When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility
    ...then the resulting system will not end up corrigible.
    
    (Is this the argument that you would give, or is there another reason why you expect that “training an agent on a constitution that says to care about $x$ ′ does not, at arbitrary capability levels, produce an agent that cares about $x$ ”?)
    
    But, at the moment, I’m skeptical of the above line of argument for several reasons.
    I’m skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
    I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
    Most relevantly, there’s a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don’t encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are “born into the world” with expansive world models that already include concepts like corrigibility (indeed, if they didn’t, Constitutional AI wouldn’t work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
    We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
    AI’s capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI’s internals to hit the target, not just the proxies.
    
    Is this the main argument? What are other reasons to think that ‘training an agent on a constitution that says to care about $x$ ’ does not, at arbitrary capability levels, produce an agent that cares about $x$ ?
    - habryka 15 Apr 2025 6:35 UTC
      23 points
      2
      Parent
      Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
      Training transformers in additional languages basically doesn’t really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a new language it can speak in, with basically no substantial changes in its performance on other tasks (of course, being better at tasks that require speaking in the new foreign language, and maybe a small boost in general task performance because you gave it more data than you had before).
      Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
      Interspersing the french data with the rest of its training data won’t change anything either. It again will just learn the language. Giving it more data in french will now just basically do the same as giving it more data in english. The learning is no longer happening at the language level, its happening at the content and world-model level.
      - Eli Tyre 18 Apr 2025 6:09 UTC
        2 points
        0
        Parent
        Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
        This is a pretty helpful answer.
        
        (Though you keep referencing the AI’s chain of thought. I wasn’t imagining training over the chain of thought. I was imagining training over the AI’s outputs, whatever those are in the relevant domain.)
        habryka 18 Apr 2025 8:21 UTC
        2 points
        2
        Parent
        I don’t undertand what it would mean for “outputs” to be corrigible, so I feel like you must be talking about internal chain of thoughts here? The output of a corrigible AI and a non-corrigibile AI is the same for almost all tasks? They both try to perform any task as well as possible, the difference is how they relate to the task and how they handle interference.
      - Eli Tyre 18 Apr 2025 6:06 UTC
        2 points
        0
        Parent
        Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in?
        I would guess that if you finetuned a model so that it always responded in French, regardless of the languge you prompt it with, it would persistently respond in French (absent various jailbreaks which would almost definitely exist).
    - Lucius Bushnaq 15 Apr 2025 6:44 UTC
      4 points
      0
      Parent
      I don’t think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes’ sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
      Some scattered points that may or may not be of use:
      There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very well on the loss, because the system realises that doing things that score well on the loss is instrumentally useful. Thus, while many aspects of how the system thinks are maybe nailed down quite definitively and robustly by the environment, what it wants does not seem nailed down in this same robust way. Desires thus seem like they can be very chaotically dependent on dynamics in early training, what the system reflected on when, which heuristics it learned in what order, and other low level details like this that are very hard to precisely control.
      I feel like there is something here about our imaginations, or at least mine, privileging the hypothesis. When I imagine an AI trained to say things a human observer would rate as ‘nice’, and to not say things a human observer rates as ‘not nice’, my imagination finds it natural to suppose that this AI will generalise to wanting to be a nice person. But when I imagine an AI trained to respond in English, rather than French or some other language, I do not jump to supposing that this AI will generalise to terminally valuing the English language.
      Every training signal we expose the AI to reinforces very many behaviours at the same time. The human raters that may think they are training the AI to be nice are also training it to respond in English (because the raters speak English), to respond to queries at all instead of ignoring them, to respond in English that is grammatically correct enough to be understandable, and a bunch of other things. The AI is learning things related to ‘niceness’, ‘English grammar’ and ‘responsiveness’ all at the same time. Why would it generalise in a way that entangles its values with one of these concepts, but not the others?
      What makes us single out the circuits responsible for giving nice answers to queries as special, as likely to be part of the circuit ensemble that will cohere into the AI’s desires when it is smarter? Why not circuits for grammar or circuits for writing in the style of 1840s poets or circuits for research taste in geology?
      We may instinctively think of our constitution that specifies $x$ as equivalent to some sort of monosemantic $x$ -reinforcing training signal. But it really isn’t. The concept of $x$ sticks out to us when we we look at the text of the constitution, because the presence of concept $x$ is a thing that makes this text different from a generic text. But the constitution, and even more so any training signal based on the constitution, will by necessity be entangled with many concepts besides just $x$ , and the training will reinforce those concepts as well. Why then suppose that the AI’s nascent shard of value are latching on to $x$ , but are not in the same way latching on to all the other stuff its many training signals are entangled with?
      It seems to me that there is no good reason to suppose this. Niceness is part of my values, so when I see it in the training signal I find it natural to imagine that the AI’s values would latch on to it. But I do not as readily register all the other concepts in the training signal the AI’s values might latch on to, because to my brain that does not value these things, they do not seem value-related.
      There is something here about phase changes under reflection. If the AI gets to the point of thinking about itself and its own desires, the many shards of value it may have accumulated up to this point are going to amalgamate into something that may be related to each of the shards, but not necessarily in a straightforwardly human-intuitive way. For example, sometimes humans that have value shards related to empathy reflect on themselves, and emerge being negative utilitarians that want to kill everyone. For another example, sometimes humans reflect on themselves and seem to decide that they don’t like the goals they have been working towards, and they’d rather work towards different goals and be different people. There, the relationship between values pre-reflection and post-reflection can be so complicated that it can seem to an outside observer and the person themselves like they just switched values non-deterministically, by a magical act of free will. So it’s not enough to get some value shards that are kind of vaguely related to human values into the AI early in training. You may need to get many or all of the shards to be more than just vaguely right, and you need the reflection process to proceed in just the right way.
- Wei Dai 14 Apr 2025 22:07 UTC
  7 points
  2
  Parent
  What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?
- Seth Herd 14 Apr 2025 6:52 UTC
  7 points
  0
  Parent
  I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
  I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
  What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
  Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
  There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
  Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
  Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.
- tailcalled 14 Apr 2025 7:07 UTC
  3 points
  0
  Parent
  Let’s say you are using the AI for some highly sensitive matter where it’s important that it resists prompt-hacking—e.g. driving a car (prompt injections could trigger car crashes), something where it makes financial transactions on the basis of public information (online websites might scam it), or military drones (the enemy might be able to convince the AI to attack the country that sent it).
  A general method for ensuring corrigibility is to be eager to follow anything instruction-like that you see. However, this interferes with being good at resisting prompt-hacking.
  - Knight Lee 14 Apr 2025 7:31 UTC
    1 point
    0
    Parent
    I think the problem you mention is a real challenge, but not the main limitation of this idea.
    The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
    The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
    - tailcalled 14 Apr 2025 8:29 UTC
      3 points
      0
      Parent
      Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it’s not enough to have something that works for all the easiest cases, you need something that can’t predictably made to fail by a highly motivated adversary.
      Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there’ll still be lots of other vulnerabilities.
      - Knight Lee 14 Apr 2025 8:39 UTC
        1 point
        0
        Parent
        I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
        I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
        That said, I’m highly uncertain about all of this and I could easily be wrong.
        tailcalled 14 Apr 2025 8:51 UTC
        3 points
        0
        Parent
        If the AI can’t do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
        (This is meant to be analogous to the soldier-commander example.)
        But was that the AI regime you expect to find yourself working with? In particular I’d expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
        Knight Lee 14 Apr 2025 10:41 UTC
        1 point
        0
        Parent
        Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
        As AI approach human intelligence, they would be capable of this too.
        tailcalled 14 Apr 2025 10:51 UTC
        3 points
        0
        Parent
        Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
        Knight Lee 14 Apr 2025 10:57 UTC
        1 point
        0
        Parent
        Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
        Maybe a good parent who listens to his/her child’s dreams?
        Very good question though. Humans usually aren’t very corrigible, and there aren’t many examples!
        tailcalled 14 Apr 2025 11:00 UTC
        3 points
        0
        Parent
        Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
        Do you mean “resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people” or “makes policy they know will be bad because the people demand it”?
        Maybe a good parent who listens to his/her child’s dreams?
        Can you expand on this?
        Knight Lee 14 Apr 2025 11:09 UTC
        1 point
        0
        Parent
        Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let’s hope it stays democratic :/
        No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.
- Knight Lee 14 Apr 2025 6:52 UTC
  2 points
  0
  Parent
  Edit: I thought more about this and wrote a post inspired by your idea! A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
  :) strong upvote.^[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
  The only nitpick is that Claude’s constitution already includes aspects of corrigibility,^[2] though maybe they aren’t emphasized enough.
  Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
  Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
  Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
  1. ^
    I wish LessWrong would promote/discuss solutions more, instead of purely reflecting on how hard the problems are.
  2. ^
    Near the bottom of Claude’s constitution, in the section “From Anthropic Research Set 2”