Noosphere89 comments on Thoughts on “AI is easy to control” by Pope & Belrose

Noosphere89 2 Dec 2023 18:27 UTC
10 points
4
My general thoughts about this post, divided into sections for easier explanation:
1. Even if controllable AI has an “easy” technical solution, I’d still be pessimistic about AI takeover
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#1__Even_if_controllable_AI_has_an__easy__technical_solution__I_d_still_be_pessimistic_about_AI_takeover

I kinda agree with this point, but I’d say a few things here:

You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor, and I agree with this point, but there are still things that change if we grant AI is easy to control:
1. Most AI pause policies become pretty unjustifiable by default, at least without extra assumptions and in general a lot of AI slowdown movements like PauseAI become closer to the nuclear case, which I’d argue is quite negative. That alone would change the dynamic of a lot of AI safety, especially it’s nascent activist wing.
2. Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we’re worried about misuse of AI for say bioterrorism.
3. On the balance between offense and defense, I actually think that it depends on the domain, with cyber being the easiest case for defense, bio being the worst case for defense but can be drastically improved, and other fields having more balance between defense and offense. However, I agree that bio is the reason I’d be rather pessimistic on AI being offense-advantaged, but this is an improvable outlier without having to restrict AI all that much, or at all.
1. Black-and-white (box) thinking
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#2__Black_and_white__box__thinking

I agree that these words should mostly be tabooed here, and I mostly agree with the section, with the caveat that the information that we get from ML models is drastically better than any human, because we only have behavioral analysis from human beings to infer their values, which is a literal black box since we only get the behavior, not the causes of the behavior. We basically never get the intuitive explanation for why a human does something, except in the trivial cases.
1. What lessons do we learn from “human alignment” (such as it is)?
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#3__What_lessons_do_we_learn_from__human_alignment___such_as_it_is__

I agree with some of this, but I’d say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.

This post might help you understand why, especially the top comment by karma also adds additional helping context for why a lot of complexity of value needs to be learned, rather than baked in as a prior:

https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome

The main implication for human alignment is that deceptive alignment mostly does not work, or is easy to make not work, because the complexity of the aligned solution and deceptive solution is very similar, only 1-1000 lines of code or 1-1000 bits at most difference in bit complexity, and thus we need very little data to discriminate between the aligned solution and the deceptive solution, which means it’s very easy to solve deceptive alignment, and given the massive incentive for solving alignment for profit, may already be solvable by default.
1. Can we all agree in advance to disavow this whole “AI is easy to control” essay if future powerful AI is trained in a meaningfully different way from current LLMs?
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4__Can_we_all_agree_in_advance_to_disavow_this_whole__AI_is_easy_to_control__essay_if_future_powerful_AI_is_trained_in_a_meaningfully_different_way_from_current_LLMs_

For me, the answer is no, because my points apply outside of LLMs, and they can be formulated as long as the prior doesn’t completely dominate the learning process, which can certainly apply to brain-like AGI or model based RL.
- Steven Byrnes 3 Dec 2023 0:31 UTC
  22 points
  13
  Parent
  You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor
  It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.
  Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we’re worried about misuse of AI for say bioterrorism.
  People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.
  In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.
  If KYC laws aren’t the answer, what is? I don’t know. I’m not advocating for any particular policy here.
  - Noosphere89 3 Dec 2023 1:09 UTC
    6 points
    0
    Parent
    
    It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.
    
    Perhaps, but I want to create a distinction between “People train AI to do good things and aren’t able to control AI for a variety of reasons, and thus humans are extinct, made into slaves, etc.” and “People train AI to do stuff like bio-terrorism, explicitly gaining power, etc, and thus humans are extinct, made into slaves, etc.” Because the optimal responses look very different if we are in a world where control is easy but preventing misuse is hard, vs if controlling AI is hard in itself, because AI safety actions as currently done are optimized far more for the case where controlling AI by humans is hard or impossible, but if this is not the case, then pretty drastic changes would need to be made in how AI safety organizations do their work, especially their nascent activist wing, and instead focus on different policies.
    
    People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.
    
    In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.
    
    https://www.lesswrong.com/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1
    
    https://www.lesswrong.com/posts/LFNXiQuGrar3duBzJ/what-does-it-take-to-defend-the-world-against-out-of-control#3_4_Very_few_relevant_actors__and_great_understanding_of_AGI_safety
    
    I note that your example of them spouting nonsense only has the full force it does if we assume that controlling AI is hard, which is what we are debating right now.
    
    Onto my point here, my fundamental claim is that there’s a counterforce to what you describe to the claim that there will be more and more people being able to make an out of control AI agent, and that is the profit motive.
    
    Hear me out, this will actually make sense here.
    
    Basically, the main reasons that the profit motive is positive for safety is that the negative externalities of AI being not controllable is far, far more internalized to the person who’s making the AI, since they also suffer severe losses in profitability without getting any profit from the AI. This is combined with the fact that they also have profit in developing safe control techniques, assuming that it isn’t very hard, since the safe techniques will probably get used in government standards for releasing AI, and there’s already at least some fairly severe barriers to any release of misaligned AGI, at least assuming that there’s no treacherous turn/deceptive alignment over weeks-months.
    
    Jaime Sevilla basically has a shorter tweet on why this is the case, and I also responded to LInch making something like the points above:
    
    https://archive.is/wPxUV
    
    https://twitter.com/Jsevillamol/status/1722675454153252940
    
    https://archive.is/3q0RG
    
    https://twitter.com/SharmakeFarah14/status/1726351522307444992
- Steven Byrnes 3 Dec 2023 0:01 UTC
  8 points
  4
  Parent
  I agree with some of this, but I’d say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.
  You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?
  If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?
  (As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)
  - Noosphere89 3 Dec 2023 0:31 UTC
    2 points
    −2
    Parent
    So what I’m trying to get at here is essentially the question “how much can we offload the complexity of values to the learning system” rather than say, directly specify it via the genome, say. In essence, I’m focused on the a priori complexity of human values and the human innate reward function, since this variable often is a key disagreement between optimists and pessimists on controlling AI, and in particular it especially matters for how likely deceptive alignment is to occur relative to actual alignment, which is both a huge and popular threat model.
    
    Re the reward function, the prior discussion also sort of applies here, because if it is learnable or otherwise is simple to hardcode, then it means that other functions probably will work just as well without relying on the human reward function, and if it’s outright learnable by AI, then it’s almost certainly going to be learned (conditional on the reward function being simple.) before anything else, especially the deceptively aligned algorithm if it’s simpler, and if not, then it’s only slightly more complex, so we can easily provide very little data to distinguish between the 2 algorithms, which is what I view the situation involving the human
    
    My crux is that this statement is probably false, conditioning on either it being very simple to hardcode, as in a few lines say, or is learnable by the self-learning algorithm/within-lifetime RL/online learning algorithms you consider:
    
    “The human innate reward function is absolutely critical to human prosocial behavior.”
    
    Putting it another way, I deny the specialness of the innate reward function in humans being the main driver, because most of that reward function has to be learned, which could be replicated by brain-like AGI/Model-Based RL via online learning, thus most of the complexity does not matter, and that also probably implies that most of the complex prosocial behavior is fundamentally replicable by a brain-like AGI/Model-Based RL agent without having to have the human innate reward function.
    
    The innate function obviously has some things hard-coded a priori, and there is some complexity in the reward function, but not nearly as much as a lot of people think, since IMO a lot of the reward function/human prosocial values are fundamentally learned and almost certainly replicable by a Brain-like AGI paradigm, even if it didn’t use the exact innate reward function the human uses.
    
    Some other generalized updates I made were these, this is quoted from a discord I’m in, credit to TurnTrout for noticing this:
    
    An update of “guess simple functions of sense data can entrain this complicated edifice of human value, along with cultural information” and the update of “alignment to human values is doable by a simple function so it’s probably doable by lots of other functions”,
    
    as well as contextualized updates like “it was probably easy for evolution to find these circuits, which is evidence you don’t need that much precision in your reward specification to get roughly reasonable outputs”.
    - Steven Byrnes 3 Dec 2023 1:20 UTC
      17 points
      6
      Parent
      I find your text confusing. Let’s go step by step.
      AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
      A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
      If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
      By analogy:
      The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
      A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
      If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
      Do you agree with all that?
      If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
      So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
      - Noosphere89 21 Sep 2024 17:57 UTC
        12 points
        0
        Parent
        I have an actual model now that I’ve learned more, so to answer the question below:
        So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
        To answer what algorithm exactly, it could well be the same algorithm that the AI uses for it’s capabilities like MCTS or AlphaZero’s algorithm or a future AGI’s capability algorithms, but the point is that the algorithm matters less than the data, especially as the data gets larger and larger so the really important question is how to make the dataset, and that’s answered in my comment below:
        https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg#BxNLNXhpGhxzm7heg
        I also want to point out that as it turns out, alignment generalizes farther than capabilities for some pretty deep reasons given below, but short answer, it’s due to both the fact that verifying that your values was satisified is in many cases easier than actually executing those values out in the world, combined with values data being easier to learn than other capabilities data:
        1.) Empirically, reward models are often significantly smaller and easier to train than core model capabilities. I.e. in RLHF and general hybrid RL setups, the reward model is often a relatively simple small MLP or even linear layer stuck atop the general ‘capabilities’ model which actually does the policy optimization. In general, it seems that simply having the reward be some simple function of the internal latent states works well. Reward models being effective as a small ‘tacked-on’ component and never being the focus of serious training effort gives some evidence towards the fact that reward modelling is ‘easy’ compared to policy optimization and training. It is also the case that, empirically, language models seem to learn fairly robust understandings of human values and can judge situations as ethical vs unethical in a way that is not obviously worse than human judgements. This is expected since human judgements of ethicality and their values are contained in the datasets they learn to approximate.
        2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
        3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
        4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
        5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
        In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
        The link is given below:
        https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
        In essence, what I’m doing here is unifying the capabilities and value reward functions, and pointing out that with total control of the dataset and densely defined rewards, we can prevent a lot of misaligned objectives from appearing, since the algorithm is less important than the data.
        I think the key crux is that all of the differences, or almost all of the differences mediate through searching for different data, and if you had the ability to totally control a sociopath’s data sources, they’d learn a different reward function that is way closer to what you want the reward function to be as.
        If you had the ability to control people’s data and reward functions as much as ML people did today, you could trivially brainwash them to accept almost arbitrary facts and moralities, and it would be one of the most used technologies in politics.
        But for alignment, this is awesome news, because it lets us control what exactly is rewarded, and what their values are like.
        What links here?
        Self-dialogue: Do behaviorist rewards make scheming AGIs? by Steven Byrnes (13 Feb 2025 18:39 UTC; 43 points)
        Noosphere89's comment on The Hopium Wars: the AGI Entente Delusion by Max Tegmark (13 Oct 2024 19:36 UTC; 33 points)
        Noosphere89's comment on “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (30 Apr 2025 0:22 UTC; 4 points)
        Steven Byrnes's comment on “The Era of Experience” has an unsolved technical alignment problem by Steven Byrnes (30 Apr 2025 17:12 UTC; 3 points)
        Steven Byrnes's comment on “Sharp Left Turn” discourse: An opinionated review by Steven Byrnes (3 Feb 2025 15:52 UTC; 2 points)
        Noosphere89's comment on My disagreements with “AGI ruin: A List of Lethalities” by Noosphere89 (2 Oct 2024 16:31 UTC; 2 points)
        Noosphere89's comment on My disagreements with “AGI ruin: A List of Lethalities” by Noosphere89 (2 Oct 2024 2:23 UTC; 2 points)
        Noosphere89's comment on The Hidden Complexity of Wishes by Eliezer Yudkowsky (17 Oct 2024 18:50 UTC; 1 point)
        Steven Byrnes 30 Apr 2025 17:07 UTC
        12 points
        0
        Parent
        Thanks! Your comment was very valuable, and helped spur me to write Self-dialogue: Do behaviorist rewards make scheming AGIs? (as I mention in that post). Sorry I forgot to reply to your comment directly.
        To add on a bit to what I wrote in that post, and reply more directly…
        So far, you’ve proposed that we can do brain-like AGI, and the reward function will be a learned function trained by labeled data. The data, in turn, will be “lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.” Right so far?
        That might or might not make sense for an LLM. I don’t think it makes sense for brain-like AGI.
        In particular, the reward function is just looking at what the model does, not what its motivations are. “Synthetic data to make misaligned agents reveal themselves safely” doesn’t seem to make sense in that context. If the reward function incentivizes the AI to say “I’m blowing the whistle on myself! I was out to get you!” then the AI will probably start saying that, even if it’s false. (Or if the reward function incentivizes the AI to only whistle-blow on itself if it has proof of its own shenanigans, then the AI will probably keep generating such proof and then showing it to us. …And meanwhile, it might also be doing other shenanigans in secret.)
        Or consider trying to incentivize the AGI for honestly reporting its intentions. That’s defined by the degree of match or mismatch between inscrutable intentions and text outputs. How do you train and apply the reward model to detect that?
        More broadly, I think there are always gonna be ways for the AI to be sneaky without the reward function noticing, which leads to a version of “playing the training game”. And almost no matter what the detailed “training game” is, an AI can do a better job on it by secretly creating a modified copy to self-reproduce around the internet and gain maximal money and power everywhere else on Earth, if that’s possible to do without getting caught. So that’s pretty egregious scheming.
        “Alignment generalizes further than capabilities” is kinda a different issue—it’s talking about the more “dignified” failure mode of generalizing poorly to new situations and environments, whereas I’m claiming that we don’t even have a solution to egregious scheming within an already-existing test environment.
        Again, more in that post. Sorry if I’m misunderstanding, and happy to keep chatting!
      - Noosphere89 3 Dec 2023 2:16 UTC
        6 points
        0
        Parent
        
        I find your text confusing. Let’s go step by step.
        
        AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different. By analogy:
        
        The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode). A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least) If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
        
        I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
        
        So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
        
        To respond to this:
        
        So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
        
        This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
        
        Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
        
        So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.