aysja comments on What Is The Alignment Problem?

aysja 20 Jan 2025 0:21 UTC
28 points
11
I agree with a bunch of this post in spirit—that there are underlying patterns to alignment deserving of true name—although I disagree about… not the patterns you’re gesturing at, exactly, but more how you’re gesturing at them. Like, I agree there’s something important and real about “a chunk of the environment which looks like it’s been optimized for something,” and “a system which robustly makes the world look optimized for a certain objective, across many different contexts.” But I don’t think the true names of alignment will be behaviorist (“as if” descriptions, based on observed transformations between inputs and output). I.e., whereas you describe it as one subtlety/open problem that this account doesn’t “talk directly about what concrete patterns or tell-tale signs make something look like it’s been optimized for X,” my own sense is that this is more like the whole problem (and also not well characterized as a coherence problem). It’s hard for me to write down the entire intuition I have about this, but some thoughts:
- Behaviorist explanations don’t usually generalize as well. Partially this is because there are often many possible causal stories to tell about any given history of observations. Perhaps the wooden sphere was created by a program optimizing for how to fit blocks together into a sphere, or perhaps the program is more general (about fitting blocks together into any shape) or more particular (about fitting these particular wood blocks together) etc. Usually the response is to consider the truth of the matter to be the shortest program consistent with all observations, but this is enables blindspots, since it might be wrong! You don’t actually know what’s true when you use procedures like this, since you’re not looking at the mechanics of it directly (the particular causal process which is in fact happening). But knowing the actual causal process is powerful, since it will tell you how the outputs will vary with other inputs, which is an important component to ensuring that unwanted outputs don’t obtain.
  - This seems important to me for at least a couple reasons. One is that we can only really bound the risk if we know what the distribution of possible outputs is. This doesn’t necessarily require understanding the underlying causal process, but understanding the underlying causal process will, I think, necessarily give you this (absent noise/unknown unknowns etc). Two is that blindspots are exploitable—anytime our measurements fail to capture reality, we enable vectors of deception. I think this will always be a problem to some extent, but it seems worse to me the less we have a causal understanding. For instance, I’m more worried about things like “maybe this program represents the behavior, but we’re kind of just guessing based on priors” than I am about e.g., calculating the pressure of this system. Because in the former there are many underlying causal processes (e.g., programs) that map to the same observation, whereas in the latter it’s more like there are many underlying states which do. And this is pretty different, since the way you extrapolate from a pressure reading will be the same no matter the microstate, but this isn’t true of the program: different ones suggest different future outcomes. You can try to get around this by entertaining the possibility that all programs which are currently consistent with the observations might be correct, weighted by their simplicity, rather than assuming a particular one. But I think in practice this can fail. E.g., scheming behavior might look essentially identical to non-scheming behavior for an advanced intelligence (according to our ability to estimate this), despite the underlying program being quite importantly different. Such that to really explain whether a system is aligned, I think we’ll need a way of understanding the actual causal variables at play.
- Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness. Such that the emphasis should imo go the other way: figuring out the core of what this process of “finding and exploiting abstractions” is, and then generalizing outward. This feels related to behaviorism insomuch as behaviorist accounts often rely on concepts like “searching over the space of all programs to find the shortest possible one.”
- Lucius Bushnaq 20 Jan 2025 7:34 UTC
  10 points
  3
  Parent
  Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness.
  I used to think this, but now I don’t quite think it anymore. The largest barrier I saw here was that the search had to prioritise simple hypotheses over complex ones. I had not idea how to do this. It seemed like it might require very novel search algorithms, such that models like AIXI were eliding basically all of the key structure of intelligence by not specifying this very special search process.
  I no longer think this. Privileging simple hypotheses in the search seems way easier than I used to think. It is a feature so basic you can get it almost by accident. Many search setups we already know about do it by default. I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction. Both in the sense that the training itself is kind of like approximated Solomonoff induction, and in the sense that the learned network algorithms may be making use of what is basically approximated Solomonoff induction in specialised hypotheses spaces to perform ‘general pattern recognition’ on their forward passes.
  
  I still think “abstraction-based-cognition” is an important class of learned algorithms that we need to understand, but a picture of intelligence that doesn’t talk about abstraction and just refers to concepts like AIXI no longer seems to me to be so incomplete as to not be saying much of value about the structure of intelligence at all.
  - mattmacdermott 6 Mar 2025 12:51 UTC
    2 points
    0
    Parent
    
    I now suspect that there is a pretty real and non-vacuous sense in which deep learning is approximated Solomonoff induction.
    
    Even granting that, do you think the same applies to the cognition of an AI created using deep learning—is it approximating Solomonoff induction when presented with a new problem at inference time?
    
    I think it’s not, for reasons like the ones in aysja’s comment.
    - Lucius Bushnaq 6 Mar 2025 14:14 UTC
      2 points
      0
      Parent
      Yes. I think this may apply to basically all somewhat general minds.
- Noosphere89 20 Jan 2025 1:32 UTC
  2 points
  0
  Parent
  
  Many accounts of cognition are impossible (eg AIXI, VNM rationality, or anything utilizing utility functions, many AIT concepts), since they include the impossible step of considering all possible worlds. I think people normally consider this to be something like a “God’s eye view” of intelligence—ultimately correct, but incomputable—which can be projected down to us bounded creatures via approximation, but I think this is the wrong sort of in-principle to real-world bridge. Like, it seems to me that intelligence is fundamentally about ~“finding and exploiting abstractions,” which is something that having limited resources forces you to do. I.e., intelligence comes from the boundedness. Such that the emphasis should imo go the other way: figuring out the core of what this process of “finding and exploiting abstractions” is, and then generalizing outward. This feels related to behaviorism insomuch as behaviorist accounts often rely on concepts like “searching over the space of all programs to find the shortest possible one.”
  
  I do think a large source of impossibility results come from trying to consider all possible worlds, but the core feature of all of the impossible proposals in our reality is a combination of ignoring computational difficulty entirely, combined with problems on embedded agency, and that the boundary between agent and environment is fundamental to most descriptions of intelligence/agency, ala Cartesian boundaries, but physically universal cellular automatons invalidate this abstraction, meaning the boundary is arbitrary and has no meaning at a low level, and our universe is plausibly physically universal.
  
  More here:
  
  https://www.lesswrong.com/posts/dHNKtQ3vTBxTfTPxu/what-is-the-alignment-problem#3GvsEtCaoYGrPjR2M
  
  (Caveat that the utility function framing actually can work, assuming we restrain the function classes significantly enough, and you could argue the GPT series has a utility function of prediction, but I won’t get into that).
  
  The problem with the bridge is that even if the god’s eye view of intelligence theories was totally philosophically correct, there is no way to get anything like that, and thus you cannot easily approximate without giving very vacuous bounds, and thus you need a specialized theory of intelligence in specific universes that might be inelegant philosophically/mathematically, but that can actually work to build and align AGI/ASI, especially if it comes soon.