Jeremy Gillen comments on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI

Jeremy Gillen 30 Jan 2024 2:33 UTC
3 points
1
I think you and Peter might be talking past each other a little, so I want to make sure I properly understand what you are saying. I’ve read your comments here and on Nate’s post, and I want to start a new thread to clarify things.
I’m not sure exactly what analogy you are making between chess AI and science AI. Which properties of a chess AI do you think are analogous to a scientific-research-AI?
- The constraints are very easy to specify (because legal moves can be easily locally evaluated). In other words, the set of paths considered by the AI is easy to define, and optimization can be constrained to only search this space.
- The task of playing chess doesn’t at all require or benefit from modelling any other part of the world except for the simple board state.
I think these are the main two reasons why current chess AIs are safe.
Separately, I’m not sure exactly what you mean when you’re saying “scientific value”. To me, the value of knowledge seems to depend on the possible uses of that knowledge. So if an AI is evaluating “scientific value”, it must be considering the uses of the knowledge? But you seem to be referring to some more specific and restricted version of this evaluation, which doesn’t make reference at all to the possible uses of the knowledge? In that case, can you say more about how this might work?
Or maybe you’re saying that evaluating hypothetical uses of knowledge can be safe? I.e. there’s a kind of goal that wants to create “hypothetically useful” fusion-rocket-designs, but doesn’t want this knowledge to have any particular effect on the real future.
You might be reading us as saying that “AI science systems are necessarily dangerous” in the sense that it’s logically impossible to have an AI science system that isn’t also dangerous? We aren’t saying this. We agree that in principle such a system could be built.
- simon 3 Feb 2024 19:38 UTC
  2 points
  0
  Parent
  While some disagreement might be about relatively mundane issues, I think there’s some more fundamental disagreement about agency as well.
  I my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI’s decision to give output X depends on the fact that output X has some specific effects in the future.
  Whereas, if you train it on a problem where solutions don’t need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that’s simpler.
  So if you train an AI to give solutions to scientific problems, I don’t think, in general, that that needs to depend on the future, so I think that it’s likely learn the direct relationships between the data and the solutions. I.e. it’s not merely a logical possibility to make it not especially dangerous, but that’s the default outcome if you give it problems that don’t need to depend on specific effects of the output.
  Now, if you were instead to give it a problem that had to depend on the effects of the output on the future, then it would be dangerous...but note that e.g. chess, even though it maps onto a game played in the real world in the future, can also be understood in abstract terms so you don’t actually need to deal with anything outside the chess game itself.
  In general, I just think that predicting the future of the world and choosing specific outputs based on their effects on the real world is a complicated way to solve problems and expect things to take shortcuts when possible.
  Once something does care about the future, then it will have various instrumental goals about the future, but the initial step about actually caring about the future is very much not trivial in my view!
  - Jeremy Gillen 12 Feb 2024 6:54 UTC
    2 points
    0
    Parent
    In my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI’s decision to give output X depends on the fact that output X has some specific effects in the future.
    Agreed.
    Whereas, if you train it on a problem where solutions don’t need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that’s simpler.
    The “problem where solutions don’t need to depend on effects” is where we disagree. I agree such problems exist (e.g. formal proof search), but those aren’t the kind of useful tasks we’re talking about in the post. For actual concrete scientific problems, like outputting designs for a fusion rocket, the “simplest” approach is to be considering the consequences of those outputs on the world. Otherwise, how would it internally define “good fusion rocket design that works when built”? How would it know not to use a design that fails because of weaknesses in the metal that will be manufactured into a particular shape for your rocket? A solution to building a rocket is defined by its effects on the future (not all of its effects, just some of them, i.e. it doesn’t explode, among many others).
    I think there’s a (kind of) loophole here, where we use an “abstract hypothetical” model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by “understood in abstract terms”? So the AI has defined “good fusion rocket design” as “fusion rocket that is built by not-real hypothetical humans based on my design and functions in a not-real hypothetical universe and has properties and consequences XYZ” (but the hypothetical universe isn’t the actual future, it’s just similar enough to define this one task, but dissimilar enough that misaligned goals in this hypothetical world don’t lead to coherent misaligned real-world actions). Is this what you mean? Rereading your comment, I think this matches what you’re saying, especially the chess game part.
    The part I don’t understand is why you’re saying that this is “simpler”? It seems equally complex in kolmogorov complexity and computational complexity.
    - simon 18 Feb 2024 7:19 UTC
      2 points
      0
      Parent
      I think there’s a (kind of) loophole here, where we use an “abstract hypothetical” model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by “understood in abstract terms”?
      More or less, yes (in the case of engineering problems specifically, which I think is more real-world-oriented than most science AI).
      The part I don’t understand is why you’re saying that this is “simpler”? It seems equally complex in kolmogorov complexity and computational complexity.
      What I’m saying is “simpler” is that, given a problem that doesn’t need to depend on the actual effects of the outputs on the future of the real world (where operating in a simulation is an example, though one that could become riskily close to the real world depending on the information taken into account by the simulation—it might not be a good idea to include highly detailed political risks of other humans thwarting construction in a fusion reactor construction simulation for example), it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
      - Jeremy Gillen 20 Feb 2024 1:08 UTC
        1 point
        0
        Parent
        I feel like you’re proposing two different types of AI and I want to disambiguate them. The first one, exemplified in your response to Peter (and maybe referenced in your first sentence above), is a kind of research assistant that proposes theories (after having looked at data that a scientist is gathering?), but doesn’t propose experiments and doesn’t think about the usefulness of its suggestions/theories. Like a Solomonoff inductor that just computes the simplest explanation for some data? And maybe some automated approach to interpreting theories?
        The second one, exemplified by the chess analogy and last paragraph above, is a bit like a consequentialist agent that is a little detached from reality (can’t learn anything, has a world model that we designed such that it can’t consider new obstacles).
        Do you agree with this characterization?
        What I’m saying is “simpler” is that, given a problem that doesn’t need to depend on the actual effects of the outputs on the future of the real world […], it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
        I accept chess and formal theorem-proving as examples of problem where we can define the solution without using facts about the real-world future (because we can easily write down formally a definition of what the solution looks like).
        For a more useful problem (e.g. curing a type of cancer) we (the designers) only know how to define a solution in terms of real world future states (patient is alive, healthy, non traumatized, etc). I’m not saying there doesn’t exist a definition of success that doesn’t involve referencing real-world future states. But the AI designers don’t know it (and I expect it would be relatively complicated).
        My understanding of your simplicity argument is that it is saying that it is computationally cheaper for a trained AI to discover during training a non-consequence definition of the task, despite a consequentialist definition being the criterion used to train it? If so, I disagree that computation cost is very relevant here, generalization (to novel obstacles) is the dominant factor determining how useful this AI is.