jbash comments on Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

jbash 24 Feb 2025 19:03 UTC
2 points
−6
If you’re planning to actually do the experiments it suggests, or indeed act on any advice it gives in any way, then it’s an agent.
- mattmacdermott 24 Feb 2025 21:14 UTC
  4 points
  2
  Parent
  Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?
  
  (And notably the proposal here isn’t to train the model on the outcomes of experiments it proposes, in case that’s what you’re thinking.)
  - Viliam 4 Mar 2025 12:20 UTC
    2 points
    0
    Parent
    Is this possibly a “Chinese room” kind of situation? The model alone is not an agent, but “the model + the way it is used” might be...
    And to be more precise, I don’t mean things like “the model could be used by an agent”, because obviously yes; but more like “the model + a way of using it that we also separately wouldn’t call an agent” could be.
  - Beckeck 25 Feb 2025 1:07 UTC
    1 point
    0
    Parent
    “Seems mistaken to think that the way you use a model is what determines whether or not it’s an agent. It’s surely determined by how you train it?”
    ---> Nah, pre training, fine tuning, scaffolding and especially RL seem like they all affect it. Currently scaffolding only gets you shitty agents, but it at least sorta works
    - mattmacdermott 25 Feb 2025 8:15 UTC
      4 points
      0
      Parent
      Pre-training, finetuning and RL are all types of training. But sure, expand ‘train’ to ‘create’ in order to include anything else like scaffolding. The point is it’s not what you do in response to the outputs of the system, it’s what the system tries to do.
      - Beckeck 25 Feb 2025 18:50 UTC
        1 point
        0
        Parent
        yeah, if the system is trying to do things I agree it’s (at least a proto) agent. My point is that creation happens in lots of places with respect to an LLM, and it’s not implausible that use steps (hell even sufficiently advanced prompt engineering) can effect agency in a system, particularly as capabilities continue to advance.
        mattmacdermott 25 Feb 2025 22:10 UTC
        1 point
        0
        Parent
        Seems like we don’t really disagree
        Beckeck 27 Feb 2025 5:07 UTC
        1 point
        0
        Parent
        we might disagree some. I think the original comment is pointing at the (reasonable as far i can tell) claim that oracular AI can have agent like qualities if it produces plans that people follow
        mattmacdermott 27 Feb 2025 10:26 UTC
        2 points
        0
        Parent
        I agree that it can be possible to turn such a system into an agent. I think the original comment is defending a stronger claim that there’s a sort of no free lunch theorem: either you don’t act on the outputs of the oracle at all, or it’s just as much of an agent as any other system.
        
        I think the stronger claim is clearly not true. The worrying thing about powerful agents is that their outputs are selected to cause certain outcomes, even if you try to prevent those outcomes. So depending on the actions you’re going to take in response to its outputs, its outputs have to be different. But the point of an oracle is to not have that property—its outputs are decided by a criterion (something like truth) -- that is independent of the actions you’re going to take in response^[1]. So if you respond differently to the outputs, they cause different outcomes. Assuming you’ve succeeded at building the oracle to specification, it’s clearly not the case that the oracle has the worrying property of agents just because you act on its outputs.
        
        I don’t disagree that by either hooking the oracle up in a scaffolded feedback loop with the environment, or getting it to output plans, you could extract more agency from it. Of the two I think the scaffolding can in principle easily produce dangerous agency in the same way long-horizon RL can, but that the version where you get it to output a plan is much less worrrying (I can argue for that in a separate comment if you like).
        
        ↩︎
        I’m ignoring the self-fulfilling prophecy case here.
        
        Beckeck 27 Feb 2025 21:33 UTC
        1 point
        0
        Parent
        thanks, I appreciate the reply.
        It sounds like I have somewhat wider error bars but mostly agree on everything but the last sentence, where I think it’s plausibly but not certainly less worrying.
        If you felt like you had crisp reasons why you’re less worried, I’d be happy to hear them, but only if it feels positive for you to produce such a thing.
        Ishual 27 Feb 2025 12:31 UTC
        1 point
        0
        Parent
        Good point. I think that if you couple the answers of an oracle to reality by some random process, then you are probably fine.
        
        However, many want to use the outputs of the oracle in very obvious ways. For instance, you ask it what code you should put into your robot, and then you just put the code into the robot.
        
        Could we have an oracle (i.e. it was trained according to some Truth criterion) where when you use it very straightforwardly, it exerts optimization pressure on the world?