Assessors that are hard to seduce

Stuart_Armstrong9 Mar 2015 14:19 UTC

9 points

A putative new idea for AI control; index here.

This is a stub of an idea, looking at “human in the loop”-type restrictions.

An assessor is a particular type of restriction. It involves a (generally) human agent, who is tasked with observing something (the behaviour of the agent, or maybe the expected outcome of its plans), and then deciding on whether to allow the AI to proceed or not.

Many of the methods suitable for preventing the hacking of restrictions can also be used to protect assessors. If you consider that the AI would have three options: “hold a gun to the assessor’s head”, “seduce the assessor” and “do what the assessor wants”, it’s clear that the first two options (the first more clearly than the second) are things that would a) be stupid to do if there were no assessor, and b) are behaviours that are very informative as to whether the assessor exists.

However, there are a few more steps we can take with assessors. First of all, the AI need not know which particular human will be an assessor, forcing it to take general human behaviour into account. This is more effective if the assessing process is not one that allows any real interactions between AI and assessor.

Indeed, the assessor need not be a human: it could be a superintelligent agent itself (a hypothetical one, if need be) that has its own motivation.

Lastly, since the assessors are agents and therefore can reach their own predictions, we could ask for the assessor’s impressions to be accurate predictions of what the AI intends to do (again, this could be done by having a hypothetical superintelligence assess the accuracy of the assessors’ expected predictions). We’ll look at ideas about modelling and model accuracy in a subsequent post.

What links here?

Stuart_Armstrong9 Mar 2015 14:19 UTC

9 points

12 comments1 min readLW link Archive

Slider 10 Mar 2015 0:53 UTC
1 point
0
Appeasing an imaginary moral judger seems an awfully lot like pleasing a man in the clouds. Any reason why critisims of religion could not be used verbatim against this arrangement?
- Stuart_Armstrong 10 Mar 2015 9:25 UTC
  1 point
  0
  Parent
  If we get to set the criteria of this man in the clouds, we can get useful behaviour. The main criticism of religion is that it’s untrue, and that a dedicated observer will realise this. Here we know it to be untrue, but the AI cannot act on that knowledge (see my post on false thermodynamic miracles).
  - Slider 10 Mar 2015 13:11 UTC
    0 points
    0
    Parent
    Wouldn’t we know that in this context this would be true (rather than untrue as you write)? Also the degree that the assessor is properly shielded from tampereing means it will become closer to imaginary (no need to mention assessor implementation details, but then it seems to work like “magic” lessening the evidecne to believe in it existence). Also it seems that things that make people turn on religion are valued and here we are counting on the AI not pulling those same stunts.
    
    Hiding the assessor among multiple plausible targets might make the AI play mafia on people (such as trying to get the assessor replaced when it can’t (no longer) satisfy it’s demands, inhopes that the replacement has easier attitudes or atleast possibility to have flaws to exploit).
    - Stuart_Armstrong 10 Mar 2015 14:36 UTC
      0 points
      0
      Parent
      These can be defined in counterfactual ways, if needed. There need not actually be an assessor, just a small probability of one.
      - Slider 11 Mar 2015 1:46 UTC
        0 points
        0
        Parent
        Wouldn’t that be the equivalent of thinking that a Pascal’s wager will keep it in check?
        Stuart_Armstrong 11 Mar 2015 10:25 UTC
        0 points
        0
        Parent
        No, because I’m using tricks like http://lesswrong.com/r/discussion/lw/ltf/false_thermodynamic_miracles/
  - djm 10 Mar 2015 11:02 UTC
    0 points
    0
    Parent
    I agree that useful behavior could come of this—religion has always been a very effective control mechanism.
    
    The main criticism of religion is that it’s untrue, and that a dedicated observer will realise this
    
    Unfortunately, it would be a challenging problem to maintain this control over an increasingly intelligent AI.
    - Stuart_Armstrong 10 Mar 2015 11:19 UTC
      0 points
      0
      Parent
      See http://lesswrong.com/r/discussion/lw/ltf/false_thermodynamic_miracles/
      - djm 10 Mar 2015 12:30 UTC
        0 points
        0
        Parent
        That would likely work for initial versions of an AI, but I still cant help feeling that this is just tampering with the signal and that an advanced AI would detect this.
        
        Would it not question the purpose of the utility function around detecting thermodynamic miracles—how would this work with its utility function to detect tampering or false data.
        
        If I saw a miracle, I would [hope] my thinking would follow the logic below
        
        a) it must be a trick/publicity stunt done with special effects b) I am having some sort of dream / mental breakdown / psychotic episode c) some other explanation I don’t of
        
        I don’t think an intelligent agent would or should jump to “it’s a miracle”, and I would be concerned of its response if/when it does realise that it has been tricked all along.
        Stuart_Armstrong 10 Mar 2015 15:15 UTC
        0 points
        0
        Parent
        
        Would it not question the purpose of the utility function around detecting thermodynamic miracles
        
        Probably, but it’s not programmed to care about that.
        
        If I saw a miracle
        
        Remember, it’s not seeing a miracle. It’s more that its decisions only matter if a miracle happened, so it’s assuming that a miracle happened for decision making purposes.
- Gram_Stone 10 Mar 2015 2:01 UTC
  0 points
  0
  Parent
  I’m no moral philosopher, but it seems that the difference is between claiming or not claiming that the ‘moral judger’ is a thing that actually exists. See divine command theory vs. ideal observer theory.
  
  Once more, I’m no decision theorist, and I barely know a thing about AI, let alone corrigibility, but the author doesn’t seem to be making a metaethical argument so much as a decision-theoretic one (?), so I don’t see the relevance of your question.
Dagon 9 Mar 2015 18:44 UTC
−1 points
0
Is this not a subset of AI Boxing? At first glance, it has all the same weaknesses, and the proposed mitigations apply equally well.