Ofer comments on Risks from Learned Optimization: Introduction

Ofer 9 Jun 2019 5:00 UTC
5 points
0
The distinction between the mesa- and behavioral objectives might be very useful when reasoning about deceptive alignment (in which the mesa-optimizer tries to have a behavioral objective that is similar to the base objective, as an instrumental goal for maximizing the mesa-objective).
- Vlad Mikulik 9 Jun 2019 18:53 UTC
  7 points
  0
  Parent
  To some extent, but keep in mind that in another sense, the behavioural objective of maximising paperclips is totally consistent with playing along with the base objective for a while and then defecting. So I’m not sure the behaviour/mesa- distinction alone does the work you want it to do even in that case.
  - Ofer 9 Jun 2019 19:44 UTC
    3 points
    0
    Parent
    Agreed (haven’t thought about that).