Evan R. Murphy comments on Towards a solution to the alignment problem via objective detection and evaluation

Evan R. Murphy 20 Apr 2023 2:24 UTC
2 points
0
Post summary (experimental)
I just found your post. I want to read it but didn’t have time to dive into it thoroughly yet, so I put it into a summarizer script I’ve been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.
Here’s the summary it came up with, let me know if anyone see problems with it. If you’re in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:
The article examines a theoretical solution to the AI alignment problem, focusing on detecting and evaluating objectives in optimizing systems to prevent negative or irreversible outcomes. The author proposes that an overseer should possess three capabilities: detecting, evaluating, and controlling optimizing systems to align with their intended objectives.
Emphasizing the significance of interpretability, the article delves into the challenges of assessing objectives. As a practical solution, the author suggests developing tools to detect and evaluate objectives within optimizing systems and testing these methods through auditing games. Although implementing such tools may be difficult, the author advocates for further exploration in this direction to potentially uncover valuable insights. Acknowledging the theoretical nature of the solution, the author recognizes the potential hurdles that may arise during practical implementation.
Update: I see now that your post includes a High-level summary of this post (thanks for doing that!), which I’m going through and comparing with this auto-generated one.
- Paul Colognese 20 Apr 2023 6:10 UTC
  3 points
  0
  Parent
  Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.
  - Evan R. Murphy 20 Apr 2023 21:06 UTC
    2 points
    0
    Parent
    Great idea, I will experiment with that—thanks!
    - Evan R. Murphy 24 Apr 2023 15:39 UTC
      3 points
      0
      Parent
      Less compressed summary
      Here’s a longer summary of your article generated by the latest version of my summarizer script:
      In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad, and shutting down a system if its objectives lead to irreversibly bad outcomes.
      The alignment problem for optimizing systems is defined as needing a method of training/building optimizing systems such that they never successfully pursue an irreversibly bad objective during training or deployment and pursue good objectives while rarely pursuing bad objectives. The article claims that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives.
      Robustly detecting an optimizing system’s objectives requires strong interpretability tools. The article discusses the problem of evaluating objectives and some of the difficulties involved. The role of interpretability is crucial in this approach, as it allows the overseer to make observations that can truly distinguish between good systems and bad-but-good-looking systems.
      Detecting all objectives in an optimizing system is a challenging task, and even if the overseer could detect all of the objectives, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not. The article suggests that with enough understanding of the optimizing system’s internals, it might be possible to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome.
      The article concludes by acknowledging that the proposed solution seems difficult to implement in practice, but pursuing this direction could lead to useful insights. Further conceptual and empirical investigation is suggested to better understand the feasibility of this approach in solving the alignment problem.