Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 30 Jul 2022 20:07 UTC
19 points
0
Maybe this is too tired a point, but AI safety really needs exercises—tasks that are interesting, self-contained (not depending on 50 hours of readings), take about 2 hours, have clean solutions, and give people the feel of alignment research.
I found some of the SERI MATS application questions better than Richard Ngo’s exercises for this purpose, but there still seems to be significant room for improvement. There is currently nothing smaller than ELK (which takes closer to 50 hours to develop a proposal for and properly think about it) that I can point technically minded people to and feel confident that they’ll both be engaged and learn something.
What links here?
- Thomas Kwa🔹's comment on Per Ivar Friborg’s Quick takes by Per Ivar Friborg (EA Forum; 3 Sep 2022 19:00 UTC; 7 points)
- Richard_Ngo 20 Oct 2022 1:34 UTC
  3 points
  1
  Parent
  If you let me know the specific MATS application questions you like, I’ll probably add them to my exercises.
  (And if you let me know the specific exercises of mine you don’t like, I’ll probably remove them.)
- Viliam 5 Aug 2022 11:04 UTC
  2 points
  0
  Parent
  Not sure if this is what you want, but I can imagine an exercise in Goodharting. You are given the criteria for a reward and the thing they were supposed to maximize, your task is to figure out the (least unlikely) way to score very high on the criteria without doing to well on the intended target.
  For example: Goal = make the people in the call center more productive. Measure = your salary depends on how many phone calls you handle each day. Intended behavior = picking up the phone quickly, trying to solve the problems quickly. Actual behavior = “accidentally” dropping phone calls after a few seconds so that the customer has to call you again (and that counts by the metric as two phone calls answered).
  Another example: Goal = make the software developers more productive. Measure 1 = number of lines of code written. Measure 2 = number of bugs fixed.
  I am proposing this because it seems to me that from a 30000 foot view, a big part of AI alignment is how to avoid Goodharting. (“Goal = create a happy and prosperous future for humanity. Measure = something that sounds very smart and scientific. Actual behavior = universe converted to paperclips, GDP successfully maximized.”)