Viliam comments on Thomas Kwa’s Shortform

Viliam 5 Aug 2022 11:04 UTC
2 points
0
Not sure if this is what you want, but I can imagine an exercise in Goodharting. You are given the criteria for a reward and the thing they were supposed to maximize, your task is to figure out the (least unlikely) way to score very high on the criteria without doing to well on the intended target.
For example: Goal = make the people in the call center more productive. Measure = your salary depends on how many phone calls you handle each day. Intended behavior = picking up the phone quickly, trying to solve the problems quickly. Actual behavior = “accidentally” dropping phone calls after a few seconds so that the customer has to call you again (and that counts by the metric as two phone calls answered).
Another example: Goal = make the software developers more productive. Measure 1 = number of lines of code written. Measure 2 = number of bugs fixed.
I am proposing this because it seems to me that from a 30000 foot view, a big part of AI alignment is how to avoid Goodharting. (“Goal = create a happy and prosperous future for humanity. Measure = something that sounds very smart and scientific. Actual behavior = universe converted to paperclips, GDP successfully maximized.”)