Martín Soto comments on The Mirror Trap

Martín Soto 7 Jun 2025 21:44 UTC
3 points
1
I don’t see how this Goodharting is bidirectional. It seems like plain old Goodharting. The assessment, with time (and due to some extraneous process), becomes a lower quality proxy, that the artist keeps optimizing, thus Goodharting actual value.
- Rana Dexsin 7 Jun 2025 23:56 UTC
  1 point
  0
  Parent
  The artist is using “does the audience overtly respond well to this” as a proxy measure for whether the art meets the artist’s more illegible standard of goodness, but the audience is using “does this come from an artist we already regard as good” as a proxy measure for their own illegible standard of goodness. The illegible standards of both parties had to intersect enough around the initial art for the cycle to get started, but that doesn’t mean they’re the same, nor that the optimization processes are completely symmetrical or the same process. It might be possible that the signals get so entangled that you could treat it as an instance of single-Goodhart on some compound measure from outside the system, but from inside the system there’s still multiple sub-cycles going on that feed each other. Does that answer this, or is there something else off?
  - Martín Soto 8 Jun 2025 10:03 UTC
    2 points
    0
    Parent
    But the audience isn’t optimizing/Goodharting anything, just providing an imperfect proxy. It is only the artist who is argmaxing, which is when Goodhart appears.
    One way out would be for the artist to stop optimizing for the audience, and start optimizing for real value. Another way out would be for the audience to perfect their assessment. But this is always the case for Goodhart: you can either stop using the proxy altogether, or improve your proxy.
    Something more interesting would be “the artist is trying to create the art that elicits the best response, and the audience is trying to produce the response that makes the artist happiest”, or something like that. This is what happens when two people pleasers meet and they end up doing a plan that none of them wants. It’s also relevant to training an AI that’s alignment-faking. In a sense, the other trying to maximize your local utility dampens the signal you wanted to use to maximize global utility.