Garrett Baker comments on Detect Goodhart and shut down

Garrett Baker 23 Jan 2025 15:31 UTC
4 points
0
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood.
Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.
- Jeremy Gillen 23 Jan 2025 15:59 UTC
  4 points
  0
  Parent
  Fair enough, good points. I guess I classify these LLM agents as “something-like-an-LLM that is genuinely creative”, at least to some extent.
  Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
  - Garrett Baker 23 Jan 2025 16:32 UTC
    4 points
    0
    Parent
    Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
    I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.