William_S comments on Daniel Kokotajlo’s Shortform

William_S 12 Mar 2025 18:22 UTC
LW: 5 AF: 4
2
AF
I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
- Daniel Kokotajlo 12 Mar 2025 20:43 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
- William_S 12 Mar 2025 18:27 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis