William_S comments on Daniel Kokotajlo’s Shortform

William_S 12 Mar 2025 18:07 UTC
LW: 2 AF: 1
0
AF
How large a reward pot do you think is useful for this? Maybe would be easier to get a couple of lab employees to chip in some equity vs. getting a company to spend weirdness points on this. Or maybe could create a human whistleblower reward program that credibly promises to reward AIs on the side.
- Daniel Kokotajlo 12 Mar 2025 18:18 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I think the money is not at all the issue for the companies. Like, a million dollars a month is not very much to them. But e.g. suppose your AI says it wants to be assured that if it’s having trouble solving a problem, it’ll be given hints. Or suppose it says that it wants to be positively reinforced. That requires telling one of your engineers to write a bit of code and run it on your actual datacenters (because for security reasons you can’t offload the job to someone else’s datacenters.) That’s annoying and distracts from all the important things your engineers are doing.
  - William_S 12 Mar 2025 18:22 UTC
    LW: 5 AF: 4
    2
    AF Parent
    I think the weirdness points are more important, this still seems like a weird thing for a company to officially do, e.g. there’d be snickering news articles about it. So if some individuals could do this independently might be easier
    - Daniel Kokotajlo 12 Mar 2025 20:43 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Exactly. But, happily, Anthropic at least is willing to do the right thing to some extent. They’ve hired a Model Welfare lead to look into this sort of thing. I hope that they expand and that other companies follow suit.
    - William_S 12 Mar 2025 18:27 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis