cousin_it comments on Plans A, B, C, and D for misalignment risk

cousin_it 8 Oct 2025 20:30 UTC
LW: 14 AF: 4
17
AF
The OP says takeover risk is 45% under plan D and 75% under plan E. We’re supposed to gain an extra 30% of safety from this feeble “build something by next week with 1% of compute”? Not happening.

My point is that if the “ten people on the inside” obey their managers, plan D will have a tiny effect at best. And if we instead postulate that they won’t obey their managers, then there are no such “ten people on the inside” in the first place. So we should already behave as if we’re in world E.
- ryan_greenblatt 8 Oct 2025 22:10 UTC
  LW: 2 AF: 2
  0
  AF Parent
  A general point is that going from “no human cares at all” to “a small group of people with limited resources cares” might be a big difference, especially given the potential leverage of using a bunch of AI labor and importing cheap measures developed elsewhere.
  - Cleo Nardo 11 Oct 2025 2:33 UTC
    2 points
    0
    Parent
    To clarify what I think is Ryan’s point:
    In D-labs, both the safety faction and the non-safety faction are leveraging AI labour.
    AI labour makes D-labs seem more like C-labs and less like E-labs, directionally.
    This is because the effectiveness ratio between (10 humans) and (990 humans) is greater than the ratio between (10 humans and 1M AIs) and (990 humans and 990M AIs).
    This is because of diminishing returns to cognitive labour, i.e. cheap interventions.
    - ryan_greenblatt 11 Oct 2025 2:42 UTC
      2 points
      0
      Parent
      (Yes, also I think that a small number of employees working on safety might get proportionally more compute than the average company employee, e.g. this currently seems to be the case.)
  - cousin_it 8 Oct 2025 22:21 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Yeah, that partly makes sense to me. I guess my intuition is like, if 95% of the company is focused on racing as hard as possible (and using AI leverage for that too, AI coming up with new unsafe tricks and all that), then the 5% who care about safety probably won’t have that much impact.
- Daniel Kokotajlo 8 Oct 2025 21:34 UTC
  LW: 2 AF: 2
  0
  AF Parent
  I disagree with the probabilities given by the OP. Also, the thing I mentioned was just one example, and probably not the best example; the idea is that the 10 people on the inside would be implementing a whole bunch of things like this.