Fabien Roger comments on A Timing Problem for Instrumental Convergence

Fabien Roger 13 Sep 2025 18:54 UTC
4 points
0
One thing I forgot to mention is that there are reasons to expect “we are likely to build smart consequentialist (that e.g. max sum_t V_t0(s_t))” to be true that are stronger than “look at current AIs” / “this is roughly aligned with commercial incentives”, such as then ones described by Evan Hubinger here.
TL;DR: alignment faking may be more sample efficient / easier to learn / more efficient at making loss go down than internalizing what humans want, so AIs that fake alignment may be selected for.
- Cleo Nardo 14 Sep 2025 3:09 UTC
  2 points
  0
  Parent
  Carlsmith is a good review of the “Will AIs be be smart consequentialists?” arguments up to late 2023. I think the conversation has progressed a little since then but not massively.