plex comments on Daniel Kokotajlo’s Shortform

plex 2 Aug 2025 13:23 UTC
5 points
3
Nice story, thanks for laying it out like this.
My guess is timelines fall off this hopeful story around:
- Things moving too quickly once automated research comes in
  - Labs go straight for RSI once it becomes available rather than slowing enough
  - Even if they vaguely try to slow down advanced systems converge towards unhobbling themselves and getting this past our safeguards
  - The control walls fall too fast and sometimes silently
  - Our interpretability tools don’t hold up through the architectural phase transitions, etc.
- Agent Foundations being too tricky to pass to AI research teams to do it properly in time
  - Labs don’t have enough in-house philosophical competence to know what to ask their systems or check their answers
  - AF is hard to verify in a way that defies easy feedback loops
  - Current AF field likely is missing crucial considerations which will kill us if the AI research systems can’t locate areas of research needed that we haven’t noticed
  - Very few people have deep skills here (maybe a few dozen by my read)