Alex Mallen comments on Daniel Kokotajlo’s Shortform

Alex Mallen 8 Jul 2025 15:41 UTC
LW: 9 AF: 7
0
AF
IMO the main implications of this update are:
- The probability of scheming increases, as I describe here.
- Non-scheming reward-seekers might take over too (e.g. without-specific-countermeasures-style)
- We get what we can measure. Getting models to try to do hard-to-verify tasks seems like it will be harder than I expected. Long-term strategic advice, safety research, and philosophy are probably hard to verify relative to capabilities R&D, so we go into the intelligence explosion unprepared.
- Stephen Martin 8 Jul 2025 18:04 UTC
  6 points
  1
  Parent
  Well the silver lining to the “we get what we can measure” cloud would be that presumably if we can’t reliably train on long term tasks, then probably the models won’t be very good at long term power seeking either.
  - Alex Mallen 10 Jul 2025 0:15 UTC
    5 points
    0
    Parent
    The “we get what we can measure” story leading to doom doesn’t rely on long-term power-seeking. It might be the culmination of myopic power-seeking leading to humans loosing a handle on the world.
    
    Also, capabilities might be tied to alignment in this way, but just because we can’t get the AI to try to do a good job of long-term tasks doesn’t mean they won’t be capable of it.