Alex Mallen comments on Legible vs. Illegible AI Safety Problems

Alex Mallen 15 Nov 2025 19:35 UTC
LW: 1 AF: 1
0
AF
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don’t think it leads to the right conclusions in many important cases (e.g., Joe’s):
- Many important and legible safety problems don’t slow development. I think it’s extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn’t work on the model spec (1) the model spec is worse and (2) dangerously capable AI happens just as fast. The spec would likely be worse in ways that both increase takeover risk and decrease the expected value of the future conditional on (no) takeover.
- The best time to work on AI x-risk is probably when it’s most legible. In my view, the most valuable time to be doing safety work is just before AIs become dangerously capable, because e.g., then we can better empirically iterate (of course, you can do this poorly as John Wentworth argues). At this point, the x-risk problems will likely be legible (e.g., because they’re empirically demonstrable in model organisms). I think it would quite plausibly be a mistake not to work on x-risk problems at this time when they’ve just become more tractable because of their increased legibility! (You were making the claim about legibility holding tractability fixed, but in fact tractability is highly correlated with legibility. Though, admittedly, also lower neglectedness.)