We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
FYI I currently would mainline guess that this is true. Also I don’t get why current evidence says anything about it – current AIs aren’t dangerous, but that doesn’t really say anything about whether an AI that’s capable of speeding up superalignment or pivotal-act-relevant research by even 2x would be dangerous.
My view is that AIs are improving faster at research-relevant skills like SWE and math than they’re increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don’t have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I’m not sure about research taste as the benchmarks for that aren’t very good. I’d change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
FYI I currently would mainline guess that this is true. Also I don’t get why current evidence says anything about it – current AIs aren’t dangerous, but that doesn’t really say anything about whether an AI that’s capable of speeding up superalignment or pivotal-act-relevant research by even 2x would be dangerous.
My view is that AIs are improving faster at research-relevant skills like SWE and math than they’re increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don’t have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I’m not sure about research taste as the benchmarks for that aren’t very good. I’d change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).
What benchmark for research taste there exist? If I understand correctly, Epoch AI’s evaluation showed that the AIs lack creativity in a way since the combinatorial problems at IMO 2024 and the hard problem at IMO 2025 weren’t solved by SOTA AI systems. A similar experiment was conducted by me and had Grok 4 commit the BFS to do a construction. Unfortunately, the ARC-AGI-1 and ARC-AGI-2 benchmarks could be more about visual intelligence, and the promising ARC-AGI-3 benchmark[1] has yet to be finished.
Of course, the worse-case scenario is that misalignment comes hand-in-hand with creativity (e.g. because of the AIs creating some moral code which doesn’t adhere to the ideals of the AI’s human hosts).
Which is known to contain puzzles.