Oliver Sourbut comments on Oliver Sourbut’s Shortform

Oliver Sourbut 16 Jul 2025 14:15 UTC
5 points
2
Here is some reasoned opinion about ML research automation.

Experimental compute and ‘taste’ seem very close to direct multiplier factors in the production of new insight:
- twice the compute means twice the number of experiments run^[1]
- ‘twice the taste’ (for some operationalisation of taste) means proposing and running useful experiments twice as often
- (there are other factors too, like insight extraction and system-2 experiment design)
My model of research taste is that it ‘accumulates’ (according to some sample efficiency) in a researcher and/or team by observation (direct or indirect) of experiments. It ‘depreciates’, like a capital stock, both because individuals and teams forget or lose touch, and (more relevant to fast-moving fields) because taste generalises only so far, and the ‘frontier’ of research keeps moving.

This makes experiments extremely important, both as a direct input to insight production and as fuel for accumulating research taste.

Peak human teams can’t get much better research taste in absence of experimental compute without improving on taste accumulation, which is a kind of learning sample efficiency. You can’t do that by just having more people: you have to get very sharp people and have very effective organisational structure for collective intelligence. Getting twice the taste is very difficult!

AI research assistants which substantially improved on experiment design, either by accumulating taste more efficiently or by (very expensive?) reasoning much more extensively about experiment design, could make the non-compute factor grow as well.

You can’t just ‘be smarter’ or ‘have better taste’ because it’ll depreciate away. Reasoning for experiment design has very (logarithmically?^[2]) diminishing returns as far as I can tell, so I’d guess it’s mostly about sample efficiency of taste accumulation.
1. ↩︎
  (There’s some parallelisation discount: k experiments in parallel is strictly worse than k in series, because you can’t incorporate learnings.)
2. ↩︎
  A naive model where reasoning for experiment design means generating more proposals from an idea generator and attempting to select the best one has worse than logarithmic returns to running longer, for most sensible distributions of idea generation. Obviously reasoning isn’t memoryless like that, because you can also build on, branch from, or refine earlier proposals, which might sometimes do better than coming up with new ones tabula rasa.
What links here?
- Oliver Sourbut's comment on How quick and big would a software intelligence explosion be? by Tom Davidson (12 Aug 2025 23:33 UTC; 2 points)