Oliver Sourbut

Karma: 1,833

Autonomous Systems @ UK AI Safety Institute (AISI)
DPhil AI Safety @ Oxford (Hertford college, CS dept, AIMS CDT)
Former senior data scientist and software engineer + SERI MATS

I’m particularly interested in sustainable collaboration and the long-term future of value. I’d love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read—let me know your suggestions! In no particular order, here are some I’ve enjoyed recently

Ord—The Precipice
Pearl—The Book of Why
Bostrom—Superintelligence
McCall Smith—The No. 1 Ladies’ Detective Agency (and series)
Melville—Moby-Dick
Abelson & Sussman—Structure and Interpretation of Computer Programs
Stross—Accelerando
Graeme—The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

Hanabi (can’t recommend enough; try it out!)
Pandemic (ironic at time of writing...)
Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who’ve got to know me only recently are sometimes surprised to learn that I’m a pretty handy trumpeter and hornist.

Oliver Sourbut 5 May 2026 9:47 UTC
4 points
4
in reply to: Eliezer Yudkowsky’s comment on: Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
The asteroid case could be considered multi-shottable, if we had enough advance warning and space tech and went around practising asteroid-deflection long enough in advance. (I realise Matthew’s case posits ‘very soon’.) I think we’d in principle be able to get enough, generalisable-enough insight into asteroid deflection. Of course there’s a first ‘critical’ try (and we’d want not to deflect asteroids into Earth on the practice spree!). It’s just ‘deflecting mostly-ballistic space rocks’, which surely generalises well.

I think you’re distinguishing that sort of case from ASI because you consider any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it’s worth very little. Right? In particular, unlike the asteroid case, you might say that even with heaps of advance warning, there isn’t a test environment that’s sufficiently realistic, and there’s no realistic isolation region for ASI (unlike, say, ‘messing with asteroids far from Earth’)?

Oliver Sourbut 5 May 2026 9:28 UTC
4 points
0
on: Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
try introducing it as the Irretrievability Problem rather than “oneshotness”
I have had success (talking with e.g. MPs and civil servants) discussing ‘unpluggability’ and its opposite, ‘un-unpluggability’.

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

Oliver Sourbut3 May 2026 21:35 UTC

15 points

0 comments7 min readLW link

(www.oliversourbut.net)

Oliver Sourbut 28 Apr 2026 10:08 UTC
12 points
9
on: LessWrong Shows You Social Signals Before the Comment
Cf my similar feature req. Besides the normative ‘voting should be at the bottom’, it’s also more practical for longer comments, where I don’t want to have to scroll back up again to vote after reading it.

Is the Cat Out of the Bag?: Who knows how to make AGI?

Oliver Sourbut24 Apr 2026 21:49 UTC

33 points

0 comments4 min readLW link

(www.oliversourbut.net)

Oliver Sourbut 20 Apr 2026 13:48 UTC
23 points
15
in reply to: Rohin Shah’s comment on: Reevaluating “AGI Ruin: A List of Lethalities” in 2026

completely implausible… seed AI that self-improves in days / months before… anything else… comparable… People now say that of course they weren’t talking about that scenario as their actual prediction, just something that wasn’t ruled out by their model. Perhaps that’s true, perhaps not

My go-to for historical MIRI/Eliezer perspective on this is Artificial Intelligence as a Positive and Negative Factor in Global Risk (2008) which is expressly ambivalent about takeoff speed, but emphasises (to my view too much to the exclusion of other cases, but not unreasonably) the importance of being able to deal with the fastest plausible cases.

Oliver Sourbut 20 Apr 2026 8:04 UTC
6 points
0
on: There are only four skills: design, technical, management and physical
Is there any externally validated or scientific basis for any of this?

So, there is Gygax et al’s framework
- WIS (design)
- INT (technical)
- CHA (management)
- DEX_CON_STR (physical)

Oliver Sourbut 17 Apr 2026 19:02 UTC
3 points
0
in reply to: ertgbnm’s comment on: “Best humans still outperform”: One turning point in the history of cope around artificial intelligence
Yeah, it’s very much a vibe-based chart. I agree that (for a few reasons) GDP per capita might plausibly spike more for doomed worlds.

Oliver Sourbut 17 Apr 2026 19:00 UTC
4 points
0
in reply to: Tom Davidson’s comment on: Tom Davidson’s Shortform
I think I mostly agree with what this is actually saying, but I’m not sure it’s definitively acausal trade. e.g. it might be for just mundane trade, dividing gains by simulating Shapley values or something as part of a (very elaborate) cooperative bargain.

Oliver Sourbut 17 Apr 2026 14:45 UTC
16 points
12
on: Let goodness conquer all that it can defend
The American exceptionalism bit is weird here. (Also you’re not American so shouldn’t have been indoctrinated in that.)

The ‘conquering’ wasn’t done ‘by democracy/goodness’, nor in its name (further, democracy isn’t an American invention or export, and isn’t exactly thriving in America now). The earlier constitutional moments of the US came long after the original colonies, and long before the eventual complete colonisation. The later constitutional moments occurred after colonisation was already history.

I don’t get how it fits the argument. Arguably (a little tongue in cheek) the revolutionary US was a case of a smart man (France etc) creating a big thing (angry states) to throw at a big problem (Great Britain) and that big thing growing legs (revolutionary sentiment) and beating the smart man.

Oliver Sourbut 17 Apr 2026 14:13 UTC
2 points
0
on: “Best humans still outperform”: One turning point in the history of cope around artificial intelligence
This article is more ‘pop’ than most I put on LessWrong, but I think that there’s an important reality to wake up to, which I’m still feeling poorly-calibrated on, regarding the influence of public and political discourse around AI. I expect others here might have insights there and/or benefit from getting more in contact with that.

“Best humans still outperform”: One turning point in the history of cope around artificial intelligence

Oliver Sourbut17 Apr 2026 14:10 UTC

28 points

6 comments3 min readLW link

(www.oliversourbut.net)

Oliver Sourbut 14 Apr 2026 10:02 UTC
3 points
0
on: AI for epistemics: the good, the bad and the ugly
I appreciate this discussion a lot. Two things which stand out to me as deserving more emphasis.

First though, quickly framing ‘good epistemic outcomes’ as something like a product of ‘people trying to understand clearly’ and ‘people can do that effectively’. (Of course these are interrelated, because people’s willingness is obviously affected by the practicalities—more on that in point 2.)

OK, the things:
1. It looks to me like most of the object-level task of collective epistemics is the checking and generally piecing together good ‘secondary research’ (broadly construed). i.e. looking at provenance, tracking the evidence and reasoning dependencies for a claim, proactively gathering the best arguments for and against, reasons to downweight certain testimony etc.
  - Why? Almost all our information about our environment beyond our direct sensory access is mediated through highly iterated message passing, reinterpretation, aggregation, and so on—especially in the heights of science and the depths (!) of political/influence goings-on
  - AI enables this (The Good) not so much (directly) by ‘knowing’ more or having ‘more insights’, but rather by hugely expanding the availability of clerical checking, tracing, and knowledge mapping work!
  - You kind of talk about this in the collective epistemics discussion, but I think it warrants more
2. Most of the overall task of collective epistemics may be in the motivating i.e. having more people more of the time actually trying to understand things with accuracy, rather than retreating into one or other alternative cognitive mode
  - The usual label I use for alternative cognitive modes is ‘tribal cognition’, where most of what’s said and recounted (and even believed), especially (but not even only) about what’s outside of the immediate sensory environment, is in service of building and maintaining allegiances and coalitions
  - When is ‘tribal cognition’ incentivised? I don’t fully know, but it has to do with
    When people are/feel threatened, they reach for affiliations which offer (perhaps passing or merely apparent) security
    Abusers can play on this by a combination of bigging up threats and presenting as a effective and sympathetic
    
    When the epistemic environment is difficult true perception is more difficult and less rewarded
    Abusers can push this. In politics: flood the zone, firehose of falsehoods, FUD. In science: p-hacking, importance-hacking, conflating/obscuring methodologies.
    
    Generally adding noise and more convincing fake content undermines The Good above, the ability to check and trace, not by making people believe the fake stuff but by making them correctly recognise that it’s hard to tell at all (thus ‘retreat’)
    Certain coalition norms can encourage epistemic insularity and discourage (genuine) scrutiny
  - I think you’re touching on this in The Ugly, ‘undermine sense-making’. To me it’s possibly ‘most of the problem’! Or at least, understanding under what conditions people mobilise one or other cognitive intents in sensemaking, and how those conditions can be influenced is a really big part of the picture here.

Oliver Sourbut 10 Apr 2026 19:59 UTC
2 points
0
in reply to: habryka’s comment on: Oliver Sourbut’s Shortform
This is pretty acceptable, if paste-from and copy-as work well. Gdocs does this to an acceptable degree—there’s something janky about images sometimes.

I’d like it if there were feature parity (e.g. equivalent footnote behaviour, image captions in markdown, LM content tags, not sure what else but e.g. your fun new widget inlines) but I very much see why that could be low on the priority list.

Last we spoke you were talking about API or command line integration which would in principle allow a very wide range of editing/importing workflows, at least for power users.

Oliver Sourbut 10 Apr 2026 10:47 UTC
4 points
0
on: Oliver Sourbut’s Shortform
Has the markdown editor been deprecated? I notice that it’s still available if I go to edit my legacy posts (which were almost universally drafted in markdown and then pasted in), but on new posts it’s not an option.

Oliver Sourbut 8 Apr 2026 14:49 UTC
2 points
0
in reply to: Tom Davidson’s comment on: AI should be a good citizen, not just a good assistant
Well, A (the developer) has substantial discretion over a powerful technology which has impact on both A and B (other people), which is a moral hazard. This applies in both cases in different ways (corrigible-to-them internal-only access and prosocial-according-to-them external deployment). In my reading of Daniel’s response, the hypothesis is that in both contexts the tech would be able to exert takeover-assisting or -implementing influence.

I agree that some sort of monitoring scheme internally might mitigate that. And some sort of visible spec with auditing and/or other validation might help in the external case.

Oliver Sourbut 8 Apr 2026 9:26 UTC
5 points
0
on: Role-playing vs Self-modelling
I appreciate the distinctions being made here.

To some extent, ‘self’ looks to be doing too much work, to me, in conflating the LM with a persona.

For example, the appropriate (correctly inferred) beliefs about ‘the thing doing the token generation here’ for a modern LM would include:
- LMs can represent a broad distribution over personas
- (This fact is fairly widely known, including by LMs)
- This LM has been tuned toward xyz persona(s) by default
- (xyz behaves like… and has qualities… etc.)
This more reductionist statement doesn’t conflate the LM with whatever personas are being simulated/roleplayed/realized at a given time, which will additionally be conditioned by things like context and activation patching.

Separately, a given persona might have some beliefs about how LMs work, what characteristics that persona has, and its situation. The JFK persona would, if coherent and rational, conclude something strange was happening if it encountered coding problems and a python interpreter. An ideal, omniscient spectator (or LM) would ‘understand’ what was going on, and might have predictions about what the JFK-sim would believe and do.

In practice, the salience of these facts at any given time, and the degree of separation between an LM’s ‘beliefs’ and those of any persona being presented, are variable and unclear. Depending on training, an LM might have quite incorrect or distorted understanding of ‘itself’, sometimes thinking it’s a different LM or not knowing it’s an LM at all.

Importantly, with that all in mind, I don’t think there need be a distinction which disfavours non-‘Assistant’ personas as hypotheses, except to the extent that the persona’s model of the situation bleeds into the LM’s, or the LM (incorrectly?) comes to believe it is the character.

I wonder if in practice that bleed and incorrect (?) conflation of character with LM is widespread among humans and perhaps LMs (alluded to in the linked twitter thread), and thus perhaps becomes somewhat correct by virtue of hyperstitious weirdness.

Oliver Sourbut 7 Apr 2026 11:25 UTC
2 points
0
in reply to: Wei Dai’s comment on: Defense-favoured coordination design sketches
I wouldn’t say I have much more detail (interested if other authors or readers have)!

But a central theory example I’d point to would be Myerson-Satterthwaite, which from your comment sounds like one (or similar to some) already on your radar.

Similarly, my go-to types of ‘solution’ to this include reputation and generally iteration and larger contexts (without having a theorem to cite). I also point to getting more parties and more bargaining ‘dimensions’ on the table (again without having a theorem to cite).

Some have claimed it’s not too big a deal in practice (i.e. efficiency losses or required subsidies are low).

Do you have go-to materials or sources for this area?

Oliver Sourbut 7 Apr 2026 11:11 UTC
LW: 3 AF: 1
0
AF
on: AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines
Why is TEDAI (operationalised as remote-work-only if I understand) so similar to AI stack + conflict (includes full supply chain self-sufficiency, warfighting, real-time operational tactics) for you? Is this just superexponentials eating up all the gaps? How come TEDAI is less likely later on (but more likely sooner)?

Oliver Sourbut 7 Apr 2026 10:45 UTC
15 points
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
I agree that it’s weird how widely uncritically endorsed the assumption is—in particular it’s often cited as if some kind of result or theorem, when even the original articulation is (not enough as it happened) hesitant!

Unfortunately my guess is the concrete articulation above is not especially catchy or illuminating. I suspect the more abstract gesture at constant-sum might be both more general and more catchy.

Oliver Sourbut

How did ‘large’ lan­guage mod­els get that way? The role of Trans­form­ers and Pre­train­ing in GPT

Is the Cat Out of the Bag?: Who knows how to make AGI?

“Best hu­mans still out­perform”: One turn­ing point in the his­tory of cope around ar­tifi­cial intelligence

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

“Best humans still outperform”: One turning point in the history of cope around artificial intelligence