Oliver Sourbut

Karma: 824

Autonomous Systems @ UK AI Safety Institute (AISI)
DPhil AI Safety @ Oxford (Hertford college, CS dept, AIMS CDT)
Former senior data scientist and software engineer + SERI MATS

I’m particularly interested in sustainable collaboration and the long-term future of value. I’d love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read—let me know your suggestions! In no particular order, here are some I’ve enjoyed recently

Ord—The Precipice
Pearl—The Book of Why
Bostrom—Superintelligence
McCall Smith—The No. 1 Ladies’ Detective Agency (and series)
Melville—Moby-Dick
Abelson & Sussman—Structure and Interpretation of Computer Programs
Stross—Accelerando
Graeme—The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

Hanabi (can’t recommend enough; try it out!)
Pandemic (ironic at time of writing...)
Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who’ve got to know me only recently are sometimes surprised to learn that I’m a pretty handy trumpeter and hornist.

Oliver Sourbut 6 May 2024 9:18 UTC
1 point
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
the original ‘theorem’ was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn’t nonsense? But in any case, I agree the original presentation is dreadful. John’s is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have
- $X$ (hidden) state
- $Y$ output/observation
and a predictor
- $S$ (predictor) state
- $^Y$ predictor output
- $R$ the reward or goal or what have you (some way of scoring ‘was $^Y$ right?’)
with structure

$\begin{matrix} X & \to Y X & \to R Y \to S & \to^Y \to R \end{matrix}$

Then GR trivially says $S$ (predictor state) should model the posterior $P (X | Y)$ .

Now if these are all instead processes (time-indexed), we have HMM
- $X_{t}$ (hidden) states
- $Y_{t}$ observations
and predictor process
- $S_{t}$ (predictor) states
- ${^Y}_{t}$ predictions
- $R_{t}$ rewards
with structure

$\begin{matrix} X_{t} & \to X_{t + 1} X_{t} & \to Y_{t} S_{t - 1} & \to S_{t} Y_{t} \to S_{t} & \to {^Y}_{t + 1} \to R_{t + 1} Y_{t + 1} & \to R_{t + 1} \end{matrix}$

Drawing together $(X_{t + 1}, Y_{t + 1}, {^Y}_{t + 1}, R_{t + 1})$ as $G_{t}$ the ‘goal’, we have a GR motif

$\begin{matrix} X_{t} & \to Y_{t} Y_{t} & \to S_{t} \to G_{t} S_{t - 1} & \to S_{t} X_{t} & \to G_{t} \end{matrix}$

so $S_{t}$ must model $P (X_{t} | S_{t - 1}, Y_{t})$ ; by induction that is $P (X_{t} | S_{0}, Y_{1}, . . ., Y_{t})$ .

Oliver Sourbut 3 May 2024 10:50 UTC
2 points
0
in reply to: Adam Shai’s comment on: Transformers Represent Belief State Geometry in their Residual Stream
I guess my question would be ‘how else did you think a well-generalising sequence model would achieve this?’ Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)

From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn’t (it strikes me as a lesser-noticed and perhaps oft-inadvertent rhetorical dark art and I see it on the rise on LW, which is sad).

That said, since I’m the only one objecting here, you appear to be more right about the surprisingness of this!

The linear probe is new news (but not surprising?) on top of GR, I agree. But the OP presents the other aspects as the surprises, and not this.

Oliver Sourbut 2 May 2024 18:06 UTC
1 point
0
on: Transformers Represent Belief State Geometry in their Residual Stream
Nice explanation of MSP and good visuals.

This is surprising!

Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem^[1].)

In general I’d encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.
1. ↩︎
  John has a good explication here

Oliver Sourbut 10 Jan 2024 22:43 UTC
1 point
0
in reply to: Oliver Sourbut’s comment on: Terminology: <something>-ware for ML?
Incidentally I noticed Yudkowsky uses ‘brainware’ in a few places (e.g. in conversation with Paul Christiano). But it looks like that’s referring to something more analogous to ‘architecture and learning algorithms’, which I’d put more in the ‘software’ camp when in comes to the taxonomy I’m pointing at (the ‘outer designer’ is writing it deliberately).

Oliver Sourbut 10 Jan 2024 19:03 UTC
1 point
0
in reply to: Oliver Sourbut’s comment on: Some Rules for an Algebra of Bayes Nets
Unironically, I think it’s worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

Oliver Sourbut 10 Jan 2024 18:48 UTC
5 points
2
in reply to: johnswentworth’s comment on: Some Rules for an Algebra of Bayes Nets
Mmm, I misinterpreted at first. It’s only a v-structure if $X$ and $Z$ are not connected. So this is a property which needs to be maintained effectively ‘at the boundary’ of the fully-connected cluster which we’re rewriting. I think that tallies with everything else, right?

ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein $⟹$ Stitch is right or not.

Oliver Sourbut 10 Jan 2024 18:20 UTC
1 point
0
in reply to: Oliver Sourbut’s comment on: Some Rules for an Algebra of Bayes Nets
That’s concerning. It would appear to make both our proofs invalid.

But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

Oliver Sourbut 10 Jan 2024 18:18 UTC
5 points
0
in reply to: johnswentworth’s comment on: Some Rules for an Algebra of Bayes Nets
Aha. Preserving v-structures (colliders like $X \to Y \leftarrow Z$ ) is necessary and sufficient for equivalence^[1]. So when rearranging fully-connected subgraphs, certainly we can’t do it (cost-free) if it introduces or removes any v-structures.

Plausibly if we’re willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs—but they’d be non-invertible. Haven’t thought about that.
1. ↩︎
  Verma & Pearl, Equivalence and Synthesis of Causal Models 1990
What links here?
- Oliver Sourbut's comment on Some Rules for an Algebra of Bayes Nets by johnswentworth (10 Jan 2024 15:42 UTC; 5 points)

Oliver Sourbut 10 Jan 2024 17:12 UTC
5 points
0
in reply to: johnswentworth’s comment on: Some Rules for an Algebra of Bayes Nets
Mhm, OK I think I see. But $X_{¯ i}, N, M$ appear to me to make a complete subgraph, and all I did was redirect the $N \to M$ . I confess I am mildly confused by the ‘reorder complete subgraph’ bookkeeping rule. It should apply to the $A \to B$ in $A \to B \leftarrow C$ , right? But then I’d be able to deduce $A \leftarrow B \leftarrow C$ which is strictly different. So it must mean something other than what I’m taking it to mean.

Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)

Oliver Sourbut 10 Jan 2024 15:42 UTC
5 points
0
in reply to: Oliver Sourbut’s comment on: Some Rules for an Algebra of Bayes Nets
Perhaps more importantly, I think with Node Introduction we really don’t need $M \leftarrow X \to N$ after all?

With Node Introduction and some bookkeeping, we can get the $N$ and $M$ graphs topologically compatible, and Frankenstein them. We can’t get as neat a merge as if we also had $M \leftarrow X \to N$ - in particular, we can’t get rid of the arrow $M \to N$ . But that’s fine, we were about to draw that arrow in anyway for the next step!

Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary.

ETA: it’s cleared up below—there’s an invalid Reorder here (it removes a v-structure).

Oliver Sourbut 10 Jan 2024 14:33 UTC
13 points
2
in reply to: Oliver Sourbut’s comment on: Some Rules for an Algebra of Bayes Nets
I had another look at this with a fresh brain and it was clearer what was happening.

TL;DR: It was both of ‘I’m missing something’, and a little bit ‘Frankenstein is invalid’ (it needs an extra condition which is sort of implicit in the post). As I guessed, with a little extra bookkeeping, we don’t need Stitching for the end-to-end proof. I’m also fairly confident Frankenstein subsumes Stitching in the general case. A ‘deductive system’ lens makes this all clearer (for me).

My Frankenstein mistake

The key invalid move I was making when I said

But this same move can alternatively be done with the Frankenstein rule, right?

is that Frankenstein requires all graphs to be over the same set of variables. This is kind of implicit in the post, but I don’t see it spelled out. I was applying it to an $(M, X_{¯ i}, X_{i})$ graph ( $N$ absent) and an $(X_{¯ i}, X_{i}, N)$ graph ( $M$ absent). No can do!

Skipping Stitch in the end-to-end proof

I was right though, Frankenstein can be applied. But we first have to do ‘Node Introduction’ and ‘Expansion’ on the graphs to make them compatible (these extra bookkeeping rules are detailed further below.)

So, to get myself in a position to apply Frankenstein on those graphs, I have to first (1) introduce $M$ to the second graph (with an arrow from each of $N$ , $X_{i}$ , and $X_{¯ i}$ ), and (2) expand the ‘blanket’ graph $M \leftarrow X \to N$ (choosing $X_{¯ i} \to X_{i}$ to maintain topological consistency). Then (3) we Frankenstein them, which leaves $N$ dangling, as we wanted.

Next, (4) I have to introduce $N$ to the first graph (again with an arrow from each of $M$ , $X_{i}$ , and $X_{¯ i}$ ). I also have a topological ordering issue with the first Frankenstein, so (5) I reorder $M$ to the top by bookkeeping. Now (6) I can Frankenstein those, to sever the $X_{¯ i} \to X_{i}$ as hoped.

But now we’ve performed exactly the combo that Stitch was performing in the original proof. The rest of the proof proceeds as before (and we don’t need Stitch).

More bookkeeping rules

These are both useful for ‘expansive’ stuff which is growing the set of variables from some smaller seed. The original post mentions ‘arrow introduction’ but nothing explicitly about nodes. I got these by thinking about these charts as a kind of ‘deductive system’.

Node introduction

A graph without all variables is making a claim about the distribution with those other variables marginalised out.

We can always introduce new variables—but we can’t (by default) assume anything about their independences. It’s sound (safe) to assume they’re dependent on everything else—i.e. they receive an incoming arrow from everywhere. If we know more than that (regarding dependencies), it’s expressed as absence of one or another arrow.

e.g. a graph with $A, B, C$ is making a claim about $P (A, B, C)$ . If there’s also a $D$ , we haven’t learned anything about its independences. But we can introduce it, as long as it has arrows $A \to D$ , $B \to D$ , and $C \to D$ .

Node expansion aka un-combine

A graph with combined nodes is making a claim about the distribution as expressed with those variables appearing jointly. There’s nothing expressed about their internal relationship.

We can always expand them out—but we can’t (by default) assume anything about their independences. It’s sound to expand and spell them out in any fully-connected sub-DAG—i.e. they have to be internally fully dependent. We also have to connect every incoming and outgoing edge to every expanded node i.e. if there’s a dependency between the combination and something else, there’s a dependency between each expanded node and that same thing.

e.g. a graph with $X, Y, Z$ is making a claim about $P (X, Y, Z)$ . If $Y$ is actually several variables, we can expand them out, as long as we respect all possible interactions that the original graph might have expressed.

Deductive system

I think what we have on our hands is a ‘deductive system’ or maybe grandiosely a ‘logic’. The semantic is actual distributions and divergences. The syntax is graphs (with divergence annotation).

An atomic proposition is a graph $G$ together with a divergence annotation $ϵ$ , which we can write $G (ϵ)$ .

Semantically, that’s when the ‘true distribution satisfies $G$ up to $ϵ$ KL divergence’ as you described^[1]. Crucially, some variables might not be in the graph. In that case, the distributions in the relevant divergence expression are marginalised over the missing variables. This means that the semantic is always under-determined, because we can always introduce new variables (which are allowed to depend on other variables however they like, being unconstrained by the graph).

Then we’re interested in sound deductive rules like

$G (ϵ_{G}) ⟹ H (ϵ_{H})$

Syntactically that is ‘when we have deduced $G (ϵ_{G})$ we can deduce $H (ϵ_{H})$ ’. That’s sound if, for any distribution $P$ satisfying $G (ϵ_{G})$ we also have $P$ satisfying $H (ϵ_{H})$ .

Gesture at general Frankenstitch rule

More generally, I’m reasonably sure Stitch is secretly just multiple applications of Frankenstein, as in the example above. The tricky bit I haven’t strictly worked through is when there’s interleaving of variables on either side of the blanket in the overall topological ordering.

A rough HANDWAVE proof sketch, similar in structure to the example above:
- Expand the $X \leftarrow Y \to Z$ blanket graph
  - The arrows internal to $X$ , $Y$ , and $Z$ need to be complete
  - We can always choose a complete graph consistent with the $X$ , $Y$ , and $Z$ parts of the original graphs (otherwise there wouldn’t be an overall consistent topology)
  - Notice that the connections $X \leftarrow Y$ are all $Y$ to all $X$ , which is not necessarily consistent with the original $X Y$ graph
    and similarly for the $Y Z$ arrows
    (there could be $X \to Y$ arrows in the original)
- Introduce $Z$ to the $X Y$ graph (and vice versa)
  - The newly-introduced nodes are necessarily ‘at the bottom’ (with arrows from everything else)
  - We can always choose internal connections for the introduced $Z$ s consistent with the original $Y Z$ graph
- Notice that the connections $X \to Z$ and $Y \to Z$ in the augmented $X Y$ graph all keep $Z$ at the bottom, which is not necessarily consistent with the original $Y Z$ graph (and vice versa)
  - But this is consistent with the Expanded blanket graph
- We ‘zip from the bottom’ with successive bookkeeping and Frankensteins
  - THIS IS WHERE THE HANDWAVE HAPPENS
  - Just like in the example above, where we got the $N$ sorted out and then moved the introduced $M$ to the ‘top’ in preparation to Frankenstein in the $M$ graph, I think there should always be enough connections between the introduced nodes to ‘move them up’ as needed for the stitch to proceed
I’m not likely to bother proving this strictly, since Stitch is independently valid (though it’d be nice to have a more parsimonious arsenal of ‘basic moves’). I’m sharing this mainly because I think Expansion and Node Introduction are of independent relevance.
1. ↩︎
  More formally, $G (ϵ)$ over variables $X$ is satisfied by distribution $P$ when $D_{K L} (P (X) | | \prod_{i} P (X_{i} | X_{p a^{G} (i)})) \leq ϵ$ . (This assumes some assignment of variables in $G$ to variables in $P$ .)

Oliver Sourbut 10 Jan 2024 9:06 UTC
3 points
−2
in reply to: Joe_Collman’s comment on: Deceptive AI ≠ Deceptively-aligned AI
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.

I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.

More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.

Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.

Oliver Sourbut 8 Jan 2024 13:57 UTC
3 points
2
in reply to: Algon’s comment on: Deceptive AI ≠ Deceptively-aligned AI
I mean the deliberation happens in a neural network. Maybe you thought I meant ‘net’ as in ‘after taking into account all contributions’? I should say ‘NN-internal’ instead, probably.

Oliver Sourbut 8 Jan 2024 10:29 UTC
5 points
0
in reply to: Oliver Sourbut’s comment on: Deceptive AI ≠ Deceptively-aligned AI
Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we’ll share some content from it at some point. In the mean time, my take after that is roughly
- deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
- this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
- it’s still unclear how much deliberation about deception can/will happen ‘net-internally’ vs externalised
  - externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
    I keep notes in my diary about how I’m going to coordinate my coup
  - steganographic deliberation about deceptive alignment is scarier
    my notes are encrypted
  - fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
    like, it’s all in my brain
I think another thing people are often arguing about without making it clear is how ‘net internal’ the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it’s basically orthogonal to the discussion about deception and deceptive alignment.^[1]

More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible—though we don’t have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there’s not-very-general pattern-matching in there which gives rise to that, or there’s some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.
1. ↩︎
  ETA I don’t mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they’re different factors.

Oliver Sourbut 8 Jan 2024 10:14 UTC
LW: 3 AF: 3
0
AF
on: Deceptive AI ≠ Deceptively-aligned AI
This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.

I’d strongly agree that separating out ‘deception’ per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.

I tend to use ‘deceptive alignment’ slightly more broadly—i.e. something could be deceptively aligned post-training, even if all updates after that point are ‘in context’ or whatever analogue is relevant at that time. Right? This would be more than ‘mere’ deception, if it’s deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn’t need to be ‘net internal’ or anything like that.

I think what you’re pointing at here by ‘deceptive alignment’ is what I’d call ‘training hacking’, which is more specific. In my terms, that’s deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.

No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I’m pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.

Oliver Sourbut 5 Jan 2024 9:45 UTC
2 points
0
in reply to: Odd anon’s comment on: Terminology: <something>-ware for ML?
I wasn’t eager on this, but your justification updated me a bit. I think the most important distinction is indeed the ‘grown/evolved/trained/found, not crafted’, and ‘brainware’ didn’t immediately evoke that for me. But you’re right, brains are inherently grown, they’re very diverse, we can probe them but don’t always/ever grok them (yet), structure is somewhat visible, somewhat opaque, they fit into a larger computational chassis but adapt to their harness somewhat, properties and abilities can be elicited by unexpected inputs, they exhibit various kinds of learning on various timescales, …

Oliver Sourbut 5 Jan 2024 9:39 UTC
1 point
0
in reply to: the gears to ascension’s comment on: Terminology: <something>-ware for ML?
Mold like fungus or mold like sculpt? I like this a bit, and I can imagine it might… grow on me. (yeuch)

Mold-as-in-sculpt has the benefit that it encompasses weirder stuff like prompt-wrangled and scaffolded stuff, and also kinda large-scale GOFAI-like things alla ‘MCTS’ and whatnot.

Oliver Sourbut 4 Jan 2024 14:57 UTC
3 points
0
in reply to: Oliver Sourbut’s comment on: Some Rules for an Algebra of Bayes Nets
Yeah, thinking slightly aloud, I tentatively think Frankenstein needs an extra condition like the blanket stitch condition… something which enforces the choice of topo ordering to be within the right class of topo orderings? That’s what the chain $M \leftarrow X \to N$ does—it means we can assign orderings $M, X_{¯ i}, X_{i}, N$ or $M, X_{i}, X_{¯ i}, N$ , but not e.g. $M, X_{¯ i}, N, X_{i}$ , even though that order is consistent with both of the other original graphs.

If I get some time I’ll return to this and think harder but I can’t guarantee it.

ETA I did spend a bit more time, and the below mostly resolves it: I was indeed missing something, and Frankenstein indeed needs an extra condition, but you do need $M \leftarrow X \to N$ .

Oliver Sourbut 4 Jan 2024 14:31 UTC
3 points
0
in reply to: johnswentworth’s comment on: Some Rules for an Algebra of Bayes Nets
But this same move can alternatively be done with the Frankenstein rule, right? (I might be missing something.) But Frankenstein has no such additional requirement, as stated. If I’m not missing something, I think Frankenstein might be invalid as stated (like maybe it needs an analogous extra condition). Haven’t thought this through yet.

i.e. I think either
- I’m missing something
- Frankenstein is invalid
- You don’t need $M \leftarrow X \to N$

Oliver Sourbut 4 Jan 2024 13:22 UTC
4 points
2
on: Natural Latents: The Math
One thing that initially stood out to me on the fundamental theorem was: where did the $N \to M$ arrow come from? It ‘gets introduced’ in the first bookkeeping step (we draw $N \to M$ and then reorder the $(N, M, X_{¯ i})$ subgraph at each $i$ .

This seemed suspicious to me at first! It seemed like kind of a choice, so what if we just didn’t add that arrow? Could we land at a conclusion of $N$ AND $M \to X$ ? That’s way too strong! But I played with it a bit, and there’s no obvious way to do the second frankenstitch which brings everything together unless you draw in that extra arrow and rearrange. You just can’t get a globally consistent topological ordering without $N$ somehow becoming ancesterable to $X_{¯ i}$ . (Otherwise the glommed $X_{¯ i}$ variables interfere with each other when you try to find $N$ ’s ancestors in the stitch.)

Still, this move seems quite salient—in particular that arrow-addition feels something like the ‘lossiest’ step in the proof (except for the final bookkeeping which gloms all the $X_{i}$ together, implicitly drawing a load of arrows between them all)?

Oliver Sourbut

My Frankenstein mistake

Skipping Stitch in the end-to-end proof

More bookkeeping rules

Node introduction

Node expansion aka un-combine

Deductive system

Gesture at general Frankenstitch rule