# Oliver Sourbut

Karma: 310

Oliver—or call me Oly: I don’t mind which!

I’m particularly interested in sustainable collaboration and the long-term future of value. Currently based in London, I’m in my early-ish career working as a senior software engineer/​data scientist, and doing occasional AI alignment work with SERI.

I’d love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read—let me know your suggestions! Recently I’ve enjoyed

• Ord—The Precipice

• Pearl—The Book of Why

• Bostrom—Superintelligence

• McCall Smith—The No. 1 Ladies’ Detective Agency

• Abelson & Sussman—Structure and Interpretation of Computer Programs

• Stross—Accelerando

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

• Hanabi (can’t recommend enough; try it out!)

• Pandemic (ironic at time of writing...)

• Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)

• Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who’ve got to know me only recently are sometimes surprised to learn that I’m a pretty handy trumpeter and hornist.

• This is a fantastic point well articulated, reminiscent of some conversations we had a few months ago at Lightcone.

I’d say that a “general-purpose search” process is something which:

• Takes in a problem or goal specification (from a fairly broad range of possible problems/​goals)

• … and returns a plan which solves the problem or scores well on the goal

I think we probably agree on what things there actually are, but I think this particular definition of ‘general purpose search’ is slightly too general to be a most useful pointer/​carving.

This because it seems to include things like matrix inversion for least-squares solutions (unless ‘from a fairly broad range of possible problems/​goals’ is taken to preclude this meaningfully?) which I deem importantly different. I’d class matrix inversion least-squares as a (powerful) heuristic[1] (a ‘proposal’ in my deliberation terminology), but not as (proper) search itself.

I think it remains useful to distinguish algorithms which evaluate/​promote or otherwise weigh proposals[2]. This is what I’ve started calling ‘proper deliberation’ and it’s generally what I mean when I talk about search.

In the case of applying matrix inversion to ordinary least squares, for me, the ‘general deliberation’ consists of something like

1. noticing the relevant features of the problem (this is ‘abstraction/​pattern-matching magic’)

2. cognitively retrieving the OLS abstraction and matrix-inversion as a cached heuristic (this is ‘propose’)

3. thinking ‘yes, this will work’ (this is ‘promote’)

4. applying matrix inversion to solve

A clever/​practised enough deliberator does steps 1, 2 and 3 ‘right’ and doesn’t need to iterate for this particular problem (my point here is that if your heuristics are good enough you can deliberate with only one proposal and say ‘yep, good enough, let’s go’). But counterfactually step 2 might make various alternative proposals, or step 3 might think ‘actually there are too many dimensions in this case for inversion to be tractable’ or something, and thus there’s an evaluation and an internal update.

1. ↩︎

Peter Barnett and Ian McKenzie coined ‘God-level heuristic’ for really solid mathematically-justified heuristics like this, which I quite like

2. ↩︎

I don’t require this to be a ‘full consequentialist model-based valuation’, but that would be one example. See my deliberation simple examples for less sophisticated versions which are quite pervasive and nevertheless embody the ‘propose;promote’ breakdown

• I love how your intro has the flavour of

We are Hydra. We are legion.

p.s. Hail Team Shard

p.p.s I’ve read a bunch of so-called Shard Theory stuff and I’m still not sure how it differs from the concepts of optimization daemons/​mesa optimization besides less exclusively emphasising the ‘post-general’ regime (for want of a better term).

• 7 Aug 2022 10:11 UTC
1 point
0 ∶ 0

Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”?

That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.

Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).

I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.

I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).

• 5 Aug 2022 17:33 UTC
3 points
0 ∶ 0
in reply to: Quintin Pope’s comment

This response is really helpful, thank you! I take various of the points as uncontroversial[1], so I’ll respond mainly to those where I think you seem surprisingly confident (vs my own current epistemic position).

I and Alex both agree that the genome can influence learned behavior and concepts by exploiting its access to sensory ground truth… the imprinting circuitry is… imprecise

It seems like there are two salient hypotheses that can come out of the imprinting phenomenon, though (they seem to sort of depend on what direction you draw the arrows between different bits of brain?):

1. Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) Corresponding abstractions are highly correlated with the proxies, and this strong signal helps with symbol grounding. (And the now-grounded ‘symbols’ feed into whatever other circuitry.) Maybe decision-making is—at least partially—defined relative to these ‘symbols’.

2. Hard-coded proxies fire for the thing in question. (Maybe also this encourages more attention and makes it more likely for the runtime learner to develop corresponding abstractions.) These proxies directly wire to reward circuits. There is runtime reinforcement learning. The runtime reinforcement learner generates corresponding abstractions because these are useful ‘features’ for reinforced behaviour. Decision-making is the product of reinforced behaviour.

Both of these seem like useful things to happen from the POV of natural selection, so I don’t see how to rule out either (and tentatively expect both to be true). I think you and Alex are exploring the hypothesis 2?

FWIW I tentatively wonder if to the extent that human and animal decision-making fits something like an actor-critic or a propose-promote deliberation framing, the actor/​propose might be more 2-ish and the critic/​promote might be more 1-ish.

there’s some explanation that specifically predicts sunk cost /​ framing /​ goal conflation as the convergent consequences of the human learning process.

We could probably dig further into each of these, but for now I’ll say: I don’t think these have in common a material/​mechanical cause much lower than ‘the brain’ and I don’t think they have in common a moving cause much lower than ‘evolution did it’. Framing, like anchoring, seems like a straightforward consequence of ‘sensible’ computational shortcuts to make world modelling tractable (on any computer, not just a human brain).

I think most high level goals /​ values are learned… don’t think most are directly installed by evolution

I basically can’t evaluate whether I agree with this because I don’t know what ‘high level’ and ‘most’ means. This isn’t intended as a rebuttal; this topic is in general hard to discuss with precision. I also find it disconcertingly hard to talk/​think about high and low level goals in humans without bumping into ‘consciousness’ one way or another and I really wish that was less of a mystery. I basically agree that the vast majority of what seem to pass for goals at almost any level are basically instrumental and generated at runtime. But, is this supposed to be a surprise? I don’t think it is.

learning systems don’t develop a single ontology… values “learn” to generalize across different ontolgies well before you learn that people are made of cells

Seems uncontroversial to me. I think we’re on the same page when I said

ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions

I don’t see any reason for supplementary abstractions to interfere with values, terminal or otherwise, resting on existing ontologies. (They can interfere enormously with new instrumental things, for epistemic reasons, of course.)

I note that sometimes people do have what looks passingly similar to ontological crises. I don’t know what to make of this except by noting that people’s ‘most salient active goals’ are often instrumental goals expressed in one or other folk ontology and subject to the very conflation we’ve agreed exists, so I suppose if newly-installed abstractions are sufficiently incompatible in the world model it can dislodge a lot of aggregate weight from the active goalset. A ‘healthy’ recovery from this sort of thing usually looks like someone identifying the in-fact-more-fundamental goals (which might putatively be the ones (or closer to the ones) installed by evolution, I don’t know).

Thanks again for this clarifying response, and I’m looking forward to more stuff from you and Alex and/​or others in this area.

1. ↩︎

By the way, I get a sense of ‘controversy signalling’ from some of this ‘shard theory’ stuff. I don’t have a good way to describe this, but it seems to make it harder for me to engage because I’m not sure what’s supposed to be new and for some reason I can’t really tell what I agree with. cf Richard’s comment. Please take this as a friendly note because I understand you’ve had a hard time getting some people to engage constructively (Alex told me something to the effect of ‘most people slide off this’). I’m afraid I don’t have positive textual/​presentational advice here beyond this footnote.

• 5 Aug 2022 14:57 UTC
LW: 1 AF: 1
0 ∶ 0
AF

I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don’t ever have infinite exploration (and we certainly don’t have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).

This seems right to me and it’s a nuance I’ve raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do ‘enough’ exploration! And furthermore (as I mentioned under Steven’s comment) I’m not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!

1. ↩︎

• FWIW I upvoted but disagree with the end part (hurray for more nuance in voting!)

I think “reward is the antecedent-computation-reinforcer” will probably be true in RL algorithms that scale to AGI

At least from my epistemic position there looks like an explanation/​communication gap here: I don’t think we can be as confident of this. To me this claim seems to preclude ‘creative’ forward-looking exploratory behaviour and model-based planning, which have more of a probingness and less of a merely-antecedent-computation-reinforcingness. But I see other comments from you here which talk about foresighted exploration (and foresighted non-exploration!) and I know you’ve written about these things at length. How are you squaring/​nuancing these things? (Silence or a link to an already-written post will not be deemed rude.)

• In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?

This seems like a great takeaway and the part I agree with most here, although probably stated less strongly. Did you see Richard Ngo’s Shaping Safer Goals (2020) or my Motivations, Natural Selection, and Curriculum Engineering (2021) responding to it[1]? Both relate to this sort of picture.

So the RL agent’s algorithm won’t make it e.g. explore wireheading either, and so the convergence theorems don’t apply even a little—even in spirit… I started off analyzing model-free actor-based approaches, but have also considered a few model-based setups

For various reasons I expect model-based RL to be a more viable path to AGI, in main because I think creative exploration is a missing ingredient addressing reward sparsity and the computational complexity barrier to tree-ish planning. Maybe a sufficiently carefully constructed curriculum can get over these, but that’s likely to be a really substantial additional hurdle, perhaps dominating the engineering effort, and perhaps simply intractable.

I also expect model-based + creative exploration[2] to be much more readily able to make exploratory leaps, perhaps including wireheading-like activities. cf humans who aren’t all that creative but still find ever more inventive ways to wirehead—as a society quite a lot of selection and intelligent design has gone into setting up incentive structures to push people away from wireheading-like activities. Also, in humans, because our hardware is pretty messy and difficult to wirehead, such activities also typically harm or destroy capability, which selects against. But in general I don’t expect wireheading to necessarily harm capability.

So we definitely can’t rule out agents which strongly (and not just weakly) value antecedent-computation-reinforcement. But it’s also not the overdetermined default outcome. More on that in future essays.

Looking forward to it!

p.s. I’m surprised you think that RL researchers on the whole in fact believe that RL produces reward-maximisers but your (few) pieces of evidence do indeed seem to suggest that! I suppose on the whole the apparent ‘surprisingness’ of the concept of inner misalignment should also point the same way. I’d still err toward assuming a mixture of sloppy language and actual-mistakenness.

1. ↩︎

Warning: both are quite verbose in my opinion and I expect both would be shorter if more time had been taken!

2. ↩︎

By the way ‘creative exploration’ is mostly magic to me but I have reason to think it relates to temporal abstraction and recomposition in planning

• I’m informed[1] that the concept I employed in the ‘recovering the equivalence’ section of the ‘fixed parts’ and ‘in flight mutations’ is similar to the ‘coalescence’ of ‘Coalescent theory’ which is an apparently relatively niche Biology tool, whose applications appear interesting (if unrelated) from a cursory look.

1. ↩︎

by Holly Elmore, thanks!

• Hey Thane, interesting stuff! Any chance you read my recent things on ‘deliberation’? It feels like we’re interested in similar questions[1] but approaching from different perspectives (I’m sort of trying to look at the bit ‘just after’ the ‘world model’). You might find it interesting or helpful.

1. ↩︎

not surprising as we’ve both been speaking to John and taken inspiration from him and from Scott G’s

• I like the way you tie real-world advice to principles in ML and RL. In general I think there are a lot of risks to naively applying epistemic deference and worldview aggregation and you articulate some really nicely here.

Something I’ve noticed with a few of your posts is that they often contain a lot of nuggets of ideas! And for you they seem to cohere into maybe a single high-level thought, but I sometimes want to pull them into smaller chunks[1]. For example, I imagine you (or others) might want to refer individually to the core idea in the paragraph beginning

However, even if in practice we end up mostly evaluating worldviews based on their epistemic track record, I claim that it’s still valuable to consider the epistemic track record as a proxy for the quality of their advice, rather than using it directly to evaluate how much we trust each worldview...

Now, the rest of the post gives this core idea context and support, but I think it stands on its own as well.

One compromise :D between putting lots of ideas together and splitting them apart too atomically could be to add meaningful sub-headings. (This also incidentally makes it easy to link out to the specific part of the text from another place via # links.)

1. ↩︎

Maybe we differ in the number of effective working memory slots we have available (for what I mean see https://​​www.sciencedaily.com/​​releases/​​2008/​​04/​​080402212855.htm though see https://​​www.ncbi.nlm.nih.gov/​​pmc/​​articles/​​PMC4159388/​​ which challenges this)

• ## ‘Temporary MAP stance’ or ‘subjective probability matching’

Temporary MAP stance or subjective probability matching are my words for useful mental manoeuvres for research, especially when dealing with confusing or prepradigmatic or otherwise non-crisp domains.

MAP is Maximum A Posteriori i.e. your best guess after considering evidence. Probability matching is making actions/​guesses proportional to your estimate of them being right (rather than picking the single MAP choice).

By this manoeuvre I’m gesturing at a kind of behaviour where you are quite unsure about what’s best (e.g. ‘should I work on interpretability or demystifying deception?’) and rather than allowing that to result in analysis paralysis, you temporarily collapse some uncertainty and make some concrete assumptions to get moving in one or other direction. Hopefully in so doing you a) make a contribution and b) grow your skills and collect new evidence to make better decisions/​contributions next time.

It happens to correspond somewhat to a decent heuristic called Thompson Sampling, which is optimal under some conditions for some uncertain-duration sequential decision problems.

HT Evan Hubinger for articulating his take on this in discussions about research, and I’m certain I’ve read others discussing similar principles on LW or EAF but I don’t have references to hand.

# Oliver Sour­but’s Shortform

14 Jul 2022 15:39 UTC
4 points
• 8 Jul 2022 0:50 UTC
LW: 15 AF: 7
1 ∶ 1
AF
1. Information inaccessibility is somehow a surmountable problem for AI alignment (and the genome surmounted it),

2. The genome solves information inaccessibility in some way we cannot replicate for AI alignment, or

3. The genome cannot directly address the vast majority of interesting human cognitive events, concepts, and properties. (The point argued by this essay)

In my opinion, either (1) or (3) would be enormous news for AI alignment

What do you mean by ‘enormous news for AI alignment’? That either of these would be surprising to people in the field? Or that resolving that dilemma would be useful to build from? Or something else?

FWIW from my POV the trilemma isn’t, because I agree that (2) is obviously not the case in principle (subject to enough research time!). And I further think it reasonably clear that both (1) and (3) are true in some measure. Granted you say ‘at least one’ must be true, but I think the framing as a trilemma suggests you want to dismiss (1) - is that right?

I’ll bite those bullets (in devil’s advocate style)...

• I think about half of your bullets are probably (1), except via rough proxies (power, scamming, family, status, maybe cheating)

• why? One clue is that people have quite specific physiological responses to some of these things. Another is that various of these are characterised by different behaviour in different species.

• why proxies? It stands to reason, like you’re pointing out here, it’s hard and expensive to specify things exactly. Further, lots of animal research demonstrates hardwired proxies pointing to runtime-learned concepts

• Sunk cost, framing, and goal conflation smell weird to me in this list—like they’re the wrong type? I’m not sure what it would mean for these to be ‘detected’ and then the bias ‘implemented’. Rather I think they emerge from failure of imagination due to bounded compute.

• in the case of goals I think that’s just how we’re implemented (it’s parsimonious)

• with the possible exception of ‘conscious self approval’ as a differently-typed and differently-implemented sole terminal goal

• other goals at various levels of hierarchy, strength, and temporal extent get installed as we go

• ontological shifts are just supplementary world abstractions being installed which happen to overlap with preexisting abstractions

• tentatively, I expect cells and atoms probably have similar representation to ghosts and spirits and numbers and ecosystems and whatnot—they’re just abstractions and we have machinery which forms and manipulates them

• admittedly this machinery is basically magic to me at this point

• wireheading and reality/​non-reality are unclear to me and I’m looking forward to seeing where you go with it

• I suspect all imagined circumstances (‘real’ or non-real) go via basically the same circuitry, and that ‘non-real’ is just an abstraction like ‘far away’ or ‘unlikely’

• after all, any imagined circumstances is non-real to some extent

• P.s. plants also do the basic thing I’d call deliberative control (or iterated deliberation). In the cases I described in that link, the model state is represented in analogue by the physical growth of the plant.

(And yes, in all cases these are inner misaligned in some weak fashion.)

• Yes, pretty much that’s a distinction I’d draw as meaningful, except I’d call the first one a ‘deliberative (optive) control procedure’, not an ‘optimizer’, because I think ‘optimizer’ has too many vague connotations.

The ‘world model’ doesn’t have to be separate from the deliberation, or even manifested at all: consider iterated natural selection, which deliberates over mutations, without having a separate ‘model’ of anything—because the evaluation is the promotion and the action (unless you count the world itself and the replication counts of various traits as the model). But in the bacterial case, there really is some (basic) world model in the form of internal chemical states.

• In this response I eschew the word ‘optimization’[1] but ‘control procedure’ might be synonymous with one rendering of ‘optimization’.

Some bacteria perform[2] a basic deliberation, ‘trying out’ alternative directions and periodically evaluating a heuristic (e.g. estimated sugar density) to seek out preferred locations. Iterated, this produces a simple control procedure which locates food items and avoids harmful substances. It can do this in a wide range of contexts, but clearly not all (as Peter alluded to via No Free Lunch). Put growing and dividing aside for now (they are separate algorithms).

A boiling water bubble doesn’t do any deliberation—it’s a ‘reaction’ in my terminology. But, within the context of ‘is underwater in X temperature and Y pressure range and Z gravitational field distribution’, its movement and essential nature are preserved, so it’s ‘iterated’, and hence the relatively direct path to the surface can be thought of as a consequence of a (very very basic) control procedure. Outside of this context it’s disabled or destroyed.

I take these basic examples as belonging to a spectrum of control procedures. Much more sophisticated ones may be able to proceed more efficiently to their goals, or do so from a wider range of starting conditions.

EDIT to be clear, I think the internal difference between the bubble and the bacterium is that the bacterium evaluates e.g. sugar concentrations to form a (very minimal) estimated model of the ‘world’ around it, and these evaluations affect its ongoing behaviour. The bubble doesn’t do this.

1. ↩︎
2. ↩︎

HT John Wentworth for this video link

• ‘Fitness’ is a very overloaded term, as you’ve delved into above. I’d like to attempt to describe a few carvings which help me to firm things up and avoid equivocation in my own thinking.

The original pretheoretic term ‘fitness’ meant ‘being fitted/​suitable/​capable (relative to a context)’, and this is what Darwin and co were originally pointing to. (Remember they didn’t have genes or Mendel until decades later!)

The modern technical usage of ‘fitness’ very often operationalises this, for organisms, to be something like number of offspring, and for alleles/​traits to be something like change in prevalence (perhaps averaged and/​or normalised relative to some reference).

So natural selection is the ex post tautology ‘that which propagates in fact propagates’.

If we allow for ex ante uncertainty, we can talk about probabilities of selection/​fixation and expected time to equilibrium and such. Here, ‘fitness’ is some latent property, understood as a distribution over outcomes.

If we look at longer timescales, ‘fitness’ is heavily bimodal: in many cases a particular allele/​trait either fixes or goes extinct[1]. If we squint, we can think of this unknown future outcome as the hidden ground truth of latent fitness, about which some bits are revealed over time and over generations.

A ‘single step’ of natural selection tries out some variations and promotes the ones which in fact work (based on a realisation of the ‘ex ante’ uncertain fitness). This indeed follows the latent fitness gradient in expectation.

In this ex ante framing it becomes much more reasonable to treat natural selection as an optimisation/​control process similar to gradient descent. It’s shooting for maximising the hidden ground truth of latent fitness over many iterations, but it’s doing so based on a similar foresight-free local heuristic like gradient descent, applied many times.

How can we reconcile this claim with the fact that the operationalised ‘relative fitness’ often walks approximately randomly, at least not often sustainedly upward[2]? Well, it’s precisely because it’s relative—relative to a changing series of fitness landscapes over time. Those landscapes change in part as a consequence of abiotic processes, partly as a consequence of other species’ changes, and often as a consequence of the very trait changes which natural selection is itself imposing within a population/​species!

So, I think, we can say with a straight face that natural selection is optimising (weakly) for increased fitness, even while a changing fitness landscape means that almost by definition relative fitness hovers around a constant for most extant lineages. I don’t think it’s optimising on species, but on lineages (which sometimes correspond).[3]

1. ↩︎

In cases where the relative fitness of a trait corresponds with its prevalence, there can be a dynamic equilibrium at neither of these modes. Consider evolutionary stable strategies. But the vast majority of mutations ever have hit the ‘extinct’ attractor, and a lot of extant material is of the form ‘ancestor of a large proportion of living organisms’.

2. ↩︎

Though note we do see (briefly?) sustained upward fitness in times of abundance, as notably in human population and in adaptive radiation in response to new resources, habitats, and niches becoming available.

3. ↩︎

Now, if the earlier instances of now-extinct lineages were somehow evolutionarily ‘frozen’ and periodically revived back into existence, we really would see that natural selection pushes for increased fitness. But because those lineages aren’t (by definition) around any more, the fitness landscape’s changes over time are under no obligation to be transitive, so in fact a faceoff between a chicken and a velociraptor might tell a different story.

• In a former role working on software control systems for internet-scale bidding stuff, we’d often talk in terms of confounders, upstream/​downstream, causal terms, etc. when developing and tuning system improvements. Pretty rare to actually draw a causal diagram (a few times?) or crack out do-calculus (never?) and I don’t know if everyone had read Pearl (probably not?) but at least passing fluency with the concepts was a big help.

I saw other teams (us too) fail or waste effort confused when they missed things that with a better appreciation for causal structure they’d have spotted.

My guess is this is a similar story for some technologists, and likely in medicine and other experimental fields, at least some of the time.

• Three ideas, not at all worked through

• quantilisation and robustness

• quantilising is generally considered ‘robust’

• not sure what the best arguments are, but maybe a Bayesian almost always ‘should’ have rapidly-enough decaying tails that some quantile is equivalent to EV...?

• contra Pascal’s wager style failures?

• finitude of evidence can’t support arbitrarily large hypotheses...?

• discount rates

• maybe exponential or hyperbolic (or other) discount rate over time steps could lead to something like logarithmic preferences?

• my intuition says nope but I’ve not run the maths

• I would be surprised if this worked over lots of different scales, but maybe on particular configurations

• if those configurations happened to be plausible ancestrally then...?

• value of information

• maybe some heuristic relating to value of information makes it convergently instrumental to have roughly logarithmic preferences

• you don’t learn anything more if you ‘go to zero’...?

• maybe cashes out something like quantilising?

• 29 Jun 2022 8:15 UTC
4 points
6 ∶ 0