Raymond Douglas

Karma: 1,184

Raymond Douglas May 1, 2025, 3:23 PM
9 points
1
in reply to: ChristianKl’s comment on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
I am extremely not Dustin, and I do not want to veer into psychologising, but I very tentatively interpret him as also conveying some mix of:
- legitimately feeling that there are some things it might be bad to fund, and feeling morally responsible for making sure the money doesn’t go to such bad things, and neither trusting OP to make those judgment, nor trusting that the good and bad will essentially balance out somehow
- finding it somewhat stressful and draining to be responsible (not just reputationally) for things you don’t have time to scrutinise, where those are in fact finite resources that need to be spent carefully
- hoping that if other people do fill in the funding gaps, they’ll also share the load on the other tacit resources (which, to be fair, is complicated by the general problems with donor funging that do seem to have been handled suboptimally)
I reiterate that all the comments are just there on the other post for anyone to scrutinise, rather than taking my word for it. I make no claim as to whether these are cruxes. But in my estimation these are some of the implications.
I would also offer this quote, because I think the meta-dynamic here is an important piece of the puzzle:
I’m not detailing specific decisions for the same reason I want to invest in fewer focus areas: additional information is used as additional attack surface area. The attitude in EA communities is “give an inch, fight a mile”. So I’ll choose to be less legible instead.

Raymond Douglas May 1, 2025, 11:19 AM
7 points
4
in reply to: habryka’s comment on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
Yeah, I meant that he was pushing back on the framing as an oversimplification, not that he was pushing back on the claim that reputation was part of the calculation—this I feel he did straightforwardly and consistently do, with actual substantive reasons, e.g.
“reputational risks” [..] narrows the mind too much on what is going on here
I can’t know all our grantees, and my estimation is I can’t divorce myself from responsibility for them, reputationally or otherwise. [emphasis original]
“PR risk” is an unnecessarily narrow mental frame for why we’re focusing [...] there are other bandwidth issues: energy, attention, stress, political influence. Those are more finite than capital.
Framing the costs as “PR” limits the way people think about mitigating costs. It’s not just “lower risk” but more shared responsibility and energy to engage with decision making, persuading, defending, etc.
Again, really leaning into trying to give the opposite side here, I think that rounding things off to “Dustin Moskovitz became more concerned about his reputation” is actually losing a lot of important nuance mostly in a way that makes Dustin look bad, and in a way that he correctly identified and objected to. Which is not to say there hasn’t been a cursed miasma causing who knows how much harm, but I think the differences in implication here are subtle and important.

Raymond Douglas Apr 30, 2025, 9:57 AM
15 points
−15
on: Bandwidth Rules Everything Around Me: Oliver Habryka on OpenPhil and GoodVentures
Interesting stuff! For the sake of multi-sidedness, I’d note that this description of the shift being because of Dustin caring about his reputation is something Dustin himself repeatedly pushed back on in the original GV update comment thread, for being an oversimplification. I might also recommend Dustin’s big Medium essay on philanthropy to anyone curious about how he conceives of what he does.

Disempowerment spirals as a likely mechanism for existential catastrophe

Raymond Douglas and owencb

Apr 10, 2025, 2:37 PM

74 points

7 comments5 min readLW link

Selection Pressures on LM Personas

Raymond DouglasMar 28, 2025, 8:33 PM

29 points

0 comments3 min readLW link

Raymond Douglas Feb 25, 2025, 7:31 PM
2 points
0
in reply to: mattmacdermott’s comment on: Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
Ah I should emphasise, I do think all of these things could help—it definitely is a spectrum, and I would guess these proposals all do push away from agency. I think the direction here is promising.
The two things I think are (1) the paper seems to draw an overly sharp distinction between agents and non-agents, and (2) basically all of the mitigations proposed look like they break down with superhuman capabilities. Hard to tell which of this is actual disagreements and which is the paper trying to be concise and approachable, so I’ll set that aside for now.
It does seem like we disagree a bit about how likely agents are to emerge. Some opinions I expect I hold more strongly than you:
- It’s easy to accidentally scaffold some kind of agent out of an oracle as soon as there’s any kind of consistent causal process from the oracle’s outputs to the world, even absent feedback loops. In other words, I agree you can choose to create agents, but I’m not totally sure you can easily choose not to
- Any system trained to predict the actions of agents over long periods of time will develop an understanding of how agents could act to achieve their goals—in a sense this is the premise of offline RL and things like decision transformers
- It might be pretty easy for agent-like knowledge to ‘jump the gap’, e.g. a model trained to predict deceptive agents might be able to analogise to itself being deceptive
- Sufficient capability at broad prediction is enough to converge on at least the knowledge of how to circumvent most of the guardrails you describe, e.g. how to collude

Raymond Douglas Feb 25, 2025, 3:29 PM
8 points
0
on: Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?
I like the thrust of this paper, but I feel that it overstates how robust the safety properties will be, by drawing an overly sharp distinction between agentic and non-agentic systems, and not really engaging with the strongest counterexamples
To give some examples from the text:
A chess-playing AI, for instance, is goal-directed because it prefers winning to losing. A classifier trained with log likelihood is not goal-directed, as that learning objective is a natural consequence of making observations
But I could easily train an AI which simply classifies chess moves by quality. What takes that to being an agent is just the fact that its outputs are labelled as ‘moves’ rather than as ‘classifications’, rather than any feature of the model itself. More generally, even a LM can be viewed as “merely” predicting next tokens—the fact that there is some perspective from which a system is non-agentic does not actually tell us very much.
Paralleling a theoretical scientist, it only generates hypotheses about the world and uses them to evaluate the probabilities of answers to given questions. As such, the Scientist AI has no situational awareness and no persistent goals that can drive actions or long-term plans.
I think it’s a stretch to say something generating hypotheses about the world has no situational awareness and no persistent goals—maybe it has indexical uncertainty, but a sufficiently powerful system is pretty likely to hypothesise about itself, and the equivalent of persistent goals can easily fall out of any ways its world model doesn’t line up with reality. Note that this doesn’t assume the AI has any ‘hidden goals’ or that it ever makes inaccurate predictions.
I appreciate that the paper does discuss objections to the safety of Oracle AIs, but the responses also feel sort of incomplete. For instance:
- The counterfactual query proposal basically breaks down in the face of collusion
- The point about isolating the training process from the real world says that “a reward-maximizing agent alters the real world to increase its reward”, which I think is importantly wrong. In general, I think the distinctions drawn here between RL and the science AI all break down at high levels.
- The uniqueness of solutions still leaves a degree of freedom in how the AI fills in details we don’t know—it might be able to, for example, pick between several world models that fit the data which each offer a different set of entirely consistent answers to all our questions. If it’s sufficiently superintelligent, we wouldn’t be able to monitor whether it was even exercising that freedom.
Overall, I’m excited by the direction, but it doesn’t feel like this approach actually gets any assurances of safety, or any fundamental advantages.

Raymond Douglas Feb 2, 2025, 10:48 PM
16 points
0
in reply to: Fabien Roger’s comment on: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Thank you for the very detailed comment! I’m pretty sympathetic to a lot of what you’re saying, and mostly agree with you about the three properties you describe. I also think we ought to do some more spelling-out of the relationship between gradual disempowerment and takeover risk, which isn’t very fleshed-out in the paper — a decent part of why I’m interested in it is because I think it increases takeover risk, in a similar but more general way to the way that race dynamics increase takeover risk.
I’m going to try to respond to the specific points you lay out, probably not in enough detail to be super persuasive but hopefully in a way that makes it clearer where we might disagree, and I’d welcome any followup questions off the back of that. (Note also that my coauthors might not endorse all this.)
Responding to the specific assumptions you lay out:
- No egregious lying — Agree this will probably be doable, and pretty interested in the prospect of ‘ai police’, but not a crux for me. I think that, for instance, much of the harm caused by unethical business practices or mass manipulation is not reliant on outright lies but rather on ruthless optimisation. A lot also comes from cases where heavy manipulation of facts in a technically not-false way is strongly incentivised.
- No strong AI rights — Agree de jure, disagree de facto, and partly this is contingent on the relative speeds of research and proliferation. Mostly I think there will be incentives and competitive pressures to give AIs decision making power, and that oversight will be hard, and that much of the harm will be emergent from complex interactions. I also think it’s maybe interesting to reflect on Ryan and Kyle’s recent piece about paying AIs to reveal misalignment — locally it makes sense, but globally I wonder what happens if we do a lot of that in the next few years.
- No hot global war — Agree, also not a crux for me, although I think the prospect of war might generate pretty bad incentives and competitive pressures.
Overall, I think I can picture worlds where (conditional on no takeover) we reach states of pretty serious disempowerment of the kind described in the paper, without any of these assumptions fully breaking down. That said, I expect AI rights to be the most important, and the one that starts breaking down first.
As for the feedback loops you mention:
- Owners of capital are aware of the consequences of their actions —
  - I think this is already sort of not true: I have very little sense of what my current stocks are doing, and my impression is many CEOs don’t really understand most of what’s going on in their companies. Maybe AIs are naturally easier to oversee in a way that helps, maybe they operate at unprecedented scale and speed, overall I’m unsure which way this cuts.
  - But also, I expect that the most important consequences of labor displacement by AIs are (1) displacement of human decision making, including over capital allocation and (2) distribution of economic power across humans, and resultant political incentives.
  - On top of that, I think a lot of the economic badness will be about aggregate competitive effects, rather than individual obviously-bad actions. If an individual CEO notices their company is doing something bad to stay competitive, which other companies are also doing, then stopping the badness in the world is a lot harder than shutting down a department.
- Politicians stop systems from doing obviously bad things —
  - I also think this is not currently totally true; there is definitely a sense in which some politicians already do not change systems that have bad consequences (terrible material conditions for citizens, at least), partly because they themselves are beholden to some pretty unfortunate incentives. There are bad equilibria within parties, between parties, and between states.
  - I also think that the mechanisms which keep countries friendly to citizens specifically are pretty fragile and contingent.
  - So essentially, I think the standard of ‘obviously bad’ might not actually be enough.
- Cultural consumption —
  - Here I am confused about where we’re disagreeing, and I think I don’t understand what you’re saying. I’m not sure why people being able to choose the culture they consume would help, and I don’t think it’s something we assume in the paper.
I hope this sheds some light on things!

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Jan_Kulveit, Raymond Douglas, Nora_Ammann, Deger Turan, David Scott Krueger (formerly: capybaralet) and David Duvenaud

Jan 30, 2025, 5:03 PM

162 points

58 comments2 min readLW link

(gradual-disempowerment.ai)

What does success look like?

Raymond DouglasJan 23, 2025, 5:48 PM

11 points

0 comments3 min readLW link

Raymond Douglas Nov 19, 2024, 11:50 PM
5 points
0
in reply to: Ksteel’s comment on: The Choice Transition
The writing here was definitely influenced by Lewis (we quote TAoM in footnote 6), although I think the Choice Transition is broader and less categorically negative.
For instance in Lewis’s criticism of the potential abolition he writes things like:
The old dealt with its pupils as grown birds deal with young birds when they teach them to fly; the new deals with them more as the poultry-keeper deals with young birds— making them thus or thus for purposes of which the birds know nothing. In a word, the old was a kind of propagation—men transmitting manhood to men; the new is merely propaganda.
The Choice Transition as we’re describing it is consistent with either of these approaches. There needn’t be any ruling minority, nor do we assume humans can perfectly control future humans, just that they (or any other dominant power) can appropriately steer emergent inter-human dynamics (if there are still humans).

The Choice Transition

owencb and Raymond Douglas

Nov 18, 2024, 12:30 PM

50 points

4 comments15 min readLW link

(strangecities.substack.com)

Raymond Douglas Oct 22, 2024, 1:05 AM
LW: 2 AF: 2
0
AF
on: Automation collapse
Could you expand on what you mean by ‘less automation’? I’m taking it to mean some combination of ‘bounding the space of controller actions more’, ‘automating fewer levels of optimisation’, ‘more of the work done by humans’ and maybe ‘only automating easier tasks’ but I can’t quite tell which of these you’re intending or how they fit together.
(Also, am I correctly reading an implicit assumption here that any attempts to do automated research would be classed as ‘automated ai safety’?)

Raymond Douglas Oct 7, 2024, 12:20 AM
26 points
1
on: Why I’m not a Bayesian
When I read this post I feel like I’m seeing four different strands bundled together:
1. Truth-of-beliefs as fuzzy or not
2. Models versus propositions
3. Bayesianism as not providing an account of how you generate new hypotheses/models
4. How people can (fail to) communicate with each other
I think you hit the nail on the head with (2) and am mostly sold on (4), but am sceptical of (1) - similar to what several others have said, it seems to me like these problems don’t appear when your beliefs are about expected observations, and only appear when you start to invoke categories that you can’t ground as clusters in a hierarchical model.
That leaves me with mixed feelings about (3):
- It definitely seems true and significant that you can get into a mess by communicating specific predictions relative to your own categories/definitions/contexts without making those sufficiently precise
- I am inclined to agree that this is a particularly important feature of why talking about AI/x-risk is hard
- It’s not obvious to me that what you’ve said above actually justifies knightian uncertainty (as opposed to infrabayesianism or something), or the claim that you can’t be confident about superintelligence (although it might be true for other reasons)

Raymond Douglas Jul 31, 2024, 2:09 PM
14 points
4
in reply to: edbs’s comment on: Decomposing Agency — capabilities without desires
Strongly agree that active inference is underrated both in general and specifically for intuitions about agency.
I think the literature does suffer from ambiguity over where it’s descriptive (ie an agent will probably approximate a free energy minimiser) vs prescriptive (ie the right way to build agents is free energy minimisation, and anything that isn’t that isn’t an agent). I am also not aware of good work on tying active inference to tool use—if you know of any, I’d be pretty curious.
I think the viability thing is maybe slightly fraught—I expect it’s mainly for anthropic reasons that we mostly encounter agents that have adapted to basically independently and reliably preserve their causal boundaries, but this is always connected to the type of environment they find themselves in.
For example, active inference points to ways we could accidentally build misaligned optimisers that cause harm—chaining an oracle to an actuator to make a system trying to do homeostasis in some domain (like content recommendation) could, with sufficient optimisation power, create all kinds of weird and harmful distortions. But such a system wouldn’t need to have any drive for boundary preservation, or even much situational awareness.
So essentially an agent could conceivably persist for totally different reasons, we just tend not to encounter such agents, and this is exactly the kind of place where AI might change the dynamics a lot.

Raymond Douglas Jul 30, 2024, 8:05 PM
3 points
0
in reply to: Oliver Sourbut’s comment on: Decomposing Agency — capabilities without desires
Interesting! I think one of the biggest things we gloss over in the piece in how perception fits into the picture, and this seems like a pretty relevant point. In general the space of ‘things that give situational awareness’ seems pretty broad and ripe for analysis.
I also wonder how much efficiency gets lost by decoupling observation and understanding—at least in humans, it seems like we have a kind of hierarchical perception where our subjective experience of ‘looking at’ something has already gone through a few layers of interpretation, giving us basically no unadulterated visual observation, presumably because this is more efficient (maybe in particular faster?).

Raymond Douglas Jul 29, 2024, 4:23 PM
2 points
2
in reply to: habryka’s comment on: Decomposing Agency — capabilities without desires
I’d be pretty curious to hear about your disagreements if you’re willing to share

Raymond Douglas Jul 17, 2024, 9:10 AM
5 points
4
in reply to: quila’s comment on: Decomposing Agency — capabilities without desires
This seems like a misunderstanding / not my intent. (Could you maybe quote the part that gave you this impression?)

I believe Dusan was trying to say that davidad’s agenda limits the planner AI to only writing provable mathematical solutions. To expand, I believe that compared to what you briefly describe, the idea in davidad’s agenda is that you don’t try to build a planner that’s definitely inner aligned, you simply have a formal verification system that ~guarantees what effects a plan will and won’t have if implemented.

Decomposing Agency — capabilities without desires

owencb and Raymond Douglas

Jul 11, 2024, 9:38 AM

153 points

32 comments12 min readLW link

(strangecities.substack.com)

Raymond Douglas Mar 24, 2024, 10:07 PM
34 points
8
in reply to: Kaj_Sotala’s comment on: ChatGPT can learn indirect control
Oh interesting! I just had a go at testing it on screenshots from a parallel conversation and it seems like it incorrectly interprets those screenshots as also being of its own conversation.
So it seems like ‘recognising things it has said’ is doing very little of the heavy lifting and ‘recognising its own name’ is responsible for most of the effect.
I’ll have a bit more of a play around and probably put a disclaimer at the top of the post some time soon.

Raymond Douglas

Disem­pow­er­ment spirals as a likely mechanism for ex­is­ten­tial catastrophe

Selec­tion Pres­sures on LM Personas

Grad­ual Disem­pow­er­ment: Sys­temic Ex­is­ten­tial Risks from In­cre­men­tal AI Development

What does suc­cess look like?