Fast-follow on frontier capabilities; compete on useful applications and convenient integration. This works all the way up to recursive self improvement or superintelligence, at which point none of this matters.
Lobby for regulation that slows down everyone. Justify this practice to your stockholders by pointing out that your competitive advantage is in applications and so fair regulation hits your competitors harder.
Publish scary demos and evals. Show them in the context of your competitors models first, with an acknowledgement that the issues also apply to your products. This won’t change anything directly, but it will be useful ammunition for external activists who can push for things you can’t.
Encourage (or fail to effectively discourage) union organizing among your employees so that you are not bound to hyper-optimizing for stockholder short term ROI. Maintain a collaborative relationship with union leadership, since you both have an interest in the company’s success. There is no guarantee that union leadership will be as safety conscious as you are, but it is impossible for them to be less safety conscious than stockholders, so this is a net win regardless.
Write a list of voluntary commitments, not just for slowing down but also regarding lobbying, based on what you would like to see if you had full consensus with your significant competitors. Include effective measures for monitoring and enforcement. Don’t worry about competitiveness, that doesn’t matter because none of this is binding until a critical mass has signed on. When it looks good, circulate the agreement to your competitors and invite their feedback, as long as that feedback remains within the spirit of the agreement. Let them actually reject the agreement rather than just assuming that they will. Maintain a paper trail. If possible, publicize any rejection and use it against them.
WillPetillo
In Emergence of Simulators and Agents, my AISC collaborators and I suggested that whether consequentialist or simulator-like cognition (which one could describe as a subcategory of process-based reasoning) emerges depends critically on environmental and training conditions, particularly the “feedback gap” (the delay, uncertainty, or inference depth between action and feedback). Large feedback gaps select for instrumental reasoning and power-seeking; small feedback gaps select for imitation and compression. As examples, LLMs are trained primarily via SSL (minimal feedback gap) and display predominantly simulator-like behavior, whereas RL-trained AlphaZero is clearly agentic.
The dynamic you describe of patterns steering toward states where they have more steering capacity outcompeting other patterns is real, but may be context dependent. If so, CCCT requires both: (1) the conditions for consequentialist reasoning being advantageous being inevitable and (2) consequentialism being inevitable given those conditions.
Claim 1, regarding conditions, is the part that needs defending. The “consequentialism is inevitable” argument requires showing either:Market/competitive forces will inevitably push toward large-feedback-gap deployments (agentic AI doing long-horizon tasks), or
Even in small-feedback-gap contexts, consequentialist subpatterns will somehow emerge and take over.
Without establishing one of these (1 seems plausible to me, but that’s an intuitive claim), the convergence thesis describes a risk contingent on our choices, not an inevitability. Of course, process-based reasoning is not the same as “safe” by any means, but that shifts the terrain of the argument.
Setting prevalence aside and taking your case study as representative of some subset, there are some other things that might be going on.
First, a desire to have someone else initiate maps to the Allowing quadrant of the Wheel of Consent, which minimizes effort while maximizing feeling desired. That said, true Allowing should still be compatible with giving clear responses, so this doesn’t by itself explain the aversion you are seeing.
Second, emotional reactions follow the pattern: event ⇒ meaning (via priors) ⇒ affect ⇒ narrative. Suppose this woman holds strongly negative priors about men’s motivations. A consent request is not simply coordination, it’s an implicit demand for legibility. But if she sees the interaction as inherently adversarial, that’s giving you leverage. And if you do all the right things, that can be perceived as just more manipulation.
Now consider the internal conflict. She feels good about you initiating, then has a negative reaction to the consent request...while also consciously endorsing the belief that asking for consent is a Good Thing. Add the background tension of wanting to interact with men while viewing them as partially adversarial...and social advice to “trust your intuition” combined with long-term dissatisfaction with her relationship status and wanting to change it. That’s substantial cognitive dissonance with no widely shared conceptual handles. Hence the shutdown.
So the behavior you describe may be better explained by Allowing plus aversion to legibility (under distrust), rather than by a desire for nonconsent.
Other, non-substantive notes:LessWrong may have high decoupling norms, but on charged topics like this, disclaimers may help prevent contextualizers from inferring views you likely don’t endorse.
Watch for selection effects! Women who give clear signals and are comfortable with explicit consent often pair off quickly. The women who remain visible in dating contexts—and thus command more of your attention—are disproportionately those who communicate more ambiguously.
I haven’t participated in CFAR workshops, I’m working from your written posts and my experience in other communities, so take these comments with the appropriate grains of salt.
I read the shift you describe as being from getting people on board with a largely preset agenda towards a more collaborative frame (and that the results have been promising so far). I wonder if the underlying issue here may be format-intent alignment. If CFAR has specific rationality techniques it believes are valuable and wants to teach, there’s an inherent directiveness to that goal. Workshop/discussion formats, however, signal exploration and reciprocity. When the intent is directional but the format signals collaboration, participants can experience a disorienting mismatch.
I’m curious whether CFAR has considered hybrid formats that are explicit about which parts are directional and which are truly collaborative, or if you are leaning entirely on shifting intentions to fit the existing format—even if that means letting conversations drift away from anything adjacent to rationality or x-risk. I don’t think there’s an inherently right answer here, but understanding where you are on the spectrum and being transparent about this choice could help participants set their expectations, or self-filter regarding whether CFAR is a good fit for them at all.
Seeing this as a spectrum rather than a binary seems important because it prevents participants from running into “invisible” restrictions. Mutual learning and meeting people where they are at are great, but it doesn’t seem realistic to try to be all things to all people. It’s therefore important to be ready to say “this is what we offer, these are our constraints, respond as you will” even if where you think it is best to draw those lines has a wide range of valid answers. Sometimes a person is incompatible with an organization and that doesn’t have to be anyone’s fault.
A second, arguably more complex and charged dynamic I am wondering how you intend to navigate is responsibility for managing participant capacity. Is CFAR explicitly working to build participants’ ability to recognize misalignment, maintain their sense of agency while in seemingly asymmetric power dynamics, and advocate for their needs? Or is the plan for CFAR to create sufficiently careful environments that participants won’t need those skills? Or to screen applicants for having the capacity for self-advocacy already? Your post makes it sound like you are primarily taking the second approach, which is fine, but I am wondering how you are assessing the trade-offs involved.
For transparency, I personally believe in building capacity for self-advocacy and agency, with community care supporting that development and acting as a fallback when capacity isn’t present, but I recognize that different approaches work for different purposes/people. What I am pushing for here is an explicit articulation of the balance you are striking so that participants can make a fully informed choice as to whether they wish to join.
In any case, I appreciate the thoughtfulness of your post and willingness to share CFAR’s evolution publicly.
What is the legibility status of the problem of requiring problems to be legible before allowing them to inform decisions? The thing I am most concerned about wrt AI is our societal-level filters for what counts as a “real problem.”
My takeaway from this post is that there are several properties of relating that people expect to converge, but in your case (and in some contexts) don’t. With empathy, there’s:
1. Depth of understanding of the other person’s experience
2. Negative judgment
3. Mirroring
I mention 3 because I think it’s strictly closer to the definition of empathy than 1, but it’s mostly irrelevant to this post. If I had this kind of empathy for the woman in the video, I’d be thinking: “man, my head hurts.”
The common narrative is that as 1 increases, 2 drops to zero, or even becomes positive judgement. This is probably true sometime, such as when counteracting the fundamental attribution error, but sometimes not: “This person is isn’t getting their work done, that’s somewhat annoying...oh, it’s because they don’t care about their education? Gaaahhh!!!” I can relate to this.
Regarding relating better without lowering standards, the questions that come to my mind are:
1. Is this a case where things have to get worse before they get better? As in, zero understanding leads to low judgement with suspension of disbelief, motivational understanding leads to high judgement, but full-story understanding returns to low judgment without relying on suspension of disbelief. Is there a way to test this without driving yourself crazy or taking up an inordinate amount of time?
2. Can you dissolve your moral judgement while keeping understanding constant? That is: “this teammate isn’t doing their share of the work because they didn’t care enough to be prepared...and this isn’t a thing I need to be angry about.” If this route looks interesting, my suggestion for the first step of the path is to introspect on the anger/disgust/etc. and what it’s protecting.
This is a useful application of a probability map! If an important term has multiple competing definitions, create nodes for all of them, link the ones you consider important to a central p(doom) node (assuming you are interested in that concept), and let other people disagree with your assessment, but with a clearer sense of what they specifically disagree about.
The basic contention here seems to be that the biggest dangers of LLMs is not from the systems themselves, but from the overreliance, excessive trust, etc. that societies and institutions put on them. Another is that “hyping LLMs”—which I assume includes folks here expressing concerns that AI will go rogue and take over the world—increases perceptions of AI’s abilities, which feeds into this overreliance. A conclusion is that promoting “x-risk” as a reason for pausing AI will have the unintended side effect of increasing (catastrophic, but not existential) dangers associated with overreliance.
This is an interesting idea, not least because it’s a common intuition among the “AI Ethics” faction, and therefore worth hashing out. Here are my reasons for skepticism:
1. The hype that matters comes from large-scale investors (and military officers) trying to get in on the next big thing. I assume these folks are paying more attention to corporate sales pitches than Internet Academics and people holding protest signs—and that their background point of reference is not Terminator, but the FOMO common in the tech industry (which makes sense in a context where losing market share is a bigger threat than losing investment dollars).
2. X-risk scenarios are admittedly less intuitive in the context of self supervised learning based LLMs than they were back when reinforcement learning was at the center of development as AI learned to play increasingly broad ranges of games. These systems regularly specification-gamed their environments and it was chilling to think about what would happen when a system could treat the entire world as a game. A concern now, however, is that agency will make a comeback because it is economically useful. Imagine the brutal, creative effectiveness of RL combined with the broad-based common sense of SSL. This reintegration of agency (can’t speak to the specific architecture) into leading AI systems is what the tech companies are actively developing towards. More on this concept in my Simulators sequence.
I, for one, will find your argument more compelling if you (1) take a deep dive into AI development motivations, rather than just lumping it all together as “hype”, and (2) explain why AI development stops with the current paradigm of LLM-fueled chatbots or something similarly innocuous in itself but potentially dangerous in the context of societal overreliance.
The motivation of this post was to design a thought experiment involving a fully self-sufficient machine ecology that remains within constraints designed to benefit something outside of the system, not as a suggestion for how to make best use of the moon.
Agree, when discussing the alignment of simulators in this post, we are referring to safety from the subset of dangers related to unbounded optimization towards alien goals, which does not include everything within value alignment, let alone AI safety. But this qualification points to a subtle meaning drift in use of the word “alignment” in this post (towards something like “comprehension and internalization of human values”) which isn’t good practice and something I’ll want to figure out how to edit/fix soon.
I am having difficulty seeing why anyone would regard these two viewpoints as opposed.
We discuss this indirectly in the first post in this sequence outlining what it means to describe a system through the lens of an agent, tool, or simulator. Yes, the concepts overlap, but there is nonetheless a kind of tension between them. In the case of agent vs. simulator, our central question is: which property is “driving the bus” with respect to the system’s behavior, utilizing the other in its service?
The second post explores the implications of the above distinction, predicting different types of values—and thus behavior—from an agent that contains a simulation of the world and uses it to navigate vs. a simulator that generates agents because such agents are part of the environment the system is modelling vs. a system where the modes are so entangled it is meaningless to even talk about where one ends and the other begins. Specifically, I would expect simulator-first systems to have wide value boundaries that internalize (and approximation of) human values, but more narrow, maximizing behavior from agent-first systems.
It seems to me that the most robust solution is to do it the hard way: know the people involved really well, both directly and via reputation among people you also know really well—ideally by having lived with them in a small community for a few decades.
Selection bias. Those of us who were inclined to consider working on outreach and governance have joined groups like PauseAI, StopAI, and other orgs. A few of us reach back on occasion to say “Come on in, the water’s fine!” The real head-scratcher for me is the lack of engagement on this topic. If one wants to deliberate on a much higher level of detail than the average person, cool—it takes all kinds to make a world. But come on, this is obviously high stakes enough to merit attention.
Thanks for the link! It’s important to distinguish here between:
(1) support for the movement,
(2) support for the cause, and
(3) active support for the movement (i.e. attracting other activists to show up at future demonstrations)
Most of the paper focuses on 1, and also on activist’s beliefs about the impact of their actions. I am more interested in 2 and 3. To be fair, the paper gives some evidence for detrimental impacts on 2 in the Trump example. It’s not clear, however, whether the nature of the cause matters here. Support for Trump is highly polarized and entangled with culture, whereas global warming (Hallam’s cause) and AI risk (PauseAI’s) have relatively broad but frustratingly lukewarm public support. There are also many other factors when looking past short-term onlooker sentiment to the larger question of affecting social change, which the paper readily admits in the Discussion section. I’d list these points, but they largely overlap with the points I made in my post...though it was interesting to see how much was speculative. More research is needed.
In any case, I bring up the extreme case to illustrate that the issue is far more nuanced than “regular people get squeamish—net negative!” This is actually somewhat irrelevant to PauseAI in particular, because most of our actions are around public education and lobbying, and even the protests are legal and non-disruptive. I’ve been in two myself and have seen nothing but positive sentiment from onlookers (with the exception of the occasional “good luck with that!” snark). The hard part with all of these is getting people to show up. (This last paragraph is not a rebuttal to anything you have said, it’s a reminder of context)
My conclusion is an admittedly weaksauce non-argument, included primarily to prevent misinterpretation of my actual beliefs. I am working on a rebuttal, but it’s taking longer than I planned. For now, see: Holly Elmore’s case for AI Safety Advocacy to the Public.
I want to push harder on Q33: “Isn’t goal agnosticism pretty fragile? Aren’t there strong pressures pushing anything tool-like towards more direct agency?”
In particular, the answer: “Being unable to specify a sufficiently precise goal to get your desired behavior out of an optimizer isn’t merely dangerous, it’s useless!” seems true to some degree, but incomplete. Let’s use a specific hypothetical of a stock-trading company employing an AI system to maximize profits. They want the system to be agentic because this takes the humans out of the loop on actually getting profits, but also understand that there is a risk that the system will discover unexpected/undesired methods of achieving its goals like insider trading. There are a couple of core problems:
1. Externalized Cost: if the system can cover its tracks well enough that the company doesn’t suffer any legal consequences for its illegal behavior, then the effects of insider trading on the market are “somebody else’s problem.”
2. Irreversible Mistake: if the company is overly optimistic about their ability to control their system, doesn’t understand the risks, etc. then they might use it despite regretting this decision later. On a large scale, this might be self-correcting if some companies have problems with AI agents and this gives the latter a bad reputation, but that assumes there are lots of small problems before a big one.
Glad to hear it! If you want more detail, feel free to come by the Discord Server or send me a Direct Message. I run the welcome meetings for new members and am always happy to describe aspects of the org’s methodology that aren’t obvious from the outside and can also connect you with members who have done a lot more on-the-ground protesting and flyering than I have.
As someone who got into this without much prior experience in activism, I was surprised how much subtlety and counterintuitive best practices there are, most of which is learned through direct experience combined with direct mentorship, as opposed to written down & formalized. I made an attempt to synthesize many of the code ideas in this video—it’s from a year ago and looking over it there is quite a bit I would change (spend less time on some philosophical ideas, add more detail re specific methods), but it mostly holds up OK.
If you want to get an informed opinion on how the general public perceives PauseAI, get a t-shirt and hand out some flyers in a high foot-traffic public space. If you want to be formal about it, bring a clipboard, track whatever seems interesting in advance, and share your results. It might not be publishable on an academic forum, but you could do it next week.
Here’s what I expect you to find, based on my own experience and the reports of basically everyone who has done this:
- No one likes flyers, but get a lot more interested if you can catch their attention enough to say it’s about AI.
- Everyone hates AI.
- Your biggest initial skepticism will be from people who think you are in favor of AI.
- Your biggest actual pushback will be from people who think that social change is impossible.
- Roughly 1⁄4 to 1⁄2 are amenable to (or have already heard about!) x-risk, most of the rest won’t actively disagree but you can tell that particular message is not really “landing” and pay a lot more attention if you talk about something else (unemployment, military applications, deepfakes, etc.)
- Bring a clipboard for signups. Even if recruitment isn’t your goal, if you don’t have one you’ll feel unprepared when people ask about it.
Also, protests are about Overton-window shifting, making AI danger a thing that is acceptable to talk about. And even if it makes a specific org look “fringe” (not a given, as Holly has argued), that isn’t necessarily a bad thing for the underlying cause. For example, if I see an XR protest, my thought is (well, was before I knew the underlying methodology): “Ugh, those protestors...I mean, I like what they are fighting for and more really needs to be done, but I don’t like the way they go about it” Notice that middle part. Activation of a sympathetic but passive audience was the point. That’s a win from their perspective. And the people who are put off by methods then go on to (be more likely to) join allied organizations that believe the same things but use more moderate tactics. The even bigger win is when the enthusiasm catches the attention of people who want to be involved but are looking for orgs that are the “real deal,” as measured by willingness to put effort where their words are.
Before jumping into critique, the good:
- Kudos to Ben Pace for seeking out and actively engaging with contrary viewpoints
- The outline of the x-risk argument and history of the AI safety movement seem generally factually accurate
The author of the article makes quite a few claims about the details of PauseAI’s proposal, its political implications, the motivations of its members and leaders...all without actually joining the public Discord server, participating in the open Q&A new member welcome meetings (I know this because I host them), or even showing evidence of spending more than 10 minutes on the website. All of these basic research opportunities were readily available and would have taken far less time than spent on writing the article. This tells you everything you need to know about the author’s integrity, motivations, and trustworthiness.
That said, the article raises an important question: “buy time for what?” The short answer is: “the real value of a Pause is the coordination we get along the way.” Something as big as an international treaty doesn’t just drop out of the sky because some powerful force emerged and made it happen against everyone else’s will. Think about the end goal and work backwards:
1) An international treaty requires
2) Provisions for monitoring and enforcement,
3) Negotiated between nations,
4) Each of whom genuinely buys in to the underlying need
5) And is politically capable of acting on that interest because it represents the interests of their constituents
6) Because the general public understands AI and its implications enough to care about it
7) And feels empowered to express that concern through an accessible democratic process
8) And is correct in this sense of empowerment because their interests are not overridden by Big Tech lobbying
9) Or distracted into incoherence by internal divisions and polarization
An organization like PauseAI can only have one “banner” ask (1), but (2-9) are instrumentally necessary—and if those were in place, I don’t think it’s at all unreasonable to assume society would be in a better position to navigate AI risk.
Side note: my objection to the term “doomer” is that it implies a belief that humanity will fail to coordinate, solve alignment in time, or be saved by any other means, and thus will actually be killed off by AI—which seems like it deserves a distinct category from those who simply believe that the risk of extinction by default is real.
When you call someone naive, think about what you’re implying: “You have a pattern of being systematically wrong about things, in the direction of excessive innocence and simplicity, and the thing you are saying sounds consistent with that bias, so I’m not going to consider your argument enough to reply to it directly.” How much of that can you stand behind as true, and see as necessary for the person to hear, in the context of them taking the time to give you feedback?