paulfchristiano

Karma: 28,836

paulfchristiano 17 Jun 2026 15:26 UTC
LW: 2 AF: 2
0
AF
in reply to: James Payor’s comment on: Announcing the ARC White-Box Estimation Challenge
I believe that all the issues you raised have been fixed, let us know if that’s not the case.

paulfchristiano 13 Jun 2026 19:15 UTC
27 points
2
in reply to: Yair Halberstadt’s comment on: PSA: Almost nobody is directly working on superintelligent alignment
ARC works on theoretical alignment, but I think it’s reasonably clear how ARC’s work fits in with the current LLM paradigm (just apply our methods to LLMs!). It’s just an ambitious foundational project that has a very different risk/reward profile from more incremental work. More incremental work naturally looks more and more exciting over time as AI improves (since the real-world problems get closer ad closer to your long-run concerns).
There was an assumption 5 years ago that AGI requires fundamental insights into intelligence.
I think a lot of people who have been working on theoretical alignment have long believed that LLMs could scale to AGI. Here’s me from 10 years ago:
It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works:”
- It’s plausible that a large neural network can replicate “fast” human cognition, and that by coupling it to simple computational mechanisms — short and long-term memory, attention, etc. — we could obtain a human-level computational architecture.
- It’s plausible that a variant of RL can train this architecture to actually implement human-level cognition. This would likely involve some combination of ingredients like model-based RL, imitation learning, or hierarchical RL.
[...]
I think that prosaic AGI should probably be the largest focus of current research on alignment.

paulfchristiano 12 Jun 2026 19:26 UTC
94 points
27
on: PSA: Almost nobody is directly working on superintelligent alignment
There are many people working on aligning existing AI systems and understanding the alignment of existing AI systems. In my opinion many of the techniques and lessons from that work are likely to remain important for aligning increasingly powerful AI systems. For example, I care a lot about techniques for scalable supervision, broader scientific understanding of generalization (e.g. for honesty or reward hacking), mechanistic interpretability, behavioral red teaming, etc. In aggregate there are probably a few hundred people working on research that I’d classify as “direct alignment work,” which is fewer than you might think but still far from “almost nobody.”
There has been much more investment in incremental improvements than indefinitely scalable methods, motivated partly by an increasing interest in aligning slightly superhuman systems and then “building the plane as we fly it.” I think it would be a definitional stretch to say that more incremental work doesn’t count as alignment, whether because you think it won’t scale indefinitely or because it could end up just training models that are more robustly incentivized to do what humans want them to do.
(There has also been a big increase in methods for detecting and mitigating potential misalignment relative to building aligned AI systems, motivated partly by a belief that it will be easier to improve alignment once we have better examples of misalignment. I think it’s reasonable to make some distinction between that kind of research and efforts to directly make AI systems more aligned, though I think it’s counterintuitive and unnecessarily confrontational to say those people aren’t “working on alignment.” If you asked me what is the best way to “work on alignment” I might suggest developing model organisms of misalignment, and I think it’s generally very plausible that better scientific understanding and better measurement tools is most of the action.)
I think a lot of people on this site are dismissive of all of those more incremental efforts and would say that “production alignment” is unrelated to the problem of aligning superhuman AI. I’ve spent a lot of time engaging with this community and find the standard arguments unpersuasive. I think I have a deep understanding of the conceptual difficulties for scalable alignment methods and in my view ARC is doing some of the most promising work for addressing those difficulties. But even understanding all of that I still think it would be a huge mistake to conclude that the incremental progress most people are investing in doesn’t help.
I do think it’s great for people to make investments in foundations and indefinitely scalable methods. There’s a real chance that other methods will break down during an intelligence explosion, or that they will just never work particularly well. And I do think that machine learning researchers, policy makers, and the EA community are predictably underrating more scalable and foundational work. I just think the OP is a significant overstatement reflecting some significant unspoken assumptions.

paulfchristiano 4 Jun 2026 15:00 UTC
5 points
1
in reply to: phoenix’s comment on: Announcing the ARC White-Box Estimation Challenge
I think the only difference between the phases is how locked in the rules are. During the warm-up round we may make big changes while we uncover bugs. In phase 1 we may make smaller changes. In phase 2 we won’t make changes so that people know how they are being graded.
(AICrowd has run a lot of competitions and ARC has run none so we just handed it off to them and letting them make these kinds of decisions.)

paulfchristiano 4 Jun 2026 14:03 UTC
2 points
0
in reply to: Daniel Tan’s comment on: Announcing the ARC White-Box Estimation Challenge
Your score is the budget-adjusted MSE. You can see how it looks on the leaderboard, where the rank is by “adjusted score” which is 0.1-1x the “final layer MSE.”
scoring-model.md is wrong or at least confusing, I’ll pass on the feedback. (I think maybe that’s because that document is just describing the “scoring model,” which is the input into the leaderboard score, but regardless it’s a misleading way to describe what’s going on.)

paulfchristiano 3 Jun 2026 16:25 UTC
13 points
2
in reply to: Guillaume Martres’s comment on: Announcing the ARC White-Box Estimation Challenge
I definitely agree that trained models are the real goal.
One view we hold is that all the complexities and messiness of randomly initialized models are still present in trained models—just combined with a lot of richer structure, different sets of features that are relevant, and so on. Moreover, we think that current algorithms for randomly initialized networks are not yet good enough that you should expect it to be possible to analyze trained models in a convincing way, except perhaps in the extremely overtrained regime where the model converges to a highly simplified algorithm.
We think there probably do exist adequate algorithms for analyzing random networks and the reason we haven’t found them is because so little research has addressed the question head on. And if that’s not possible we want to learn sooner and give up on the project.
In my opinion you can see this issue a lot in mechanistic interpretability research. Often at some point an investigation into a particular behavior needs to throw up its hands and say “and then a bunch of messy and complicated stuff happens” and you are left wondering whether you understand the phenomenon or whether the important action is in all the messy details. (Except in the super overtrained setting.)
I think that interpretability research would be more promising if we had a really solid understanding of “random messy stuff,” such that you could say “yes we understand the randomly initialized model” and then focus on whether you are able to keep handling new structure as it gets introduced by training. Rather than starting off confused and hoping that you can catch some glimpses of what training is doing despite the background confusion.
(Of course in this comment I’m using a much more limited notion of “understanding” than interpretability researchers typically have in mind—I just want algorithms that are able to leverage a mechanistic analysis of the weights to answer concrete questions about the model. We hope that’s a productive substitution that can drive much faster progress and help us find scalable algorithms, and we think that this more limited sense would still be sufficient to solve alignment.)

paulfchristiano 26 Apr 2026 19:36 UTC
LW: 2 AF: 2
0
AF
in reply to: JesseClifton’s comment on: Lukas Finnveden’s Shortform
If “managing the news” just means “making a decision in situation X such that you are glad to hear the news that you made that decision in situation X,” then I agree that’s a description of EDT. I think it’s a priori reasonable to manage news about what you decide to do, so I don’t see this as a fundamental reason that EDT is problematic. I usually associate the phrase with various intuitive mistakes that EDT might make, and then I want to discuss concrete cases (like Lukas’) in which it appears an agent did something wrong.

paulfchristiano 26 Apr 2026 15:55 UTC
LW: 3 AF: 2
0
AF
in reply to: JesseClifton’s comment on: Lukas Finnveden’s Shortform
Are there cases where EDT manages the news for reasons other than anthropic updating? I’m not aware of any, and if not then it this is exactly because of the interaction of EDT and anthropic updating.

paulfchristiano 26 Apr 2026 15:51 UTC
LW: 4 AF: 3
2
AF
in reply to: Lukas Finnveden’s comment on: Lukas Finnveden’s Shortform
In your example, you know T1 and your counterpart knows T2. You see your behavior as correlated with the behavior of your counterpart. Under these conditions, it seems like T1 can’t possibly be so fundamental that you need it in order to do EDT-style reasoning (otherwise your counterpart couldn’t). So even if you grant that you need some beliefs to get EDT off the ground, it seems like those beliefs must be in the intersection of T1 and T2, in which case you wouldn’t run into this problem.
I think there are a lot of pretty basic open questions about how logically bounded minds work, especially if they want them to be philosophically principled all the way down. That’s confusing enough that it’s totally plausible that you’ll come out of it with some TDT-like story, where causality is a basic cognitive ingredient that makes the bootstrapping work at all. My guess is that in the end causality will just be a helpful language for talking about certain kinds of independence relationships rather than a fundamental cognitive building block or something with direct normative implications, but who knows.
(Note for observers: I’m not expecting us to resolve any of those questions as part of AI safety.)

paulfchristiano 15 Apr 2026 14:48 UTC
11 points
2
in reply to: papetoast’s comment on: Self-driving car bets
Waymo’s website says they are serving riders in 11 US cities (all of which are among the 50 largest).
Waymo feels very real in SF: it’s a product that’s easy to access, people use it because they like it rather than for novelty, it’s able to serve most routes within the city, and it has nearly as much market share as Uber (here’s some slightly stale data, I think by now its share is more like 25%). I don’t have direct visibility into the other 10 cities, but I think at least a few of them still have a waitlist. My current best guess is that we’ll hit my 2.5 year prediction a bit before 3 years (and probably hit my 3.5 year prediction a bit early).
Overall I think that self-driving deployment has been faster than most technical people expected in 2016. It was probably somewhere around my 20th percentile.

paulfchristiano 14 Apr 2026 17:52 UTC
4 points
2
in reply to: habryka’s comment on: No77e’s Shortform
When you wrote:
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2)
I interpreted you as claiming that Ajeya had said “things more like:”
In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I don’t recall any examples of Ajeya saying or implying anything at all like that. I asked ChatGPT to try to find examples and I think it didn’t find anything.
In your ChatGPT session, a typical example it cites is:
In the AXRP discussion, she also says there were concerns that making the report seem too slick or official could increase capabilities interest.
I think those examples don’t meaningfully support the original claim, at least as a typical reader would understand it.

paulfchristiano 14 Apr 2026 16:03 UTC
17 points
1
in reply to: habryka’s comment on: No77e’s Shortform
Ajeya from a few years ago also said things more like this (more point 1 and 3, less point 2).
I don’t remember anything like this. I think it might be misremembered or a strained interpretation.
Here are points 1 and 3 for reference:
1. AI is going to become vastly superhuman in the near future; but being a good scientist means refusing to speculate about the potential novel risks this may pose. Instead, we should only expect risks that we can clearly see today, and that seem difficult to address today.
3. In general, people worried about AI risk should coordinate as much as possible to play down our concerns, so as not to look like alarmists. This is very important in order to build allies and accumulate political influence, so that we’re well-positioned to act if and when an important opportunity arises.
I asked ChatGPT to read bioanchors (where I thought this was most likely to occur), and then to read all of her other writings looking for anything that fits that mode. Here’s its reply, not finding anything.
The closest match it finds is that Ajeya often caveats her claims. For example from bio anchors:
This is a work in progress and does not represent Open Philanthropy’s institutional view […] Accordingly, we have not done an official publication or blog post, and would prefer for now that people not share it widely in a low bandwidth way.
I don’t think this matches points 1 or 3 well.

paulfchristiano 8 Apr 2026 15:27 UTC
81 points
30
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Speaking for myself, I’d say we’ve ruled out the most pessimistic scenarios I was taking seriously 15 years ago. I’ve always thought alignment would probably be fine, but conditional on not being fine there was a reasonable chance we’d have seen serious problems by now and we haven’t. On balance I’m more pessimistic than I was back then, but that’s because we’ve ruled out many more of the most optimistic scenarios (back then it wasn’t even obvious we’d be training giant opaque neural network agents using RL, that was just a hypothetical scenario that seemed plausible and particularly worrying!).
If we want to go by Eliezer’s public writing rather than my self-reports, in 2012 he appeared to take some very pessimistic hypotheses seriously, including some that I would say are basically ruled out. For example see this exchange where he wrote:
It’s not unthinkable that a non-self-modifying superhuman planning Oracle could be developed with the further constraint that its thoughts are human-interpretable, or can be translated for human use without any algorithms that reason internally about what humans understand, but this would at the least be hard.

[...]

It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here. [...] Does the AI know that music is valuable? If not, will it not describe music-destruction as an “effect” of a plan which offers to free up large amounts of computer storage by, as it turns out, overwriting everyone’s music collection?
It seems like Eliezer is taking seriously the possibility that “describe plans and their effects to humans” requires the kind of consequentialism that might result in takeover, and that AI might be dangerous at a point when “understanding the human’s utility function” (in order to understand what effects are worth mentioning explicitly) is still a hard problem. Those look much less plausible now—we have AI systems that are superhuman in some respects and whose chains of thought are interpretable (for now) because they are anchored to cognitive demonstrations from humans rather than because of consequentialist reasoning about how to communicate with humans.
This isn’t to say that concern is discredited. Indeed today we have AI systems that clearly know about our preferences, but will ignore them when it’s the easiest way to get reward. Chain of thought monitorability is possible but on shaky ground. That said, I think we’re ruling out plenty of even worse scenarios.

paulfchristiano 8 Apr 2026 14:35 UTC
4 points
2
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
(Wei Dai was suggesting that current sentiment for Claude Code and Codex seemed to be comparable, in response to Vladimir Nesov mentioning “OpenAI’s current failure to have a strong offering similar to Claude Code.”)

paulfchristiano 8 Apr 2026 2:57 UTC
41 points
10
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
For what it’s worth I’ve personally found that Claude Code with Opus 4.6 and Codex with GPT 5.4 are very similar products. I haven’t done a very deep dive, but I’ve used them side by side for a few projects. Certainly the difference between them feels much smaller than the difference from models that are a few months older.

paulfchristiano 7 Apr 2026 4:18 UTC
LW: 10 AF: 8
0
AF
in reply to: ryan_greenblatt’s comment on: AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines
Maybe I should have made this clearer, but I’m not talking about scaling to arbitrary inference costs. Instead, I’m talking about scaling to inference costs that are a moderate fraction of the human task completion cost. (E.g., 1%-100% depending on the task.) I think you’d want to compare the AI performance at some inference budget to human labor with some limit in supply.
I agree. In the more realistic regime you are talking about you have some more complicated quantitative question around how large are the slowdowns from task decomposition into what scale.
My main point was that for the tasks we are talking about here, the slowdowns seem like they might not be that large even for modest human time horizons. (In contrast with some of the crazy factored cognition stuff we have sometimes talked about, which involves much shorter horizons, much harder-to-decompose tasks, and much larger slowdowns.)
However, under this alternative definition, relatively small time-horizon values may correspond to much larger (real-world) impact.
I agree that this could lead to large impacts with relatively short horizons (perhaps even today’s horizons, with an appropriately broadened training distribution and a bunch of schlep). That does imply a different picture of AI strengths and weaknesses (e.g. weaker on-the-job learning with performance mostly limited to domains near the training distribution; differential speedup for tasks that are easily decomposed), with a more schlep-heavy singularity, a greater role for tight human involvement later in the process, and probably less alignment concern earlier in the trajectory.

paulfchristiano 6 Apr 2026 22:48 UTC
LW: 83 AF: 41
23
AF
on: AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines
I’ve seen demonstrations of AIs accomplishing [...] tasks that would take humans months to years. Due to this, I tentatively believe that (as of March 1st) the well-elicited 50% reliability time-horizon on ESNI tasks (using only publicly available models) is somewhere between a month and several years
Once you are talking about scaling to arbitrary inference costs, I think the relevant notion of time horizon is closer to: “For what T could you solve this task using an arbitrary supply of humans each of whom only works on the project for T hours?”
On this framing I think that lots of easily-checkable tasks have much shorter horizons than “amount of time they would take a human to do.” That is, I think that you could successfully solve them by delegating to a giant team of low-context humans each of whom makes marginal improvements, and the work from those humans would look quite similar to the AI work.
This is particularly true for reimplementation tasks with predefined tests + a reference implementation. In that setting a human can pick a specific failing test, use the reference implementation to understand the desired behavior, and then implement it while using tests to make sure they don’t break anything else. For the projects you are describing it seems plausible that a human with a few days could use that workflow to reliably make forward progress and get the project done in a moderate number of steps.
(I would expect a giant human team using that methodology to make a huge number of mistakes for anything not covered by the tests, and to write code that is low quality and hard to maintain. But I think we’re currently seeing roughly that pattern of failures from AI reimplementation.)
It’s also very true of vulnerability discovery, since different humans can look in different places and then iteratively zoom in on whatever leads are most promising.
So I’m not sure how much of the phenomenon you’re observe is “there is a way longer horizon for these tasks” rather than “a more careful definition of horizon is more important for these tasks.” Probably some of both, but It seems quite possible that for the tasks you are describing the horizon length is really more like a few days or a week than months or a year.
I think it’s helpful to try to pull these effects apart, particularly when reasoning about the likely effects of continuing RL schlep: I think that you are seeing good performance on these tasks partly because they are easy to decompose and delegate, and not entirely because AI developers are able to do RL on them. For the very “long horizon” tasks you cite, I think that’s the dominant effect. So I wouldn’t expect performance to be as impressive on tasks where it would be hard to delegate them to a giant team of low-context humans.
I still agree that recent public development seem like evidence for shorter timelines, and I’m not saying my bottom line numbers differ much from yours. Even a more conservative extrapolation of time horizons would suggest more like 2-4 years until you can strictly outperform humans for virtual projects (with human parity somewhat before that).

paulfchristiano 13 Mar 2026 15:42 UTC
19 points
5
in reply to: leogao’s comment on: leogao’s Shortform
Your economics are wrong for a few reasons. Let’s grant the hypothetical where all humans supply homogeneous labor at a uniform wage.
- If AI is slightly cheaper than humans, what happen is that wages fall slightly. At the new, lower wages, there is more demand for labor (and more humans drop out of the labor force). At the same time, capital costs are bid up slightly. Eventually the price of AI and human labor is equal, and the quantity demanded is equal to the quantity supplied.
- At the same time, you are increasing demand for labor to build the AI (right now labor is ultimately the main input to building all the stuff that goes in datacenters). If the social value of the AI is near zero, then the net increase in demand is almost the same as the net increase in supply. Lowering wages and increasing capital costs doesn’t offset the benefits of extra productive capacity, it just shifts value from laborers to capitalists.
The real fiscal issue in this scenario is that you are shifting output from labor to capital, and the tax rate on capital is lower than the tax on labor. (Moreover as you automate the economy there are further corporate reorganizations that would drive effect tax rates well below the on-paper capital gains rate). You’re doing that at the same time that you are potentially increasing spending, which is tough unless you are willing to adjust the tax code.
I’m inclined to agree with other commenters though that none of this seems like the most important issue. The fiscal issues can be overcome if the state cares, and my best guess is that growth will accelerate enough that it would be OK even if there was no political change.
People should have much bigger concerns about being completely materially disempowered: (i) the state may not continue to support them, either because they are politically disempowered or because the state itself is disempowered, and (ii) even if they are able to survive they will have no say over what the world looks like and that sucks in its own way.

paulfchristiano 12 Jan 2024 5:11 UTC
16 points
11
in reply to: aphyer’s comment on: Introduce a Speed Maximum
I think US speed limits are so low in part because of an effort to limit demand for oil in 1973; see Wikipedia.
Even setting aside that ignominious origin, I expect that optimal speeds have significantly increased as technology has improved and so speed limits are probably too low because of inertia.

paulfchristiano 6 Jan 2024 17:03 UTC
LW: 22 AF: 11
6
AF
in reply to: habryka’s comment on: Catching AIs red-handed
As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:
- Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn’t seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
- Is the behavior sensitive to the AI’s beliefs about the quality of human oversight? For example does training data suggesting more vigilant human oversight in this domain decrease the probability of the mistake? Do other AI systems trained in different ways reproduce a similar pattern of mistakes?
- If we have one example of a mistake that looks suspicious, we can examine other similar mistakes and build up a detailed pattern of behavior. For example we can hold fix the context and then look at whether the AI misses other kinds of software errors. With multiple mistakes in hand we can run all this analysis on a broader spectrum of mistakes. We can iteratively modify scenarios to explore the landscape of deceptive behavior, gradient descenting our way towards stronger and stronger evidence.
As another example, suppose that an AI system tries to modulate its behavior in any way based on whether it believes it is being closely monitored vs operating in the real world. And suppose that you find an example in the lab where the AI system messes up, and therefore where there is an unusually large gap between its ability to solve a problem in a “realistic-looking” situation vs a situation that’s obviously a test. The initial behavior isn’t super suspicious (sometimes behavior randomly varies with framing), but it’s suspicious enough to get flagged for more thorough review. Subsequent rounds find further evidence of sensitivity, and eventually it’s pretty likely (though still not “legal smoking gun”) that we’ve got an example of an AI systems modulating its behavior based on whether it believes it is being tested.
At that point you can start collecting data about what cues the AI system is using to determine whether it is being tested, by varying features of the context and seeing whether you see nicer or less nice behavior. And once you have this understanding in hand you can use it in other contexts to try to get the AI to mess up more. As you find more examples you can build a richer picture, in a virtuous cycle (like normal science).
I think being an AI in this situation is pretty untenable if you are up against human scientists trying hard to figure out what’s going on with a typical level of rigor and competence. And so my guess is that typical competent malicious AI systems won’t try to modulate their behavior in a significant way based on whether they believe they are being actively tested/monitored. (But it’s still worth looking extremely hard, and e.g. it’s likely that some AI systems will mess up and try to pull this kind of stunt.) And that’s a huge difference with the human case: changing your behavior when you are being closely watched is one of the key things that a human needs to do in order to preserve plausible deniability, and one of the key features of many AI safety horror stories.
(We could tell an identical story about cases where AI systems try to behave badly in ways that are too subtle for humans to notice, but mess up in a single case by underestimating human abilities. In that case we get a clear example of sensitivity to information about human competence, and again can explore the whole frontier of this behavior.)
This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it’s never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI’s training data and limiting information about the world. So “super confident that humans will never notice” is a very high bar.
just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.
I’m not sure this distinction is particularly important. If ChatGPT is taking actions to undermine human control of AI, and you can establish a pattern of behavior, that’s a big deal which will cause scientists and the world to sit up and take interest. I don’t think a defense like “it’s just trying to undermine human control because it’s read a lot of stories about AI systems taking over” or “it lies and manipulates the user because that’s what a sociopathic human would do, and RL training put it into a more sociopathic mode” are going to fly very well.
If this is happening at a point in time when the AI system is competent enough to have real consequences, then I think it’s going to be an extremely good basis for a policy reaction.
We’ve seen a tiny example of this with Sydney, in a system where it has almost zero consequences and where it’s pretty much random rather than having a pattern of behavior, and that’s already had significant effects on public and scientific perception.
Overall I would agree it takes a lot of work to actually do the monitoring (especially if AI developers want to avoid accountability), to establish a pattern of behavior, to amplify a little bit of evidence into a robust scientific picture, to diagnose the issue appropriately and avoid overfitting, etc. I think some people have the view that the evidence is already in and nothing will change, but I think that’s very wrong, and it seems to me like it’s one of the more harmful features of the AI safety ideological bubble. (If you go out into the broader scientific community, I think “do you have evidence” is really a central hangup, and the standards of evidence being used are overall pretty reasonable IMO.)
What links here?