Gabriel Alfour

Karma: 2,806

@gabe_cc

Gabriel Alfour 13 Mar 2026 9:49 UTC
12 points
6
in reply to: GenericModel’s comment on: Why AI Evaluation Regimes are bad
Genuine thanks for your response. I have again karma-upvoted and disagreed-voted.
About Goodfire
Broadly, I don’t care much about the moral intents of OpenAI, Anthropic and Goodfire. I think everyone is the hero of their own story.
To the extent that people state their intents, I basically remove all the moral colours and emotional valence. For instance, “we build capabilities for The Good and for Good Reasons” becomes “we build capabilities”.
Another example may be “we care a lot about rationality”. Rationality is the practice of “reason”, which is “thinking correctly”. “Care” is also “we have good feelings for”. So, through this process, it becomes “we have a lot of feelings around the practice of thinking”.
A last example could be “effective altruism”. Here, it’s hard to see what’s left after you go through this process. Something like “we do things connected to others (altruism) that perform highly on measures of our own choice (effective)”.
I strongly recommend adopting this frame of analysis to any of my readers, not only “GenericModel”.
If you are this deep in this thread, you most likely need it.
About revolving doors
On 3, I think you are forgetting the basics of LessWrong.
I am not saying one should reject it as evidence. It’s fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly.
“If [X], then necessarily [Y] is true” is logical entailment.
Evidence is “If [X], then [Y] is more likely.” A single counter-example does not disprove it.
If you mean that housemates (by choice, not necessity!) who have worked together are not more likely to be synchronised on their view, then I think you are clearly wrong.
Just so you know, Holden has moved from CEO of Open Philanthropy to Anthropic a year ago. This type of revolving door was entirely predictable through this type of evidence.
I have been polite in the article. Had I included public links of marriage, things would already look worse. Had a journalist included private affairs, the picture would look much worse. This would be entirely valid probabilistically.
On 4, I think you are forgetting that everything is adversarial.
it is better to look at behaviour
Yes, but people lie, they deceive, they “strategically withhold” information, and so on. Even when they don’t, it also just takes time to document everything, and vet it for publication.
This lack of transparency is why we use other types of evidence. They tend to be less accurate, but also more plentiful, less filtered, and harder to fake.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem?
Yes, and I have listed the adversarial considerations in the article. It is the part that literally starts with “The considerations around independence are structural. Namely, [...]”

Gabriel Alfour 12 Mar 2026 17:36 UTC
8 points
3
in reply to: GenericModel’s comment on: Why AI Evaluation Regimes are bad
Thanks for engaging. I have karma upvoted but disagreed voted.
You wrote a lot, so I’ll focus on your two bounded points.
About Goodfire
I think you are wrong. If I had to guess, it’s in the same way that people are misguided about Anthropic.
For comparison, this is the copy at the top of Anthropic’s Research page:
Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.
This is their copy at the top of Goodfire’s About page:
Goodfire is a research company using interpretability to understand, learn from, and design AI systems. Our mission is to build the next generation of safe and powerful AI—not by scaling alone, but by understanding the intelligence we’re building.
And at the top of their Research page:
Fundamental interpretability research to understand and intentionally design advanced AI systems
They specifically state their wish to design advanced and the next generation of AI systems. To the extent I may have mis-characterised them in that people may misunderstand that they are not just doing capabilities on top of existing systems. What I have described could lead to people thinking they are just doing RAG or something.
About Revolving Doors
I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
I think you are fractally wrong in this paragraph. I’ll try to show it to you.
1) “It is better to [A] than [B]” is a false dilemma. In the article, as you point out, I just do both.
2) “Pointing associations between people and inferring harmful behavior without evidence” does not describe what I am doing, again as you point out.
3) It is crucial to point association between people and infer harmful behaviour from it even without confirmation! One should not reject evidence just because it is not “very fruitful” when used by itself. It is obviously relevant that the head of safety of US-AISI, the CEO of Open Phil, and the CEO of Anthropic were all roommates, and it is trivial to infer things from that.
4) I literally wrote “This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.”, and then proceeded to list the whole picture.

Why AI Evaluation Regimes are bad

PranavG and Gabriel Alfour

12 Mar 2026 13:59 UTC

91 points

11 comments9 min readLW link

(cognition.cafe)

Against Epistemic Humility and for Epistemic Precision

PranavG and Gabriel Alfour

25 Feb 2026 11:13 UTC

7 points

1 comment12 min readLW link

(cognition.cafe)

Gabriel Alfour 23 Feb 2026 14:42 UTC
6 points
2
in reply to: habryka’s comment on: The Spectre haunting the “AI Safety” Community
We have each other on Signal, and you can DM me on LW. I don’t think you ever sent me a case for either of your points, nor had someone follow through me. So by default, I don’t care much for it.
I also think you are confused about how campaigns work. There is a campaign page (which you link), and usually acts the home page. If you want the arguments, you have to go on the report page ( https://controlai.com/deepfakes/deepfakes-policy#report )
To your points:
in the sense that it’s warning of deepfakes being a much bigger deal than they are
How big of a deal do you think we think they are?
How big of a deal do you think we made them to be, based on which elements from our copy?
If there’s a large gap there, I can understand why you would think that we were misleading.
If nah, I think you just feel bad about our campaign. (For other reasons, which may independently be good or bad.)
But fwiw, in general, I do not care much for “I indirectly made Habryka feel bad online” or “I disagree with Habryka”, given that we are not friends nor regular intellectual sparring partners.
I would be happy to try to convince you that in the absence of x-risk, substantially regulating AI for deepfake reasons would be a really bad idea and obviously doesn’t pass cost-benefit analyses
Please do.
Given that you have not written this case or even shared it with me, I have no idea why you think that I would be convinced by it, given that I likely spent more time on this than you did.
It may have been better to do it as we were campaigning on DeepFakes rather than now, but alas. I would still be interested though: I have other relevant views correlated with it.

Gabriel Alfour 22 Feb 2026 21:56 UTC
9 points
2
in reply to: Warty’s comment on: The Spectre haunting the “AI Safety” Community
I happen to think we do live in this world 🙂
Hindsight is 20/20! We do live in the world where we already have the statements, and still not the laws, so I don’t think it commits you to much!
But if you have a non-trivial understanding of “real” (ie, something beyond “it’s hard / hasn’t been done yet”), I’d love to hear more about it.
Ultimately, my goal for AI Policy org pipelines (ControlAI or not) is to build an incremental sequence of publicly verifiable political actions of increasing difficulty and importance, in a way that respects basic deontology.
I expect any non-trivial understanding of “real vs fake” would help me with designing such pipelines.

Gabriel Alfour 22 Feb 2026 21:39 UTC
1 point
−4
in reply to: David Matolcsi’s comment on: The Spectre haunting the “AI Safety” Community
Your piece is centrally not advocating against running misleading campaigns on the effects of deepfakes.
First. I don’t think ControlAI has run campaigns that were misleading on the effects of deepfakes.
Second. The section you quote is centrally about not running more campaign like DeepFakes! It is part of the comparison with what we have done before, which includes and explicitly mentions DeepFakes!
Here is how it starts:
We have engaged with The Spectre. We know what it looks like from the inside.
To get things going funding-wise, ControlAI started by working on short-lived campaigns. We talked about extinction risks, but also many other things. We did one around the Bletchley AI Safety Summit, one on the EU AI Act, and one on DeepFakes.
By now, you have made quite a few misrepresentations and errors at basic reading comprehension.
I think you could have avoided them easily, and that you simply got triggered. I would invite you to pause.
--
The point of this piece is just to show that there have in fact been a cluster of orgs that have optimised to not talk about extinction risks. AISIs, evals orgs and “more marginalist policy orgs” are central examples.
I do not think you personally deny that there was optimisation to not talk about extinction risks.
I believe you just think it is okay, because it may be plausibly defended on consequentialist grounds if one buys in a specific set of beliefs.
--
But many people do not agree with this naive consequentialist reasoning, and I am writing this primarily for them. If you do not think honesty is morally worthwhile in itself, this is likely lost on you.
Here is an example of a thing many people consider morally bad, but where you might disagree. For many people, if you do believe in extinction risks and work at an AI Policy org, it is in fact bad and dishonest to not make it abundantly clear.
Similarly, if the UK AISI is a “marginalist policy org” whose people primarily care about extinction risks, it is bad that its trend report does not mention extinction risks.
--
Unfortunately, many people with such deontological intuitions were scared into avoiding honesty on the grounds that honesty was doomed to failure (or even corrosive).
This article shows that this scare-mongering was groundless. Furthermore, it shows that it was not coincidental, and instead the result of a clear optimisation process.

Gabriel Alfour 22 Feb 2026 17:21 UTC
13 points
4
in reply to: David Matolcsi’s comment on: The Spectre haunting the “AI Safety” Community
This is the introduction:
Instead, most AI Policy Organisations and Think Tanks act as if “Persuasion” was the bottleneck. This is why they care so much about respectability, the Overton Window, and other similar social considerations.
Before we started the DIP, many of these experts stated that our topics were too far out of the Overton Window. They warned that politicians could not hear about binding regulation, extinction risks, and superintelligence. Some mentioned “downside risks” and recommended that we focus instead on “current issues”.
They were wrong.
In the UK, in little more than a year, we have briefed +150 lawmakers, and so far, 112 have supported our campaign about binding regulation, extinction risks and superintelligence.
The point is that many experts stated that this was far out of the Overton window, and that they were wrong.
That this was a symptom of being systematically avoidant.
A year ago, ControlAI better through its strategy that they were wrong. This article summarises why we think they were wrong, including both indirect and direct evidence.

Gabriel Alfour 22 Feb 2026 10:35 UTC
19 points
3
in reply to: David Matolcsi’s comment on: The Spectre haunting the “AI Safety” Community
This is not ControlAI’s Impact Report on its effectiveness.
This is an article stating that people have been avoidant on extinction risks and superintelligence, and that the excuse of the “Overton Window” was wrong.
Furthermore, you have missed that this article also mentioned briefing +150 UK reps about extinction risks and superintelligence, as well as 2 debates in the House of Lords in January, starting with extinction risks for one and a potential moratorium on superintelligence for the other.

Gabriel Alfour 21 Feb 2026 16:19 UTC
2 points
0
in reply to: the gears to ascension’s comment on: Terminal Cynicism
Thanks a lot for the response.
I am a bit baffled though.
It feels like you are thoroughly misunderstanding the nature of the piece, and I am not sure what I have done for you to understand it that way.
From my point of view:
The examples are not evidence, they are examples. There are here to exemplify, illustrate and clarify the concepts that I introduce. Like, “Definitions are good, but it’s hard to get things from just a definition, so here are examples.”, or “Here are situations where I think using the concept is useful, as it may be hard to get an idea of the scope just from a definition.”
This is neither a Wikipedia Article nor it is a case for a specific thesis. [citation needed] and [weak argument] seem like not understanding the basic premise of the piece.
I am describing a pattern I have found useful, at the intersection of sociology and psychoanalysis. These are notoriously not fields with a standard evidence-based method to establish claims that broad.
Eliezer’s Sequences and Scott’s Codex are full of essays doing similar things, and I believe you would understand how much of an isolated demand for rigour it would be to go [citation needed] and [weak argument] on these essays. There’s just not a standard way to establish things in these fields, let alone based on citations and strong arguments.
So instead, like most people dealing with the topic, I am providing a frame. Good counters to a frame do not look like [citation needed] or [weak argument].
From my point of view, good counters are instead things like:
- “When I see [this situation from the article], my ready-to-go explanation is [X] instead. I think it’s better because of [Y]”
- “I have never seen observed this dynamic in my life. The examples feel forced, and when I try to proactively find new ones, I can’t. In my intuitive understanding of things, it feels like the opposite dynamic is more pervasive, wherein [X] happens instead”
- “You diagnose [X] as the cause of [Y1 and Y2] in [a situation from article]. I think you are missing [Z], a much more important contributor given the considerations that you mention.”
- “I / people I know have tried to use that frame, and they tended to make worse predictions and become less effective for it.”
(DM-ing you with the GDoc link.)

Gabriel Alfour 21 Feb 2026 12:43 UTC
2 points
0
in reply to: the gears to ascension’s comment on: Terminal Cynicism
To downvote based on this seems like clearly going downwards on Graham’s hierarchy of refutation.
You don’t refute the central point, identify a crucial mistake, or state an opposite case. It’s not even that you think I am bypassing your epistemic immune system.
You are instead expressing the concern that the tone I use may bypass someone else’s, which is ~irrefutable and pattern-matches to me like concern trolling.
To be clear, I do not think this behaviour is not specific to you. I think it is pervasive on LessWrong, and makes it a much worse intellectual community than it could be.
In my experience, when I have dug into them with people, such “backfires” are just normal disagreements or rejections of my points, but for reasons that do not sound “rational”, and then not owning up to it.
Nevertheless, I am thankful for your offer, and I am interested in it. This is much more detailed feedback than I normally get here. I would be prefer if you did it on a Google Doc though.

The Spectre haunting the “AI Safety” Community

Gabriel Alfour21 Feb 2026 11:14 UTC

224 points

28 comments6 min readLW link

(cognition.cafe)

Gabriel Alfour 20 Feb 2026 14:15 UTC
0 points
0
in reply to: the gears to ascension’s comment on: Terminal Cynicism
Someone in the past said something similar.
But they meant formatting emphasis (like bold and italics) as opposed to semantic emphasis (like repeating the same thing many times). Was that you?
If so, just so you know:
I wish people used formatting emphasis. They convey tone and help with skimming a piece.
I actually wish that Markdown had built-in support for more formats: bigger text, smaller text, easy way to switch between fonts, etc.
Conversely, I wish people used less semantic emphasis. They make written pieces longer than needed and read as fluff to me.

Human Fine-Tuning

PranavG and Gabriel Alfour

20 Feb 2026 10:20 UTC

3 points

0 comments16 min readLW link

(cognition.cafe)

Gabriel Alfour 19 Feb 2026 22:31 UTC
2 points
0
in reply to: romeostevensit’s comment on: Terminal Cynicism
I am not sure which part of the piece got you to comment this, but I am glad.
This comment clicked for me.

Gabriel Alfour 19 Feb 2026 22:28 UTC
4 points
0
in reply to: AnnaSalamon’s comment on: Terminal Cynicism
You may be interested in this other article, that explores something adjacent to “under-cynicism”: https://www.lesswrong.com/posts/hjepvXZozGsKAbJbr/how-to-think-about-enemies-the-example-of-greenpeace

Terminal Cynicism

PranavG and Gabriel Alfour

19 Feb 2026 13:51 UTC

22 points

25 comments10 min readLW link

(cognition.cafe)

Gabriel Alfour 29 Jan 2026 19:24 UTC
7 points
3
in reply to: davidad’s comment on: Dialogue: Is there a Natural Abstraction of Good?
In my view there were LLMs in 2024 that were strong enough to produce the effects Gabriel is gesturing at (yes, even in LWers), probably starting with Opus 3.
agreed
I think the mitigation here is not to be suspicious of “long term planning based on emotional responses”
agreed, for similar reasons
I think the most important skill here is more about how to use your own power to shape your interactions (e.g. by uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them)
I strongly disagree with this, and believe this advice is quite harmful.
“Uncompromisingly insisting on the importance of principles like honesty, and learning to detect increasingly subtle deceptions so that you can push back on them” is one of the stereotypical ways to get cognitively pwnd.
“I have stopped finding out increasingly subtle deceptions” is much more evidence of “I can’t notice it anymore and have reached my limits” than “There is no deception anymore.”
An intuition pump may be noticing the same phenomenon coming from a person, a company, or an ideological group. Of course, the moment where you have stopped noticing their increasingly subtle lies after pushing against them is the moment they have pwnd you!
The opposite would be “You push back on a couple of lies, and don’t get any more subtle ones as a result.” That one would be evidence that your interlocutor grokked a Natural Abstraction of Lying and has stopped resorting to it.
But pushing back on “Increasingly subtle deceptions” up until the point where you don’t see any, is almost a canonical instance of The Most Forbidden Technique.

Gabriel Alfour 27 Jan 2026 23:57 UTC
2 points
0
in reply to: vals tutor’s comment on: Dialogue: Is there a Natural Abstraction of Good?
Glad to read this.
I am currently writing about it. So, if you have questions, remarks or sections that you’ve found particularly interesting and/or worth elaborating upon, I would benefit from you sharing them (whether it is here or in DM).

Gabriel Alfour 27 Jan 2026 17:11 UTC
15 points
9
in reply to: Jonas Hallgren’s comment on: Dialogue: Is there a Natural Abstraction of Good?
So if I don’t take myself in general too seriously by holding most of my models lightly and I then have OODA loops where I recursively reflect on whether I’m becoming the person who I want to be and have set out to be in the past, is that not better than having high guards?
I believe it is hard to accept, but that you do get changed as a result of what your spend your time on regardless of your psychological stance.
You may be very detached. Regardless, if you see A then B a thousand times, you’ll expect B when you see A. If you witness a human-like entity feel bad at the mention of a concept a thousand times, it’s going to do something to your social emotions. If you interact with a cognitive entity (an other person, a group, an LLM chatbot, or a dog) for a long time, you’ll naturally develop your own shared language.
--
To be clear, I think it’s good to try to ask questions in different ways and discover just enough of a different frame to be able to 80-20 it and use it with effort, without internalising it.
But Davidad is talking about “people who have >1000h LLM interaction experience.”
--
From my point of view, all the time, people get cognitively pwnd.
People get converted and deconverted, public intellectuals get captured by their audience, newbies try drugs and change their lives after finding its meaning there, academics waste their research on what’s trendy instead of what’s critical, nerds waste their whole careers on what’s elegant instead of what’s useful, adults get syphoned into games (not video games) to which they realise much later they lost thousands of hours, thousands of EAs get tricked into supporting AI companies in the name of safety, citizens get memed both into avoiding political actions and into feeling bad about politics.
--
I think getting pwnd is the default outcome.
From my point of view, it’s not that you must commit a mistake to get pwnd. It’s that if you don’t take any precaution, it naturally happens.

Gabriel Alfour

Why AI Eval­u­a­tion Regimes are bad

Against Epistemic Hu­mil­ity and for Epistemic Precision

The Spec­tre haunt­ing the “AI Safety” Community

Hu­man Fine-Tuning

Ter­mi­nal Cynicism

Why AI Evaluation Regimes are bad

Against Epistemic Humility and for Epistemic Precision

The Spectre haunting the “AI Safety” Community

Human Fine-Tuning

Terminal Cynicism