emile delcourt

Karma: 31

emile delcourt 15 Jun 2026 4:16 UTC
1 point
0
on: x-risk-themed
Great post! I think “do no harm” comes to mind. Before we intervene (e.g., by accelerating RSI/ASI “for defenders” or “against cancer”) we owe 10B people the verifiable proof that the intervention does not shorten their remaining lives to single digits. FWIW, crunch mode might move a needle orthogonally or in the opposite direction to actual existential safety (to Max H’s point). Any unsaturated eval can advance capabilities or deceptive alignment, so building more accelerates harm (setting aside closed ones and withheld datasets—thanks to the groups who run these). There’s also safety washing, numbing audiences to the problem. And papers consumed automatically by agents (Autoredteamer’s lifelong attack integration), so the more we publish, the worse it gets.

emile delcourt 14 Jun 2026 13:47 UTC
1 point
0
in reply to: Alex Mallen’s comment on: Alex Mallen’s Shortform
Running evals or inspecting logprobs reveals a lot of misalignment available and active outside the top 90% of the distribution, and distribution is indeed a key factor. The longer and the less coherent inputs, the worse the outputs

emile delcourt 14 Jun 2026 13:23 UTC
0 points
1
in reply to: Jan Betley’s comment on: Jan Betley’s Shortform
Bostrom’s AI takeoff speed concerns may or may not have covered this, but a key problem is that under a decisive strategic advantage scenario our most critical resources—if only communications—would likely be a queen superintelligent AI would capture before we can put it in check (no EMP if with can’t launch it). You just don’t want to be in a scenario where you need to do this, because you wouldn’t be able to, at all.

Don’t just aim for Frontier Labs

emile delcourt14 Jun 2026 4:41 UTC

4 points

0 comments28 min readLW link

emile delcourt 21 May 2026 0:27 UTC
1 point
0
on: Classifier Context Rot: Monitor Performance Degrades with Context Length
Thank you for sharing these findings. I’m leading a theme in OWASP securing agentic applications v2 where your input would be welcome (we’ll otherwise likely just cite the paper where relevant) - real-time behavioral kill switches.

emile delcourt’s Shortform

emile delcourt12 May 2026 3:56 UTC

2 points

2 comments1 min readLW link

emile delcourt 12 May 2026 3:56 UTC
2 points
0
on: emile delcourt’s Shortform
I think a lot about how to increase the chance of contributing something valuable to x-risk mitigation—and AI safety as my particular path. 5 weeks ago, @jaybailey suggested that one way to achieve that goal can be to Have an Unreasonably Specific Story about the Future (a Future That I Want, to be even more specific). Do I have a 5-minute story that helps to do this? I think yes: 1)The current environment and AI architecture incentivize continued growth in capabilities and scale, which also increase weaknesses and threats, and where agents are viewed as short-term profits and building trust at the expense of strategic resilience and rational safety. 2) Many stakeholders would lift their skepticism regarding the risks (and perhaps even advocate for a more cautious approach to AI in their own network) if presented with a compelling case of clear and present risk of irreversible damage to society, which would tilt the balance in favor of decisions that reduce the risk of catastrophic incidents 3) Creating such a case requires assessments of impacts and likelihoods that are a) accurate, high-elicitation, and up to date against the latest models, research, and threat intelligence. b) well-crafted to make immediate sense to the audience c) well publicized to reach the decision makers that matter 4) Each time I share a failure case / threat model / demo / paper, and a decision maker understands why it rises above the FUD, the alternative of stricter governance gains momentum. 5) When someone learns something, the next question is always the same. What do we do next? It’s the golden moment. We organize behavior change, working extremely efficiently through chokepoints (even simplifying processes and systems to create some if necessary), to re-align systems, people, and impact. If doing this looks like a lower rate of acceleration, it is not slowing down, and if it is, that just indicates an opportunity to allocate resources more optimally—and we just show that.

emile delcourt 28 Apr 2026 1:33 UTC
2 points
1
on: Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models
Really great post (and concise to boot)!

An opportunity exists to see actual model activations and logprobs of unverbalized blackmail in latent space (at least not verbalized in the limited sample: we’d expect an average of less than 3% blackmail based on the zeros observed but we don’t know if it’s 2% or 0.01%). Thanks to the low #parameters maybe training a set of linear probes is feasible?

emile delcourt 23 Aug 2025 1:48 UTC
2 points
0
on: My AGI timeline updates from GPT-5 (and 2025 so far)
What if work longer than one week was vastly sublinear in complexity for AI—meaning that doubling task duration doesn’t double difficulty for an agent?
This could be a corollary to your point on superexponential results, but maybe less from a capability standpoint but environmental overhangs to which humans may not have fully optimized. Especially in a multi agentic direction.

Scoping LLMs

erik, David Baek, emile delcourt and 4gate

10 Apr 2025 0:32 UTC

4 points

0 comments22 min readLW link

emile delcourt 13 Dec 2024 6:37 UTC
1 point
1
on: An Introduction to AI Sandbagging
Really appreciate this post! The recommendation “Evaluators should ensure that effective capability elicitation techniques are used for their evaluations” is especially important. Zero-shot, single-turn prompts with no transformations no longer seem representative of a model’s impact on the public (who, in aggregate or with scant determination, will be inflicting many variants of unsanctioned prompts with many shots or many turns)

emile delcourt 13 Dec 2024 6:06 UTC
1 point
0
on: An Introduction to AI Sandbagging
I’m curious why in example 1 the ability to manipulate (“persuade”) is called a capability evaluation, making limited results eligible for the sandbagging label, whereas in example 6 (under the name “blackmail”) it is called an alignment evaluation, making limited results ineligible for that label?
Both examples tuned out the manipulation enough to hide it in testing, with worse results in production. Can someone help me to better learn the nuances we’d like to impose on sandbagging? Benchmark evasion is an area I started getting into only in november.

Can Large Language Models effectively identify cybersecurity risks?

emile delcourt30 Aug 2024 20:20 UTC

18 points

0 comments11 min readLW link

emile delcourt 30 Aug 2024 19:39 UTC
9 points
2
on: Open Thread Summer 2024
Hi! Just introducing myself to this group. I’m a cybersecurity professional, enjoyed various deep learning adventures over the last 6 years and inevitably managing AI related risks in my information security work. Went through BlueDot’s AI safety fundamentals last spring with lots of curiosity and (re?)discovered LessWrong. Looking forward to visiting more often, and engaging with the intelligence of this community to sharpen how I think.

emile delcourt

Don’t just aim for Fron­tier Labs

emile del­court’s Shortform

Scop­ing LLMs

Can Large Lan­guage Models effec­tively iden­tify cy­ber­se­cu­rity risks?

Don’t just aim for Frontier Labs

emile delcourt’s Shortform

Scoping LLMs

Can Large Language Models effectively identify cybersecurity risks?