RSS

Elliott Thornley (EJT)

Karma: 1,115

elliott-thornley.com

Towards Shut­down­able Agents: Gen­er­al­iz­ing Stochas­tic Choice in RL Agents and LLMs

3 Jun 2026 14:24 UTC
20 points
3 comments19 min readLW link
(arxiv.org)

Maybe we should pre­train on syn­thetic data about good-but-re­ward-hack­ing AIs

Elliott Thornley (EJT)29 May 2026 14:50 UTC
12 points
4 comments3 min readLW link

AI safety can be a Pas­cal’s mug­ging even if p(doom) is high

Elliott Thornley (EJT)25 Apr 2026 16:16 UTC
29 points
10 comments1 min readLW link

Prefer­ence gaps as a safe­guard against AI self-replication

26 Nov 2025 14:49 UTC
10 points
2 comments11 min readLW link

Shut­down­able Agents through POST-Agency

Elliott Thornley (EJT)16 Sep 2025 12:09 UTC
33 points
8 comments54 min readLW link
(arxiv.org)

Towards shut­down­able agents via stochas­tic choice

8 Jul 2024 10:14 UTC
59 points
11 comments23 min readLW link
(arxiv.org)

The Shut­down Prob­lem: In­com­plete Prefer­ences as a Solution

Elliott Thornley (EJT)23 Feb 2024 16:01 UTC
62 points
33 comments41 min readLW link

The Shut­down Prob­lem: An AI Eng­ineer­ing Puz­zle for De­ci­sion Theorists

Elliott Thornley (EJT)23 Oct 2023 21:00 UTC
79 points
28 comments39 min readLW link
(philpapers.org)

The price is right

Elliott Thornley (EJT)16 Oct 2023 16:34 UTC
42 points
3 comments4 min readLW link
(openairopensea.substack.com)

[Question] What are some ex­am­ples of AIs in­stan­ti­at­ing the ‘near­est un­blocked strat­egy prob­lem’?

Elliott Thornley (EJT)4 Oct 2023 11:05 UTC
6 points
4 comments1 min readLW link

EJT’s Shortform

Elliott Thornley (EJT)26 Sep 2023 15:19 UTC
4 points
16 comments1 min readLW link

There are no co­her­ence theorems

20 Feb 2023 21:25 UTC
161 points
136 comments19 min readLW link1 review