RogerDearnaley

Karma: 2,782

I’m currently an independent AI Alignment researcher at Meridian in Cambridge, formerly a staff artificial intelligence engineer and researcher working with AI and LLMs. I’ve been interested in AI alignment, safety and interpretability for the last 17 years, and have been writing on LessWrong about these for 3 years. I did research at MATS summer 2025, and will be doing PIBBSS this summer. I also have post-graduate experience in Theoretical Physics and an interest in Evolutionary Biology. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.

Experimental Evidence for Simulator Theory— Part 2: The Scalers Strike Back

RogerDearnaley23 Mar 2026 22:37 UTC

21 points

0 comments34 min readLW link

Experimental Evidence for Simulator Theory— Part 1: Emergent Misalignment and Weird Generalizations

RogerDearnaley23 Mar 2026 22:37 UTC

25 points

0 comments53 min readLW link

[Question] How Hard a Problem is Alignment?

RogerDearnaley11 Mar 2026 16:47 UTC

21 points

15 comments3 min readLW link

How Hard a Problem is Alignment? (My Opinionated Answer)

RogerDearnaley11 Mar 2026 16:46 UTC

51 points

4 comments68 min readLW link

Shaping the exploration of the motivation-space matters for AI safety

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan and dillonkn

6 Mar 2026 14:43 UTC

78 points

15 comments10 min readLW link

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

RogerDearnaley21 Feb 2026 1:59 UTC

37 points

4 comments5 min readLW link

[Question] What’s Your P(WEIRD)?

RogerDearnaley16 Feb 2026 18:19 UTC

27 points

18 comments9 min readLW link

The Meta-Anthropic Argument

RogerDearnaley2 Feb 2026 1:10 UTC

41 points

55 comments2 min readLW link

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training

RogerDearnaley19 Jan 2026 21:24 UTC

106 points

12 comments11 min readLW link

(arxiv.org)

Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV

RogerDearnaley23 Dec 2025 3:40 UTC

41 points

25 comments20 min readLW link

The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

RogerDearnaley28 May 2025 6:21 UTC

36 points

34 comments9 min readLW link

Why Aligning an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC

39 points

3 comments4 min readLW link

[Question] What Other Lines of Work are Safe from AI Automation?

RogerDearnaley11 Jul 2024 10:01 UTC

41 points

36 comments5 min readLW link

A “Bitter Lesson” Approach to Aligning AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC

64 points

43 comments24 min readLW link

2.5. Evolution and Ethics

RogerDearnaley15 Feb 2024 23:38 UTC

8 points

12 comments7 min readLW link 1 review

Requirements for a Basin of Attraction to Alignment

RogerDearnaley14 Feb 2024 7:10 UTC

47 points

12 comments31 min readLW link

Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley1 Feb 2024 21:15 UTC

15 points

15 comments13 min readLW link

Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

RogerDearnaley26 Jan 2024 3:58 UTC

25 points

2 comments11 min readLW link

A Chinese Room Containing a Stack of Stochastic Parrots

RogerDearnaley12 Jan 2024 6:29 UTC

21 points

4 comments5 min readLW link 1 review

Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC

37 points

4 comments39 min readLW link