Geoffrey Irving

Karma: 904

Chief Scientist at the UK AI Safety Institute (AISI). Previously, DeepMind, OpenAI, Google Brain, etc.

Research Areas in Cognitive Science (The Alignment Project by UK AISI)

Geoffrey Irving1 Aug 2025 10:26 UTC

12 points

0 comments6 min readLW link

(alignmentproject.aisi.gov.uk)

The Alignment Project by UK AISI

Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa and Edmund Lau

1 Aug 2025 9:52 UTC

29 points

0 comments2 min readLW link

(alignmentproject.aisi.gov.uk)

The need to relativise in debate

Geoffrey Irving and Simon Marshall

26 Jun 2025 16:23 UTC

31 points

2 comments5 min readLW link

Prover-Estimator Debate: A New Scalable Oversight Protocol

Jonah Brown-Cohen and Geoffrey Irving

17 Jun 2025 13:53 UTC

89 points

19 comments5 min readLW link

Unexploitable search: blocking malicious use of free parameters

Jacob Pfau and Geoffrey Irving

21 May 2025 17:23 UTC

40 points

16 comments6 min readLW link

Dodging systematic human errors in scalable oversight

Geoffrey Irving14 May 2025 15:19 UTC

34 points

4 comments4 min readLW link

An alignment safety case sketch based on debate

Marie_DB, Jacob Pfau, Benjamin Hilton and Geoffrey Irving

8 May 2025 15:02 UTC

62 points

21 comments25 min readLW link

(arxiv.org)

UK AISI’s Alignment Team: Research Agenda

Benjamin Hilton, Jacob Pfau, Marie_DB and Geoffrey Irving

7 May 2025 16:33 UTC

115 points

3 comments11 min readLW link

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Tomek Korbak, Mikita Balesni, Buck and Geoffrey Irving

14 Apr 2025 16:45 UTC

29 points

1 comment2 min readLW link

Prospects for Alignment Automation: Interpretability Case Study

Jacob Pfau and Geoffrey Irving

21 Mar 2025 14:05 UTC

33 points

5 comments8 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

30 Jan 2025 17:28 UTC

61 points

0 comments5 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

24 Jan 2025 10:39 UTC

37 points

9 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

21 Oct 2024 14:50 UTC

72 points

9 comments7 min readLW link

Debate, Oracles, and Obfuscated Arguments

Jonah Brown-Cohen and Geoffrey Irving

20 Jun 2024 23:14 UTC

45 points

4 comments21 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Rohin Shah and Geoffrey Irving

13 May 2022 12:17 UTC

150 points

34 comments9 min readLW link

Learning the smooth prior

Geoffrey Irving, Rohin Shah and evhub

29 Apr 2022 21:10 UTC

36 points

0 comments12 min readLW link