RSS

Geoffrey Irving

Karma: 873

Chief Scientist at the UK AI Safety Institute (AISI). Previously, DeepMind, OpenAI, Google Brain, etc.

Re­search Areas in Cog­ni­tive Science (The Align­ment Pro­ject by UK AISI)

Geoffrey Irving1 Aug 2025 10:26 UTC
12 points
0 comments6 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
28 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

The need to rel­a­tivise in de­bate

26 Jun 2025 16:23 UTC
25 points
2 comments5 min readLW link

Prover-Es­ti­ma­tor De­bate: A New Scal­able Over­sight Protocol

17 Jun 2025 13:53 UTC
88 points
18 comments5 min readLW link

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

21 May 2025 17:23 UTC
34 points
16 comments6 min readLW link

Dodg­ing sys­tem­atic hu­man er­rors in scal­able oversight

Geoffrey Irving14 May 2025 15:19 UTC
33 points
3 comments4 min readLW link

An al­ign­ment safety case sketch based on debate

8 May 2025 15:02 UTC
57 points
21 comments25 min readLW link
(arxiv.org)

UK AISI’s Align­ment Team: Re­search Agenda

7 May 2025 16:33 UTC
113 points
2 comments11 min readLW link

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

14 Apr 2025 16:45 UTC
29 points
1 comment2 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

21 Mar 2025 14:05 UTC
32 points
5 comments8 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
35 points
9 comments3 min readLW link

Au­toma­tion collapse

21 Oct 2024 14:50 UTC
72 points
9 comments7 min readLW link

De­bate, Or­a­cles, and Obfus­cated Arguments

20 Jun 2024 23:14 UTC
44 points
4 comments21 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
44 points
3 comments2 min readLW link
(arxiv.org)

Deep­Mind is hiring for the Scal­able Align­ment and Align­ment Teams

13 May 2022 12:17 UTC
150 points
34 comments9 min readLW link

Learn­ing the smooth prior

29 Apr 2022 21:10 UTC
35 points
0 comments12 min readLW link