RSS

Jacob Pfau

Karma: 915

UK AISI Alignment Team and NYU PhD student

Re­search Areas in Meth­ods for Post-train­ing and Elic­i­ta­tion (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
12 points
0 comments6 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Bench­mark De­sign and Eval­u­a­tion (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:26 UTC
10 points
0 comments9 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Prob­a­bil­is­tic Meth­ods (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:26 UTC
3 points
0 comments4 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Eval­u­a­tion and Guaran­tees in Re­in­force­ment Learn­ing (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 9:53 UTC
14 points
0 comments11 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
28 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

21 May 2025 17:23 UTC
34 points
16 comments6 min readLW link

An al­ign­ment safety case sketch based on debate

8 May 2025 15:02 UTC
57 points
21 comments25 min readLW link
(arxiv.org)

UK AISI’s Align­ment Team: Re­search Agenda

7 May 2025 16:33 UTC
113 points
2 comments11 min readLW link

Prospects for Align­ment Au­toma­tion: In­ter­pretabil­ity Case Study

21 Mar 2025 14:05 UTC
32 points
5 comments8 min readLW link

Au­dit­ing LMs with coun­ter­fac­tual search: a tool for con­trol and ELK

Jacob Pfau20 Feb 2024 0:02 UTC
28 points
6 comments10 min readLW link

LM Si­tu­a­tional Aware­ness, Eval­u­a­tion Pro­posal: Vio­lat­ing Imitation

Jacob Pfau26 Apr 2023 22:53 UTC
16 points
2 comments2 min readLW link

Early situ­a­tional aware­ness and its im­pli­ca­tions, a story

Jacob Pfau6 Feb 2023 20:45 UTC
29 points
6 comments3 min readLW link

Ja­cob Pfau’s Shortform

Jacob Pfau17 Jun 2022 16:40 UTC
3 points
19 commentsLW link