RSS

Francis Rhys Ward

Karma: 380

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
77 points
10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
43 points
7 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

29 Jan 2024 0:24 UTC
39 points
5 comments4 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

Re­ward Hack­ing from a Causal Perspective

21 Jul 2023 18:27 UTC
29 points
5 comments7 min readLW link

Agency from a causal perspective

30 Jun 2023 17:37 UTC
38 points
5 comments6 min readLW link

Causal­ity: A Brief Introduction

20 Jun 2023 15:01 UTC
48 points
18 comments6 min readLW link

In­tro­duc­tion to Towards Causal Foun­da­tions of Safe AGI

12 Jun 2023 17:55 UTC
67 points
6 comments4 min readLW link

For ev­ery choice of AGI difficulty, con­di­tion­ing on grad­ual take-off im­plies shorter timelines.

Francis Rhys Ward21 Apr 2022 7:44 UTC
31 points
13 comments3 min readLW link