Felix Hofstätter

Karma: 231

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

13 Jun 2024 10:04 UTC
77 points
10 comments2 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
43 points
7 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

29 Jan 2024 0:24 UTC
39 points
5 comments4 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

15 Aug 2023 21:13 UTC
19 points
0 comments17 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC
11 points
0 comments12 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix Hofstätter13 Sep 2022 17:08 UTC
15 points
0 comments14 min readLW link

On Prefer­ence Ma­nipu­la­tion in Re­ward Learn­ing Processes

Felix Hofstätter15 Aug 2022 19:32 UTC
8 points
0 comments4 min readLW link