RSS

Felix Hofstätter

Karma: 370

Stress Test­ing De­liber­a­tive Align­ment for Anti-Schem­ing Training

Sep 17, 2025, 4:59 PM
125 points

38 votes

Overall karma indicates overall quality.

19 comments1 min readLW link
(antischeming.ai)

Can SAE steer­ing re­veal sand­bag­ging?

Apr 15, 2025, 12:33 PM
35 points

11 votes

Overall karma indicates overall quality.

3 comments4 min readLW link

The Elic­i­ta­tion Game: Eval­u­at­ing ca­pa­bil­ity elic­i­ta­tion techniques

Feb 27, 2025, 8:33 PM
10 points

5 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

[Paper] AI Sand­bag­ging: Lan­guage Models can Strate­gi­cally Un­der­perform on Evaluations

Jun 13, 2024, 10:04 AM
84 points

35 votes

Overall karma indicates overall quality.

10 comments2 min readLW link
(arxiv.org)

An In­tro­duc­tion to AI Sandbagging

Apr 26, 2024, 1:40 PM
50 points

24 votes

Overall karma indicates overall quality.

13 comments8 min readLW link

Sim­ple dis­tri­bu­tion ap­prox­i­ma­tion: When sam­pled 100 times, can lan­guage mod­els yield 80% A and 20% B?

Jan 29, 2024, 12:24 AM
39 points

20 votes

Overall karma indicates overall quality.

5 comments4 min readLW link

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

Nov 8, 2023, 11:37 AM
49 points

21 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

Aug 15, 2023, 9:13 PM
19 points

11 votes

Overall karma indicates overall quality.

0 comments17 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix HofstätterApr 25, 2023, 1:45 PM
9 points

7 votes

Overall karma indicates overall quality.

1 comment15 min readLW link

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix HofstätterMar 10, 2023, 7:54 AM
11 points

3 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix HofstätterSep 13, 2022, 5:08 PM
15 points

5 votes

Overall karma indicates overall quality.

0 comments14 min readLW link

On Prefer­ence Ma­nipu­la­tion in Re­ward Learn­ing Processes

Felix HofstätterAug 15, 2022, 7:32 PM
8 points

4 votes

Overall karma indicates overall quality.

0 comments4 min readLW link