RSS

Felix Hofstätter

Karma: 128

Tall Tales at Differ­ent Scales: Eval­u­at­ing Scal­ing Trends For De­cep­tion In Lan­guage Models

8 Nov 2023 11:37 UTC
49 points
0 comments18 min readLW link

Un­der­stand­ing the In­for­ma­tion Flow in­side Large Lan­guage Models

15 Aug 2023 21:13 UTC
19 points
0 comments17 min readLW link

An in­ves­ti­ga­tion into when agents may be in­cen­tivized to ma­nipu­late our be­liefs.

Felix Hofstätter13 Sep 2022 17:08 UTC
15 points
0 comments14 min readLW link

Reflec­tions On The Fea­si­bil­ity Of Scal­able-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC
11 points
0 comments12 min readLW link

Ex­plain­ing the Trans­former Cir­cuits Frame­work by Example

Felix Hofstätter25 Apr 2023 13:45 UTC
8 points
0 comments15 min readLW link

On Prefer­ence Ma­nipu­la­tion in Re­ward Learn­ing Processes

Felix Hofstätter15 Aug 2022 19:32 UTC
8 points
0 comments4 min readLW link