Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Felix Hofstätter
Karma:
160
All
Posts
Comments
New
Top
Old
Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
Felix Hofstätter
,
Francis Rhys Ward
,
HarrietW
,
LAThomson
,
Ollie J
,
Patrik Bartak
and
Sam F. Brown
8 Nov 2023 11:37 UTC
49
points
0
comments
18
min read
LW
link
Understanding the Information Flow inside Large Language Models
Felix Hofstätter
and
cozyfractal
15 Aug 2023 21:13 UTC
19
points
0
comments
17
min read
LW
link
An investigation into when agents may be incentivized to manipulate our beliefs.
Felix Hofstätter
13 Sep 2022 17:08 UTC
15
points
0
comments
14
min read
LW
link
Reflections On The Feasibility Of Scalable-Oversight
Felix Hofstätter
10 Mar 2023 7:54 UTC
11
points
0
comments
12
min read
LW
link
Explaining the Transformer Circuits Framework by Example
Felix Hofstätter
25 Apr 2023 13:45 UTC
8
points
0
comments
15
min read
LW
link
On Preference Manipulation in Reward Learning Processes
Felix Hofstätter
15 Aug 2022 19:32 UTC
8
points
0
comments
4
min read
LW
link
Back to top