Felix Hofstätter

Karma: 384

Stress Testing Deliberative Alignment for Anti-Scheming Training

Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny and alex.lloyd

17 Sep 2025 16:59 UTC

133 points

19 comments1 min readLW link

(antischeming.ai)

Can SAE steering reveal sandbagging?

jordinne, Hoang Khiem, Felix Hofstätter and Cleo Nardo

15 Apr 2025 12:33 UTC

36 points

3 comments4 min readLW link

The Elicitation Game: Evaluating capability elicitation techniques

Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB and Francis Rhys Ward

27 Feb 2025 20:33 UTC

15 points

1 comment2 min readLW link

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Ollie J, Sam F. Brown and Francis Rhys Ward

13 Jun 2024 10:04 UTC

84 points

10 comments2 min readLW link

(arxiv.org)

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

50 points

13 comments8 min readLW link

Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

29 Jan 2024 0:24 UTC

39 points

5 comments4 min readLW link

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak and Sam F. Brown

8 Nov 2023 11:37 UTC

49 points

0 comments18 min readLW link

Understanding the Information Flow inside Large Language Models

Felix Hofstätter and cozyfractal

15 Aug 2023 21:13 UTC

19 points

0 comments17 min readLW link

Explaining the Transformer Circuits Framework by Example

Felix Hofstätter25 Apr 2023 13:45 UTC

9 points

1 comment15 min readLW link

Reflections On The Feasibility Of Scalable-Oversight

Felix Hofstätter10 Mar 2023 7:54 UTC

11 points

0 comments12 min readLW link

An investigation into when agents may be incentivized to manipulate our beliefs.

Felix Hofstätter13 Sep 2022 17:08 UTC

15 points

0 comments14 min readLW link

On Preference Manipulation in Reward Learning Processes

Felix Hofstätter15 Aug 2022 19:32 UTC

8 points

0 comments4 min readLW link