Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Teun van der Weij
Karma:
341
Research scientist at Apollo Research.
All
Posts
Comments
New
Top
Old
Stress Testing Deliberative Alignment for Anti-Scheming Training
Mikita Balesni
,
Bronson Schoen
,
Marius Hobbhahn
,
Axel Højmark
,
AlexMeinke
,
Teun van der Weij
,
Jérémy Scheurer
,
Felix Hofstätter
,
Nicholas Goldowsky-Dill
,
rusheb
,
Andrei Matveiakin
,
jenny
and
alex.lloyd
17 Sep 2025 16:59 UTC
124
points
13
comments
1
min read
LW
link
(antischeming.ai)
How to mitigate sandbagging
Teun van der Weij
23 Mar 2025 17:19 UTC
30
points
0
comments
8
min read
LW
link
Teun van der Weij’s Shortform
Teun van der Weij
14 Mar 2025 3:54 UTC
3
points
1
comment
1
min read
LW
link
The Elicitation Game: Evaluating capability elicitation techniques
Teun van der Weij
,
Felix Hofstätter
,
JaydenTeoh
,
HenningB
and
Francis Rhys Ward
27 Feb 2025 20:33 UTC
10
points
1
comment
2
min read
LW
link
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam F. Brown
and
Francis Rhys Ward
13 Jun 2024 10:04 UTC
84
points
10
comments
2
min read
LW
link
(arxiv.org)
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
26 Apr 2024 13:40 UTC
49
points
13
comments
8
min read
LW
link
Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Teun van der Weij
,
Felix Hofstätter
and
Francis Rhys Ward
29 Jan 2024 0:24 UTC
39
points
5
comments
4
min read
LW
link
List of projects that seem impactful for AI Governance
JaimeRV
and
Teun van der Weij
14 Jan 2024 16:53 UTC
14
points
0
comments
13
min read
LW
link
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
Simon Lermen
,
Teun van der Weij
and
Leon Lang
16 May 2023 10:53 UTC
26
points
0
comments
13
min read
LW
link
Back to top