Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
RobertKirk
Karma:
321
Research Scientist on the Safeguards team at UK AI Security Institute
All
Posts
Comments
New
Top
Old
A Sober Look at Steering Vectors for LLMs
Joschka Braun
,
Dmitrii Krasheninnikov
,
Usman Anwar
,
RobertKirk
,
Daniel Tan
and
David Scott Krueger (formerly: capybaralet)
23 Nov 2024 17:30 UTC
40
points
0
comments
5
min read
LW
link
Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping
RobertKirk
20 Jul 2023 9:56 UTC
39
points
2
comments
5
min read
LW
link
Causal confusion as an argument against the scaling hypothesis
RobertKirk
and
David Scott Krueger (formerly: capybaralet)
20 Jun 2022 10:54 UTC
86
points
30
comments
15
min read
LW
link
Sparsity and interpretability?
Ada Böhm
,
RobertKirk
and
Tomáš Gavenčiak
1 Jun 2020 13:25 UTC
41
points
3
comments
7
min read
LW
link
How can Interpretability help Alignment?
RobertKirk
and
Tomáš Gavenčiak
23 May 2020 16:16 UTC
37
points
3
comments
9
min read
LW
link
What is Interpretability?
RobertKirk
,
Tomáš Gavenčiak
and
Ada Böhm
17 Mar 2020 20:23 UTC
39
points
1
comment
11
min read
LW
link
Back to top