RobertKirk

Karma: 321

Research Scientist on the Safeguards team at UK AI Security Institute

A Sober Look at Steering Vectors for LLMs

Joschka Braun, Dmitrii Krasheninnikov, Usman Anwar, RobertKirk, Daniel Tan and David Scott Krueger (formerly: capybaralet)

23 Nov 2024 17:30 UTC

40 points

0 comments5 min readLW link

Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping

RobertKirk20 Jul 2023 9:56 UTC

39 points

2 comments5 min readLW link

Causal confusion as an argument against the scaling hypothesis

RobertKirk and David Scott Krueger (formerly: capybaralet)

20 Jun 2022 10:54 UTC

86 points

30 comments15 min readLW link

Sparsity and interpretability?

Ada Böhm, RobertKirk and Tomáš Gavenčiak

1 Jun 2020 13:25 UTC

41 points

3 comments7 min readLW link

How can Interpretability help Alignment?

RobertKirk and Tomáš Gavenčiak

23 May 2020 16:16 UTC

37 points

3 comments9 min readLW link

What is Interpretability?

RobertKirk, Tomáš Gavenčiak and Ada Böhm

17 Mar 2020 20:23 UTC

39 points

1 comment11 min readLW link