robertzk

Karma: 826

Paraphrasing Is (At Best) a Partial Defence Against Steganography in LLMs

Usman Anwar and robertzk

3 May 2026 7:53 UTC

14 points

0 comments8 min readLW link

Resisting Reality

robertzk22 Jan 2026 13:50 UTC

26 points

3 comments6 min readLW link

robertzk’s Shortform

robertzk8 Dec 2025 17:27 UTC

5 points

27 comments1 min readLW link

SAEs are highly dataset dependent: a case study on the refusal direction

Connor Kissane, robertzk, Neel Nanda and Arthur Conmy

7 Nov 2024 5:22 UTC

67 points

4 comments14 min readLW link

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

27 Oct 2024 18:46 UTC

48 points

4 comments5 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

61 points

20 comments10 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

Attention Output SAEs Improve Circuit Analysis

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

21 Jun 2024 12:56 UTC

33 points

3 comments19 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

63 points

0 comments12 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

78 points

4 comments8 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

85 points

9 comments18 min readLW link

Training Process Transparency through Gradient Interpretability: Early experiments on toy language models

robertzk and evhub

21 Jul 2023 14:52 UTC

56 points

1 comment1 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

Emily Brontë on: Psychology Required for Serious™ AGI Safety Research

robertzk14 Sep 2022 14:47 UTC

2 points

0 comments1 min readLW link

Ask LW: ω-self-aware systems

robertzk16 Dec 2012 22:18 UTC

0 points

10 comments1 min readLW link

The rationalist’s checklist

robertzk16 Dec 2011 16:21 UTC

45 points

8 comments1 min readLW link