David Lindner

Karma: 483

Alignment researcher at Google DeepMind

Early Signs of Steganographic Capabilities in Frontier LLMs

Kei Nishimura-Gasparian, Artur Zolkowski, robert mccarthy and David Lindner

4 Jul 2025 16:36 UTC

30 points

5 comments2 min readLW link

MONA: Three Month Later—Updates and Steganography Without Optimization Pressure

David Lindner and Vikrant Varma

12 Apr 2025 23:15 UTC

31 points

0 comments5 min readLW link

Can LLMs learn Steganographic Reasoning via RL?

robert mccarthy, Vasil Georgiev, Steven Basart and David Lindner

11 Apr 2025 16:33 UTC

29 points

3 comments6 min readLW link

MONA: Managed Myopia with Approval Feedback

Seb Farquhar, David Lindner and Rohin Shah

23 Jan 2025 12:24 UTC

81 points

30 comments9 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

49 points

18 comments7 min readLW link

(arxiv.org)

David Lindner 23 Oct 2023 22:01 UTC
3 points
0
in reply to: Zane’s comment on: VLM-RM: Specifying Rewards with Natural Language
The agents are rewarded at every timestep and we want them to perform the task throughout the whole episode, so falling over is definitely not what we want. But this has more to do with the policy optimization failing than with the reward model. In other words a policy that doesn’t fall over would achieve higher reward than the policies we actually learn. For example, if we plot the CLIP reward over one episode, it typically drops at the end of the episode if the agent falls down.
We tried some tricks to improve the training, such as providing a curriculum starting from short episodes to longer ones. This worked decently well and made the agents fall over less, but we ended up not using it in the final experiments because we primarily wanted to show that it works well with off-the-shelf RL algorithms.

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

David Lindner 29 Mar 2023 20:32 UTC
LW: 2 AF: 2
0
AF
in reply to: Buck’s comment on: Practical Pitfalls of Causal Scrubbing
Thanks, that’s a useful alternative framing of CaSc!
FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is “cancellation” which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don’t get these kind of cancellation problems.
Plausibly you’d run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on “how wrong” your hypothesis is.

David Lindner 27 Mar 2023 19:11 UTC
1 point
0
in reply to: Buck’s comment on: Practical Pitfalls of Causal Scrubbing
Yes, this seem like a plausible confusion. Your interpretation of what we mean is correct.

Practical Pitfalls of Causal Scrubbing

Jérémy Scheurer, Phil3, tony, jacquesthibs and David Lindner

27 Mar 2023 7:47 UTC

87 points

17 comments13 min readLW link

Threat Model Literature Review

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

79 points

4 comments25 min readLW link

Clarifying AI X-risk

zac_kenton, Rohin Shah, David Lindner, Vikrant Varma, Vika, Mary Phuong, Ramana Kumar and Elliot Catt

1 Nov 2022 11:03 UTC

127 points

24 comments4 min readLW link 1 review