janos

Karma: 239

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

49 points

18 comments7 min readLW link

(arxiv.org)

Power-seeking can be probable and predictive for trained agents

28 Feb 2023 21:10 UTC

56 points

22 comments9 min readLW link

(arxiv.org)