RSS

rusheb

Karma: 94

A starter guide for evals

8 Jan 2024 18:24 UTC
44 points
2 comments12 min readLW link
(www.apolloresearch.ai)

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

7 Nov 2023 17:59 UTC
36 points
2 comments2 min readLW link
(arxiv.org)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

7 May 2023 17:00 UTC
42 points
2 comments10 min readLW link