RSS

rusheb

Karma: 487

Stress Test­ing De­liber­a­tive Align­ment for Anti-Schem­ing Training

Sep 17, 2025, 4:59 PM
124 points

37 votes

Overall karma indicates overall quality.

14 comments1 min readLW link
(antischeming.ai)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points

35 votes

Overall karma indicates overall quality.

1 comment2 min readLW link

Fron­tier Models are Ca­pable of In-con­text Scheming

Dec 5, 2024, 10:11 PM
210 points

74 votes

Overall karma indicates overall quality.

24 comments7 min readLW link

Apollo Re­search 1-year update

May 29, 2024, 5:44 PM
93 points

46 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

A starter guide for evals

Jan 8, 2024, 6:24 PM
57 points

32 votes

Overall karma indicates overall quality.

2 comments12 min readLW link
(www.apolloresearch.ai)

Scal­able And Trans­fer­able Black-Box Jailbreaks For Lan­guage Models Via Per­sona Modulation

Nov 7, 2023, 5:59 PM
38 points

22 votes

Overall karma indicates overall quality.

2 comments2 min readLW link
(arxiv.org)

Un­der­stand­ing mesa-op­ti­miza­tion us­ing toy models

May 7, 2023, 5:00 PM
46 points

27 votes

Overall karma indicates overall quality.

6 comments10 min readLW link