RSS

ChengCheng

Karma: 155

Avoid­ing AI De­cep­tion: Lie De­tec­tors can ei­ther In­duce Hon­esty or Evasion

5 Jun 2025 23:07 UTC
22 points
2 comments5 min readLW link
(far.ai)

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

7 Feb 2025 3:57 UTC
37 points
0 comments10 min readLW link

GPT-4o Guardrails Gone: Data Poi­son­ing & Jailbreak-Tuning

1 Nov 2024 0:10 UTC
18 points
0 comments6 min readLW link
(far.ai)

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
59 points
8 comments2 min readLW link
(arxiv.org)

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
14 points
0 comments1 min readLW link
(far.ai)

VLM-RM: Spec­i­fy­ing Re­wards with Nat­u­ral Language

23 Oct 2023 14:11 UTC
20 points
2 comments5 min readLW link
(far.ai)

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

14 Sep 2023 1:40 UTC
32 points
7 comments8 min readLW link
(far.ai)