ChengCheng

Karma: 155

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

ChengCheng, ChrisCundy, smallsilo and AdamGleave

5 Jun 2025 23:07 UTC

22 points

2 comments5 min readLW link

(far.ai)

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

7 Feb 2025 3:57 UTC

37 points

0 comments10 min readLW link

GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

ChengCheng, Brendan Murphy, AdamGleave and Kellin Pelrine

1 Nov 2024 0:10 UTC

18 points

0 comments6 min readLW link

(far.ai)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Does robustness improve with scale?

ChengCheng, niki.h, Ian McKenzie, Oskar Hollinsworth, Tom Tseng and AdamGleave

25 Jul 2024 20:55 UTC

14 points

0 comments1 min readLW link

(far.ai)

VLM-RM: Specifying Rewards with Natural Language

ChengCheng, David Lindner and Ethan Perez

23 Oct 2023 14:11 UTC

20 points

2 comments5 min readLW link

(far.ai)

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

ChengCheng 31 Mar 2023 0:46 UTC
4 points
0
on: Speed running everyone through the bad alignement bingo. $5k bounty for a LW conversational agent
First of all, thank you @ArthurB for offering this bounty and raising the awareness of the need for quality AI alignment educational resources! We are particularly grateful to those who mentioned the Stampy project and also to people who have reached out offering to help in our efforts. Our submission https://chat.stampy.ai/ is a very early prototype focused primarily on summarizing and synthesizing information from our own database of FAQs along with selected documents collected from the alignment research dataset. The conversational feature still requires considerable work. Nevertheless, we would love to get input and feedback to further develop this tool for anyone seeking to better understand or contribute to AI safety. This would not have been possible without the support of our volunteers and collaborators. We welcome all who are interested in using AI to advance alignment.