smallsilo

Karma: 151

AI safety communications at FAR.AI

Previously at AISafety.info

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk and Adam Gleave

4 Jul 2025 0:07 UTC

13 points

1 comment4 min readLW link

(far.ai)

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

ChengCheng, ChrisCundy, smallsilo and AdamGleave

5 Jun 2025 23:07 UTC

22 points

2 comments5 min readLW link

(far.ai)

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

7 Feb 2025 3:57 UTC

37 points

0 comments10 min readLW link

AISafety.info Distillation Hackathon

smallsilo1 Oct 2023 18:54 UTC

2 points

0 comments1 min readLW link

Join AISafety.info’s Distillation Hackathon (Oct 6-9th)

smallsilo1 Oct 2023 18:43 UTC

21 points

0 comments2 min readLW link

(forum.effectivealtruism.org)

GPT-powered EA/LW weekly summary

smallsilo25 Aug 2023 18:19 UTC

19 points

1 comment11 min readLW link

(forum.effectivealtruism.org)

AISafety.info’s Writing & Editing Hackathon

smallsilo5 Aug 2023 17:14 UTC

2 points

0 comments1 min readLW link

Join AISafety.info’s Writing & Editing Hackathon (Aug 25-28) (Prizes to be won!)

smallsilo5 Aug 2023 14:08 UTC

19 points

3 comments1 min readLW link

(forum.effectivealtruism.org)

All AGI Safety questions welcome (especially basic ones) [July 2023]

smallsilo20 Jul 2023 20:20 UTC

38 points

41 comments2 min readLW link

(forum.effectivealtruism.org)