Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
smallsilo
Karma:
150
AI safety communications at FAR.AI
Previously at
AISafety.info
All
Posts
Comments
New
Top
Old
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
smallsilo
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
,
Xander Davies
,
scasper
,
Aaron Tucker
,
Robert Kirk
and
Adam Gleave
4 Jul 2025 0:07 UTC
13
points
1
comment
4
min read
LW
link
(far.ai)
Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
ChengCheng
,
ChrisCundy
,
smallsilo
and
AdamGleave
5 Jun 2025 23:07 UTC
22
points
2
comments
5
min read
LW
link
(far.ai)
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
7 Feb 2025 3:57 UTC
37
points
0
comments
10
min read
LW
link
AISafety.info Distillation Hackathon
smallsilo
1 Oct 2023 18:54 UTC
2
points
0
comments
1
min read
LW
link
Join AISafety.info’s Distillation Hackathon (Oct 6-9th)
smallsilo
1 Oct 2023 18:43 UTC
21
points
0
comments
2
min read
LW
link
(forum.effectivealtruism.org)
GPT-powered EA/LW weekly summary
smallsilo
25 Aug 2023 18:19 UTC
18
points
1
comment
11
min read
LW
link
(forum.effectivealtruism.org)
AISafety.info’s Writing & Editing Hackathon
smallsilo
5 Aug 2023 17:14 UTC
2
points
0
comments
1
min read
LW
link
Join AISafety.info’s Writing & Editing Hackathon (Aug 25-28) (Prizes to be won!)
smallsilo
5 Aug 2023 14:08 UTC
19
points
3
comments
1
min read
LW
link
(forum.effectivealtruism.org)
All AGI Safety questions welcome (especially basic ones) [July 2023]
smallsilo
20 Jul 2023 20:20 UTC
38
points
41
comments
2
min read
LW
link
(forum.effectivealtruism.org)
Back to top