Guaranteed Safe AI

TagLast edit: 9 Aug 2024 23:22 UTC by Ben Goldhaber

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Gunnar_Zarncke1 Dec 2025 10:07 UTC

8 points

0 comments1 min readLW link

(arxiv.org)

November-December 2024 Progress in Guaranteed Safe AI

Quinn22 Jan 2025 1:20 UTC

17 points

0 comments4 min readLW link

(gsai.substack.com)

AXRP Episode 40 - Jason Gross on Compact Proofs and Interpretability

DanielFilan28 Mar 2025 18:40 UTC

26 points

0 comments89 min readLW link

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Gunnar_Zarncke16 May 2024 13:09 UTC

51 points

20 comments1 min readLW link

(arxiv.org)

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Joar Skalse17 May 2024 19:13 UTC

67 points

10 comments2 min readLW link

In response to critiques of Guaranteed Safe AI

Nora_Ammann31 Jan 2025 1:43 UTC

44 points

14 comments26 min readLW link

Agent foundations: not really math, not really science

Alex_Altair17 Aug 2025 5:48 UTC

121 points

29 comments5 min readLW link

The Epistemic Collapse of Aligned Models: Why RLHF Guarantees Model Blindness in ASI

Ivan Demirev22 Feb 2026 3:55 UTC

1 point

0 comments2 min readLW link

Mitigating Agent Drift with Holographic Invariant Storage (HIS)

Belverith A. S. Synthette12 Feb 2026 15:53 UTC

1 point

0 comments1 min readLW link

Davidad’s Provably Safe AI Architecture—ARIA’s Programme Thesis

simeon_c1 Feb 2024 21:30 UTC

69 points

17 comments1 min readLW link

(www.aria.org.uk)

Provably Safe AI

PeterMcCluskey5 Oct 2023 22:18 UTC

37 points

15 comments4 min readLW link

(bayesianinvestor.com)

Can a Bayesian Oracle Prevent Harm from an Agent? (Bengio et al. 2024)

mattmacdermott1 Sep 2024 7:46 UTC

28 points

0 comments5 min readLW link

(yoshuabengio.org)

Limitations on Formal Verification for AI Safety

Andrew Dickson19 Aug 2024 23:03 UTC

135 points

60 comments23 min readLW link

Topological Debate Framework

lunatic_at_large16 Jan 2025 17:19 UTC

10 points

5 comments9 min readLW link

Provably Safe AI: Worldview and Projects

Ben Goldhaber and Steve_Omohundro

9 Aug 2024 23:21 UTC

58 points

44 comments7 min readLW link

No comments.