RSS

Verification

TagLast edit: 9 Jul 2022 14:37 UTC by Tor Økland Barstad

Val­i­dat­ing against a mis­al­ign­ment de­tec­tor is very differ­ent to train­ing against one

mattmacdermott4 Mar 2025 15:41 UTC
53 points
6 comments4 min readLW link

The V&V method—A step to­wards safer AGI

Yoav Hollander24 Jun 2025 13:42 UTC
20 points
1 comment1 min readLW link
(blog.foretellix.com)

I ap­plied Au­dit stan­dards to LLM agents. It re­li­ably ex­poses hid­den as­sump­tions.

Lei Wang17 Jan 2026 4:40 UTC
1 point
0 comments2 min readLW link

Safe Re­cur­sive Self-Im­prove­ment with Ver­ified Compilers

Adam Chlipala24 Mar 2026 13:35 UTC
15 points
0 comments11 min readLW link

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

All hands on deck to build the dat­a­cen­ter lie detector

Naci Cankaya19 Feb 2026 11:42 UTC
31 points
2 comments5 min readLW link
(open.substack.com)

Make Pow­er­ful Machines Verifiable

Naci Cankaya4 Mar 2026 14:20 UTC
22 points
4 comments4 min readLW link

Map­ping the Con­strained Au­ton­omy Gra­di­ent: A Col­lab­o­ra­tive, Min­i­mal-Scale AGI Safety Benchmark

ViceStudioPub15 Jan 2026 9:46 UTC
1 point
0 comments4 min readLW link

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

24 Jun 2024 19:27 UTC
104 points
4 comments8 min readLW link
(arxiv.org)

For­mal ver­ifi­ca­tion, heuris­tic ex­pla­na­tions and sur­prise accounting

Jacob_Hilton25 Jun 2024 15:40 UTC
168 points
11 comments9 min readLW link
(www.alignment.org)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link
No comments.