RSS

Verification

TagLast edit: 9 Jul 2022 14:37 UTC by Tor Økland Barstad

Val­i­dat­ing against a mis­al­ign­ment de­tec­tor is very differ­ent to train­ing against one

mattmacdermott4 Mar 2025 15:41 UTC
39 points
4 comments4 min readLW link

The V&V method—A step to­wards safer AGI

Yoav Hollander24 Jun 2025 13:42 UTC
20 points
1 comment1 min readLW link
(blog.foretellix.com)

Align­ment with ar­gu­ment-net­works and as­sess­ment-predictions

Tor Økland Barstad13 Dec 2022 2:17 UTC
10 points
5 comments45 min readLW link

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

24 Jun 2024 19:27 UTC
104 points
4 comments8 min readLW link
(arxiv.org)

For­mal ver­ifi­ca­tion, heuris­tic ex­pla­na­tions and sur­prise accounting

Jacob_Hilton25 Jun 2024 15:40 UTC
156 points
11 comments9 min readLW link
(www.alignment.org)

Mak­ing it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad9 Jul 2022 14:42 UTC
15 points
5 comments22 min readLW link
No comments.