Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Verification
Tag
Last edit:
9 Jul 2022 14:37 UTC
by
Tor Økland Barstad
Relevant
New
Old
Validating against a misalignment detector is very different to training against one
mattmacdermott
4 Mar 2025 15:41 UTC
53
points
6
comments
4
min read
LW
link
The V&V method—A step towards safer AGI
Yoav Hollander
24 Jun 2025 13:42 UTC
20
points
1
comment
1
min read
LW
link
(blog.foretellix.com)
I applied Audit standards to LLM agents. It reliably exposes hidden assumptions.
Lei Wang
17 Jan 2026 4:40 UTC
1
point
0
comments
2
min read
LW
link
Safe Recursive Self-Improvement with Verified Compilers
Adam Chlipala
24 Mar 2026 13:35 UTC
15
points
0
comments
11
min read
LW
link
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
13 Dec 2022 2:17 UTC
10
points
5
comments
45
min read
LW
link
All hands on deck to build the datacenter lie detector
Naci Cankaya
19 Feb 2026 11:42 UTC
31
points
2
comments
5
min read
LW
link
(open.substack.com)
Make Powerful Machines Verifiable
Naci Cankaya
4 Mar 2026 14:20 UTC
22
points
4
comments
4
min read
LW
link
Mapping the Constrained Autonomy Gradient: A Collaborative, Minimal-Scale AGI Safety Benchmark
ViceStudioPub
15 Jan 2026 9:46 UTC
1
point
0
comments
4
min read
LW
link
Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC
,
rajashree
,
Adrià Garriga-alonso
and
Jason Gross
24 Jun 2024 19:27 UTC
104
points
4
comments
8
min read
LW
link
(arxiv.org)
Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton
25 Jun 2024 15:40 UTC
168
points
11
comments
9
min read
LW
link
(www.alignment.org)
Making it harder for an AGI to “trick” us, with STVs
Tor Økland Barstad
9 Jul 2022 14:42 UTC
15
points
5
comments
22
min read
LW
link
No comments.
Back to top