Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Verification
Tag
Last edit:
9 Jul 2022 14:37 UTC
by
Tor Økland Barstad
Relevant
New
Old
Validating against a misalignment detector is very different to training against one
mattmacdermott
4 Mar 2025 15:41 UTC
39
points
4
comments
4
min read
LW
link
The V&V method—A step towards safer AGI
Yoav Hollander
24 Jun 2025 13:42 UTC
20
points
1
comment
1
min read
LW
link
(blog.foretellix.com)
Alignment with argument-networks and assessment-predictions
Tor Økland Barstad
13 Dec 2022 2:17 UTC
10
points
5
comments
45
min read
LW
link
Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC
,
rajashree
,
Adrià Garriga-alonso
and
Jason Gross
24 Jun 2024 19:27 UTC
104
points
4
comments
8
min read
LW
link
(arxiv.org)
Formal verification, heuristic explanations and surprise accounting
Jacob_Hilton
25 Jun 2024 15:40 UTC
156
points
11
comments
9
min read
LW
link
(www.alignment.org)
Making it harder for an AGI to “trick” us, with STVs
Tor Økland Barstad
9 Jul 2022 14:42 UTC
15
points
5
comments
22
min read
LW
link
No comments.
Back to top