Scal­able Over­sight and Weak-to-Strong Gen­er­al­iza­tion: Com­pat­i­ble ap­proaches to the same problem

Dec 16, 2023, 5:49 AM
76 points
4 comments6 min readLW link1 review

The prob­lems with the con­cept of an in­fo­haz­ard as used by the LW com­mu­nity [Linkpost]

Noosphere89Dec 22, 2023, 4:13 PM
75 points
43 comments3 min readLW link
(www.beren.io)

Ideal­ized Agents Are Ap­prox­i­mate Causal Mir­rors (+ Rad­i­cal Op­ti­mism on Agent Foun­da­tions)

Thane RuthenisDec 22, 2023, 8:19 PM
74 points
14 comments6 min readLW link

Neu­ral un­cer­tainty es­ti­ma­tion re­view ar­ti­cle (for al­ign­ment)

Charlie SteinerDec 5, 2023, 8:01 AM
74 points
3 comments11 min readLW link

shoes with springs

bhauthDec 30, 2023, 9:46 PM
71 points
9 comments4 min readLW link2 reviews
(www.bhauth.com)

Up­date on Chi­nese IQ-re­lated gene panels

Lao MeinDec 14, 2023, 10:12 AM
70 points
7 comments1 min readLW link

OpenAI: Pre­pared­ness framework

Zach Stein-PerlmanDec 18, 2023, 6:30 PM
70 points
23 comments4 min readLW link
(openai.com)

Find­ing Sparse Lin­ear Con­nec­tions be­tween Fea­tures in LLMs

Dec 9, 2023, 2:27 AM
70 points
5 comments10 min readLW link

Flag­ging Po­ten­tially Un­fair Parenting

jefftkDec 26, 2023, 12:40 PM
69 points
1 comment1 min readLW link
(www.jefftk.com)

Meetup Tip: Heart­beat Messages

ScrewtapeDec 7, 2023, 5:18 PM
69 points
4 comments3 min readLW link

We’re all in this together

Tamsin LeakeDec 5, 2023, 1:57 PM
69 points
65 comments2 min readLW link

Don’t Share In­for­ma­tion Exfo­haz­ardous on Others’ AI-Risk Models

Thane RuthenisDec 19, 2023, 8:09 PM
68 points
11 comments1 min readLW link

AI #42: The Wrong Answer

ZviDec 14, 2023, 2:50 PM
67 points
6 comments54 min readLW link
(thezvi.wordpress.com)

Out-of-dis­tri­bu­tion Bioattacks

jefftkDec 2, 2023, 12:20 PM
66 points
15 comments2 min readLW link
(www.jefftk.com)

Fund­ing case: AI Safety Camp 10

Dec 12, 2023, 9:08 AM
66 points
5 comments6 min readLW link
(manifund.org)

How LDT helps re­duce the AI arms race

Tamsin LeakeDec 10, 2023, 4:21 PM
65 points
13 comments4 min readLW link
(carado.moe)

Com­plex sys­tems re­search as a field (and its rele­vance to AI Align­ment)

Dec 1, 2023, 10:10 PM
65 points
11 comments19 min readLW link

METR is hiring!

Beth BarnesDec 26, 2023, 9:00 PM
65 points
1 comment1 min readLW link

E.T. Jaynes Prob­a­bil­ity The­ory: The logic of Science I

Dec 27, 2023, 11:47 PM
63 points
20 comments21 min readLW link

Balsa Up­date and Gen­eral Thank You

ZviDec 12, 2023, 8:30 PM
61 points
8 comments8 min readLW link
(thezvi.wordpress.com)

AI Safety Chatbot

Dec 21, 2023, 2:06 PM
61 points
11 comments4 min readLW link

Some nega­tive steganog­ra­phy results

Fabien RogerDec 9, 2023, 8:22 PM
60 points
5 comments2 min readLW link

Origi­nal­ity vs. Correctness

Dec 6, 2023, 6:51 PM
60 points
17 comments25 min readLW link

In Defense of Epistemic Em­pa­thy

Kevin DorstDec 27, 2023, 4:27 PM
60 points
19 comments6 min readLW link
(kevindorst.substack.com)

Are There Ex­am­ples of Over­hang for Other Tech­nolo­gies?

Jeffrey HeningerDec 13, 2023, 9:48 PM
59 points
50 comments11 min readLW link
(blog.aiimpacts.org)

Talk: “AI Would Be A Lot Less Alarm­ing If We Un­der­stood Agents”

johnswentworthDec 17, 2023, 11:46 PM
58 points
3 comments1 min readLW link
(www.youtube.com)

The LessWrong 2022 Re­view: Re­view Phase

RobertMDec 22, 2023, 3:23 AM
58 points
7 comments2 min readLW link

Mea­sure­ment tam­per­ing de­tec­tion as a spe­cial case of weak-to-strong generalization

Dec 23, 2023, 12:05 AM
57 points
10 comments4 min readLW link

Med­i­ta­tions on Mot

Richard_NgoDec 4, 2023, 12:19 AM
56 points
11 comments8 min readLW link
(www.mindthefuture.info)

The pre­dic­tive power of dis­si­pa­tive adaptation

dr_sDec 17, 2023, 2:01 PM
56 points
14 comments19 min readLW link

The Best of Don’t Worry About the Vase

ZviDec 13, 2023, 12:50 PM
55 points
4 comments13 min readLW link
(thezvi.wordpress.com)

Weak-to-Strong Gen­er­al­iza­tion: Elic­it­ing Strong Ca­pa­bil­ities With Weak Supervision

leogaoDec 16, 2023, 5:39 AM
55 points
5 comments1 min readLW link

An­throp­i­cal Para­doxes are Para­doxes of Prob­a­bil­ity Theory

Ape in the coatDec 6, 2023, 8:16 AM
55 points
18 comments5 min readLW link

Google Gem­ini Announced

Jacob G-WDec 6, 2023, 4:14 PM
54 points
22 comments1 min readLW link
(blog.google)

the micro-fulfill­ment cam­brian explosion

bhauthDec 4, 2023, 1:15 AM
54 points
5 comments4 min readLW link
(www.bhauth.com)

AI #44: Copy­right Confrontation

ZviDec 28, 2023, 2:30 PM
54 points
13 comments43 min readLW link
(thezvi.wordpress.com)

2022 (and All Time) Posts by Ping­back Count

RaemonDec 16, 2023, 9:17 PM
53 points
14 comments6 min readLW link

AI #43: Func­tional Discoveries

ZviDec 21, 2023, 3:50 PM
52 points
26 comments49 min readLW link
(thezvi.wordpress.com)

Pseudonymity and Accusations

jefftkDec 21, 2023, 7:20 PM
52 points
20 comments3 min readLW link
(www.jefftk.com)

n of m ring signatures

DanielFilanDec 4, 2023, 8:00 PM
51 points
7 comments1 min readLW link
(danielfilan.com)

Will 2024 be very hot? Should we be wor­ried?

A.H.Dec 29, 2023, 11:22 AM
51 points
12 comments10 min readLW link

Goal-Com­plete­ness is like Tur­ing-Com­plete­ness for AGI

LironDec 19, 2023, 6:12 PM
51 points
26 comments3 min readLW link

On OpenAI’s Pre­pared­ness Framework

ZviDec 21, 2023, 2:00 PM
51 points
4 comments21 min readLW link
(thezvi.wordpress.com)

The Short­est Path Between Scylla and Charybdis

Thane RuthenisDec 18, 2023, 8:08 PM
50 points
8 comments5 min readLW link

Gem­ini 1.0

ZviDec 7, 2023, 2:40 PM
50 points
7 comments9 min readLW link
(thezvi.wordpress.com)

Bounty: Di­verse hard tasks for LLM agents

Dec 17, 2023, 1:04 AM
49 points
31 comments16 min readLW link

On ‘Re­spon­si­ble Scal­ing Poli­cies’ (RSPs)

ZviDec 5, 2023, 4:10 PM
48 points
3 comments37 min readLW link
(thezvi.wordpress.com)

What is the next level of ra­tio­nal­ity?

Dec 12, 2023, 8:14 AM
48 points
24 comments7 min readLW link

If Clar­ity Seems Like Death to Them

Zack_M_DavisDec 30, 2023, 5:40 PM
47 points
192 comments87 min readLW link1 review
(unremediatedgender.space)

En­vi­ron­men­tal aller­gies are cur­able? (Sublin­gual im­munother­apy)

Chris LakinDec 26, 2023, 7:05 PM
47 points
10 comments1 min readLW link