Goals se­lected from learned knowl­edge: an al­ter­na­tive to RL alignment

Seth Herd15 Jan 2024 21:52 UTC
40 points
17 comments7 min readLW link

In­tro­duc­ing REBUS: A Ro­bust Eval­u­a­tion Bench­mark of Un­der­stand­ing Symbols

15 Jan 2024 21:21 UTC
33 points
0 comments1 min readLW link

Live Sound: Big-O Improvements

jefftk15 Jan 2024 19:50 UTC
8 points
0 comments1 min readLW link
(www.jefftk.com)

In­ves­ti­gat­ing Bias Rep­re­sen­ta­tions in LLMs via Ac­ti­va­tion Steering

DawnLu15 Jan 2024 19:39 UTC
29 points
4 comments5 min readLW link

Sparse MLP Distillation

slavachalnev15 Jan 2024 19:39 UTC
30 points
3 comments6 min readLW link

Re­view of Align­ment Plan Cri­tiques- De­cem­ber AI-Plans Cri­tique-a-Thon Re­sults

Iknownothing15 Jan 2024 19:37 UTC
24 points
0 comments25 min readLW link
(aiplans.substack.com)

[Question] What does it look like for AI to sig­nifi­cantly im­prove hu­man co­or­di­na­tion, be­fore su­per­in­tel­li­gence?

jacobjacob15 Jan 2024 19:22 UTC
22 points
2 comments1 min readLW link

Now Ac­cept­ing Player Ap­pli­ca­tions for Band of Blades

Joe Rogero15 Jan 2024 17:58 UTC
2 points
0 comments3 min readLW link

Three Types of Con­straints in the Space of Agents

15 Jan 2024 17:27 UTC
26 points
3 comments17 min readLW link

The case for train­ing fron­tier AIs on Sume­rian-only corpus

15 Jan 2024 16:40 UTC
127 points
14 comments3 min readLW link

How to Pro­mote More Pro­duc­tive Dialogue Out­side of LessWrong

sweenesm15 Jan 2024 14:16 UTC
16 points
4 comments2 min readLW link

[Question] Come and day­dream with me about sci­ence reform

TeaTieAndHat15 Jan 2024 11:09 UTC
9 points
1 comment1 min readLW link

AI do­ing philos­o­phy = AI gen­er­at­ing hands?

Wei Dai15 Jan 2024 9:04 UTC
46 points
19 comments1 min readLW link

Even if we lose, we win

Pi Rogers15 Jan 2024 2:15 UTC
23 points
17 comments4 min readLW link

De­tach­ment vs at­tach­ment [AI risk and men­tal health]

Neil 15 Jan 2024 0:41 UTC
14 points
4 comments3 min readLW link

Mak­ing up statis­tics to es­tab­lish pri­or­ity on Land Value Tax vs Earned In­come Tax Credit vs So­cial Me­dia Dy­namic Regulation

Canucklug14 Jan 2024 23:57 UTC
−5 points
2 comments7 min readLW link

Is the uni­verse all there is? ‘Ev­i­dence’ for ob­jects out­side the uni­verse...

JonathanHall14 Jan 2024 23:56 UTC
−4 points
27 comments11 min readLW link

[Question] What is the min­i­mum amount of time travel and re­sources needed to se­cure the fu­ture?

Perhaps14 Jan 2024 22:01 UTC
−3 points
5 comments1 min readLW link

Gothen­burg LW /​ ACX meetup

Stefan14 Jan 2024 21:21 UTC
1 point
0 comments1 min readLW link

Gothen­burg LW /​ ACX meetup

Stefan14 Jan 2024 21:20 UTC
1 point
1 comment1 min readLW link

D&D.Sci Hyper­sphere Anal­y­sis Part 2: Non­lin­ear Effects & Interactions

aphyer14 Jan 2024 19:59 UTC
23 points
0 comments7 min readLW link

Gen­der Exploration

sapphire14 Jan 2024 18:57 UTC
111 points
25 comments5 min readLW link
(open.substack.com)

List of pro­jects that seem im­pact­ful for AI Governance

14 Jan 2024 16:53 UTC
13 points
0 comments13 min readLW link

The Leeroy Jenk­ins prin­ci­ple: How faulty AI could guaran­tee “warn­ing shots”

titotal14 Jan 2024 15:03 UTC
43 points
5 comments1 min readLW link
(titotal.substack.com)

No­tice When Peo­ple Are Direc­tion­ally Correct

Chris_Leong14 Jan 2024 14:12 UTC
127 points
7 comments2 min readLW link

Cor­ro­sive Mnemonics

Epirito14 Jan 2024 12:44 UTC
7 points
0 comments2 min readLW link

Against most, but not all, AI risk analogies

Matthew Barnett14 Jan 2024 3:36 UTC
62 points
40 comments7 min readLW link

Vote With Your Face

jefftk14 Jan 2024 3:30 UTC
11 points
0 comments1 min readLW link
(www.jefftk.com)

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
23 points
0 comments42 min readLW link

D&D.Sci Hyper­sphere Anal­y­sis Part 1: Datafields & Pre­limi­nary Analysis

aphyer13 Jan 2024 20:16 UTC
27 points
1 comment5 min readLW link

Some ad­di­tional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC
28 points
4 comments13 min readLW link

(4 min read) An in­tu­itive ex­pla­na­tion of the AI in­fluence situation

trevor13 Jan 2024 17:34 UTC
12 points
26 comments4 min readLW link

AI #47: Meet the New Year

Zvi13 Jan 2024 16:20 UTC
36 points
7 comments57 min readLW link
(thezvi.wordpress.com)

Take­aways from the NeurIPS 2023 Tro­jan De­tec­tion Competition

mikes13 Jan 2024 12:35 UTC
20 points
2 comments1 min readLW link
(confirmlabs.org)

[Question] Why do so many think de­cep­tion in AI is im­por­tant?

Prometheus13 Jan 2024 8:14 UTC
23 points
12 comments1 min readLW link

Elimi­nat­ing Cookie Ban­ners is Hard

jefftk13 Jan 2024 3:00 UTC
23 points
15 comments3 min readLW link
(www.jefftk.com)

In­tro­duc­ing Align­ment Stress-Test­ing at Anthropic

evhub12 Jan 2024 23:51 UTC
179 points
23 comments2 min readLW link

D&D.Sci(-fi): Coloniz­ing the SuperHyperSphere

abstractapplic12 Jan 2024 23:36 UTC
46 points
23 comments2 min readLW link

Com­mon­wealth Fu­sion Sys­tems is the Same Scale as OpenAI

Jeffrey Heninger12 Jan 2024 21:43 UTC
22 points
13 comments2 min readLW link

Through­put vs. Latency

12 Jan 2024 21:37 UTC
29 points
2 comments13 min readLW link

Sleeper Agents: Train­ing De­cep­tive LLMs that Per­sist Through Safety Training

12 Jan 2024 19:51 UTC
298 points
95 comments3 min readLW link
(arxiv.org)

METAPHILOSOPHY—A Philoso­phiz­ing through log­i­cal consequences

Seremonia12 Jan 2024 18:47 UTC
−7 points
7 comments1 min readLW link

Ideal­ism, Real­is­tic & Pragmatic

Seremonia12 Jan 2024 18:16 UTC
−7 points
3 comments1 min readLW link

The ex­is­ten­tial threat of hu­mans.

Spiritus Dei12 Jan 2024 17:50 UTC
−24 points
0 comments3 min readLW link

[Question] Con­crete ex­am­ples of do­ing agen­tic things?

Jacob G-W12 Jan 2024 15:59 UTC
13 points
10 comments1 min readLW link

Land Recla­ma­tion is in the 9th Cir­cle of Stag­na­tion Hell

Maxwell Tabarrok12 Jan 2024 13:36 UTC
54 points
6 comments2 min readLW link
(maximumprogress.substack.com)

What good is G-fac­tor if you’re dumped in the woods? A field re­port from a camp coun­selor.

Hastings12 Jan 2024 13:17 UTC
133 points
22 comments1 min readLW link

A Chi­nese Room Con­tain­ing a Stack of Stochas­tic Parrots

RogerDearnaley12 Jan 2024 6:29 UTC
18 points
2 comments5 min readLW link

De­cent plan prize an­nounce­ment (1 para­graph, $1k)

lukehmiles12 Jan 2024 6:27 UTC
25 points
19 comments1 min readLW link

in­tro­duc­tion to solid ox­ide electrolytes

bhauth12 Jan 2024 5:35 UTC
17 points
0 comments4 min readLW link
(www.bhauth.com)