Value learn­ing in the ab­sence of ground truth

Joel_Saarinen5 Feb 2024 18:56 UTC
47 points
8 comments45 min readLW link

Im­ple­ment­ing ac­ti­va­tion steering

Annah5 Feb 2024 17:51 UTC
59 points
5 comments7 min readLW link

AI al­ign­ment as a trans­la­tion problem

Roman Leventov5 Feb 2024 14:14 UTC
21 points
2 comments3 min readLW link

Safe Sta­sis Fallacy

Davidmanheim5 Feb 2024 10:54 UTC
54 points
2 comments1 min readLW link

[Question] How has in­ter­nal­is­ing a post-AGI world af­fected your cur­rent choices?

yanni kyriacos5 Feb 2024 5:43 UTC
10 points
8 comments1 min readLW link

A thought ex­per­i­ment for com­par­ing “biolog­i­cal” vs “digi­tal” in­tel­li­gence in­crease/​explosion

Super AGI5 Feb 2024 4:57 UTC
6 points
3 comments1 min readLW link

Notic­ing Panic

Cole Wyeth5 Feb 2024 3:45 UTC
55 points
8 comments3 min readLW link

EA/​ACX/​LW Fe­bru­ary Santa Cruz Meetup

madmail4 Feb 2024 23:26 UTC
1 point
0 comments1 min readLW link

Vi­talia Ra­tion­al­ity Meetup

veronica4 Feb 2024 19:46 UTC
1 point
0 comments1 min readLW link

Per­sonal predictions

Daniele De Nuntiis4 Feb 2024 3:59 UTC
2 points
2 comments3 min readLW link

A sketch of acausal trade in practice

Richard_Ngo4 Feb 2024 0:32 UTC
32 points
4 comments7 min readLW link

Brute Force Man­u­fac­tured Con­sen­sus is Hid­ing the Crime of the Century

Roko3 Feb 2024 20:36 UTC
220 points
156 comments9 min readLW link

My thoughts on the Beff Je­zos—Con­nor Leahy debate

Ariel Kwiatkowski3 Feb 2024 19:47 UTC
−5 points
23 comments4 min readLW link

The Jour­nal of Danger­ous Ideas

rogersbacon3 Feb 2024 15:40 UTC
−25 points
4 comments5 min readLW link
(www.secretorum.life)

At­ti­tudes about Ap­plied Rationality

Camille Berger 3 Feb 2024 14:42 UTC
108 points
18 comments4 min readLW link

Prac­tic­ing my Hand­writ­ing in 1439

Maxwell Tabarrok3 Feb 2024 13:21 UTC
11 points
0 comments3 min readLW link
(www.maximum-progress.com)

Finite Fac­tored Sets to Bayes Nets Part 2

J Bostock3 Feb 2024 12:25 UTC
6 points
0 comments8 min readLW link

Why I no longer iden­tify as transhumanist

Kaj_Sotala3 Feb 2024 12:00 UTC
54 points
33 comments3 min readLW link
(kajsotala.fi)

At­ten­tion SAEs Scale to GPT-2 Small

3 Feb 2024 6:50 UTC
76 points
4 comments8 min readLW link

Why do we need RLHF? Imi­ta­tion, In­verse RL, and the role of reward

Ran W3 Feb 2024 4:00 UTC
12 points
0 comments5 min readLW link

An­nounc­ing the Lon­don Ini­ti­a­tive for Safe AI (LISA)

2 Feb 2024 23:17 UTC
97 points
0 comments9 min readLW link

Sur­vey for al­ign­ment re­searchers!

2 Feb 2024 20:41 UTC
71 points
11 comments1 min readLW link

Vot­ing Re­sults for the 2022 Review

Ben Pace2 Feb 2024 20:34 UTC
57 points
3 comments73 min readLW link

On Dwarkesh’s 3rd Pod­cast With Tyler Cowen

Zvi2 Feb 2024 19:30 UTC
36 points
9 comments21 min readLW link
(thezvi.wordpress.com)

Most ex­perts be­lieve COVID-19 was prob­a­bly not a lab leak

DanielFilan2 Feb 2024 19:28 UTC
66 points
89 comments2 min readLW link
(gcrinstitute.org)

What Failure Looks Like is not an ex­is­ten­tial risk (and al­ign­ment is not the solu­tion)

otto.barten2 Feb 2024 18:59 UTC
13 points
12 comments9 min readLW link

Solv­ing al­ign­ment isn’t enough for a flour­ish­ing future

mic2 Feb 2024 18:23 UTC
27 points
0 comments1 min readLW link
(papers.ssrn.com)

Man­i­fold Markets

PeterMcCluskey2 Feb 2024 17:48 UTC
26 points
9 comments4 min readLW link
(bayesianinvestor.com)

Types of sub­jec­tive welfare

MichaelStJules2 Feb 2024 9:56 UTC
10 points
3 comments1 min readLW link

Open Source Sparse Au­toen­coders for all Resi­d­ual Stream Lay­ers of GPT2-Small

Joseph Bloom2 Feb 2024 6:54 UTC
95 points
37 comments15 min readLW link

Soft Prompts for Eval­u­a­tion: Mea­sur­ing Con­di­tional Dis­tance of Capabilities

porby2 Feb 2024 5:49 UTC
43 points
1 comment4 min readLW link
(1drv.ms)

Run­ning a Pre­dic­tion Mar­ket Mafia Game

Arjun Panickssery1 Feb 2024 23:24 UTC
22 points
5 comments1 min readLW link
(arjunpanickssery.substack.com)

Eval­u­at­ing Sta­bil­ity of Un­re­flec­tive Alignment

james.lucassen1 Feb 2024 22:15 UTC
30 points
3 comments18 min readLW link
(jlucassen.com)

Davi­dad’s Prov­ably Safe AI Ar­chi­tec­ture—ARIA’s Pro­gramme Thesis

simeon_c1 Feb 2024 21:30 UTC
69 points
17 comments1 min readLW link
(www.aria.org.uk)

Align­ment has a Basin of At­trac­tion: Beyond the Orthog­o­nal­ity Thesis

RogerDearnaley1 Feb 2024 21:15 UTC
4 points
15 comments13 min readLW link

OpenAI re­port also finds no effect of cur­rent LLMs on vi­a­bil­ity of bioter­ror­ism attacks

lberglund1 Feb 2024 20:18 UTC
19 points
4 comments2 min readLW link
(openai.com)

Wrong an­swer bias

lukehmiles1 Feb 2024 20:05 UTC
49 points
24 comments1 min readLW link

On Not Re­quiring Vaccination

jefftk1 Feb 2024 19:20 UTC
31 points
21 comments1 min readLW link
(www.jefftk.com)

The econ­omy is mostly newbs (strat pre­dic­tions)

lukehmiles1 Feb 2024 19:15 UTC
27 points
6 comments2 min readLW link

Manag­ing risks while try­ing to do good

Wei Dai1 Feb 2024 18:08 UTC
61 points
26 comments1 min readLW link

Put­ting mul­ti­modal LLMs to the Tetris test

1 Feb 2024 16:02 UTC
30 points
5 comments7 min readLW link

AI #49: Bioweapon Test­ing Begins

Zvi1 Feb 2024 15:30 UTC
37 points
11 comments42 min readLW link
(thezvi.wordpress.com)

Some Notes on Ethics

Pareto Optimal1 Feb 2024 10:18 UTC
−3 points
0 comments1 min readLW link
(paretooptimal.substack.com)

In­creas­ingly vague in­ter­per­sonal welfare comparisons

MichaelStJules1 Feb 2024 6:45 UTC
5 points
0 comments1 min readLW link

PIBBSS Speaker events com­ings up in February

1 Feb 2024 3:28 UTC
10 points
2 comments1 min readLW link

Drone Wars Endgame

RussellThor1 Feb 2024 2:30 UTC
34 points
71 comments8 min readLW link

Se­quenc­ing Swabs

jefftk1 Feb 2024 1:50 UTC
19 points
1 comment5 min readLW link
(www.jefftk.com)

Lead­ing The Parade

johnswentworth31 Jan 2024 22:39 UTC
143 points
30 comments9 min readLW link

Pro­posal for an AI Safety Prize

sweenesm31 Jan 2024 18:35 UTC
3 points
0 comments2 min readLW link

Liter­ally Every­thing is Infinite

Spiral31 Jan 2024 18:31 UTC
−10 points
8 comments5 min readLW link