Ano­ma­lous To­kens in Deep­Seek-V3 and r1

henry25 Jan 2025 22:55 UTC
144 points
3 comments7 min readLW link

The Ris­ing Sea

Jesse Hoogland25 Jan 2025 20:48 UTC
97 points
6 comments2 min readLW link

Monet: Mix­ture of Monose­man­tic Ex­perts for Trans­form­ers Explained

CalebMaresca25 Jan 2025 19:37 UTC
20 points
2 comments11 min readLW link

AI and Non-Ex­is­tence.

Eleven25 Jan 2025 19:36 UTC
−3 points
9 comments2 min readLW link

Agents don’t have to be al­igned to help us achieve an in­definite pause.

Hastings25 Jan 2025 18:51 UTC
31 points
0 comments3 min readLW link

[Question] AI Safety in secret

Michael Flood25 Jan 2025 18:16 UTC
7 points
0 comments1 min readLW link

On polytopes

Dmitry Vaintrob25 Jan 2025 13:56 UTC
56 points
5 comments12 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

25 Jan 2025 13:12 UTC
108 points
21 comments4 min readLW link
(publications.apolloresearch.ai)

A con­cise defi­ni­tion of what it means to win

testingthewaters25 Jan 2025 6:37 UTC
4 points
1 comment5 min readLW link
(aclevername.substack.com)

[Question] A Float­ing Cube—Re­jected HLE submission

Shankar Sivarajan25 Jan 2025 4:52 UTC
8 points
1 comment1 min readLW link

Why I’m Pour­ing Cold Water in My Left Ear, and You Should Too

Maloew24 Jan 2025 23:13 UTC
12 points
0 comments2 min readLW link

Coun­ter­in­tu­itive effects of min­i­mum prices

dynomight24 Jan 2025 23:05 UTC
25 points
0 comments8 min readLW link
(dynomight.net)

AXRP Epi­sode 38.6 - Joel Lehman on Pos­i­tive Vi­sions of AI

DanielFilan24 Jan 2025 23:00 UTC
10 points
0 comments9 min readLW link

Lo­cat­ing and Edit­ing Knowl­edge in LMs

Dhananjay Ashok24 Jan 2025 22:53 UTC
1 point
0 comments4 min readLW link

How are Those AI Par­ti­ci­pants Do­ing Any­way?

mushroomsoup24 Jan 2025 22:37 UTC
4 points
0 comments10 min readLW link

Six Thoughts on AI Safety

boazbarak24 Jan 2025 22:20 UTC
92 points
55 comments15 min readLW link

In­stru­men­tal Goals Are A Differ­ent And Friendlier Kind Of Thing Than Ter­mi­nal Goals

24 Jan 2025 20:20 UTC
186 points
61 comments5 min readLW link

Yud­kowsky on The Tra­jec­tory podcast

Seth Herd24 Jan 2025 19:52 UTC
71 points
39 comments2 min readLW link
(www.youtube.com)

Em­piri­cal In­sights into Fea­ture Geom­e­try in Sparse Autoencoders

Jason Boxi Zhang24 Jan 2025 19:02 UTC
7 points
0 comments11 min readLW link

Ideas for CoT Models: A Geo­met­ric Per­spec­tive on La­tent Space Reasoning

Rohan Ganapavarapu24 Jan 2025 19:01 UTC
2 points
0 comments2 min readLW link
(rohan.ga)

Liron Shapira vs Ken Stan­ley on Doom De­bates. A review

TheManxLoiner24 Jan 2025 18:01 UTC
10 points
0 comments14 min readLW link

Is there such a thing as an im­pos­si­ble pro­tein?

Abhishaike Mahajan24 Jan 2025 17:12 UTC
15 points
3 comments4 min readLW link
(www.owlposting.com)

Star­gate AI-1

Zvi24 Jan 2025 15:20 UTC
85 points
1 comment18 min readLW link
(thezvi.wordpress.com)

QFT and neu­ral nets: the ba­sic idea

Dmitry Vaintrob24 Jan 2025 13:54 UTC
28 points
0 comments8 min readLW link

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
35 points
9 comments3 min readLW link

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTrout24 Jan 2025 5:18 UTC
26 points
3 comments1 min readLW link
(turntrout.com)

[Question] Do you con­sider perfect surveillance in­evitable?

samuelshadrach24 Jan 2025 4:57 UTC
16 points
34 comments1 min readLW link

Un­con­trol­lable: A Sur­pris­ingly Good In­tro­duc­tion to AI Risk

PeterMcCluskey24 Jan 2025 4:30 UTC
16 points
1 comment1 min readLW link
(bayesianinvestor.com)

Con­tra Dances Get­ting Shorter and Earlier

jefftk23 Jan 2025 23:30 UTC
11 points
0 comments2 min readLW link
(www.jefftk.com)

Start­ing Thoughts on RLHF

Michael Flood23 Jan 2025 22:16 UTC
2 points
0 comments5 min readLW link

Up­dat­ing and Edit­ing Fac­tual Knowl­edge in Lan­guage Models

Dhananjay Ashok23 Jan 2025 19:34 UTC
2 points
2 comments10 min readLW link

AI com­pa­nies are un­likely to make high-as­surance safety cases if timelines are short

ryan_greenblatt23 Jan 2025 18:41 UTC
145 points
5 comments13 min readLW link

AISN #46: The Transition

23 Jan 2025 18:09 UTC
8 points
0 comments5 min readLW link
(newsletter.safe.ai)

What does suc­cess look like?

Raymond Douglas23 Jan 2025 17:48 UTC
11 points
0 comments3 min readLW link

AI #100: Meet the New Boss

Zvi23 Jan 2025 15:40 UTC
50 points
4 comments69 min readLW link
(thezvi.wordpress.com)

[Cross-post] Every Bay Area “Walled Com­pound”

davekasten23 Jan 2025 15:05 UTC
43 points
3 comments3 min readLW link

Writ­ing ex­per­i­ments and the ba­nana es­cape valve

Dmitry Vaintrob23 Jan 2025 13:11 UTC
34 points
1 comment2 min readLW link

MONA: Man­aged My­opia with Ap­proval Feedback

23 Jan 2025 12:24 UTC
81 points
30 comments9 min readLW link

[Question] How use­ful would alien al­ign­ment re­search be?

Donald Hobson23 Jan 2025 10:59 UTC
17 points
5 comments1 min readLW link

What are the differ­ences be­tween AGI, trans­for­ma­tive AI, and su­per­in­tel­li­gence?

23 Jan 2025 10:03 UTC
10 points
3 comments3 min readLW link
(aisafety.info)

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC
34 points
3 comments4 min readLW link

Tail SP 500 Call Op­tions

sapphire23 Jan 2025 5:21 UTC
73 points
28 comments2 min readLW link

A hi­er­ar­chy of disagreement

Adam Zerner23 Jan 2025 3:17 UTC
21 points
4 comments8 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

23 Jan 2025 1:34 UTC
28 points
1 comment7 min readLW link

You Have Two Brains

Eneasz23 Jan 2025 0:52 UTC
24 points
5 comments5 min readLW link
(deathisbad.substack.com)

[Question] are there 2 types of al­ign­ment?

KvmanThinking23 Jan 2025 0:08 UTC
4 points
9 comments1 min readLW link

The­ory of Change for AI Safety Camp

Linda Linsefors22 Jan 2025 22:07 UTC
36 points
3 comments7 min readLW link

On Deep­Seek’s r1

Zvi22 Jan 2025 19:50 UTC
55 points
2 comments35 min readLW link
(thezvi.wordpress.com)

De­tect Good­hart and shut down

Jeremy Gillen22 Jan 2025 18:45 UTC
70 points
21 comments7 min readLW link

Re­cur­sive Self-Model­ing as a Plau­si­ble Mechanism for Real-time In­tro­spec­tion in Cur­rent Lan­guage Models

rife22 Jan 2025 18:36 UTC
14 points
6 comments2 min readLW link