Ano­ma­lous To­kens in Deep­Seek-V3 and r1

henryJan 25, 2025, 10:55 PM
137 points
3 comments7 min readLW link

The Ris­ing Sea

Jesse HooglandJan 25, 2025, 8:48 PM
92 points
2 comments2 min readLW link

Monet: Mix­ture of Monose­man­tic Ex­perts for Trans­form­ers Explained

CalebMarescaJan 25, 2025, 7:37 PM
20 points
2 comments11 min readLW link

AI and Non-Ex­is­tence.

ElevenJan 25, 2025, 7:36 PM
−3 points
9 comments2 min readLW link

Agents don’t have to be al­igned to help us achieve an in­definite pause.

HastingsJan 25, 2025, 6:51 PM
29 points
0 comments3 min readLW link

[Question] AI Safety in secret

Michael FloodJan 25, 2025, 6:16 PM
7 points
0 comments1 min readLW link

On polytopes

Dmitry VaintrobJan 25, 2025, 1:56 PM
56 points
5 comments12 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

Jan 25, 2025, 1:12 PM
108 points
22 comments4 min readLW link
(publications.apolloresearch.ai)

A con­cise defi­ni­tion of what it means to win

testingthewatersJan 25, 2025, 6:37 AM
4 points
1 comment5 min readLW link
(aclevername.substack.com)

[Question] A Float­ing Cube—Re­jected HLE submission

Shankar SivarajanJan 25, 2025, 4:52 AM
7 points
1 comment1 min readLW link

Why I’m Pour­ing Cold Water in My Left Ear, and You Should Too

MaloewJan 24, 2025, 11:13 PM
12 points
0 comments2 min readLW link

Coun­ter­in­tu­itive effects of min­i­mum prices

dynomightJan 24, 2025, 11:05 PM
25 points
0 comments8 min readLW link
(dynomight.net)

AXRP Epi­sode 38.6 - Joel Lehman on Pos­i­tive Vi­sions of AI

DanielFilanJan 24, 2025, 11:00 PM
10 points
0 comments9 min readLW link

Lo­cat­ing and Edit­ing Knowl­edge in LMs

Dhananjay AshokJan 24, 2025, 10:53 PM
1 point
0 comments4 min readLW link

How are Those AI Par­ti­ci­pants Do­ing Any­way?

mushroomsoupJan 24, 2025, 10:37 PM
4 points
0 comments10 min readLW link

Six Thoughts on AI Safety

boazbarakJan 24, 2025, 10:20 PM
91 points
55 comments15 min readLW link

In­stru­men­tal Goals Are A Differ­ent And Friendlier Kind Of Thing Than Ter­mi­nal Goals

Jan 24, 2025, 8:20 PM
181 points
61 comments5 min readLW link

Yud­kowsky on The Tra­jec­tory podcast

Seth HerdJan 24, 2025, 7:52 PM
71 points
39 comments2 min readLW link
(www.youtube.com)

Em­piri­cal In­sights into Fea­ture Geom­e­try in Sparse Autoencoders

Jason Boxi ZhangJan 24, 2025, 7:02 PM
7 points
0 comments11 min readLW link

Ideas for CoT Models: A Geo­met­ric Per­spec­tive on La­tent Space Reasoning

Rohan GanapavarapuJan 24, 2025, 7:01 PM
2 points
0 comments2 min readLW link
(rohan.ga)

Liron Shapira vs Ken Stan­ley on Doom De­bates. A review

TheManxLoinerJan 24, 2025, 6:01 PM
9 points
0 comments14 min readLW link

Is there such a thing as an im­pos­si­ble pro­tein?

Abhishaike MahajanJan 24, 2025, 5:12 PM
15 points
3 comments4 min readLW link
(www.owlposting.com)

Star­gate AI-1

ZviJan 24, 2025, 3:20 PM
85 points
1 comment18 min readLW link
(thezvi.wordpress.com)

QFT and neu­ral nets: the ba­sic idea

Dmitry VaintrobJan 24, 2025, 1:54 PM
26 points
0 comments8 min readLW link

Elic­it­ing bad contexts

Jan 24, 2025, 10:39 AM
34 points
9 comments3 min readLW link

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTroutJan 24, 2025, 5:18 AM
26 points
3 comments1 min readLW link
(turntrout.com)

[Question] Do you con­sider perfect surveillance in­evitable?

samuelshadrachJan 24, 2025, 4:57 AM
16 points
34 comments1 min readLW link

Un­con­trol­lable: A Sur­pris­ingly Good In­tro­duc­tion to AI Risk

PeterMcCluskeyJan 24, 2025, 4:30 AM
11 points
0 comments1 min readLW link
(bayesianinvestor.com)

Con­tra Dances Get­ting Shorter and Earlier

jefftkJan 23, 2025, 11:30 PM
11 points
0 comments2 min readLW link
(www.jefftk.com)

Start­ing Thoughts on RLHF

Michael FloodJan 23, 2025, 10:16 PM
2 points
0 comments5 min readLW link

Up­dat­ing and Edit­ing Fac­tual Knowl­edge in Lan­guage Models

Dhananjay AshokJan 23, 2025, 7:34 PM
2 points
2 comments10 min readLW link

AI com­pa­nies are un­likely to make high-as­surance safety cases if timelines are short

ryan_greenblattJan 23, 2025, 6:41 PM
145 points
5 comments13 min readLW link

AISN #46: The Transition

Jan 23, 2025, 6:09 PM
8 points
0 comments5 min readLW link
(newsletter.safe.ai)

What does suc­cess look like?

Raymond DouglasJan 23, 2025, 5:48 PM
11 points
0 comments3 min readLW link

AI #100: Meet the New Boss

ZviJan 23, 2025, 3:40 PM
50 points
4 comments69 min readLW link
(thezvi.wordpress.com)

[Cross-post] Every Bay Area “Walled Com­pound”

davekastenJan 23, 2025, 3:05 PM
37 points
3 comments3 min readLW link

Writ­ing ex­per­i­ments and the ba­nana es­cape valve

Dmitry VaintrobJan 23, 2025, 1:11 PM
34 points
1 comment2 min readLW link

MONA: Man­aged My­opia with Ap­proval Feedback

Jan 23, 2025, 12:24 PM
80 points
30 comments9 min readLW link

[Question] How use­ful would alien al­ign­ment re­search be?

Donald HobsonJan 23, 2025, 10:59 AM
17 points
5 comments1 min readLW link

What are the differ­ences be­tween AGI, trans­for­ma­tive AI, and su­per­in­tel­li­gence?

Jan 23, 2025, 10:03 AM
10 points
3 comments3 min readLW link
(aisafety.info)

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaleyJan 23, 2025, 6:44 AM
33 points
3 comments4 min readLW link

Tail SP 500 Call Op­tions

sapphireJan 23, 2025, 5:21 AM
67 points
28 comments2 min readLW link

A hi­er­ar­chy of disagreement

Adam ZernerJan 23, 2025, 3:17 AM
21 points
4 comments8 min readLW link

Early Ex­per­i­ments in Hu­man Au­dit­ing for AI Control

Jan 23, 2025, 1:34 AM
27 points
0 comments7 min readLW link

You Have Two Brains

EneaszJan 23, 2025, 12:52 AM
24 points
5 comments5 min readLW link
(deathisbad.substack.com)

[Question] are there 2 types of al­ign­ment?

KvmanThinkingJan 23, 2025, 12:08 AM
4 points
9 comments1 min readLW link

The­ory of Change for AI Safety Camp

Linda LinseforsJan 22, 2025, 10:07 PM
36 points
3 comments7 min readLW link

On Deep­Seek’s r1

ZviJan 22, 2025, 7:50 PM
55 points
2 comments35 min readLW link
(thezvi.wordpress.com)

De­tect Good­hart and shut down

Jeremy GillenJan 22, 2025, 6:45 PM
70 points
21 comments7 min readLW link

Re­cur­sive Self-Model­ing as a Plau­si­ble Mechanism for Real-time In­tro­spec­tion in Cur­rent Lan­guage Models

rifeJan 22, 2025, 6:36 PM
8 points
6 comments2 min readLW link