Some Th­e­ses on Mo­ti­va­tional and Direc­tional Feedback

abstractapplicFeb 2, 2025, 10:50 PM
9 points
3 comments4 min readLW link

Hu­man­ity Has A Pos­si­ble 99.98% Chance Of Ex­tinc­tion

st3rlxxFeb 2, 2025, 9:46 PM
−12 points
1 comment5 min readLW link

Ex­plor­ing how Othel­loGPT com­putes its world model

JMaarFeb 2, 2025, 9:29 PM
7 points
0 comments8 min readLW link

An In­tro­duc­tion to Ev­i­den­tial De­ci­sion Theory

BabićFeb 2, 2025, 9:27 PM
5 points
2 comments10 min readLW link

“DL train­ing == hu­man learn­ing” is a bad analogy

kmanFeb 2, 2025, 8:59 PM
3 points
0 comments1 min readLW link

Con­di­tional Im­por­tance in Toy Models of Superposition

james__pFeb 2, 2025, 8:35 PM
9 points
4 comments10 min readLW link

Trac­ing Ty­pos in LLMs: My At­tempt at Un­der­stand­ing How Models Cor­rect Misspellings

Ivan DostalFeb 2, 2025, 7:56 PM
3 points
1 comment5 min readLW link

The Sim­plest Good

Jesse HooglandFeb 2, 2025, 7:51 PM
75 points
6 comments5 min readLW link

Grad­ual Disem­pow­er­ment, Shell Games and Flinches

Jan_KulveitFeb 2, 2025, 2:47 PM
129 points
36 comments6 min readLW link

Thoughts on Toy Models of Superposition

james__pFeb 2, 2025, 1:52 PM
5 points
2 comments9 min readLW link

Es­cape from Alder­aan I

lsusrFeb 2, 2025, 10:48 AM
58 points
2 comments6 min readLW link

ChatGPT: Ex­plor­ing the Digi­tal Wilder­ness, Find­ings and Prospects

Bill BenzonFeb 2, 2025, 9:54 AM
2 points
0 comments5 min readLW link

[Question] Would any­one be in­ter­ested in pur­su­ing the Virtue of Schol­ar­ship with me?

japancoloradoFeb 2, 2025, 4:02 AM
11 points
2 comments1 min readLW link

Chi­nese room AI to sur­vive the in­escapable end of com­pute governance

rotatingpaguroFeb 2, 2025, 2:42 AM
−4 points
0 comments11 min readLW link

Sea­sonal Pat­terns in BIDA’s Attendance

jefftkFeb 2, 2025, 2:40 AM
11 points
0 comments2 min readLW link
(www.jefftk.com)

AI ac­cel­er­a­tion, Deep­Seek, moral philosophy

Josh HFeb 2, 2025, 12:08 AM
2 points
0 comments12 min readLW link

False­hoods you might be­lieve about peo­ple who are at a ra­tio­nal­ist meetup

ScrewtapeFeb 1, 2025, 11:32 PM
60 points
12 comments4 min readLW link

In­ter­pret­ing au­tonomous driv­ing agents with at­ten­tion based architecture

Manav DahraFeb 1, 2025, 11:20 PM
1 point
0 comments11 min readLW link

Ra­tion­al­ist Movie Reviews

Nicholas / Heather KrossFeb 1, 2025, 11:10 PM
16 points
2 comments4 min readLW link
(www.thinkingmuchbetter.com)

Retroac­tive If-Then Commitments

MichaelDickensFeb 1, 2025, 10:22 PM
7 points
0 comments1 min readLW link

Ex­plor­ing the co­her­ence of fea­tures ex­pla­na­tions in the GemmaScope

Mattia ProiettiFeb 1, 2025, 9:28 PM
1 point
0 comments19 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

RudaibaFeb 1, 2025, 9:26 PM
9 points
2 comments11 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM
7 points
0 comments8 min readLW link

Post AGI effect prediction

JuliezhangggFeb 1, 2025, 9:16 PM
1 point
0 comments7 min readLW link

Un­lock­ing Eth­i­cal AI and Im­prov­ing Jailbreak Defenses: Re­in­force­ment Learn­ing with Lay­ered Mor­phol­ogy (RLLM)

MiguelDevFeb 1, 2025, 7:17 PM
4 points
2 comments2 min readLW link
(www.whitehatstoic.com)

Poetic Meth­ods I: Meter as Com­mu­ni­ca­tion Protocol

adamShimiFeb 1, 2025, 6:22 PM
19 points
0 comments1 min readLW link
(formethods.substack.com)

Black­pool Ap­plied Ra­tion­al­ity Un­con­fer­ence 2025

Feb 1, 2025, 2:09 PM
6 points
0 comments7 min readLW link

[Question] How likely is an at­tempted coup in the United States in the next four years?

Alexander de VriesFeb 1, 2025, 1:12 PM
4 points
2 comments1 min readLW link

Black­pool Ap­plied Ra­tion­al­ity Un­con­fer­ence 2025

Feb 1, 2025, 1:04 PM
23 points
2 comments7 min readLW link

One-di­men­sional vs multi-di­men­sional fea­tures in interpretability

charlieoneillFeb 1, 2025, 9:10 AM
6 points
0 comments2 min readLW link

Can 7B-8B LLMs judge their own home­work?

dereshevFeb 1, 2025, 8:29 AM
1 point
0 comments4 min readLW link

2024 was the year of the big bat­tery, and what that means for so­lar power

transhumanist_atom_understanderFeb 1, 2025, 6:27 AM
36 points
1 comment8 min readLW link

Re: Taste

lsusrFeb 1, 2025, 3:34 AM
33 points
8 comments6 min readLW link

Thoughts about Policy Ecosys­tems: The Miss­ing Links in AI Governance

Echo HuangFeb 1, 2025, 1:54 AM
1 point
0 comments5 min readLW link

Pro­posal: Safe­guard­ing Against Jailbreak­ing Through Iter­a­tive Multi-Turn Testing

jacquesallenJan 31, 2025, 11:00 PM
4 points
0 comments8 min readLW link

The Failed Strat­egy of Ar­tifi­cial In­tel­li­gence Doomers

Ben PaceJan 31, 2025, 6:56 PM
136 points
78 comments5 min readLW link
(www.palladiummag.com)

Safe Search is off: root causes of AI catas­trophic risks

Jemal YoungJan 31, 2025, 6:22 PM
2 points
0 comments3 min readLW link

5,000 calories of peanut but­ter ev­ery week for 3 years straight

Declan MolonyJan 31, 2025, 5:29 PM
16 points
8 comments1 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

Jan 31, 2025, 4:49 PM
203 points
28 comments12 min readLW link

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

BuckJan 31, 2025, 4:23 PM
130 points
10 comments4 min readLW link

[Question] How do biolog­i­cal or spik­ing neu­ral net­works learn?

Dom PolsinelliJan 31, 2025, 4:03 PM
2 points
1 comment2 min readLW link

Defense Against the Dark Prompts: Miti­gat­ing Best-of-N Jailbreak­ing with Prompt Evaluation

Jan 31, 2025, 3:36 PM
16 points
2 comments2 min readLW link

[Question] Strong, Stable, Open: Choose Two—in search of an article

Eli_Jan 31, 2025, 2:48 PM
2 points
0 comments1 min readLW link

Deep­Seek: Don’t Panic

ZviJan 31, 2025, 2:20 PM
45 points
6 comments27 min readLW link
(thezvi.wordpress.com)

Catas­tro­phe through Chaos

Marius HobbhahnJan 31, 2025, 2:19 PM
184 points
17 comments12 min readLW link

In­ter­views with Moon­shot AI’s CEO, Yang Zhilin

Cosmia_NebulaJan 31, 2025, 9:19 AM
3 points
0 comments68 min readLW link
(rentry.co)

Re­view: The Lathe of Heaven

dr_sJan 31, 2025, 8:10 AM
23 points
0 comments8 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloudJan 31, 2025, 7:13 AM
22 points
1 comment2 min readLW link

Take­aways from sketch­ing a con­trol safety case

joshcJan 31, 2025, 4:43 AM
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

Thread for Sense-Mak­ing on Re­cent Mur­ders and How to Sanely Respond

Ben PaceJan 31, 2025, 3:45 AM
109 points
146 comments2 min readLW link