Meta: Fron­tier AI Framework

Zach Stein-Perlman3 Feb 2025 22:00 UTC
33 points
2 comments1 min readLW link
(ai.meta.com)

$300 Fermi Model Competition

ozziegooen3 Feb 2025 19:47 UTC
16 points
18 comments2 min readLW link

Vi­su­al­iz­ing Interpretability

Darold Davis3 Feb 2025 19:36 UTC
3 points
0 comments4 min readLW link

Align­ment Can Re­duce Perfor­mance on Sim­ple Eth­i­cal Questions

Daan Henselmans3 Feb 2025 19:35 UTC
16 points
7 comments6 min readLW link

The Over­lap Paradigm: Re­think­ing Data’s Role in Weak-to-Strong Gen­er­al­iza­tion (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC
2 points
0 comments11 min readLW link

Sleeper agents ap­pear re­silient to ac­ti­va­tion steering

Lucy Wingard3 Feb 2025 19:31 UTC
6 points
0 comments7 min readLW link

Part 1: En­hanc­ing In­ner Align­ment in CLIP Vi­sion Trans­form­ers: Miti­gat­ing Reifi­ca­tion Bias with SAEs and Grad ECLIP

Gilber A. Corrales3 Feb 2025 19:30 UTC
1 point
0 comments13 min readLW link

Su­per­in­tel­li­gence Align­ment Proposal

Davey Morse3 Feb 2025 18:47 UTC
5 points
3 comments9 min readLW link

The Self-Refer­ence Trap in Mathematics

Alister Munday3 Feb 2025 16:12 UTC
−41 points
23 comments2 min readLW link

Stop­ping un­al­igned LLMs is easy!

Yair Halberstadt3 Feb 2025 15:38 UTC
−3 points
11 comments2 min readLW link

The Outer Levels

Jerdle3 Feb 2025 14:30 UTC
2 points
3 comments6 min readLW link

o3-mini Early Days

Zvi3 Feb 2025 14:20 UTC
45 points
0 comments15 min readLW link
(thezvi.wordpress.com)

OpenAI re­leases deep re­search agent

Seth Herd3 Feb 2025 12:48 UTC
78 points
21 comments3 min readLW link
(openai.com)

Neu­ron Ac­ti­va­tions to CLIP Embed­dings: Geom­e­try of Lin­ear Com­bi­na­tions in La­tent Space

Roman Malov3 Feb 2025 10:30 UTC
5 points
0 comments2 min readLW link

[Question] Can we in­fer the search space of a lo­cal op­ti­miser?

Lucius Bushnaq3 Feb 2025 10:17 UTC
25 points
5 comments3 min readLW link

Pick two: con­cise, com­pre­hen­sive, or clear rules

Screwtape3 Feb 2025 6:39 UTC
84 points
27 comments8 min readLW link

Lan­guage Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC
1 point
0 comments1 min readLW link
(hylaeansea.org)

Keep­ing Cap­i­tal is the Challenge

LTM3 Feb 2025 2:04 UTC
13 points
2 comments17 min readLW link
(routecause.substack.com)

Use com­put­ers as pow­er­ful as in 1985 or AI con­trols hu­mans or ?

jrincayc3 Feb 2025 0:51 UTC
3 points
0 comments2 min readLW link

Some Th­e­ses on Mo­ti­va­tional and Direc­tional Feedback

abstractapplic2 Feb 2025 22:50 UTC
10 points
3 comments4 min readLW link

Hu­man­ity Has A Pos­si­ble 99.98% Chance Of Ex­tinc­tion

st3rlxx2 Feb 2025 21:46 UTC
−12 points
1 comment5 min readLW link

Ex­plor­ing how Othel­loGPT com­putes its world model

JMaar2 Feb 2025 21:29 UTC
8 points
0 comments8 min readLW link

An In­tro­duc­tion to Ev­i­den­tial De­ci­sion Theory

Babić2 Feb 2025 21:27 UTC
5 points
2 comments10 min readLW link

“DL train­ing == hu­man learn­ing” is a bad analogy

kman2 Feb 2025 20:59 UTC
3 points
0 comments1 min readLW link

Con­di­tional Im­por­tance in Toy Models of Superposition

james__p2 Feb 2025 20:35 UTC
9 points
4 comments10 min readLW link

Trac­ing Ty­pos in LLMs: My At­tempt at Un­der­stand­ing How Models Cor­rect Misspellings

Ivan Dostal2 Feb 2025 19:56 UTC
4 points
1 comment5 min readLW link

The Sim­plest Good

Jesse Hoogland2 Feb 2025 19:51 UTC
76 points
6 comments5 min readLW link

Grad­ual Disem­pow­er­ment, Shell Games and Flinches

Jan_Kulveit2 Feb 2025 14:47 UTC
145 points
36 comments6 min readLW link

Thoughts on Toy Models of Superposition

james__p2 Feb 2025 13:52 UTC
5 points
2 comments9 min readLW link

Es­cape from Alder­aan I

lsusr2 Feb 2025 10:48 UTC
59 points
2 comments6 min readLW link

ChatGPT: Ex­plor­ing the Digi­tal Wilder­ness, Find­ings and Prospects

Bill Benzon2 Feb 2025 9:54 UTC
2 points
0 comments5 min readLW link

[Question] Would any­one be in­ter­ested in pur­su­ing the Virtue of Schol­ar­ship with me?

japancolorado2 Feb 2025 4:02 UTC
11 points
2 comments1 min readLW link

Chi­nese room AI to sur­vive the in­escapable end of com­pute governance

rotatingpaguro2 Feb 2025 2:42 UTC
−4 points
1 comment11 min readLW link

Sea­sonal Pat­terns in BIDA’s Attendance

jefftk2 Feb 2025 2:40 UTC
11 points
0 comments2 min readLW link
(www.jefftk.com)

AI ac­cel­er­a­tion, Deep­Seek, moral philosophy

Josh H2 Feb 2025 0:08 UTC
2 points
0 comments12 min readLW link

False­hoods you might be­lieve about peo­ple who are at a ra­tio­nal­ist meetup

Screwtape1 Feb 2025 23:32 UTC
70 points
12 comments4 min readLW link

In­ter­pret­ing au­tonomous driv­ing agents with at­ten­tion based architecture

Manav Dahra1 Feb 2025 23:20 UTC
1 point
0 comments11 min readLW link

Ra­tion­al­ist Movie Reviews

Nicholas Kross1 Feb 2025 23:10 UTC
16 points
2 comments3 min readLW link
(www.thinkingmuchbetter.com)

Retroac­tive If-Then Commitments

MichaelDickens1 Feb 2025 22:22 UTC
8 points
1 comment1 min readLW link

Ex­plor­ing the co­her­ence of fea­tures ex­pla­na­tions in the GemmaScope

Mattia Proietti1 Feb 2025 21:28 UTC
1 point
0 comments19 min readLW link

Ma­chine Un­learn­ing in Large Lan­guage Models: A Com­pre­hen­sive Sur­vey with Em­piri­cal In­sights from the Qwen 1.5 1.8B Model

Rudaiba1 Feb 2025 21:26 UTC
9 points
2 comments11 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantos1 Feb 2025 21:17 UTC
8 points
0 comments8 min readLW link

Post AGI effect prediction

Juliezhanggg1 Feb 2025 21:16 UTC
1 point
0 comments7 min readLW link

Un­lock­ing Eth­i­cal AI and Im­prov­ing Jailbreak Defenses: Re­in­force­ment Learn­ing with Lay­ered Mor­phol­ogy (RLLM)

MiguelDev1 Feb 2025 19:17 UTC
4 points
2 comments2 min readLW link

Poetic Meth­ods I: Meter as Com­mu­ni­ca­tion Protocol

adamShimi1 Feb 2025 18:22 UTC
19 points
0 comments1 min readLW link
(formethods.substack.com)

Black­pool Ap­plied Ra­tion­al­ity Un­con­fer­ence 2025

1 Feb 2025 14:09 UTC
6 points
0 comments7 min readLW link

[Question] How likely is an at­tempted coup in the United States in the next four years?

Alexander de Vries1 Feb 2025 13:12 UTC
5 points
2 comments1 min readLW link

Black­pool Ap­plied Ra­tion­al­ity Un­con­fer­ence 2025

1 Feb 2025 13:04 UTC
23 points
2 comments7 min readLW link

One-di­men­sional vs multi-di­men­sional fea­tures in interpretability

charlieoneill1 Feb 2025 9:10 UTC
6 points
0 comments2 min readLW link

Can 7B-8B LLMs judge their own home­work?

dereshev1 Feb 2025 8:29 UTC
1 point
0 comments4 min readLW link