Towards Align­ment Au­dit­ing as a Num­bers-Go-Up Science

Sam Marks4 Aug 2025 22:30 UTC
123 points
15 comments6 min readLW link

It turns out that DNNs are re­mark­ably in­ter­pretable.

Maciej Satkiewicz4 Aug 2025 22:18 UTC
12 points
8 comments1 min readLW link
(arxiv.org)

Dis­solv­ing moral philos­o­phy: from pain to meta-ethics

Charbel-Raphaël4 Aug 2025 20:20 UTC
5 points
3 comments2 min readLW link

Nav­i­gat­ing Se­cu­rity: Fight­ing flamma­bil­ity with fire (when safe)

jimmy4 Aug 2025 19:58 UTC
4 points
4 comments16 min readLW link

ACX At­lanta Au­gust Meetup

Steve French4 Aug 2025 19:52 UTC
2 points
0 comments1 min readLW link

Per­ma­nent Disem­pow­er­ment is the Baseline

Vladimir_Nesov4 Aug 2025 17:43 UTC
76 points
23 comments6 min readLW link

Ex­plor­ing en­tropy gra­di­ent propul­sion via the Casimir Effect

bobdavis624 Aug 2025 15:48 UTC
8 points
14 comments1 min readLW link

If you can gen­er­ate obfus­cated chain-of-thought, can you mon­i­tor it?

4 Aug 2025 15:46 UTC
34 points
4 comments11 min readLW link

On Alt­man’s In­ter­view With Theo Von

Zvi4 Aug 2025 15:10 UTC
41 points
1 comment9 min readLW link
(thezvi.wordpress.com)

Should we aim for flour­ish­ing over mere sur­vival? The Bet­ter Fu­tures se­ries.

wdmacaskill4 Aug 2025 14:28 UTC
63 points
8 comments5 min readLW link

Луна Лавгуд и Комната Тайн, Часть 8

4 Aug 2025 10:28 UTC
2 points
0 comments2 min readLW link

Frame­work I made for gen­eral “pro­duc­tivity”

Mark Wang4 Aug 2025 8:40 UTC
4 points
2 comments1 min readLW link

🫵YOU🫵 get to help the AGI Safety Act in Congress! This is real!

Wes R4 Aug 2025 3:13 UTC
10 points
5 comments1 min readLW link

Say­ing Goodbye

sapphire3 Aug 2025 23:52 UTC
66 points
74 comments4 min readLW link

[Linkpost] Avatar’s Dirty Se­cret: Na­ture Is Just Fancy Infrastructure

AlphaAndOmega3 Aug 2025 19:37 UTC
13 points
2 comments1 min readLW link
(open.substack.com)

[Question] How to tol­er­ate bore­dom?

tryhard10003 Aug 2025 17:16 UTC
7 points
3 comments1 min readLW link

Per­sona Vec­tors—An­thropic Paper

Stephen Martin3 Aug 2025 16:11 UTC
11 points
3 comments1 min readLW link
(www.anthropic.com)

Al­co­hol is so bad for so­ciety that you should prob­a­bly stop drinking

KatWoods3 Aug 2025 15:31 UTC
38 points
29 comments8 min readLW link

Ex­plo­sive growth from sub­sti­tu­tion: the case of the In­dus­trial Revolution

ParrotRobot3 Aug 2025 7:52 UTC
9 points
1 comment5 min readLW link

Emo­tions Make Sense

DaystarEld3 Aug 2025 7:03 UTC
207 points
40 comments21 min readLW link
(daystareld.com)

Creative writ­ing with LLMs, part 2: Co-writ­ing techniques

Kaj_Sotala3 Aug 2025 6:44 UTC
1 point
0 comments18 min readLW link

The Ethics of Copy­ing Con­scious States and the Many-Wor­lds In­ter­pre­ta­tion of Quan­tum Mechanics

TobyC2 Aug 2025 22:48 UTC
15 points
6 comments27 min readLW link

Astro­nom­i­cal Waste & Con­scien­tious Objection

Lydia Nottingham2 Aug 2025 22:37 UTC
8 points
1 comment2 min readLW link

The Inkhaven Residency

Ben Pace2 Aug 2025 18:51 UTC
134 points
35 comments3 min readLW link

[Question] Feed­back re­quest: `eval-crypt` a sim­ple util­ity to miti­gate eval con­tam­i­na­tion.

2 Aug 2025 17:04 UTC
9 points
4 comments2 min readLW link

The Ob­server Effect for be­lief measurement

Roman Malov2 Aug 2025 13:57 UTC
8 points
4 comments2 min readLW link

Many pre­dic­tion mar­kets would be bet­ter off as batched auctions

William Howard2 Aug 2025 12:04 UTC
173 points
21 comments5 min readLW link
(antidiluvian.substack.com)

2025 ACX Grants pro­ject pitches

duck_master2 Aug 2025 5:04 UTC
2 points
2 comments1 min readLW link

The deep his­tory of intelligence

Dan MacKinlay2 Aug 2025 4:04 UTC
10 points
0 comments1 min readLW link
(danmackinlay.name)

How many species has hu­man­ity driven ex­tinct?

Raemon2 Aug 2025 2:50 UTC
42 points
9 comments1 min readLW link

[Question] At what point do you aban­don ship?

Gesild Muka2 Aug 2025 1:13 UTC
7 points
4 comments1 min readLW link

Three Quotes on Trans­for­ma­tive Technology

Chris_Leong1 Aug 2025 22:57 UTC
8 points
3 comments1 min readLW link

SB-1047 Doc­u­men­tary: The Post-Mortem

Michaël Trazzi1 Aug 2025 21:42 UTC
130 points
0 comments5 min readLW link

Per­sona vec­tors: mon­i­tor­ing and con­trol­ling char­ac­ter traits in lan­guage models

1 Aug 2025 21:19 UTC
25 points
3 comments5 min readLW link
(arxiv.org)

Boots the­ory and Wikipedia

philh1 Aug 2025 20:30 UTC
8 points
12 comments12 min readLW link
(reasonableapproximation.net)

Pod­cast: Lin­coln Quirk from Wave

Elizabeth1 Aug 2025 19:00 UTC
40 points
1 comment1 min readLW link
(acesounderglass.com)

AI in a vat: Fun­da­men­tal limits of effi­cient world mod­el­ling for safe agent sandboxing

Fernando Rosas1 Aug 2025 18:37 UTC
34 points
3 comments15 min readLW link

The Dark Arts As A Scaf­fold­ing Skill For Rationality

Screwtape1 Aug 2025 17:12 UTC
82 points
25 comments7 min readLW link

Steve Petersen seek­ing funding

abramdemski1 Aug 2025 17:03 UTC
87 points
0 comments1 min readLW link

The Week in AI Governance

Zvi1 Aug 2025 12:20 UTC
18 points
1 comment24 min readLW link
(thezvi.wordpress.com)

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
25 points
0 comments18 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Meth­ods for Post-train­ing and Elic­i­ta­tion (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
12 points
0 comments6 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Bench­mark De­sign and Eval­u­a­tion (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:26 UTC
10 points
0 comments9 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in In­ter­pretabil­ity (The Align­ment Pro­ject by UK AISI)

Joseph Bloom1 Aug 2025 10:26 UTC
14 points
0 comments5 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Cog­ni­tive Science (The Align­ment Pro­ject by UK AISI)

Geoffrey Irving1 Aug 2025 10:26 UTC
12 points
0 comments6 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Learn­ing The­ory (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:26 UTC
15 points
0 comments24 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Prob­a­bil­is­tic Meth­ods (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:26 UTC
3 points
0 comments4 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Eco­nomic The­ory and Game The­ory (The Align­ment Pro­ject by UK AISI)

Cecilia Wood1 Aug 2025 10:25 UTC
4 points
0 comments6 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in Com­pu­ta­tional Com­plex­ity The­ory (The Align­ment Pro­ject by UK AISI)

Simon Marshall1 Aug 2025 10:25 UTC
6 points
0 comments10 min readLW link
(alignmentproject.aisi.gov.uk)

Re­search Areas in In­for­ma­tion The­ory and Cryp­tog­ra­phy (The Align­ment Pro­ject by UK AISI)

Simon Marshall1 Aug 2025 10:25 UTC
6 points
0 comments3 min readLW link
(alignmentproject.aisi.gov.uk)