21 Jul 2024 20:38 UTC

0 points

1 comment19 min readLW link

Demography and Destiny

Zero Contradictions21 Jul 2024 20:34 UTC

6 points

11 comments1 min readLW link

(thewaywardaxolotl.blogspot.com)

The $100B plan with “70% risk of killing us all” w Stephen Fry [video]

Oleg Trott21 Jul 2024 20:06 UTC

35 points

8 comments1 min readLW link

(www.youtube.com)

Raising Welfare for Lab Rodents

xanderbalwit21 Jul 2024 19:18 UTC

−2 points

0 comments1 min readLW link

(press.asimov.com)

A simple model of math skill

Alex_Altair21 Jul 2024 18:57 UTC

107 points

17 comments8 min readLW link

Using an LLM perplexity filter to detect weight exfiltration

Adam Karvonen21 Jul 2024 18:18 UTC

25 points

11 comments2 min readLW link

[Question] Would a scope-insensitive AGI be less likely to incapacitate humanity?

Jim Buhler21 Jul 2024 14:15 UTC

2 points

3 comments1 min readLW link

Holomorphic surjection theorem (Picard’s little theorem)

dkl921 Jul 2024 13:24 UTC

17 points

0 comments2 min readLW link

(dkl9.net)

aimless ace analyzes active amateur: a micro-aaaaalignment proposal

lemonhope21 Jul 2024 12:37 UTC

13 points

0 comments1 min readLW link

Pivotal Acts are easier than Alignment?

Michael Soareverix21 Jul 2024 12:15 UTC

2 points

4 comments1 min readLW link

Ball Sq Pathways

jefftk21 Jul 2024 2:20 UTC

13 points

1 comment1 min readLW link

(www.jefftk.com)

Freedom and Privacy of Thought Architectures

SebastianG 20 Jul 2024 21:43 UTC

5 points

2 comments1 min readLW link

Why Georgism Lost Its Popularity

Zero Contradictions20 Jul 2024 15:08 UTC

49 points

55 comments1 min readLW link

(zerocontradictions.net)

Only Fools Avoid Hindsight Bias

Kevin Dorst20 Jul 2024 13:42 UTC

−11 points

5 comments6 min readLW link

(kevindorst.substack.com)

A more systematic case for inner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC

31 points

4 comments5 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

62 points

0 comments4 min readLW link

Krona Compare

jefftk20 Jul 2024 1:10 UTC

10 points

0 comments2 min readLW link

(www.jefftk.com)

(Approximately) Deterministic Natural Latents

johnswentworth and David Lorell

19 Jul 2024 23:02 UTC

45 points

1 comment4 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

55 points

10 comments1 min readLW link

(storage.googleapis.com)

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

24 points

4 comments2 min readLW link

(arxiv.org)

Sustainability of Digital Life Form Societies

Hiroshi Yamakawa19 Jul 2024 13:59 UTC

19 points

1 comment20 min readLW link

Romae Industriae

Maxwell Tabarrok19 Jul 2024 13:03 UTC

36 points

2 comments7 min readLW link

(www.maximum-progress.com)

[Question] Have people given up on iterated distillation and amplification?

Chris_Leong19 Jul 2024 12:23 UTC

20 points

1 comment1 min readLW link

How do we know that “good research” is good? (aka “direct evaluation” vs “eigen-evaluation”)

Ruby19 Jul 2024 0:31 UTC

49 points

21 comments6 min readLW link

Linkpost: Surely you can be serious

kave18 Jul 2024 22:18 UTC

64 points

8 comments1 min readLW link

(www.experimental-history.com)

My experience applying to MATS 6.0

mic18 Jul 2024 19:02 UTC

19 points

3 comments5 min readLW link

[Question] What are the actual arguments in favor of computationalism as a theory of identity?

sunwillrise18 Jul 2024 18:44 UTC

16 points

27 comments5 min readLW link

Yet Another Critique of “Luxury Beliefs”

ymeskhout18 Jul 2024 18:37 UTC

6 points

9 comments9 min readLW link

(www.ymeskhout.com)

[Interim research report] Evaluating the Goal-Directedness of Language Models

Rauno Arike, Elizabeth Donoway and Marius Hobbhahn

18 Jul 2024 18:19 UTC

40 points

4 comments11 min readLW link

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys, george_adams and Sonia Joseph

18 Jul 2024 17:02 UTC

9 points

0 comments1 min readLW link

(arxiv.org)

Activation Engineering Theories of Impact

Jakub K. Nowak🔸18 Jul 2024 16:44 UTC

6 points

1 comment2 min readLW link

[Question] Me & My Clone

SimonBaars18 Jul 2024 16:25 UTC

27 points

22 comments1 min readLW link

AI #73: Openly Evil AI

Zvi18 Jul 2024 14:40 UTC

89 points

20 comments52 min readLW link

(thezvi.wordpress.com)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

125 points

18 comments18 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

67 points

0 comments10 min readLW link

[Question] Should we exclude alignment research from LLM training datasets?

Ben Millwood18 Jul 2024 10:27 UTC

3 points

5 comments1 min readLW link

Keeping content out of LLM training datasets

Ben Millwood18 Jul 2024 10:27 UTC

4 points

0 comments5 min readLW link

The Assassination of Trump’s Ear is Evidence for Time-Travel

elv18 Jul 2024 7:01 UTC

−9 points

5 comments5 min readLW link

Friendship is transactional, unconditional friendship is insurance

Ruby17 Jul 2024 22:52 UTC

70 points

25 comments2 min readLW link 1 review

D&D.Sci: Whom Shall You Call? [Evaluation and Ruleset]

abstractapplic17 Jul 2024 22:34 UTC

17 points

5 comments5 min readLW link

Optimistic Assumptions, Longterm Planning, and “Cope”

Raemon17 Jul 2024 22:14 UTC

229 points

47 comments7 min readLW link 1 review

Baking vs Patissing vs Cooking, the HPS explanation

adamShimi17 Jul 2024 20:29 UTC

30 points

16 comments3 min readLW link

(epistemologicalfascinations.substack.com)

Launching the Respiratory Outlook 2024/25 Forecasting Series

ChristianWilliams17 Jul 2024 19:51 UTC

5 points

0 comments1 min readLW link

(www.metaculus.com)

What are you getting paid in?

Austin Chen17 Jul 2024 19:23 UTC

120 points

16 comments4 min readLW link 1 review

(www.approachwithalacrity.com)

Individually incentivized safe Pareto improvements in open-source bargaining

Nicolas Macé, Anthony DiGiovanni and JesseClifton

17 Jul 2024 18:26 UTC

41 points

3 comments17 min readLW link

Profit and Value

kwang17 Jul 2024 18:06 UTC

22 points

3 comments6 min readLW link

(open.substack.com)

So You’ve Learned To Teleport by Tom Scott

landscape_kiwi17 Jul 2024 18:04 UTC

4 points

0 comments1 min readLW link

(www.youtube.com)

How does generalized accessibility compare to targeted accessibility?

ErioirE17 Jul 2024 17:07 UTC

3 points

0 comments2 min readLW link

Housing Roundup #9: Restricting Supply

Zvi17 Jul 2024 12:50 UTC

25 points

8 comments44 min readLW link

(thezvi.wordpress.com)