AXRP Episode 29 - Science of Deep Learning with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC

13 points

0 comments63 min readLW link

Improving Dictionary Learning with Gated Sparse Autoencoders

Neel Nanda, Senthooran Rajamanoharan, Arthur Conmy, lsgos, Tom Lieberum, Vikrant Varma, János Kramár and Rohin Shah

25 Apr 2024 18:43 UTC

25 points

0 comments1 min readLW link

(arxiv.org)

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

112 points

13 comments1 min readLW link

(www.anthropic.com)

Dequantifying first-order theories

jessicata23 Apr 2024 19:04 UTC

37 points

8 comments8 min readLW link

(unstableontology.com)

ProLU: A Nonlinearity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC

29 points

2 comments7 min readLW link

Time complexity for deterministic string machines

alcatal21 Apr 2024 22:35 UTC

14 points

0 comments21 min readLW link

Inducing Unprompted Misalignment in LLMs

Sam Svenningsen, evhub and Henry Sleight

19 Apr 2024 20:00 UTC

35 points

6 comments16 min readLW link

[Full Post] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

70 points

8 comments8 min readLW link

[Summary] Progress Update #1 from the GDM Mech Interp Team

Neel Nanda, Arthur Conmy, lsgos, Senthooran Rajamanoharan, Tom Lieberum, János Kramár and Vikrant Varma

19 Apr 2024 19:06 UTC

68 points

0 comments3 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam Marks18 Apr 2024 16:17 UTC

93 points

5 comments12 min readLW link

LLM Evaluators Recognize and Favor Their Own Generations

Arjun Panickssery, Sam Bowman and Shi Feng

17 Apr 2024 21:09 UTC

43 points

1 comment3 min readLW link

(tiny.cc)

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

301 points

63 comments12 min readLW link

Speedrun ruiner research idea

lukehmiles13 Apr 2024 23:42 UTC

4 points

11 comments2 min readLW link

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC

67 points

10 comments107 min readLW link

The theory of Proximal Policy Optimisation implementations

salman.mohammadi11 Apr 2024 13:00 UTC

3 points

1 comment6 min readLW link

(salmanmohammadi.github.io)

How I select alignment research projects

Ethan Perez, Henry Sleight and Mikita Balesni

10 Apr 2024 4:33 UTC

34 points

4 comments24 min readLW link

PIBBSS is hiring in a variety of roles (alignment research and incubation program)

Nora_Ammann, Lucas Teixeira and DusanDNesic

9 Apr 2024 8:12 UTC

47 points

0 comments3 min readLW link

How We Picture Bayesian Agents

johnswentworth and David Lorell

8 Apr 2024 18:12 UTC

65 points

11 comments7 min readLW link

Measuring Learned Optimization in Small Transformer Models

J Bostock8 Apr 2024 14:41 UTC

22 points

0 comments11 min readLW link

Measuring Predictability of Persona Evaluations

Thee Ho and evhub

6 Apr 2024 8:46 UTC

19 points

0 comments7 min readLW link