MATS Program

TagLast edit: 21 Oct 2023 17:46 UTC by Ryan Kidd

The ML Alignment & Theory Scholars (MATS) Program is an educational seminar and independent research program that aims to provide talented scholars with talks, workshops, and research mentorship in the field of AI alignment, and connect them with the Berkeley AI safety research community.

Some real examples of gradient hacking

Oliver Sourbut22 Nov 2021 0:11 UTC

15 points

8 comments2 min readLW link

Theoretical Neuroscience For Alignment Theory

Cameron Berg7 Dec 2021 21:50 UTC

62 points

18 comments23 min readLW link

Understanding and controlling auto-induced distributional shift

L Rudolf L13 Dec 2021 14:59 UTC

32 points

4 comments16 min readLW link

The Natural Abstraction Hypothesis: Implications and Evidence

CallumMcDougall14 Dec 2021 23:14 UTC

37 points

8 comments19 min readLW link

Motivations, Natural Selection, and Curriculum Engineering

Oliver Sourbut16 Dec 2021 1:07 UTC

16 points

0 comments42 min readLW link

How complex are myopic imitators?

Vivek Hebbar8 Feb 2022 12:00 UTC

26 points

1 comment15 min readLW link

SERI ML Alignment Theory Scholars Program 2022

Ryan Kidd, Victor Warlop and ozhang

27 Apr 2022 0:43 UTC

63 points

6 comments3 min readLW link

Conditions for mathematical equivalence of Stochastic Gradient Descent and Natural Selection

Oliver Sourbut9 May 2022 21:38 UTC

61 points

19 comments8 min readLW link 1 review

(www.oliversourbut.net)

Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:58 UTC

62 points

31 comments7 min readLW link

[Short version] Information Loss --> Basin flatness

Vivek Hebbar21 May 2022 12:59 UTC

12 points

0 comments1 min readLW link

My SERI MATS Application

Daniel Paleka30 May 2022 2:04 UTC

16 points

0 comments8 min readLW link

Identification of Natural Modularity

Stephen Fowler25 Jun 2022 15:05 UTC

15 points

3 comments7 min readLW link

Race Along Rashomon Ridge

Stephen Fowler, Peter S. Park and MichaelEinhorn

7 Jul 2022 3:20 UTC

50 points

15 comments8 min readLW link

Notes on Learning the Prior

Spencer Becker-Kahn15 Jul 2022 17:28 UTC

22 points

2 comments25 min readLW link

Why you might expect homogeneous take-off: evidence from ML research

Andrei Alexandru17 Jul 2022 20:31 UTC

24 points

0 comments10 min readLW link

How Interpretability can be Impactful

Connall Garrod18 Jul 2022 0:06 UTC

18 points

0 comments37 min readLW link

Deception?! I ain’t got time for that!

Paul Colognese18 Jul 2022 0:06 UTC

55 points

5 comments13 min readLW link

A distillation of Evan Hubinger’s training stories (for SERI MATS)

Daphne_W18 Jul 2022 3:38 UTC

15 points

1 comment10 min readLW link

Training goals for large language models

Johannes Treutlein18 Jul 2022 7:09 UTC

28 points

5 comments19 min readLW link

Conditioning Generative Models for Alignment

Jozdien18 Jul 2022 7:11 UTC

58 points

8 comments20 min readLW link

Modelling Deception

Garrett Baker18 Jul 2022 21:21 UTC

15 points

0 comments7 min readLW link

Bounded complexity of solving ELK and its implications

Rubi J. Hudson19 Jul 2022 6:56 UTC

11 points

4 comments18 min readLW link

Abram Demski’s ELK thoughts and proposal—distillation

Rubi J. Hudson19 Jul 2022 6:57 UTC

16 points

8 comments16 min readLW link

Information theoretic model analysis may not lend much insight, but we may have been doing them wrong!

Garrett Baker24 Jul 2022 0:42 UTC

7 points

0 comments10 min readLW link

Finding Skeletons on Rashomon Ridge

David Udell, Peter S. Park and NickyP

24 Jul 2022 22:31 UTC

30 points

2 comments7 min readLW link

Translating between Latent Spaces

JamesH, Jeremy Gillen and NickyP

30 Jul 2022 3:25 UTC

27 points

2 comments8 min readLW link

How transparency changed over time

ViktoriaMalyasova30 Jul 2022 4:36 UTC

21 points

0 comments6 min readLW link

Externalized reasoning oversight: a research direction for language model alignment

tamera3 Aug 2022 12:03 UTC

130 points

23 comments6 min readLW link

Broad Basins and Data Compression

Jeremy Gillen, Stephen Fowler and Thomas Larsen

8 Aug 2022 20:33 UTC

33 points

6 comments7 min readLW link

How (not) to choose a research project

Garrett Baker, CatGoddess and Johannes C. Mayer

9 Aug 2022 0:26 UTC

78 points

11 comments7 min readLW link

Project proposal: Testing the IBP definition of agent

Jeremy Gillen, Thomas Larsen and JamesH

9 Aug 2022 1:09 UTC

21 points

4 comments2 min readLW link

Team Shard Status Report

David Udell9 Aug 2022 5:33 UTC

38 points

8 comments3 min readLW link

How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It)

Peter S. Park, NickyP and Stephen Fowler

10 Aug 2022 18:14 UTC

28 points

30 comments11 min readLW link

Shard Theory: An Overview

David Udell11 Aug 2022 5:44 UTC

161 points

34 comments10 min readLW link

A brief note on Simplicity Bias

Spencer Becker-Kahn14 Aug 2022 2:05 UTC

19 points

0 comments4 min readLW link

What Makes an Idea Understandable? On Architecturally and Culturally Natural Ideas.

NickyP, Peter S. Park and Stephen Fowler

16 Aug 2022 2:09 UTC

21 points

2 comments16 min readLW link

Mesa-optimization for goals defined only within a training environment is dangerous

Rubi J. Hudson17 Aug 2022 3:56 UTC

6 points

2 comments4 min readLW link

The Core of the Alignment Problem is...

Thomas Larsen, Jeremy Gillen and JamesH

17 Aug 2022 20:07 UTC

74 points

10 comments9 min readLW link

Finding Goals in the World Model

Jeremy Gillen, JamesH and Thomas Larsen

22 Aug 2022 18:06 UTC

59 points

8 comments13 min readLW link

The Shard Theory Alignment Scheme

David Udell25 Aug 2022 4:52 UTC

47 points

32 comments2 min readLW link

Taking the parameters which seem to matter and rotating them until they don’t

Garrett Baker26 Aug 2022 18:26 UTC

120 points

48 comments1 min readLW link

Can We Align a Self-Improving AGI?

Peter S. Park30 Aug 2022 0:14 UTC

8 points

5 comments11 min readLW link

Inner Alignment via Superpowers

JamesH, Thomas Larsen and Jeremy Gillen

30 Aug 2022 20:01 UTC

37 points

13 comments4 min readLW link

Behaviour Manifolds and the Hessian of the Total Loss—Notes and Criticism

Spencer Becker-Kahn3 Sep 2022 0:15 UTC

35 points

5 comments6 min readLW link

Framing AI Childhoods

David Udell6 Sep 2022 23:40 UTC

37 points

8 comments4 min readLW link

Searching for Modularity in Large Language Models

NickyP and Stephen Fowler

8 Sep 2022 2:25 UTC

44 points

3 comments14 min readLW link

Swap and Scale

Stephen Fowler9 Sep 2022 22:41 UTC

17 points

3 comments1 min readLW link

Trying to find the underlying structure of computational systems

Matthias G. Mayer13 Sep 2022 21:16 UTC

17 points

9 comments4 min readLW link

Neural Tangent Kernel Distillation

Thomas Larsen and Jeremy Gillen

5 Oct 2022 18:11 UTC

75 points

20 comments8 min readLW link

SERI MATS Program—Winter 2022 Cohort

Ryan Kidd, Victor Warlop and Christian Smith

8 Oct 2022 19:09 UTC

72 points

12 comments4 min readLW link

What sorts of systems can be deceptive?

Andrei Alexandru31 Oct 2022 22:00 UTC

16 points

0 comments7 min readLW link

Auditing games for high-level interpretability

Paul Colognese1 Nov 2022 10:44 UTC

34 points

1 comment7 min readLW link

Why I’m Working On Model Agnostic Interpretability

Jessica Rumbelow11 Nov 2022 9:24 UTC

26 points

9 comments2 min readLW link

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Jessica Rumbelow17 Nov 2022 11:06 UTC

27 points

2 comments2 min readLW link

A Short Dialogue on the Meaning of Reward Functions

Leon Lang, Quintin Pope and peligrietzer

19 Nov 2022 21:04 UTC

45 points

0 comments3 min readLW link

[ASoT] Reflectivity in Narrow AI

Ulisse Mini21 Nov 2022 0:51 UTC

6 points

1 comment1 min readLW link

Guardian AI (Misaligned systems are all around us.)

Jessica Rumbelow25 Nov 2022 15:55 UTC

15 points

6 comments2 min readLW link

Is the “Valley of Confused Abstractions” real?

jacquesthibs5 Dec 2022 13:36 UTC

19 points

11 comments2 min readLW link

Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities

jacquesthibs5 Dec 2022 16:09 UTC

28 points

6 comments8 min readLW link

Working towards AI alignment is better

Johannes C. Mayer9 Dec 2022 15:39 UTC

8 points

2 comments2 min readLW link

Finite Factored Sets in Pictures

Magdalena Wache11 Dec 2022 18:49 UTC

174 points

35 comments12 min readLW link

[Question] How is ARC planning to use ELK?

jacquesthibs15 Dec 2022 20:11 UTC

24 points

5 comments1 min readLW link

Proper scoring rules don’t guarantee predicting fixed points

Johannes Treutlein, Rubi J. Hudson and Caspar Oesterheld

16 Dec 2022 18:22 UTC

68 points

8 comments21 min readLW link

Results from a survey on tool use and workflows in alignment research

jacquesthibs, Jan, janus and Logan Riggs

19 Dec 2022 15:19 UTC

79 points

2 comments19 min readLW link

Some Notes on the mathematics of Toy Autoencoding Problems

Spencer Becker-Kahn22 Dec 2022 17:21 UTC

14 points

0 comments12 min readLW link

Content and Takeaways from SERI MATS Training Program with John Wentworth

RohanS24 Dec 2022 4:17 UTC

28 points

3 comments12 min readLW link

Getting up to Speed on the Speed Prior in 2022

robertzk28 Dec 2022 7:49 UTC

36 points

5 comments65 min readLW link

But is it really in Rome? An investigation of the ROME model editing technique

jacquesthibs30 Dec 2022 2:40 UTC

102 points

1 comment18 min readLW link

Soft optimization makes the value target bigger

Jeremy Gillen2 Jan 2023 16:06 UTC

117 points

20 comments12 min readLW link

My Advice for Incoming SERI MATS Scholars

Johannes C. Mayer3 Jan 2023 19:25 UTC

56 points

1 comment4 min readLW link

The Alignment Problems

Martín Soto12 Jan 2023 22:29 UTC

19 points

0 comments4 min readLW link

Disentangling Shard Theory into Atomic Claims

Leon Lang13 Jan 2023 4:23 UTC

85 points

6 comments18 min readLW link

Consequentialists: One-Way Pattern Traps

David Udell16 Jan 2023 20:48 UTC

54 points

3 comments14 min readLW link

Experiment Idea: RL Agents Evading Learned Shutdownability

Leon Lang16 Jan 2023 22:46 UTC

31 points

7 comments17 min readLW link

(docs.google.com)

Neural networks generalize because of this one weird trick

Jesse Hoogland18 Jan 2023 0:10 UTC

162 points

28 comments53 min readLW link

(www.jessehoogland.com)

[RFC] Possible ways to expand on “Discovering Latent Knowledge in Language Models Without Supervision”.

gekaklam, Walter Laurito , Kaarel and Kay Kozaronek

25 Jan 2023 19:03 UTC

47 points

6 comments12 min readLW link

Spooky action at a distance in the loss landscape

Jesse Hoogland and Filip Sondej

28 Jan 2023 0:22 UTC

61 points

4 comments7 min readLW link

(www.jessehoogland.com)

Stop-gradients lead to fixed point predictions

Johannes Treutlein, Caspar Oesterheld, Rubi J. Hudson and Emery Cooper

28 Jan 2023 22:47 UTC

36 points

2 comments24 min readLW link

More findings on Memorization and double descent

Marius Hobbhahn1 Feb 2023 18:26 UTC

53 points

2 comments19 min readLW link

More findings on maximal data dimension

Marius Hobbhahn2 Feb 2023 18:33 UTC

27 points

1 comment11 min readLW link

Normative vs Descriptive Models of Agency

mattmacdermott2 Feb 2023 20:28 UTC

26 points

5 comments4 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

663 points

204 comments12 min readLW link

Gradient surfing: the hidden role of regularization

Jesse Hoogland6 Feb 2023 3:50 UTC

33 points

6 comments14 min readLW link

(www.jessehoogland.com)

SolidGoldMagikarp II: technical details and more recent findings

mwatkins and Jessica Rumbelow

6 Feb 2023 19:09 UTC

109 points

45 comments13 min readLW link

[ASoT] Policy Trajectory Visualization

Ulisse Mini7 Feb 2023 0:13 UTC

9 points

2 comments1 min readLW link

Using PICT against PastaGPT Jailbreaking

Quentin FEUILLADE--MONTIXI9 Feb 2023 4:30 UTC

17 points

0 comments9 min readLW link

Qualities that alignment mentors value in junior researchers

Akash14 Feb 2023 23:27 UTC

84 points

14 comments3 min readLW link

Non-Unitary Quantum Logic—SERI MATS Research Sprint

Yegreg16 Feb 2023 19:31 UTC

27 points

0 comments7 min readLW link

A Neural Network undergoing Gradient-based Training as a Complex System

Spencer Becker-Kahn19 Feb 2023 22:08 UTC

22 points

1 comment19 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett

20 Feb 2023 19:35 UTC

91 points

6 comments21 min readLW link

The shallow reality of ‘deep learning theory’

Jesse Hoogland22 Feb 2023 4:16 UTC

34 points

11 comments3 min readLW link

(www.jessehoogland.com)

Intervening in the Residual Stream

MadHatter22 Feb 2023 6:29 UTC

30 points

1 comment9 min readLW link

Searching for a model’s concepts by their shape – a theoretical framework

Kaarel, gekaklam, Walter Laurito , Kay Kozaronek, AlexMennen and June Ku

23 Feb 2023 20:14 UTC

50 points

0 comments19 min readLW link

A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

MadHatter26 Feb 2023 1:10 UTC

61 points

14 comments6 min readLW link

Performance guarantees in classical learning theory and infra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC

9 points

4 comments31 min readLW link

A mostly critical review of infra-Bayesianism

matolcsid28 Feb 2023 18:37 UTC

104 points

9 comments29 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

Why are counterfactuals elusive?

Martín Soto3 Mar 2023 20:13 UTC

14 points

6 comments2 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

312 points

22 comments23 min readLW link

Fixed points in mortal population games

ViktoriaMalyasova14 Mar 2023 7:10 UTC

24 points

0 comments12 min readLW link

(www.lesswrong.com)

Natural Abstractions: Key claims, Theorems, and Critiques

LawrenceC, Leon Lang and Erik Jenner

16 Mar 2023 16:37 UTC

206 points

20 comments45 min readLW link

[Appendix] Natural Abstractions: Key Claims, Theorems, and Critiques

LawrenceC, Erik Jenner and Leon Lang

16 Mar 2023 16:38 UTC

46 points

0 comments13 min readLW link

Clarifying mesa-optimization

Marius Hobbhahn and Pierre Peigné

21 Mar 2023 15:53 UTC

37 points

6 comments10 min readLW link

Empirical risk minimization is fundamentally confused

Jesse Hoogland22 Mar 2023 16:58 UTC

32 points

5 comments1 min readLW link

SERI MATS—Summer 2023 Cohort

Aris, Ryan Kidd and Christian Smith

8 Apr 2023 15:32 UTC

71 points

25 comments4 min readLW link

Approximation is expensive, but the lunch is cheap

Jesse Hoogland and Zach Furman

19 Apr 2023 14:19 UTC

68 points

3 comments16 min readLW link

An open letter to SERI MATS program organisers

Roman Leventov20 Apr 2023 16:34 UTC

25 points

26 comments4 min readLW link

Behavioural statistics for a maze-solving agent

peligrietzer and TurnTrout

20 Apr 2023 22:26 UTC

44 points

11 comments10 min readLW link

Research agenda: Supervising AIs improving AIs

Quintin Pope, Owen Dudney, Roman Engeler and jacquesthibs

29 Apr 2023 17:09 UTC

76 points

5 comments19 min readLW link

Finding Neurons in a Haystack: Case Studies with Sparse Probing

wesg and Neel Nanda

3 May 2023 13:30 UTC

33 points

5 comments2 min readLW link

(arxiv.org)

How MATS addresses “mass movement building” concerns

Ryan Kidd4 May 2023 0:55 UTC

62 points

9 comments3 min readLW link

Some Summaries of Agent Foundations Work

mattmacdermott15 May 2023 16:09 UTC

56 points

1 comment13 min readLW link

Boomerang—protocol to dissolve some commitment races

Filip Sondej30 May 2023 16:21 UTC

37 points

10 comments8 min readLW link

Quantitative cruxes in Alignment

Martín Soto2 Jul 2023 20:38 UTC

21 points

0 comments23 min readLW link

Sources of evidence in Alignment

Martín Soto2 Jul 2023 20:38 UTC

20 points

0 comments11 min readLW link

Infra-Bayesian Logic

harfe and Yegreg

5 Jul 2023 19:16 UTC

15 points

2 comments1 min readLW link

Activation adding experiments with FLAN-T5

Nina Rimsky13 Jul 2023 23:32 UTC

21 points

5 comments7 min readLW link

Activation adding experiments with llama-7b

Nina Rimsky16 Jul 2023 4:17 UTC

49 points

1 comment3 min readLW link

AutoInterpretation Finds Sparse Coding Beats Alternatives

Hoagy17 Jul 2023 1:41 UTC

54 points

1 comment7 min readLW link

Decoding intermediate activations in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC

35 points

3 comments4 min readLW link

Reducing sycophancy and improving honesty via activation steering

Nina Rimsky28 Jul 2023 2:46 UTC

116 points

14 comments9 min readLW link

Understanding and Aligning a Human-like Inductive Bias with Cognitive Science: a Review of Related Literature

Claire Short29 Jul 2023 6:10 UTC

23 points

0 comments12 min readLW link

Modulating sycophancy in an RLHF model via activation steering

Nina Rimsky9 Aug 2023 7:06 UTC

64 points

20 comments12 min readLW link

Decomposing independent generalizations in neural networks via Hessian analysis

Dmitry Vaintrob and Nina Rimsky

14 Aug 2023 17:04 UTC

82 points

3 comments1 min readLW link

Understanding and visualizing sycophancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC

45 points

0 comments6 min readLW link

Large Language Models will be Great for Censorship

Ethan Edwards21 Aug 2023 19:03 UTC

183 points

14 comments8 min readLW link

(ethanedwards.substack.com)

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

Dmitry Vaintrob and Nina Rimsky

23 Aug 2023 21:12 UTC

79 points

1 comment13 min readLW link

Red-teaming language models via activation engineering

Nina Rimsky26 Aug 2023 5:52 UTC

65 points

6 comments9 min readLW link

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

74 points

3 comments1 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

Invulnerable Incomplete Preferences: A Formal Statement

Sami Petersen30 Aug 2023 21:59 UTC

124 points

32 comments35 min readLW link

Taking features out of superposition with sparse autoencoders more quickly with informed initialization

Pierre Peigné23 Sep 2023 16:21 UTC

29 points

8 comments5 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

25 Sep 2023 17:19 UTC

25 points

3 comments7 min readLW link

[Paper] All’s Fair In Love And Love: Copy Suppression in GPT-2 Small

CallumMcDougall, Arthur Conmy, starship006, Tom McGrath and Neel Nanda

13 Oct 2023 18:32 UTC

82 points

4 comments8 min readLW link

On Interpretability’s Robustness

WCargo18 Oct 2023 13:18 UTC

11 points

0 comments4 min readLW link

Apply for MATS Winter 2023-24!

Rocket, Ryan Kidd and LauraVaughan

21 Oct 2023 2:27 UTC

106 points

6 comments5 min readLW link

[Closed] Agent Foundations track in MATS

Vanessa Kosoy31 Oct 2023 8:12 UTC

54 points

1 comment1 min readLW link

(www.matsprogram.org)

Balancing Security Mindset with Collaborative Research: A Proposal

MadHatter1 Nov 2023 0:46 UTC

9 points

3 comments4 min readLW link

Polysemantic Attention Head in a 4-Layer Transformer

Jett, cmathw and StefanHex

9 Nov 2023 16:16 UTC

46 points

0 comments6 min readLW link

Game Theory without Argmax [Part 1]

Cleo Nardo11 Nov 2023 15:59 UTC

53 points

16 comments19 min readLW link

Game Theory without Argmax [Part 2]

Cleo Nardo11 Nov 2023 16:02 UTC

31 points

14 comments13 min readLW link

Classifying representations of sparse autoencoders (SAEs)

Annah17 Nov 2023 13:54 UTC

15 points

6 comments2 min readLW link

MATS Summer 2023 Retrospective

Rocket, Juan Gil, Ryan Kidd, Christian Smith, McKennaFitzgerald and LauraVaughan

1 Dec 2023 23:29 UTC

77 points

34 comments26 min readLW link

Interview: Applications w/ Alice Rigg

jacobhaimes19 Dec 2023 19:03 UTC

12 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Steering Llama-2 with contrastive activation additions

Nina Rimsky, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

2 Jan 2024 0:47 UTC

119 points

29 comments8 min readLW link

(arxiv.org)

Uncertainty in all its flavours

Cleo Nardo9 Jan 2024 16:21 UTC

25 points

6 comments35 min readLW link

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan and Neel Nanda

14 Jan 2024 2:06 UTC

22 points

0 comments42 min readLW link

Sparse Autoencoders Work on Attention Layer Outputs

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

16 Jan 2024 0:26 UTC

82 points

5 comments19 min readLW link

How important is AI hacking as LLMs advance?

Artyom Karpov29 Jan 2024 18:41 UTC

1 point

0 comments6 min readLW link

Attention SAEs Scale to GPT-2 Small

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

3 Feb 2024 6:50 UTC

76 points

4 comments8 min readLW link

Implementing activation steering

Annah5 Feb 2024 17:51 UTC

59 points

5 comments7 min readLW link

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To

robertzk, Connor Kissane, Arthur Conmy and Neel Nanda

6 Mar 2024 5:03 UTC

56 points

0 comments12 min readLW link

MATS AI Safety Strategy Curriculum

Ryan Kidd and Ronny Fernandez

7 Mar 2024 19:59 UTC

62 points

2 comments16 min readLW link

Understanding SAE Features with the Logit Lens

Joseph Bloom and Johnny Lin

11 Mar 2024 0:16 UTC

53 points

0 comments14 min readLW link

My MATS Summer 2023 experience

James Chua20 Mar 2024 11:26 UTC

18 points

0 comments3 min readLW link

(jameschua.net)

End-to-end hacking with language models

tchauvin5 Apr 2024 15:06 UTC

17 points

0 comments8 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

59 points

4 comments10 min readLW link

No comments.