Apart Research

TagLast edit: 18 Jul 2024 14:34 UTC by Esben Kran

Apart Research is an AI safety research lab. They host the Apart Sprints, large-scale international events for research experimentation. This tag includes posts written by Apart researchers and content about Apart Research.

Newsletter for Alignment Research: The ML Safety Updates

Esben Kran22 Oct 2022 16:17 UTC

26 points

0 comments7 min readLW link

Black Box Investigation Research Hackathon

Esben Kran and Jonas Hallgren

12 Sep 2022 7:20 UTC

9 points

4 comments2 min readLW link

Safety timelines: How long will it take to solve alignment?

Esben Kran, JonathanRystroem and Steinthal

19 Sep 2022 12:53 UTC

38 points

7 comments6 min readLW link

(forum.effectivealtruism.org)

Analysing Adversarial Attacks with Linear Probing

Yoann Poupart, Imene Kerboua, Clement Neo and Jason Hoelscher-Obermaier

17 Jun 2024 14:16 UTC

9 points

0 comments8 min readLW link

We Found An Neuron in GPT-2

Joseph Miller and Clement Neo

11 Feb 2023 18:27 UTC

143 points

23 comments7 min readLW link

(clementneo.com)

Deceptive agents can collude to hide dangerous features in SAEs

Simon Lermen and Mateusz Dziemian

15 Jul 2024 17:07 UTC

33 points

2 comments7 min readLW link

Results from the language model hackathon

Esben Kran10 Oct 2022 8:29 UTC

22 points

1 comment4 min readLW link

AI Safety Ideas: A collaborative AI safety research platform

Esben Kran17 Oct 2022 17:01 UTC

24 points

0 comments4 min readLW link

Computational Mechanics Hackathon (June 1 & 2)

Adam Shai24 May 2024 22:18 UTC

34 points

5 comments1 min readLW link

College technical AI safety hackathon retrospective—Georgia Tech

yix15 Nov 2024 0:22 UTC

44 points

2 comments5 min readLW link

(open.substack.com)

Early Experiments in Reward Model Interpretation Using Sparse Autoencoders

lukemarks, Amirali Abdullah, Rauno Arike, Fazl and nothoughtsheadempty

3 Oct 2023 7:45 UTC

18 points

0 comments5 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

9 May 2023 19:41 UTC

119 points

1 comment10 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

47 points

5 comments9 min readLW link

Will Machines Ever Rule the World? MLAISU W50

Esben Kran16 Dec 2022 11:03 UTC

12 points

7 comments4 min readLW link

(newsletter.apartresearch.com)

Towards AI Safety Infrastructure: Talk & Outline

Paul Bricman7 Jan 2024 9:31 UTC

11 points

0 comments2 min readLW link

(www.youtube.com)

Results from the interpretability hackathon

Esben Kran and Neel Nanda

17 Nov 2022 14:51 UTC

81 points

0 comments6 min readLW link

(alignmentjam.com)

Sequential Coherence: A Bottleneck in Automation

eeeee, xavi_ferres and felixgaston

19 Jul 2025 15:27 UTC

26 points

2 comments11 min readLW link

Identifying semantic neurons, mechanistic circuits & interpretability web apps

Esben Kran and Neel Nanda

13 Apr 2023 11:59 UTC

18 points

0 comments8 min readLW link

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas, nlpet and hal2k

6 Jan 2025 10:24 UTC

21 points

6 comments10 min readLW link

Results from the AI x Democracy Research Sprint

Esben Kran, jordine and Jason Hoelscher-Obermaier

14 Jun 2024 16:40 UTC

13 points

0 comments6 min readLW link

Robustness & Evolution [MLAISU W02]

Esben Kran13 Jan 2023 15:47 UTC

10 points

0 comments3 min readLW link

(newsletter.apartresearch.com)

Hackathon and Staying Up-to-Date in AI

jacobhaimes8 Jan 2024 17:10 UTC

11 points

0 comments1 min readLW link

(into-ai-safety.github.io)

How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC

48 points

5 comments13 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

25 May 2023 15:37 UTC

71 points

1 comment13 min readLW link

Demonstrate and evaluate risks from AI to society at the AI x Democracy research hackathon

Esben Kran19 Apr 2024 14:46 UTC

5 points

0 comments6 min readLW link

(www.apartresearch.com)

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities

Jonathan N, abra, Connor Axiotes and Esben Kran

5 Nov 2024 1:01 UTC

9 points

0 comments6 min readLW link

(www.apartresearch.com)

Approximating Human Preferences Using a Multi-Judge Learned System

JoseFaustino, eitan sprejer, Fernando Avalos and Augusto Bernardi

31 Jul 2025 18:01 UTC

19 points

0 comments13 min readLW link

Can startups be impactful in AI safety?

Esben Kran and Archana Vaidheeswaran

13 Sep 2024 19:00 UTC

15 points

0 comments6 min readLW link

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training

Jeyashree Krishnan and Ajay Mandyam Rangarajan

14 Oct 2025 20:54 UTC

8 points

0 comments7 min readLW link

AI improving AI [MLAISU W01!]

Esben Kran6 Jan 2023 11:13 UTC

5 points

0 comments4 min readLW link

(newsletter.apartresearch.com)

Generalizability & Hope for AI [MLAISU W03]

Esben Kran20 Jan 2023 10:06 UTC

5 points

2 comments2 min readLW link

(newsletter.apartresearch.com)

Finding Deception in Language Models

Esben Kran and Archana Vaidheeswaran

20 Aug 2024 9:42 UTC

20 points

4 comments4 min readLW link

Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results

Esben Kran, Fazl, Sabrina Zaki, gabrielrecc and rz2383

23 Feb 2023 10:48 UTC

8 points

0 comments6 min readLW link

Superposition and Dropout

Edoardo Pona16 May 2023 7:24 UTC

21 points

5 comments6 min readLW link

ML Safety at NeurIPS & Paradigmatic AI Safety? MLAISU W49

Esben Kran and Steinthal

9 Dec 2022 10:38 UTC

19 points

0 comments4 min readLW link

(newsletter.apartresearch.com)

Join the AI Testing Hackathon this Friday

Esben Kran12 Dec 2022 14:24 UTC

10 points

0 comments8 min readLW link

(alignmentjam.com)

Join the interpretability research hackathon

Esben Kran28 Oct 2022 16:26 UTC

15 points

0 comments5 min readLW link

[Book] Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

Esben Kran31 Oct 2022 11:38 UTC

20 points

1 comment1 min readLW link

(christophm.github.io)

Results from the AI testing hackathon

Esben Kran2 Jan 2023 15:46 UTC

13 points

0 comments5 min readLW link

(alignmentjam.com)

NeurIPS Safety & ChatGPT. MLAISU W48

Esben Kran and Steinthal

2 Dec 2022 15:50 UTC

3 points

0 comments4 min readLW link

(newsletter.apartresearch.com)

No comments.