RSS

Align­ment Re­search Cen­ter (ARC)

TagLast edit: Dec 30, 2024, 9:23 AM by Dakara

Alignment Research Centre (ARC) is a non-profit research organization whose mission is to align future machine learning systems with human interests. Its current work focuses on developing an alignment strategy that could be adopted in industry today while scaling gracefully to future ML systems. Right now Paul Christiano, Mark Xu, and Jacob Hilton are researchers and Kyle Scott handles operations.

ARC Evals new re­port: Eval­u­at­ing Lan­guage-Model Agents on Real­is­tic Au­tonomous Tasks

Beth BarnesAug 1, 2023, 6:30 PM
153 points

72 votes

Overall karma indicates overall quality.

12 comments5 min readLW link
(evals.alignment.org)

Steel­man­ning heuris­tic arguments

Dmitry VaintrobApr 13, 2025, 1:09 AM
77 points

28 votes

Overall karma indicates overall quality.

0 comments17 min readLW link

Ob­sta­cles in ARC’s agenda: Low Prob­a­bil­ity Estimation

David MatolcsiMay 2, 2025, 7:38 PM
44 points

15 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Ob­sta­cles in ARC’s agenda: Find­ing explanations

David MatolcsiApr 30, 2025, 11:03 PM
123 points

37 votes

Overall karma indicates overall quality.

10 comments17 min readLW link

[Question] How is ARC plan­ning to use ELK?

jacquesthibsDec 15, 2022, 8:11 PM
24 points

12 votes

Overall karma indicates overall quality.

5 comments1 min readLW link

Ob­sta­cles in ARC’s agenda: Mechanis­tic Ano­maly Detection

David MatolcsiMay 1, 2025, 8:51 PM
43 points

15 votes

Overall karma indicates overall quality.

1 comment11 min readLW link

Paul Chris­ti­ano on Dwarkesh Podcast

ESRogsNov 3, 2023, 10:13 PM
19 points

11 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.dwarkeshpatel.com)

ARC is hiring the­o­ret­i­cal researchers

Jun 12, 2023, 6:50 PM
126 points

46 votes

Overall karma indicates overall quality.

12 comments4 min readLW link
(www.alignment.org)

Low Prob­a­bil­ity Es­ti­ma­tion in Lan­guage Models

Gabriel WuOct 18, 2024, 3:50 PM
50 points

18 votes

Overall karma indicates overall quality.

0 comments10 min readLW link
(www.alignment.org)

Prizes for ma­trix com­ple­tion problems

paulfchristianoMay 3, 2023, 11:30 PM
164 points

72 votes

Overall karma indicates overall quality.

52 comments1 min readLW link
(www.alignment.org)

ARC’s first tech­ni­cal re­port: Elic­it­ing La­tent Knowledge

Dec 14, 2021, 8:09 PM
228 points

73 votes

Overall karma indicates overall quality.

90 comments1 min readLW link3 reviews
(docs.google.com)

A bird’s eye view of ARC’s research

Jacob_HiltonOct 23, 2024, 3:50 PM
121 points

52 votes

Overall karma indicates overall quality.

12 comments7 min readLW link
(www.alignment.org)

Es­ti­mat­ing Tail Risk in Neu­ral Networks

Mark XuSep 13, 2024, 8:00 PM
68 points

20 votes

Overall karma indicates overall quality.

9 comments23 min readLW link
(www.alignment.org)

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilanJul 27, 2023, 1:50 AM
22 points

7 votes

Overall karma indicates overall quality.

0 comments72 min readLW link

ARC pa­per: For­mal­iz­ing the pre­sump­tion of independence

Erik JennerNov 20, 2022, 1:22 AM
97 points

48 votes

Overall karma indicates overall quality.

2 comments2 min readLW link
(arxiv.org)

ARC tests to see if GPT-4 can es­cape hu­man con­trol; GPT-4 failed to do so

Christopher KingMar 15, 2023, 12:29 AM
116 points

67 votes

Overall karma indicates overall quality.

22 comments2 min readLW link

More in­for­ma­tion about the dan­ger­ous ca­pa­bil­ity eval­u­a­tions we did with GPT-4 and Claude.

Beth BarnesMar 19, 2023, 12:25 AM
233 points

114 votes

Overall karma indicates overall quality.

54 comments8 min readLW link
(evals.alignment.org)

Chris­ti­ano (ARC) and GA (Con­jec­ture) Dis­cuss Align­ment Cruxes

Feb 24, 2023, 11:03 PM
61 points

25 votes

Overall karma indicates overall quality.

7 comments47 min readLW link

Coun­terex­am­ples to some ELK proposals

paulfchristianoDec 31, 2021, 5:05 PM
53 points

16 votes

Overall karma indicates overall quality.

10 comments7 min readLW link

ARC is hiring!

Dec 14, 2021, 8:09 PM
64 points

23 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Con­crete Meth­ods for Heuris­tic Es­ti­ma­tion on Neu­ral Networks

Oliver DanielsNov 14, 2024, 5:07 AM
33 points

11 votes

Overall karma indicates overall quality.

0 comments27 min readLW link

Ex­plor­ing a Vi­sion for AI as Com­pas­sion­ate, Emo­tion­ally In­tel­li­gent Part­ners — Seek­ing Col­lab­o­ra­tion and Insights

theophilosJul 14, 2025, 11:22 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Ex­per­i­men­tally eval­u­at­ing whether hon­esty generalizes

paulfchristianoJul 1, 2021, 5:47 PM
103 points

37 votes

Overall karma indicates overall quality.

24 comments9 min readLW link1 review

Eval­u­a­tions pro­ject @ ARC is hiring a re­searcher and a web­dev/​engineer

Beth BarnesSep 9, 2022, 10:46 PM
99 points

33 votes

Overall karma indicates overall quality.

7 comments10 min readLW link

The Align­ment Problems

Martín SotoJan 12, 2023, 10:29 PM
20 points

10 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

1.75 ASR HARMBENCH & 0% HARMFUL RESPONSES FOR MISALIGNMENT.

jfdomNov 10, 2025, 8:43 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Em­piri­cal Proof of Sys­temic In­co­her­ence in LLMs (Gem­ini Case Study

arayunNov 6, 2025, 2:23 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

[Question] Why is there an al­ign­ment prob­lem?

InfiniteLightDec 22, 2023, 6:19 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

The Goal Mis­gen­er­al­iza­tion Problem

MyspyMay 18, 2023, 11:40 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link
(drive.google.com)

ELK prize results

Mar 9, 2022, 12:01 AM
138 points

56 votes

Overall karma indicates overall quality.

50 comments21 min readLW link
No comments.