RSS

Rauno Arike

Karma: 1,106

An­gles of at­tack for con­tinual learn­ing safety

16 Jun 2026 16:15 UTC
47 points
0 comments13 min readLW link

How might con­tinual learn­ing af­fect safety and al­ign­ment?

13 Jun 2026 17:34 UTC
59 points
2 comments16 min readLW link

What’s Con­tinual Learn­ing, and Why Might We Ex­pect To See It In Ad­vanced LLM Agents?

12 Jun 2026 18:43 UTC
28 points
2 comments17 min readLW link

Im­pli­ca­tions of Con­tinual Learn­ing for LLM Agents: Introduction

12 Jun 2026 18:36 UTC
46 points
0 comments6 min readLW link

Es­ti­mat­ing No-CoT Task-Com­ple­tion Time Hori­zons of Fron­tier AI Models

10 Jun 2026 17:58 UTC
240 points
20 comments4 min readLW link

A List of Re­search Direc­tions in Char­ac­ter Training

Rauno Arike19 Mar 2026 22:58 UTC
47 points
21 comments8 min readLW link

[Paper] How does in­for­ma­tion ac­cess af­fect LLM mon­i­tors’ abil­ity to de­tect sab­o­tage?

11 Feb 2026 21:25 UTC
26 points
0 comments6 min readLW link

Aether is hiring tech­ni­cal AI safety researchers

5 Jan 2026 22:27 UTC
22 points
0 comments2 min readLW link

13 Ar­gu­ments About a Tran­si­tion to Neu­ralese AIs

Rauno Arike7 Nov 2025 16:19 UTC
50 points
14 comments10 min readLW link

Hid­den Rea­son­ing in LLMs: A Taxonomy

25 Aug 2025 22:43 UTC
79 points
12 comments12 min readLW link

How we spent our first two weeks as an in­de­pen­dent AI safety re­search group

11 Aug 2025 19:32 UTC
34 points
0 comments10 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

8 Aug 2025 10:41 UTC
52 points
7 comments10 min readLW link

Aether July 2025 Update

1 Jul 2025 21:08 UTC
26 points
7 comments3 min readLW link

[Question] What faith­ful­ness met­rics should gen­eral claims about CoT faith­ful­ness be based upon?

Rauno Arike8 Apr 2025 15:27 UTC
26 points
0 comments4 min readLW link

On Re­cent Re­sults in LLM La­tent Reasoning

Rauno Arike31 Mar 2025 11:06 UTC
38 points
6 comments13 min readLW link

The Best Lec­ture Series on Every Subject

Rauno Arike24 Mar 2025 20:03 UTC
13 points
1 comment2 min readLW link

Rauno’s Shortform

Rauno Arike15 Nov 2024 12:08 UTC
3 points
37 comments1 min readLW link

A Dialogue on De­cep­tive Align­ment Risks

Rauno Arike25 Sep 2024 16:10 UTC
11 points
0 comments18 min readLW link

[In­terim re­search re­port] Eval­u­at­ing the Goal-Direct­ed­ness of Lan­guage Models

18 Jul 2024 18:19 UTC
40 points
4 comments11 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

3 Oct 2023 7:45 UTC
18 points
0 comments5 min readLW link