RSS

beren(Beren Millidge)

Karma: 1,250

RLHF does not ap­pear to differ­en­tially cause mode-collapse

20 Mar 2023 15:39 UTC
88 points
8 comments3 min readLW link

Against ubiquitous al­ign­ment taxes

beren6 Mar 2023 19:50 UTC
52 points
10 comments2 min readLW link

Ad­den­dum: ba­sic facts about lan­guage mod­els dur­ing training

beren6 Mar 2023 19:24 UTC
20 points
2 comments5 min readLW link

Ba­sic facts about lan­guage mod­els dur­ing training

beren21 Feb 2023 11:46 UTC
84 points
14 comments18 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

beren20 Feb 2023 21:32 UTC
15 points
1 comment4 min readLW link

Em­pa­thy as a nat­u­ral con­se­quence of learnt re­ward models

beren4 Feb 2023 15:35 UTC
37 points
26 comments13 min readLW link

AGI will have learnt util­ity functions

beren25 Jan 2023 19:42 UTC
28 points
3 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
145 points
18 comments5 min readLW link

Scal­ing laws vs in­di­vi­d­ual differences

beren10 Jan 2023 13:22 UTC
42 points
21 comments7 min readLW link

Ba­sic Facts about Lan­guage Model Internals

4 Jan 2023 13:01 UTC
115 points
17 comments9 min readLW link

An ML in­ter­pre­ta­tion of Shard Theory

beren3 Jan 2023 20:30 UTC
37 points
5 comments4 min readLW link

The ul­ti­mate limits of al­ign­ment will de­ter­mine the shape of the long term future

beren2 Jan 2023 12:47 UTC
33 points
2 comments6 min readLW link

Ev­i­dence on re­cur­sive self-im­prove­ment from cur­rent ML

beren30 Dec 2022 20:53 UTC
31 points
12 comments6 min readLW link

Hu­man sex­u­al­ity as an in­ter­est­ing case study of alignment

beren30 Dec 2022 13:37 UTC
37 points
26 comments3 min readLW link

[In­terim re­search re­port] Tak­ing fea­tures out of su­per­po­si­tion with sparse autoencoders

13 Dec 2022 15:41 UTC
85 points
14 comments22 min readLW link

De­con­fus­ing Direct vs Amor­tised Optimization

beren2 Dec 2022 11:30 UTC
49 points
7 comments10 min readLW link

The Sin­gu­lar Value De­com­po­si­tions of Trans­former Weight Ma­tri­ces are Highly Interpretable

28 Nov 2022 12:54 UTC
173 points
30 comments31 min readLW link

Cur­rent themes in mechanis­tic in­ter­pretabil­ity research

16 Nov 2022 14:14 UTC
85 points
3 comments12 min readLW link

In­ter­pret­ing Neu­ral Net­works through the Poly­tope Lens

23 Sep 2022 17:58 UTC
128 points
27 comments33 min readLW link