RSS

evhub

Karma: 772
Page 1

Re­laxed ad­ver­sar­ial train­ing for in­ner alignment

evhub
10 Sep 2019 23:03 UTC
39 points
2 comments27 min readLW link

Are min­i­mal cir­cuits de­cep­tive?

evhub
7 Sep 2019 18:11 UTC
41 points
4 comments8 min readLW link

Con­crete ex­per­i­ments in in­ner alignment

evhub
6 Sep 2019 22:16 UTC
60 points
9 comments5 min readLW link

Towards a mechanis­tic un­der­stand­ing of corrigibility

evhub
22 Aug 2019 23:20 UTC
22 points
4 comments6 min readLW link