RSS

Nina Rimsky

Karma: 1,393

https://​​ninarimsky.substack.com/​​

https://​​ninarimsky.com/​​

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
120 points
29 comments8 min readLW link
(arxiv.org)

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
86 points
4 comments12 min readLW link

Con­sider giv­ing money to peo­ple, not pro­jects or organizations

Nina Rimsky2 Jul 2023 14:33 UTC
79 points
30 comments3 min readLW link
(ninarimsky.substack.com)

On house­hold dust

Nina Rimsky30 Jun 2023 17:03 UTC
74 points
12 comments5 min readLW link

A fram­ing for interpretability

Nina Rimsky14 Nov 2023 16:14 UTC
69 points
5 comments4 min readLW link
(ninarimsky.substack.com)

In­fluence func­tions—why, what and how

Nina Rimsky15 Sep 2023 20:42 UTC
69 points
6 comments8 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Rimsky9 Aug 2023 7:06 UTC
67 points
20 comments12 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Rimsky26 Aug 2023 5:52 UTC
65 points
6 comments9 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Rimsky16 Jul 2023 4:17 UTC
50 points
1 comment3 min readLW link

The challenge of ar­tic­u­lat­ing tacit knowledge

Nina Rimsky31 May 2023 23:10 UTC
48 points
4 comments5 min readLW link
(ninarimsky.substack.com)

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

Pass­ing the ide­olog­i­cal Tur­ing test? Ar­gu­ments against ex­is­ten­tial risk from AI.

Nina Rimsky7 Jul 2023 10:38 UTC
42 points
5 comments7 min readLW link
(ninarimsky.substack.com)

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina Rimsky28 Oct 2023 22:54 UTC
36 points
5 comments2 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC
36 points
3 comments4 min readLW link

Recipe: Hes­sian eigen­vec­tor com­pu­ta­tion for PyTorch models

Nina Rimsky14 Aug 2023 2:48 UTC
30 points
5 comments5 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina Rimsky13 Jul 2023 23:32 UTC
21 points
5 comments7 min readLW link