RSS

Nina Rimsky

Karma: 1,393

https://​​ninarimsky.substack.com/​​

https://​​ninarimsky.com/​​

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
120 points
29 comments8 min readLW link
(arxiv.org)

A fram­ing for interpretability

Nina Rimsky14 Nov 2023 16:14 UTC
69 points
5 comments4 min readLW link
(ninarimsky.substack.com)

Com­par­ing rep­re­sen­ta­tion vec­tors be­tween llama 2 base and chat

Nina Rimsky28 Oct 2023 22:54 UTC
36 points
5 comments2 min readLW link

In­ves­ti­gat­ing the learn­ing co­effi­cient of mod­u­lar ad­di­tion: hackathon project

17 Oct 2023 19:51 UTC
86 points
4 comments12 min readLW link

In­fluence func­tions—why, what and how

Nina Rimsky15 Sep 2023 20:42 UTC
69 points
6 comments8 min readLW link

Red-team­ing lan­guage mod­els via ac­ti­va­tion engineering

Nina Rimsky26 Aug 2023 5:52 UTC
65 points
6 comments9 min readLW link

The Low-Hang­ing Fruit Prior and sloped valleys in the loss landscape

23 Aug 2023 21:12 UTC
79 points
1 comment13 min readLW link

Un­der­stand­ing and vi­su­al­iz­ing syco­phancy datasets

Nina Rimsky16 Aug 2023 5:34 UTC
45 points
0 comments6 min readLW link

De­com­pos­ing in­de­pen­dent gen­er­al­iza­tions in neu­ral net­works via Hes­sian analysis

14 Aug 2023 17:04 UTC
82 points
3 comments1 min readLW link

Recipe: Hes­sian eigen­vec­tor com­pu­ta­tion for PyTorch models

Nina Rimsky14 Aug 2023 2:48 UTC
30 points
5 comments5 min readLW link

Mo­du­lat­ing syco­phancy in an RLHF model via ac­ti­va­tion steering

Nina Rimsky9 Aug 2023 7:06 UTC
67 points
20 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

De­cod­ing in­ter­me­di­ate ac­ti­va­tions in llama-2-7b

Nina Rimsky21 Jul 2023 5:35 UTC
36 points
3 comments4 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with llama-7b

Nina Rimsky16 Jul 2023 4:17 UTC
50 points
1 comment3 min readLW link

Ac­ti­va­tion adding ex­per­i­ments with FLAN-T5

Nina Rimsky13 Jul 2023 23:32 UTC
21 points
5 comments7 min readLW link

Ar­gu­ments against ex­is­ten­tial risk from AI, part 2

Nina Rimsky10 Jul 2023 8:25 UTC
7 points
0 comments5 min readLW link
(ninarimsky.substack.com)

Pass­ing the ide­olog­i­cal Tur­ing test? Ar­gu­ments against ex­is­ten­tial risk from AI.

Nina Rimsky7 Jul 2023 10:38 UTC
42 points
5 comments7 min readLW link
(ninarimsky.substack.com)

Distil­la­tion: RL with KL penalties is bet­ter viewed as Bayesian inference

Nina Rimsky6 Jul 2023 3:33 UTC
16 points
0 comments2 min readLW link

Con­sider giv­ing money to peo­ple, not pro­jects or organizations

Nina Rimsky2 Jul 2023 14:33 UTC
79 points
30 comments3 min readLW link
(ninarimsky.substack.com)

On house­hold dust

Nina Rimsky30 Jun 2023 17:03 UTC
74 points
12 comments5 min readLW link