Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Nina Rimsky
Karma:
1,393
https://ninarimsky.substack.com/
https://ninarimsky.com/
All
Posts
Comments
New
Top
Old
Page
1
Steering Llama-2 with contrastive activation additions
Nina Rimsky
,
Wuschel Schulz
,
NickGabs
,
Meg
,
evhub
and
TurnTrout
2 Jan 2024 0:47 UTC
120
points
29
comments
8
min read
LW
link
(arxiv.org)
A framing for interpretability
Nina Rimsky
14 Nov 2023 16:14 UTC
69
points
5
comments
4
min read
LW
link
(ninarimsky.substack.com)
Comparing representation vectors between llama 2 base and chat
Nina Rimsky
28 Oct 2023 22:54 UTC
36
points
5
comments
2
min read
LW
link
Investigating the learning coefficient of modular addition: hackathon project
Nina Rimsky
and
Dmitry Vaintrob
17 Oct 2023 19:51 UTC
86
points
4
comments
12
min read
LW
link
Influence functions—why, what and how
Nina Rimsky
15 Sep 2023 20:42 UTC
69
points
6
comments
8
min read
LW
link
Red-teaming language models via activation engineering
Nina Rimsky
26 Aug 2023 5:52 UTC
65
points
6
comments
9
min read
LW
link
The Low-Hanging Fruit Prior and sloped valleys in the loss landscape
Dmitry Vaintrob
and
Nina Rimsky
23 Aug 2023 21:12 UTC
79
points
1
comment
13
min read
LW
link
Understanding and visualizing sycophancy datasets
Nina Rimsky
16 Aug 2023 5:34 UTC
45
points
0
comments
6
min read
LW
link
Decomposing independent generalizations in neural networks via Hessian analysis
Dmitry Vaintrob
and
Nina Rimsky
14 Aug 2023 17:04 UTC
82
points
3
comments
1
min read
LW
link
Recipe: Hessian eigenvector computation for PyTorch models
Nina Rimsky
14 Aug 2023 2:48 UTC
30
points
5
comments
5
min read
LW
link
Modulating sycophancy in an RLHF model via activation steering
Nina Rimsky
9 Aug 2023 7:06 UTC
67
points
20
comments
12
min read
LW
link
Reducing sycophancy and improving honesty via activation steering
Nina Rimsky
28 Jul 2023 2:46 UTC
117
points
16
comments
9
min read
LW
link
Decoding intermediate activations in llama-2-7b
Nina Rimsky
21 Jul 2023 5:35 UTC
36
points
3
comments
4
min read
LW
link
Activation adding experiments with llama-7b
Nina Rimsky
16 Jul 2023 4:17 UTC
50
points
1
comment
3
min read
LW
link
Activation adding experiments with FLAN-T5
Nina Rimsky
13 Jul 2023 23:32 UTC
21
points
5
comments
7
min read
LW
link
Arguments against existential risk from AI, part 2
Nina Rimsky
10 Jul 2023 8:25 UTC
7
points
0
comments
5
min read
LW
link
(ninarimsky.substack.com)
Passing the ideological Turing test? Arguments against existential risk from AI.
Nina Rimsky
7 Jul 2023 10:38 UTC
42
points
5
comments
7
min read
LW
link
(ninarimsky.substack.com)
Distillation: RL with KL penalties is better viewed as Bayesian inference
Nina Rimsky
6 Jul 2023 3:33 UTC
16
points
0
comments
2
min read
LW
link
Consider giving money to people, not projects or organizations
Nina Rimsky
2 Jul 2023 14:33 UTC
79
points
30
comments
3
min read
LW
link
(ninarimsky.substack.com)
On household dust
Nina Rimsky
30 Jun 2023 17:03 UTC
74
points
12
comments
5
min read
LW
link
Back to top
Next