j_we

Karma: 72

I’m a PhD student working on AI Safety. I’m thinking about how we can use interpretability techniques to make LLMs more safe.

j_we Jul 19, 2024, 9:08 AM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
on: Activation Engineering Theories of Impact
Thanks for writing this, I think it’s great to spell out the ToI behind this research direction!
You touch on this, but I wanted to make it explicit: Activation Engineering can also be used for detecting when a system is “thinking” about some dangerous concept. If you have a steering vector for e.g. honesty, you can measure the similarity with the activations during a forward pass to find out whether the system is being dishonest or not.
You might also be interested in my (less thorough) summary of the ToIs for Activation Engineering.

j_we Jul 15, 2024, 1:25 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Logan Riggs’s comment on: An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs
Thanks, I agree that Activation Patching can also be used for localizing representations (and I edited the mistake in the post).

j_we Jul 2, 2024, 4:15 PM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
AF
on: Representation Tuning
Hey Christopher, this is really cool work. I think your idea of representation tuning is a very nice way to combine activation steering and fine-tuning. Do you have any intuition as to why fine-tuning towards the steering vector sometimes works better than simply steering towards it?
If you keep on working on this I’d be interested to see a more thorough evaluation of capabilities (more than just perplexity) by running it on some standard LM benchmarks. Whether the model retains its capabilities seems important to understand the safety-capabilities trade-off of this method.
I’m curious whether you tried adding some way to retain general capabilities into the loss function with which you do representation-tuning? E.g. to regularise the activations to stay closer to the original activations or by adding some standard Language Modelling loss?
As a nitpick: I think when measuring the Robustness of Tuned models the comparison advantages the honesty-tuned model. If I understand correctly the honesty-tuned model was specifically trained to be less like the vector used for dishonesty steering, whereas the truth-tuned model hasn’t been. Maybe a more fair comparison would be using automatic adversarial attack methods like GCG?
Again, I think this is a very cool project!

j_we Oct 4, 2022, 9:37 AM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Charlie Steiner’s comment on: Data for IRL: What is needed to learn human values?
I agree that focusing too much on gathering data now would be a mistake. I believe thinking about data for IRL now is mostly valuable to identify challenges which make IRL hard. Then we can try to develop algorithms that solve these challenges or find out IRL is not a tractable solution for alignment.

j_we Oct 4, 2022, 9:16 AM
1 point
1 vote
Overall karma indicates overall quality.
0
0 votes
Agreement karma indicates agreement, separate from overall quality.
in reply to: Erik Jenner’s comment on: Data for IRL: What is needed to learn human values?
Thank you Erik, that was super valuable feedback and gives some food for thought.
It also seems to me that humans being suboptimal planners and not knowing everything the AI knows seem like the hardest (and most informative) problems in IRL. I’m curious what you’d think about this approach for adressing the suboptimal planner sub-problem : “Include models from coginitive psychology about human decision in IRL, to allow IRL to better understand the decision process.” This would give IRL more realistic assumptions about the human planner and possibly allow it to understand it’s irrationalites and get to the values which drive behaviour.
Also do you have a pointer for something to read on preference comparisons?

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer