figuring out how values are represented;
I feel like basically none of the key terms here are well defined. What is a value? I don’t think there is a good answer for Humans yet. How would we know if it was represented?
In my experience you can look at a neuron’s high activation phrases and get some sense of some ideas that have high probability of triggering that neuron. That doesn’t mean this neuron is the representation of these ideas, many other neurons may trigger on related concepts nearby in conceptual space and your own parsing of conceptual space into discreet separate concepts may be flawed so you can’t be sure what you’re thinking about is actually a concept in some higher Platonic sense.
To use the famous mech interp example: Is the Golden Gate Bridge neuron actually about the Golden Gate Bridge? Maybe it is more broadly about steel structures and existing in San Francisco and being colored orange. The GGB neuron or feature may trigger on other concepts, or at least concepts we see as distinct, and may signify a broader concept to the LLM than what we assume.
Maybe something like this:
I’ve done a lower dimensional embedding of layer 24 of GPT2-XL and there seem to be interpretable directions in the embedding. In particular there are two main dimensions which spread the neurons (i.e. dissimilarity is mostly expressed in these two dimensions, the other dimensions just correlate very highly on a linear combination of the first two, we get a flat disc in the embedding space). The main one is what I call a narrative-personal dimension which seems to capture the difference between “narrative” language like in Wikipedia articles and news stories where the writer and audience are outside observers unable to directly affect the phenomena under discussion and on the other end is “personal” language which is structured conversationally like short fiction where characters discuss events from first person or sales pitches; in this language the author and/or the audience can participate.
The other dimension seems to be a valence dimension where “good things” are on one side and “bad things” are on the other. This is kinda perhaps what you’d be after.
“Good” is going to be relationally defined; the area the machine thinks about sunshine and lollypops is the “good” part and we only know that because the “good” stuff is there (i.e. without reference to sunshine and lollypops, which to be explicit is just a stand-in for general “good” things, we can’t define “good” really). What you want to know is whether the thoughts about “kill all humans” are more similar to sunshine and lollypops or fear, piss, and death for the machine (fear, piss, and death being concepts likely on a bad end of any dichotomous good-bad principle component).
Of course this only works if the machine actually lays out similarity in neurons in such a way that sunshine and lollypops are grouped as similar, which we don’t generically know (by my preliminary results I am pretty sure GPT2 does this but I still need to analyze more neurons). This may not be the case and the LLM may not see traditional Human valence as salient in any way; an embedding in a similarity space may find sunshine next to piss and death strictly between sunshine and lollypops. In such a case Human value may be completely alien to the LLM, indeed the concept of value in general may be alien to such a system.
Preliminarily GPT2-XL layer 24 does seem to have an “emerging modern threats” neuron which seems to trigger on a number of threatening-sounding phenomena which were contemporary to the 2010s. In particular it triggers on the vaguely threatening sounding “5G” and “big data.” It also triggers a lot on discussions of post-apocalyptic futures including an AI uprising (Horizon Zero Dawn is a video game set in the aftermath of an AI uprising and mentioned in the top activation phrases, for example). This neuron embeds near the extreme personal-bad corner of the general mass of neurons, so I think we’re on the same page as GPT2 on this. Incidentally this neuron also triggers on “Glenn Beck” which I find funny.
Inkhaven seems like a waste. To get more good writers we don’t need people to “practice” (and flood LW with a month of not-really-worth-it content forced out of constipated brains), we need an (established big-name) editor (with a built-in following) to find good work and tell other people it is good (including the writer).
While there aren’t likely as many bad posts that are widely upvoted and read (there are some I’m sure) the bigger problem is good writing that gets no attention and dies in obscurity despite being good writing. That adds noise to the writer’s quality signal. Elevating good writing to public attention is what editors/curators are for.