LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Charlie Steiner(Charlie Steiner)
I’m not very enlightened by what tokens most excite the component directions in a vacuum. Interpreting text models is hard.
Maybe something like network dissection could work? What I’d want is a dataset of text samples labeled by properties that you want to find features to track.
E.g. suppose you want features that track “calm text” vs. “upset text.” Then you want each snippet labeled as either calm or upset—or even better, you could collect a squiggly curve for how “calm” vs. “upset” labelers think the text is around any given token (maybe by showing them shorter snippets and then combining them into longer ones, or maybe by giving them a UI that lets then change levels of different features as changes happen in the text). And then you look for features that track that coarse-grained property of the text—that vary on a long timescale, in ways correlated with the variation of how calm/upset the text seems to humans.
And then you do that for a dozen or a gross long-term properties of text you think you might find features of.
Dancers and analytic types have a surprising overlap. What the demographics look like is heavily dependent on the local scene, and really this is only good advice if you like dancing.
I feel like there’s also a Bayesian NN perspective on the PBRF thing. It has some ingredients that look like a Gaussian prior (L2 regularization), and an update (which would have to be the combination of the first two ingredients—negative loss function on a single datapoint but small overall difference in loss).
Said like this, it’s obvious to me that this is way different from leave-one-out. First learning to get low loss at a datapoint and then later learning to get high loss there is not equivalent to never learning anything directly about it.
Did you ever try out independent component analysis? There’s a scikit-learn implementation even. If you haven’t, I’m strongly tempted to throw an undergrad at it (in a RL setting where it makes sense to look for features that are coherent across time).
EDIT: Nevermind, it’s in the paper. And also I guess in the figure if I was paying closer attention :P
Yes, absolutely, but I don’t expect algorithms to be implemented in separable chunks the way a human would do it. Comparing frequencies of various words just needs an early attention head with broad attention. But such an attention head will also be recruited to do other things, not just faithfully pass on the sum of its inputs, and so you’d never literally find TF-IDF.
Yeah, good point, we build models of the world, or at least of our senses, we don’t automatically build models of what our neurons are doing.
(Maybe any learning in the brain can be interpreted as a “model” of the neurons that feed into the learning neurons, but the details of that sort of thing aren’t available to our faculties for navigating the world, doing abstract reasoning, or communicating—they’re happening at a lower layer in the software stack of the brain.)
That’s veering towards a more “Mary’s room” sort of definition of “ineffability,” where you can’t freely exchange world-models and experiences, which isn’t really what the Jell-O box analogy was about—it was about interpersonal comparisons, and our inability to experience what other people experience.
But I guess they’re connected. Suppose we’re both listening to a simple tone, but my pitch perception is more accurate than yours. If you want to experience my experience for yourself, you might try taking your own experience and then imagining “adding on some extra pitch perception”—an act of model-to-experience exchange reminiscent of what Mary’s supposed to try.
Sorry, what’s a simple experience? There’s externally simple experiences like looking at a black room in the dark, but It’s not like those experiences use a smaller number of neurons than my other experiences.
Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).
Sounds good. I enjoyed at least 50% of the time I spent reading the epistemology :P I just wanted a go-to resource for specific technical questions.
Could I send you a DM with it?
Sure, but no promises on interesting feedback.
Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.
Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?
Deception’s not quite the right concept. More like exploitation of biases and other weaknesses. This can look like deception, or it can look like incentivizing an AI to “honestly” be searching for arguments in a way that just so happens to be shaped by the argument-evaluation process’ standards other than truth.
(1) might work, but seems like a bad reason to exert a lot of effort. If we’re in a game state where people are building dangerous AIs, stopping one such person so that they can re-try to build a non-dangerous AI (and hoping nobody else builds a dangerous AI in the meantime) is not really a strategy that works. Unless by sheer luck we’re right at the middle of the logistic success curve, and we have a near-50/50 shot of getting a good AI on the next try.
(2) veers dangerously close to what I call “understanding-based safety,” which is the idea that it’s practical for humans to understand all the important safety properties of an AI, and once they do they’ll be able to modify that AI to make it safe. I think this intuition is wrong. Understanding all the relevant properties is very unlikely even with lots more resources poured into interpretability, and there’s too much handwaving about turning this understanding into safety.
This is also the sort of interpretability that’s most useful for capabilities (in ways independent of how useful it is for alignment), though different people will have different cost-benefits here.
(3) is definitely interesting, but it’s not a way that interpretability actually helps with alignment.
I actually do think interpretability can be a prerequisite for useful alignment technologies! I just think this post represents one part of the “standard view” on interpretability that I on-balance disagree with.
I expect you’d get problems if you tried to fine-tune a LLM agent to be better at tasks by using end-to-end RL. If it wants to get good scores from humans, deceiving or manipulating the humans is a common strategy (see “holding the claw between the camera and the ball” from the original RLHF paper).
LLMs trained purely predictively are, relative to RL, very safe. I don’t expect real-world problems from them. It’s doing RL against real-world tasks that’s the problem.
RLHF can itself provide an RL signal based on solving real-world tasks.
Doing RLHF that provides a reward signal on some real-world task that’s harder to learn than deceiving/manipulating humans will provide the AI a lot of incentive to deceive/manipulate humans in the real world.
Once you understand how it works, it’s no longer surprising.
Take collecting keys in Montezuma’s Revenge. If framed simply as “I trained an AI to take actions that increase the score, and it learned how to collect keys that will only be useful later,” then plausibly it’s a surprising example of learning instrumentally useful actions. But if it’s “I trained an AI to construct a model of the world and then explore options in that model with the eventual goal of getting high reward, and rewarded it for increasing the score,” then it’s no longer so surprising—if you understand why it does what it does, it’s not so surprising.
Neat! Was normalizing to zero mean actutally helpful? It seems like some asymmetries might just be part pf the data distribution, and so adjusting for them might mess up perpendicular features.
My feedback is that safety based on total understanding is a boondoggle. (This feedback is also only approximately aimed at you, take what you will.)
Blue sky research is fine, and in a lot of academia the way you get blue sky research funded is by promising the grantmaker that if you just understand the thing, that will directly let you solve some real-world problem, but usually that’s BS. And it’s BS that it’s easy to fall into sorta-believing because it’s convenient. Still, I think one can sometimes make a good case for blue sky research on transformers that doesn’t promise that we’re going to get safety by totally understanding a model.
In AI safety, blue sky research is less good than normal because we have to be doing differential technological development—advancing safe futures more than unsafe ones. So ideally, research agendas do have some argument about why they differentially advance safety. And if you find some argument about how to advance safety (other than the total understanding thing), ideally that should even inform what research you do.
All I want for christmas is a “version for engineers.” Here’s how we constructed the reward, here’s how we did the training, here’s what happened over the course of training.
My current impression is that the algorithm for deciding who wins an argument is clever, if computationally expensive, but you don’t have a clever way to turn this into a supervisory signal, instead relying on brute force (which you don’t have much of). I didn’t see where you show that you managed to actually make the LLMs better arguers.
Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.
I am disagreeing with the underlying assumption that it’s worthwhile to create simulacra of the sort that satisfy point 2. I expect an AI reasoning about its successor to not simulate it with perfect fidelity—instead, it’s much more practical to make approximations that make the reasoning process different from instantiating the successor.
I think this is too big-brain. Reasoning about systems more complex than you should look more like logical inductors, or infrabayesian hypotheses, or heuristic arguments, or other words that code for “you find some regularities and trust them a little, rather than trying to deduce an answer that’s too hard to compute.”
My first thought for an even easier case is imprinting in ducks. Maybe a good project might be reading a bunch of papers on imprinting and trying to fit it into a neurological picture of duck learning Steve Bynes style. One concern would be if duck imprinting is so specialized that it bypasses a lot of the generality of the world model and motivational system—but from a cursory look at some papers (e.g.) I think that might not be the case.
I think maybe this gwern essay is the one I was thinking of, but I’m not sure. It doesn’t quite answer your question.
But there isn’t a complexity-theoretic argument that’s more informative than general arguments about humans not being the special beings of maximal possible intelligence. We don’t know precisely what problems a future AI will have to solve, or what approximations it will find appropriate to make.
Don’t feel bad about the lack of comments—it’s the lurkers who are wrong :P I’m super excited to see attempts to “just solve the problem,” and I think this kind of approach has a lot going for it.
Here are some various comments I accumulated while reading:
The really important scenario is “I’m an AI considering modifying myself to more effectively realize human values. What should I do?” To be hyperbolic, other data about human values is relevant only insofar as it generalizes to this scenario.
(Okay, maybe you don’t mean to be working on AGI and only mean to do better alignment of the narrow sort already being done to LLMs. I’m still personally interested in thinking ahead to the implications for AGI.)
To this end, we want to promote generalization and be cautious of high levels of context-dependence. We’ve messed up somewhere if the lessons the AI learns about what to say in the context of abortion have no bearing whatsoever on what it says in the context of self-improving AI. Some context dependence is fine, but the main point is to find the models of human values that generalize to important future questions.
It might be a good idea for a value-finding process to be aware of this fact.Maybe the users can get information about how (and to what extent) behavior in one scenario would generalize to other scenarios, given the current model. They might give different feedback if they know what the broader ramifications of that feedback would be.
Another option is meta-level feedback where the users say directly how they want generalization to be done. Actually changing generalization to reflect this feedback is… an open problem.
A third option is active inference by the AI, so that it prioritizes collecting information from humans that might influence how it generalizes to important future questions.
More general values aren’t always better. Generalizing values often means throwing away information about who you are and what you’re like. Sometimes I’ll think that information is important, and I’ll want an AI to take those more-specific values into consideration!
So while I admire the spirit of just going for it, I don’t think that connecting expressed values with a broadness/goodness relation is very helpful for aggregation. And de-emphasizing this relation is the same as de-emphasizing the graph structure.
The idea of building this graph was a good one to elaborate on even if I don’t think it works out. I am inspired to go try to list 10 similar ideas and see if progress happens.
Deleting cycles is almost as bad as avoiding all cases where people disagree on a preference ordering. Any democratic process has to be able to handle disagreement! (On second thought, maybe if you’re not aiming for AGI it’s fine to try to just find places where 99% of people agree. I’m still personally interested in handling disagreement.) Rather than deferring to “negotiations,” we might ambitiously try to design AI systems that would be good outcomes of such negotiations if they took place.
Sure. Here are some bullet points of evidence:
To all appearances, we’re an evolved species on an otherwise fairly unremarkable planet in a universe that doesn’t have any special rules for us.
The causal history of us talking about morality as a species runs through evolution and culture.
We learn to build models of the world, and can use language to communicate about parts of these models. Sometimes it is relevant that the map is not the territory, and the elements of our discourse are things on maps.
In terms of semantics of moral language, I think the people who have to argue about whether they’re realists or anti-realists are doing a fine job. Having fancy semantics that differentiate you from everyone else was a mistake. Good models of moral language should be able to reproduce the semantics that normal people use every day.
E.g. “It’s true that in baseball, you’re out after three strikes.” is not a sentence that needs deep revision after considering that baseball is an invented, contingent game.
In terms of epistemology of morality, the average philosopher has completely dropped the ball. But since, on average, they think that as well, surely I’m only deferring to those who have thought longer on this when I say that.