If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
For math specifically, this seems useful. Maybe also for some notion of “general knowledge.”
I had a music class in elementary school. How would you test for whether the students have learned to make music? I had a spanish class—how do you test kids’ conversational skills?
Prior to good multimodal AI, the answer [either was or still is, not sure] to send a skilled proctor to interact with students one-on-one. But I think this is too unpalatable for reliability, cost, and objectivity reasons.
(Other similar skills: writing fiction, writing fact, teamwork, conflict resolution, debate, media literacy, cooking, knowledge of your local town)
I’m not sure if you’re reading in more rudeness than I intended to that phrase. I’ll try to clarify and then maybe you can tell me.
By “I feel for this person,” I mean “I think it’s understandable, even sympathetic, to have the mental model of LLMs that they do.” Is that how you interpreted it, and you’re saying it’s condescending for me to say that while also saying this person made a bunch of mistakes and is wrong?
On thing I do not mean, but which I now worry someone could get out, is “I feel sorry (or mockingly pretend to feel sorry) for this person because they’re such a pitiable wretch.”
Well, thanks for the link.
I might save this as a scathing, totally unintentional pan of the hard problem of consciousness:
Ultimately, it’s called the Hard Problem of Consciousness, not because it is simply difficult, but because it is capital-H Hard in a way not dissimilar to how NP-Hard problems may be literally undecidable.
That’s actually misleading of me to pick on, because I thought that most of the section on consciousness was actually a highlight of the article. It’s because I read it with more interest that I noticed little oversights like the above.
A part of me wonders if an LLM said it was really deep.
I feel for this person, I really do. Anthropomorphizing LLMs is easy and often useful. But you can’t just ditch technical reasoning about how AI works for this kind of… ecstatic social-reasoning-based futurism. Or, you can, you’re just going to be wrong.
And please, if you run your theory of anything by an LLM that says it’s genius, it’s important to remember that there’s currently a minor problem (or so I remember a moderator saying, take this with a grain of salt) on the physics subreddit with crank Theory of Everything submissions that got ChatGPT to “fill in the details,” and they must be right because ChatGPT made such a nice argument for why this idea was genius! These amateur physicsts trusted ChatGPT to watch their (epistemic) back, but it didn’t work out for them.
Thank you for the excellent most of this reply.
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
>> If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.
Ok, but RL.
Like, consider the wedding party attractor. The LLM doesn’t have to spend effort every step guessing if the story is going to end up with a wedding party or not. Instead, it can just take for granted that the story is going to end in a wedding party, and do computation ahead of time that will be useful later for getting to the party while spending as little of its KL-divergence budget as possible.
The machinery to steer the story towards wedding parties is 99% constructed by unsupervised learning in the base model. The RL just has to do relatively simple tweaks like “be more confident that the author’s intent is to get to a wedding party, and more attentive to advance computations that you do when you’re confident about the intent.”
If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
Also: Claude is a nice guy, but, RL.
I know, I know, how dare those darn alignment researchers just assume that AI is going to be bad. But I don’t think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it’s RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
Maybe a good test of this would be to try to condition DeepSeek base model to play the chatbot game, and see how sycophantic it naturally is relative to RL finetuned. (An even better test might be to use gpt3, trained on data that doesn’t include very many sycophantic LLMs.)
Similarly, I don’t think current AI models are cheating at programming tests because of training text about their low moral character. I think it’s RL, programming tasks, training set, implied high reward for cheating.
Is there a signal-to-noise problem if you don’t do hyperpolarization, and just give someone an isotopically enriched molecule?
17O seems like a good try for that, because of low natural abundance and decent NMR properties.
The main point for this being that I am experiencing qualia right now and ultimately it’s the only thing I can know for certain. I know that me saying “I experience qualia and this is the only true fact I can prove form certain about the universe” isn’t verifiable from the outside, but certainly other people experience the exact same thing? Are illusionists, and people who claim qualia doesn’t exist in general P-Zombies?
Well, something is definitely going on. But I think a very reasonable position, which often overlaps with illusionism, is that lots of people are constructing the wrong mental model of what’s going on. (Or making bad demands of what a good mental model should be.)
It’s a bit similar to using a microscope to do science that leads you to make a mental model of the world in which “microscope” is not an ontologically basic or privileged thing.
I dunno, this seems like the sort of thing LLMs would be quite unreliable about—e.g. they’re real bad at introspective questions like “How did you get the answer to this math problem?” They are not model-based, let alone self-modeling, in the way that encourages generalizing to introspection.
How much do you think subjective experience owes to the internal-state-analyzing machinery?
I’m big on continua and variety. Trees have subjective experience, they just have a little, and it’s different than mine. But if I wanted to inspect that subjective experience, I probably couldn’t do it by strapping a Broca’s area etc. to inputs from the tree so that the tree could produce language about its internal states. The introspection, self-modeling, and language-production circuitry isn’t an impartial window into what’s going on inside, the story it builds reflects choices about how to interpret its inputs.
Are various transparency requirements (E.g. Transparency about when you’re training a compute-frontier model, transparency about the system prompt, transparency about goal-like post-training of frontier models) not orphaned, or are they not even not orphaned?
Sure, that’s one interpretation. If people are working on dual-use technology that’s mostly being used for profit but might sometimes contribute to alignment, I tend to not count them as “doing AI safety work,” but it’s really semantics.
Does lobbying the US government count?
I wonder if there’s some accidental steganography—if you use an LLM to rewrite the shorter scenario, and maybe it has “this is a test” features active while doing that, nudging the text towards sounding like a test.
A lot depends on how broadly you construe the field. There’s plenty of work in academia and at large labs on how to resist jailbreaks, improve RL on human feedback, etc. This is at least adjacent to AI safety work in your first category.
If you put a gun to my head and told me to make some guesses, there’s maybe like 600 people doing that sort of work, about 80 people more aware of alignment problems that get harder as AI gets smarter and so doing more centrally first-category work, about 40 people doing work that looks more like your second category (maybe with another 40 doing off-brand work in academia), and about 400 people doing AI safety work that doesn’t neatly fit into either group.
Yeah, I think instead the numbers only work out if you include things like the cost of land, or the cost of the farmer’s time—and then what’s risen is not the “subsistence cost of horses” per se, but a more general “cost of the things the simplified model of horse productivity didn’t take into account.”
I feel sad that your hypotheses are almost entirely empirical, but seem like they include just enough metaethically-laden ideas that you have to go back to describing what you think people with different commitments might accept or reject.
My checklist:
Moral reasoning is real (or at least, the observables you gesture towards could indeed be observed, setting aside the interpretation of what humans are doing)
Faultless convergence is maybe possible (I’m not totally sure what observables you’re imagining—is an “argument” allowed to be a system that interacts with its audience? If it’s a book, do all people have to read the same sequence of words, or can the book be a choose your own adventure that tells differently-inclined readers to turn to different pages? Do arguments have to be short, or can they take years to finish, interspersed with real-life experiences?), but also I disagree with the connotation that this is good, that convergence via argument is the gold standard, that the connection between being changed by arguments and sharing values is solid rather than fluid.
No Uniqueness
No Semi-uniqueness
Therefore Unificiation is N/A
Man, I’m reacting to an entire genre of thought, not just this post exactly, so apologies for combination unkindness and inaccuracy, but I think it’s barking up the wrong tree to worry about whether AIs will have the Stuff or not. Pain perception, consciousness, moral patiency, these are things that are all-or-nothing-ish for humans, in our everyday experience of the everyday world. But there is no Stuff underlying them, such that things either have the Stuff or don’t have the Stuff—no Platonic-realm enforcement of this all-or-nothing-ish-ness. They’re just patterns that are bimodal in our typical experience.
And then we generate a new kind of thing that falls into neither hump of the distribution, and it’s super tempting to ask questions like “But is it really in the first hump, or really in the second hump?” “What if we treat AIs as if they’re in the first hump, but actually they’re really in the second hump?”
Caption: Which hump is X really in?
The solution seems simple to state but very complicated to do: just make moral decisions about AIs without relying on all-or-nothing properties that may not apply.
Do you have any quick examples of value-shaped interpretations that conflict?
Someone trying but failing to quit smoking. On one interpretation, they don’t really want to smoke, smoking is some sort of mistake. On another interpretation, they do want to smoke, the quitting-related behavior is some sort of mistake (or has a social or epistemological reason).
This example stands in for other sorts of “obvious inconsistency,” biases that we don’t reflectively endorse, etc. But also consider cases where humans say they don’t want something but we (outside the thought experiment) think they actually do want that thing! A possible example is the people who say they would hate a post-work world, they want to keep doing work so they have purpose. Point is, the verbal spec isn’t always right.
The interpretation “Humans want to follow the laws of physics,” versus an interpretation that’s a more filled-in version of “Humans want to do a bunch of human-scale things like talking to humans, eating good food, interacting with nature, learning about the world, etc.” The first is the limit of being more predictive at the cost of having a more complicated model of humans, and as you can tell, it sort of peters out into explaining everything but having no push towards good stuff.
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
I had a pretty different interpretation—that the dirty secrets were plenty conscious (he knew consciously they might be stealing a boat), instead he had unconscious mastery of a sort of people-modeling skill including self-modeling, which let him take self-aware actions in response to this dirty secret.