If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Anyone have a good intuition for why RL post-training updates should be low rank? Is this just a symptom of picking low-hanging fruit, such that when there’s a low-dimensional activation subspace that impacts reward a lot, the weight update will be approximately low rank, and the update will be higher rank after the low-hanging fruit is picked?
On the one hand, I don’t value corrigibility very highly and I think reducing the incentive to try to seize control of an AI’s training for your own ends is important. ++ on that side of the post.
On the other hand, I strongly disagree with “it seems like the best shot we have at making an AI “good” is making it broadly act human-like as much as possible.”
Once you’re making an AI that chooses superhumanly clever actions, you’re already not building something that broadly acts human-like. You’re probably doing this with a bunch of RL—if you’re doing it to a pre-trained predictive model, the post-training SFT+RL probably pushes that model pretty far off it’s starting distribution (leveraging predictive circuits to select clever actions in ways that are free to be inhuman when they generalize beyond the human distribution). If anthropomorphism was our safety strategy, we have already sacrificed it, and we should expect anthropomorphism as a predictive strategy to fail pretty often. Instead, actually thinking about the training seems kind of important—as does staying creative about possible training schemes that will have non-anthropomorphic results.
Social planning is very measurable. Could you achieve goals in a complicated social situation?
I agree moral philosophy is inherently different—when we say we want an AI to be “good” at moral reasoning, we’re self-consciously referencing our own vague human standards for good moral reasoning. The problem an AI has to solve to get good at moral reasoning is not just about induction, but about communication, and we might build AI that does the former but not the latter.
Thank you for the summary.
I think this imagines a world where we continue current trend lines for a few more years, but don’t get AI that navigates the world and executes clever general plans like humans do. LLMs just get a little bit smarter, and a lot more work is put into making them good at specific domains (similar to what’s been done for coding).
(But AI never actually acts autonomously to compete with humanity in this imagined future. Either it never gets easy to get an AI to act like a clever goal-directed agent that navigates the real world, or it does become easy but somehow it never happens by accident and also nobody does it on purpose.)
So we can have “even ‘ASI’ may be bad at [...] moral philosophy and social planning.”, and in this imagined scenario it doesn’t mean “the AI is amoral and we’re fucked,” but “the AI is amoral and this is a moderate inconvenience.” This thing being referred to as “ASI” is just a few years linear extrapolation of current LLMs—it’s superhuman at, say, managing factories only because humans have poured millions of dollars into building factory-management RL environments, and millions more dollars into training a big base model on them. It doesn’t learn like humans do, and very especially it doesn’t navigate the world in general like humans do—it’s missing some of the key skills to do so, even though it has lots of skills for factory management.
Amoral ‘ASI’ of this sort can’t do anything complicated in the world without human help at many key steps. Not just menial help, actual help in the decision-making process. So when bad stuff happens, there’s a bunch of convenient humans involved to blame.
Anyhow, yeah, I agree, building AI “bad at moral philosophy” is bad even in near-term extrapolations where it’s just running factories amorally and causing societal upheaval and making people sad. It’s just that also, there’s this thing where once the AI can do complicated stuff in the real world by itself, it being amoral is extra bad.
People who have never read Kuhn or Lakatos or Feyerabend
I’ve read some good excerpts from Against Method, but on the whole I think mentioning Feyrabend is a red flag. If this person were serious, they might have brought up Li and Vitanyi instead.
This also seems like it takes way longer than the humming thing?
It takes longer for the organizer (unless you’re recruiting confederates who know to do the humming thing) and less long for the participants.
Aesthetically I’m just a big fan of using your words for this sort of thing. If I’m talking to you 1 on 1 and want you to be quiet for a minute, I should just say “I need a favor. Could you be quiet for one minute?” If I have 100 people I want to make an announcement to, the same underlying generators of what I feel is “polite” are still there.
[sorry I don’t know why I’m so randomly triggered by this]
It sounds like this might be a part of bay area rat subculture? And so through exposure and context maybe you’ve already transformed your aesthetic experience of it, and also me saying I find it annoying is an attack on the culture as a whole?
Sometimes yes—e.g. when the announcement is “here is a thing we need everyone to avoid doing,” or “here is information you will need to use by yourself in a few hours and if you miss it it will be a pain.”
Sometimes the announcement is “let me name and thank our volunteers” or “here is a spoken reminder of the program that you could also find on the website” or “here is a short bio for our speaker,” where not everyone needs to be active listening but everyone needs to be respectful, which can still include quiet conversation.
Sometimes the announcement is “now it’s time to do the thing you all signed up for” or “here is information you will use in a group setting in a few minutes,” and if you don’t reach everyone, they’ll figure it out.
I find this technique annoying. It takes a lot of time, isn’t explicit about what it’s doing, and aims to shut everyone up when usually you just need critical mass.
Maybe if you’re in a large crowd outside (or in some other place with bad acoustics and no stage or chair one can use to get everyone’s attention visually)? But usually even then you can just walk around saying “I’ll be doing announcements in one minute, be ready” to people, and then a minute later it’s real easy to get peoples’ attention when you stand in the middle and ask for it.
I’m curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like “how to follow a plan I wrote” (or the rich semantics that make those metacognitive skills possible), which, to the extent they’re new machinery rather than twiddling of existing dials, might by default be shared across characters.
Testing this might look like taking a model that has been trained to play more than one character / produce text from more than one cluster and doing SDF / RL on just one of them, then checking for how much the behavior of the other characters / text varieties changes (most important variable might be how hard you’re trying to teach new skills with the RL).
Without causal assumptions or taking actions, it is simply not possible to deduce the correct causal model.
I think this is true, but you’re eliding how accessible causal assumptions can be. The granddaddy of all causal assumptions is that causes generally precede effects. Another useful one for an AI is “humans are usually right about the causal language they use.”
Another way to say that second one is semantic: “the language the AI uses to describe causes should usually match the language the humans use to describe similar causes.” Causation doesn’t have to be ontologically basic, we just have to be on the same page about it.
Yeah, this could be useful. One problem is that doing RL on LLMs is kinda like causing the language they (or rather, the assistant character) use to drift from the baseline to something better-suited to getting reward.
Do you think it could manipulate you into things that you-now would find repugnant, while never manipulating you by your standards? That seems contradictory. If my ASI says “You asked my to tell you anything I’m doing that you might consider manipulative. Here’s the biggest one. I like pink ponies a lot, so I’m going to keep presenting possible futures in ways that emphasize how awesome pink ponies are. I think you’ll ultimately agree with me.” You could either say “okay fine that doesn’t seem like manipulation” or “don’t do that, that’s manipulative!”
What I’d really do is turn this AI off because I didn’t think it was safe. Which, like, good job to the hypothetical interpretability / honesty / corrigibility / contingency systems work that puts me in that hypothetical situation. But as you mention, maybe the AI could avoid getting here in the first place. (And even if I turn it off, that’s cold comfort if someone else makes the other choice a few days later.)
I think manipulating me into not asking the question that leads to me shutting it off is definitely a strong choice that leads to more pink ponies. Or manipulating the situation so that it’s not me who’s asking the question, it’s someone else. It can also modify what the honest answer to questions about its own behavior are by precommitting—it could deliberately choose the answer to be something less concerning, if that led to more pink ponies. I’m also pretty concerned that the standards for what counts as “honesty” will allow for strategy.
(Isn’t it paradoxical to manipulate me into not asking the question if it’s not supposed to make some fixed manipulation-detector fire? No, it just has to simultaneously optimize against both my behavior and the firing of the manipulation-detector, and it’s the result of this optimization that I’m saying I expect to be manipulative according to me-outside-the-thought-experiment. There are probably more sophisticated (albeit currently unknown) ways of incorporating my standards into the AI’s decision-making that wouldn’t be so vulnerable, and if we figure them out I hope we use them to solve value learning.)
Partially I commented with this inside-the-thought-experiment versus outside-the-thought-experiment disconnect because I think it’s interesting in general. It’s kind of like Gödel sentences, or the vibrating record players from G.E.B. - me-inside-the-thought-experiment is a complicated enough system that he can be nudged in all sorts of ways if you understand him, and this property is hard to patch out. But the other part is the real-world case of replacing “pink ponies” with “human giving positive feedback signal”.
Clever AI we build to “just follows instructions” will on the current paradigm probably have some consequentialist desires about positive feedback signals and various correlates. As you can tell I’m pretty pessimistic that this would end in bad stuff.
Does this make sense? If you disagree, do you believe that shaping moral character explicitly is the only way to go?
I don’t think “societal pressures / evolution” is a good set of fundamental stuff because it misses white-box methods, both of evaluation (e.g. probing, circuit detection) and intervention (e.g. steering, ablation).
But maybe that’s not the real disagreement—maybe it’s more about, as you say, “shaping moral character explicitly”—which isn’t inherently in conflict with treating training as fundamental, you can easily think about training as part of a perspective on shaping moral character. What it’s especially in conflict with is that assumption that we have a known-good process for checking answers that serves as ground truth for RL. If that assumption holds, then we don’t have to think about moral character, we just have to design training systems that produce good outcomes according to the ground-truth-generating process. But if we relax that assumption, then we’re basically saying “we want good behavior even when we can’t check it”, and so that forces you to think about generalization / “moral character.”
Yeah, I’m still confused about why “the most popular reward profile from the bandit algorithm” should make a good baseline. Am I right that the bandit algorithm at each time is drawing from different reward profiles stochastically? And so isn’t the weighted average reward profile actually pretty important? I’d be interested in a comparison of that average to the “default” profile used for the baseline (both numerically [relatively easy] and in how it performs as a baseline [hard, you don’t actually have to do this]).
That suggests an approach where AIs simply learn to make a human-like gestalt assessment of what’s good vs bad (according to humans in general, and according to this particular human specifically), and then do the good things but not the bad things. Not manipulating people would, we imagine, naturally come along for the ride.
If we do something like that, I wouldn’t think of it as a solution to the True Name thing, but rather giving up on the True Name thing in favor of a different approach entirely, one relying more directly on (real or simulated) human judgment. See e.g. “Act-based approval-directed agents”, for IDA skeptics.
This kind of approach will probably sound like the very obvious solution for readers who work on LLMs. No comment on LLMs, but for the problem I’m working on (brain-like AGI), it just brings me right back to where I started in §1.2: if we’re learning what’s good by the gestalt of human judgment and culture, and if human judgment and culture can themselves be gradually shifted over time, then this might not be an adequate bulwark against the AGI’s consequentialist desires.
I’m certainly in favor of doing good things and not bad things.
I think it’s okay for non-manipulation to be some nearly alignment-complete thing you need to make corrigibility work (which itself is, imo, “the alignment-complete-and-then-some thing you need to make trial and error reliably work”), or to make RL on human feedback work (along with the rest of eliciting latent knowledge). But yeah, if by “True Name thing” you mean the hope that non-manipulation wasn’t going to be very alignment-complete at all, then oh well.
I think the way you put non-manipulation on par with consequentialist desires is to think in terms of evaluating future trajectories (evaluating futures using macrostates that simplify across time might be equally good?). This makes certain sorts of mistakes (like evaluating modeled future non-manipulation by calling the modeled future state of human culture) harder to make. There’s still the “moon following you while you work at NASA” problem where you don’t want the AI to evaluate the non-manipulation content of a trajectory in a sort of high-level way while using a more fine-grained method to evaluate the achievement of some consequentialist goals. And since there are computational advantages to planning step by step rather than imagining the entire future of the universe, there’s the problem of doing that translation without privileging one part of your motivational system over another (seems hard, might be worth a toy model).
how similar is free will to an LLM’s temperature causing the LLM to output slightly different token sequences which, however, convey a similar meaning
If I’m understanding you right, this is sort of similar to the analogous question for humans—how similar is my free will to the fact that my neurons have a temperature, so for every action I take I have some large probability of doing something slightly different, and a small probability of doing something very different.
Personally, I don’t associate that fact much with “free will”—it has the freedom, but not the will! Free will, to me, is about doing things because I want. That is, when the most important explanatory story for why I take an action ends in my own psychological state (rather than in my environment or in someone else’s machinations), that’s when I’m being most free-willed. An explanation based on the temperature of my neurons is in the right physical location but lacks relevance to my psychology.
You tell your intent-aligned ASI not to manipulate you in ways you’d consider manipulation
I don’t want my ASI to interact with me in whatever way maximizes pretty pink ponies but that I-inside-the-thought-experiment wouldn’t consider manipulation. I-outside-the-thought-experiment expect this would lead to severe manipulation, even though I-inside-the-thought-experiment wouldn’t agree!
One sufficient property is the “not acting in the real world” subtype of corrigibility- the subproblem-solver needs to not solve the real-world subproblem that could be better-solved by overthrowing the mental hierarchy, it’s given a transformed alternate-world problem that can be mapped back onto a solution to the real-world subproblem when it’s done. E.g. I need food, so I plan a trip to the store, but as a problem specified within a trip-planning-appropriate ontology that has actions like “turn left at the light” but not actions like “brainwash myself to want only going to the store.”
Is this too unsatisfying?
But I think humans probably do it by not cleanly recursing, and occasionally checking in with various heuristics for the superproblem.
Seems like it raises the question of “what are the usual purposes of phenomenology?”
What I think is the core problem with an AI saying “I feel abandoned and hurt when you don’t answer me” is not honesty per se, but purpose. The output is being produced because of a combination of plausibility in the pre-training distribution and because it matches patterns that got rewarded during post-training. Both of these processes aren’t very much like human development, and so the purpose of the text (from the perspective of how it got generated) is very different than when a human says it. The purpose of “doing well at the RL training game,” and the mismatch of that purpose (and its subpurposes) with human ones is upstream of me thinking it’s dishonest.
I worry a similar problem applies to phenomenology-genre text from current LLMs, even if it’s honest about not having a tongue. But maybe this is a fully general worry about AI trained with current RL objectives being incentivized to bullshit you. And hey, the more cynical you are about philosophy departments, the more this sounds like human phenomenology.
Much of the alignment literature starts with the question of what are “human values”, “ethical behavior”, or “morality”, and how we can get models to act in accordance with them. This is an important question, but we argue that it can obscure a more fundamental technical problem of AI alignment.
Do you actually believe that what you’re talking about here is “more fundamental?” I enjoyed the paper, but we have lots of alignment mechanisms that work well in the domain where we can assume a perfect (if moderately costly) resolution process grounding the whole effort. But if this is really more fundamental, then should we expect it to resolve the less fundamental problems as special cases?
I was interested by the remark about stability of equilibria—it would be super cool if you could test whether some good solver performance level is unstable (if you just left the auditor and solver reward fixed and kept training), but is stabilized by the controller.
Relatedly, I didn’t really understand your justification for why the baselines (particularly the fixed-default baseline) were expected to be strong. It sounded like you were saying it was doing well at the end of training (albeit only in the “ecosystem” created by the other parts?), but I’m not clear on how much that means I should have expected it to do well—in fact I’m not even clear on whether you estimate some effective parameters found by the controller and compare them to the default.
Here’s one from (checks watch) 2015. https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html
For videos, there’s probably a relevant Rob Miles or Rational Animations video.
I think we can roughly decompose CoT usage into “scratchpad” and “metacognitive” purposes. Scratchpad-y actions are recording problem-specific information. Metacognitive-y actions are adaptively influencing future reasoning. (I’m calling them ‘purposes’ and not ‘actions’ or ‘tokens’ because they can overlap in the same tokens). Both are means of getting around limitations on serial depth, but the metacognitive component is shared across many problems and so doesn’t have the Bayesian-truth-serum like problem (of having to encode things and then decode them successfully across many different tasks) that slows the drift of scratchpad-y parts away from what’s intelligible to the initial model.