LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
Charlie Steiner(Charlie Steiner)
If we had a robot with the same cognitive performance as ChatGPT, it would be easy to fine-tune it to be corrigible.
This is false, and the reason may be a bit subtle. Basically “agency” is not a bare property of programs, it’s a property of how programs interact with their environment. ChatGPT is corrigible relative to the environment of the real world, in which it just sits around outputting text. This is easy because it’s not really an agent relative to the real world! However, ChatGPT is an agent relative to the text environment—it’s trying to steer the text in a preferred direction [1].
A robot that literally had the same cognitive performance as ChatGPT would just move the robot body in a way that encoded text, not in a way that had any skill at navigating the real world. But a robot that had analogous cognitive capabilities as ChatGPT except suited for navigating the real world would be able to navigate the real world quite well, and would also have corrigibility problems that were never present in ChatGPT because ChatGPT was never trying to navigate the real world.
- ^
AFAIK this is only precisely true for the KL-penalty regularized version of RLHF, where you can think of the finetuned model as trying to strategically spend its limited ability to update the base transition function, in order to steer the trajectory to higher reward. For early stopping regularized RLHF you probably get something mathematically messier.
- ^
Since it was evidently A Thing, I have caved to peer pressure :P
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn’t work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.
This is because the way in which I think it’s plausible for it to be easy is some case (3) that’s even more restricted than (1) or (2). Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that’s aligned because of the shard theory alignment story.
To back up: in nontrivial cases, robustness doesn’t exist in a vacuum—you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we’re using to think about the problem—a good ontology / way of thinking about the problem makes the right degrees of freedom “obvious,” and makes it hard to do things totally wrong.
I think in real life, if we think “maybe this doesn’t need more work and just we don’t know it yet,” what’s actually going to happen is that for some of the degrees of freedom we need to set, we’re going to be using an ontology that allows for perturbations where the thing’s not robust, depressing the chances of success exponentially.
you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes
We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you’re doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.
Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible—but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you
Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.
You seem to be making an implied argument like “this isn’t a big problem for me, a human, so it probably happens by default in a good way in future RL agents,” and I don’t find that implied argument valid.
I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.
What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?
As one of the people who’s raised such points, I should note that they mostly apply to applications of language models qua language models (which Jozdien correctly does), and that different techniques can be appropriate for different domains.
I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.
Parts of generalizing correctly involve outer alignment. I.e. building objective functions that have “something to say” about how humans want the AI to generalize.
Relatedly, outer alignment research is not done, and RLHF/P is not the be-all-end-all.
I think we should be aiming to build AI CEOs (or more generally, working on safety technology with an eye towards how it could be used in AGI that skillfully navigates the real world). Yes, the reality of the game we’re playing with gung-ho orgs is more complicated, but sometimes, if you don’t, someone else really will.
Getting AI systems to perform simpler behaviors safely also looks like capabilities research. When you say “this will likely require improving sample efficiency,” a bright light should flash. This isn’t a fatal problem—some amount of advancing capabilities is just a cost of doing business. There exists safety research that doesn’t advance capabilities, but that subset has a lot of restrictions on it (little connection to ML being the big one). Rather than avoiding ever advancing AI capabilities, we should acknowledge that fact in advance and try to make plans that account for it.
These aren’t necessarily milestones rather than capabilites that can come on a sliding scale, but:
Tools to accelerate alignment research (which are also tools to accelerate AGI research)
AI assistants for conceptual research
Novel modes of AI-enabled writing
AI assistants for interpretability or AI design
Value learning schemes at various stages along both conceptual and technological development
Low conceptual, high technological: people think it has a lot of holes, but it works well with SOTA AI designs and good tools have been developed to handle the human-interaction parts of the value learning.
High conceptual, low technological: Pretty much everyone is, if not excited by it, not actively worried about it, but it would require developing entirely new infrastructure to use.
That said, I’m not sure how much governance plans should adapt based on milestones. Maybe we should expect governance to be slow to respond, and therefore requiring plans that are good when applied broadly and without much awareness of context.
I think none are all that relevant. But fun and interesting:
The Metamorphosis of Prime Intellect
Accelerando
Friendship is Optimal
The memo trap reminds me of the recent work from Anthropic on superposition, memorization, and double descent—it’s plausible that there’s U-shaped scaling in there somewhere for similar reasons. But because of the exponential scaling of how good superposition is for memorization, maybe the paper actually implies the opposite? Hm.
My impression from the DreamerV3 paper was actually that I expect it to be less easy to generalize than EfficentZero in many ways, because it has clever hacks like symlog on rewards that are actually encoding programmer knowledge about the statistics of the environment. (The biggest such thing in EfficientZero is maybe the learning rate schedule? Or its value prediction scheme is pretty complicated, but didn’t seem fine-tuned. But in the DreamerV3 paper there’s quite a few pieces that seemed carefully engineered to me.)
On the other hand, there are convincing arguments for why eventually, scaling and generalization should be better for a system that learns its own rules for prediction / amplification, rather than being locked into MCTS. But I feel like we’re not quite there yet—transformers are sort of like learning rules for convolution, but we might imagine that for RL we want to be able to learn even more complicated generalization processes. So currently I expect benefits of scale, but we can’t really take off to the moon (fortunately).
There’s still all the same incentives there ever were to build an AI that makes plans to affect the real world. And having a good unsupervised model of the world is a great starting point for an RL agent.
So sure. I would also like it if people just decided to avoid doing that :P
This is a good question but it’s hard to answer. I’ll try and signal-boost this a little later, but I’ll give it a shot first.
This depends on question 2. If you want to funnel people into research, then you want to find people who can read an argument and figure out what it implies, who are fundamentally curious people, who are able to think mechanistically about AI, who when faced with a new problem can come up with hypotheses and tests on their own. But if you find (or help push for) some organization that’s doing scalable software work, then you want people who are good coders, understand systems, etc.
Asking questions for either is hard—you’d think I’d know something about the first, but I don’t really. Maybe the people to ask are organizers for SERI? My stab would be general cognitive questions, asking about past research, science, or engineering -like projects, and maybe showing them a gridworld AI from Concrete Problems in AI Safety and getting them to explain what’s going on, why the AI does something bad, and ask them to give one example of where the toy model seems like it would generalize to the real world, and one example of where it wouldn’t.
Unfortunately, the answer might be “nowhere”. AI alignment has researcher money, but not many places are actually setting up top-down organizations prepared to integrate new people. The current model looks more like academic research, where people tend to have to be self-directed (which might be for good reasons). The pipelines people are building for adding new people (e.g. SERI MATS, MLAB) are also focused on this kind of self-directed research, rather than hiring people for specific jobs.
In theory, though, interpretability work has plenty of places where skilled software engineers would help, in ways that are scaleable enough to justify larger organizations. Redwood Research is the org that has likely put the most thought into this, and maybe you should chat with them.
I’ll have to eat the downvote for now—I think it’s worth it to use magic as a term of art, since it’s 11 fewer words than “stuff we need to remind ourselves we don’t know how to do,” and I’m not satisfied with “free parameters.”
I think it’s quite plausible that you don’t need much more work for shard theory alignment, because value formation really is that easy / robust.
But how do we learn that fact?
If extremely-confident-you says “the diamond-alignment post would literally work” and I say “what about these magical steps where you make choices without knowing how to build confidence in them beforehand” and extremely-confident-you says “don’t worry, most choices work fine because value formation is robust,” how did they learn that value formation is robust in that sense?
I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don’t think it’s plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don’t see a way to go from “suggestive” to “I am actually confident” that doesn’t involve de-mystifying the magic.
I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.
Shard theory alignment has important, often-overlooked free parameters.
Thanks a bunch for this series!
Nice! Collin was a really articulate interviewee.
Here’s a place where I want one of those disagree buttons separate from the downvote button :P
Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don’t get represented much at all, and which ways get left out can depend how you’re training this AI (and a bit more subtly, how you’re interpreting its parameters as a world model).
Especially because of that second part, finding good goals in an AI’s world model isn’t satisfactory if you’re just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.
I believe “encoder” refers exclusively to the part of the model that reads in text to generate an internal representation
Architecturally, I think the big difference is bi-directional (BERT can use future tokens to influence latent features of current tokens) vs. uni-directional (GPT only flows information from past to future). You could totally use the “encoder” to generate text, or the “decoder” to generate latent representations used for another task, though perhaps they’re more suited for their typical roles.
EDIT: Whoops, was wrong in initial version of comment.
The paper is an impressive testimony to the engineering sweat and tears they had to put in to get their model to generalize as well as it did. Like, just seeing a parameter set equal to 0.997 makes you think “huh, so did they try every value from 0.995 to 0.999?”—to say nothing of all the functional degrees of freedom. The end result is simpler than MCTS on the surface, but it doesn’t seem obvious whether it was any less effort by the researchers. Still plenty cool to read about though.
And also, yes, of course, it’s 2023 and shouldn’t someone have heard the fire alarm already? Even though this research is not directly about building an AI that can navigate the real world, it’s still pushing towards it, in a way that I sure wish orgs like DeepMind would put on a back burner relative to research on how to get an AI to do good things and not bad things if it’s navigating the real world.
Do you have any insider knowledge about this conference? Do you know if it tends to be interesting, have interesting people, be in touch with reality, etc?