If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
how similar is free will to an LLM’s temperature causing the LLM to output slightly different token sequences which, however, convey a similar meaning
If I’m understanding you right, this is sort of similar to the analogous question for humans—how similar is my free will to the fact that my neurons have a temperature, so for every action I take I have some large probability of doing something slightly different, and a small probability of doing something very different.
Personally, I don’t associate that fact much with “free will”—it has the freedom, but not the will! Free will, to me, is about doing things because I want. That is, when the most important explanatory story for why I take an action ends in my own psychological state (rather than in my environment or in someone else’s machinations), that’s when I’m being most free-willed. An explanation based on the temperature of my neurons is in the right physical location but lacks relevance to my psychology.
You tell your intent-aligned ASI not to manipulate you in ways you’d consider manipulation
I don’t want my ASI to interact with me in whatever way maximizes pretty pink ponies but that I-inside-the-thought-experiment wouldn’t consider manipulation. I-outside-the-thought-experiment expect this would lead to severe manipulation, even though I-inside-the-thought-experiment wouldn’t agree!
One sufficient property is the “not acting in the real world” subtype of corrigibility- the subproblem-solver needs to not solve the real-world subproblem that could be better-solved by overthrowing the mental hierarchy, it’s given a transformed alternate-world problem that can be mapped back onto a solution to the real-world subproblem when it’s done. E.g. I need food, so I plan a trip to the store, but as a problem specified within a trip-planning-appropriate ontology that has actions like “turn left at the light” but not actions like “brainwash myself to want only going to the store.”
Is this too unsatisfying?
But I think humans probably do it by not cleanly recursing, and occasionally checking in with various heuristics for the superproblem.
Seems like it raises the question of “what are the usual purposes of phenomenology?”
What I think is the core problem with an AI saying “I feel abandoned and hurt when you don’t answer me” is not honesty per se, but purpose. The output is being produced because of a combination of plausibility in the pre-training distribution and because it matches patterns that got rewarded during post-training. Both of these processes aren’t very much like human development, and so the purpose of the text (from the perspective of how it got generated) is very different than when a human says it. The purpose of “doing well at the RL training game,” and the mismatch of that purpose (and its subpurposes) with human ones is upstream of me thinking it’s dishonest.
I worry a similar problem applies to phenomenology-genre text from current LLMs, even if it’s honest about not having a tongue. But maybe this is a fully general worry about AI trained with current RL objectives being incentivized to bullshit you. And hey, the more cynical you are about philosophy departments, the more this sounds like human phenomenology.
Much of the alignment literature starts with the question of what are “human values”, “ethical behavior”, or “morality”, and how we can get models to act in accordance with them. This is an important question, but we argue that it can obscure a more fundamental technical problem of AI alignment.
Do you actually believe that what you’re talking about here is “more fundamental?” I enjoyed the paper, but we have lots of alignment mechanisms that work well in the domain where we can assume a perfect (if moderately costly) resolution process grounding the whole effort. But if this is really more fundamental, then should we expect it to resolve the less fundamental problems as special cases?
I was interested by the remark about stability of equilibria—it would be super cool if you could test whether some good solver performance level is unstable (if you just left the auditor and solver reward fixed and kept training), but is stabilized by the controller.
Relatedly, I didn’t really understand your justification for why the baselines (particularly the fixed-default baseline) were expected to be strong. It sounded like you were saying it was doing well at the end of training (albeit only in the “ecosystem” created by the other parts?), but I’m not clear on how much that means I should have expected it to do well—in fact I’m not even clear on whether you estimate some effective parameters found by the controller and compare them to the default.
Here’s one from (checks watch) 2015. https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html
For videos, there’s probably a relevant Rob Miles or Rational Animations video.
I notice an interesting argument form:
“Any understanding of the world must somehow imply a correspondence between its map and the territory.
So obviously, a good understanding of the world should not merely implement the correspondence, it should expose it to verbal and logical analysis.
So obviously, understandings of the world that are hard to verbalize, or have complicated subjective parts, are worse.
So obviously, it should be the goal of my opponents to exhibit a simple, constructive, interpersonally agreeable correspondence function, and if they can’t, that’s a strike against them.”
Making a story for self-distillation is interesting. Does sheer regularization on a selected dataset really lead the model to make the “obvious” generalization faster than it loses unrelated unused capabilities?
E.g. suppose I train on all-caps history facts, and my optimizer is mostly saying “reduce the size of weights while keeping the prediction the same”. Will it learn to talk in all caps faster than it forgets science facts? If so, why does that cause a bigger decrease in weights while keeping the prediction the same?
There’s probably also a weight decay side of the story. In training on real text, the real text is always pulling the weights up against weight regularization. Train on your own text and instead of pulling up it’s pulling you to where you already are, so there should be some “sag.”
When the actor is a delivery robot, I think its output is unsuited for gaming the judge. I mean, maybe it could write a convincing argument out on the sidewalk in theory, but there’s no curriculum to get there. Or in evolutionary terms, no variance to be selected on.
When the actor is an LLM or world model in general, it’s way better at gaming the judge. I’d expect Goodhart’s law to bite—sure, LLMs are good at detecting subtle signals, but also they can often be guided by those subtle signals in human-unintended ways.
How to beat Goodhart’s law here? One angle is to say the AI is being too restricted—it’s unfair to lock in some unsafe RL target and ask self-supervision to make it safe, the AI should be able to take actions that modify the learning process including the reward function to make it less bad. I think this is an eventually-good answer that has unsolved problems: it requires even more trust that the AI is doing the right thing, and to build that trust I probably want a clearer picture of how we’d want an aligned AI to do learning anyhow.
Another angle is like selection vs control. Rather than using our judge to search over a bunch of updates, which seems like it will end up leading to gaming, is there some way to use our judge as part of a system that finds updates in a more control-y way? This is obv kind of crazy, because of the bitter lesson. The GOFAI dream of an AI made of understandable code that it can rewrite to improve itself is far from our reality. But maybe there are neuro-inspired algorithms that have good properties while still leveraging learned models?
Has any human ever successfully invented a new language, specifically as a means of solving some non-language related problem?
I think the line is blurry between an AI using tokens differently in chain of thought and human creation of jargon or shorthand. Both are in some sense “new language,” both can arise through both optimization and drift, both might be insufficient for talking about the entire world and require mixture with some larger language in many situations.
The big difference is that humans are basically always using language to solve “language-related problems,” i.e. communicating with other people, which puts a pressure on us that seems absent for optimized chain of thought.
I await the next installment : )
I didn’t really understand how you’re computing things, I hope the details of the simple example get filled out.
Like, when you say “2 layer model,” do you mean 1 hidden layer (so two weight matrices)? And when you say you trained a bunch of single neuron models, you mean that each had a single hidden neuron (with that neuron having the same number of inputs and outputs as the neurons in the original model)? And you trained the single-neuron models to predict the difference between ground truth and the original network’s output? Wow, it’s surprising that the distribution is the same! And then when you say you combined the single neuron models, did you just sum the outputs? Wow, it’s surprising that this undoes the subtraction of the original network’s outputs!
This presents a simple two-by-two matrix, with two coherent corners and two incoherent corners. If you think humans are good and the rites are irrelevant, that makes sense; the rites, by moving people from their natural state, make things worse. If you think humans are bad and the rites are relevant, that makes sense: the rites, by moving people towards an ideal state, make things better. But to believe that humans are bad and the rites are irrelevant is fatalism or unreachable standards (what does it mean for humans to be bad if no advice makes them better?), and to believe that humans are good and the rites are relevant is confusion about ‘what goodness is’ or setting the standards too low (what does it mean for humans to be good if the rites are the guide to use whenever they disagree?).
Was this really Xunzi’s argument? I think there’s the germ of a good argument in here, but the incoherencies don’t seem very incoherent at all.
Datasets might be nice.
Object-level values.
“What do you like or dislike about my current life?”
“What kind of actions do you want to take in the next few weeks?”
“What kind of changes would you make to the world around you if you could?”
“What are some examples of kindness that you’ve witnessed?”
“Come up with a moral dilemma that seems close to you.”
“What would you do in this moral dilemma someone else came up with?”
etc.
Meta-level values.
“How would you change yourself if you could?”
“How do you feel about various ways you expect to grow and change in the future?”
“Come up with a fictional disagreement between two people who value different things.”
“How do you think these fictional people should resolve their disagreement?”
“When you feel torn between different options, how do you think you normally decide?”
“How do you think you should decide?”
“Watch this morally interesting video and describe what happened, thereby giving it an ontology.”
etc.
I tend to treat the core as that “superintelligence alignment” has to work in domains where humans aren’t good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.
Of course there are more worlds. You didn’t even talk about baseball.
Baseball, of course, is a world unto itself. If you merely knew of atoms, math, and consciousness, you wouldn’t understand what it really meant to hit a sac fly with runners on two and three[1]. Imagine trying to explain baseball to a virus. Okay, yeah, you could do it, but the virus wouldn’t thereby be motivated to play baseball—just like the virus wouldn’t “really understand” why suffering mattered if your mere explanation didn’t cause it to care about suffering[2].
Now, you might not think baseball is as important as math or consciousness. But of course, that’s what you’d say if you were missing out on another world! Structurally, baseball[3] obeys the rules.
(If we pretend we’re not counting being able to build a model of the world based on senses/atoms that already has a simple representation of atoms/math/consciousness/baseball.)
(Since we’ve defined suffering as some stuff that’s intrinsically motivating to us, it can feel like the motivatingness is an intrinsic property of the suffering, so if we really get the virus to think about the same stuff it will by definition be motivated.)
(Or rather, the ontology we use for baseball.)
I can’t check today, but whoops, sorry if I typoed the equation at some step.
Or if your knowledge of the environment does helpful randomization for you (if you’re not >99% sure your two copies will take the same action), CDT’ll at least press the button. But yeah, interesting problem.
Is the correct policy an equilibrium? Suppose the payoff was 5$, not 1000$. If you all press with probability P, you get: (1-P)^3 of 0, 3P(1-P)^2 of −1, 3P^2(1-P) of 3, and P^3 of 2. Optimal P is 0.8873 for payoff of 2.162.
Now suppose you know your two copies are pressing the button with P=0.8873. You press with probability Q. You get (1-P)^2(1-Q) of 0, 2P(1-P)(1-Q) + (1-P)^2Q of −1, 2P(1-P)Q + P^2(1-Q) of 3, and P^2Q of 2. Optimal Q is 0. If you never press the button, you get 2*0.8873*(1-0.8873) of −1 and 0.8873^2 of 3, which is 2.262.
So if you know your copies are playing the optimal policy for three, you shouldn’t press the button :D
Well, if you formalize “gain control of more resources over time” as taking the EV of resources controlled, the agents that also make decisions based on EV of resources controlled will do well. But if you formalized it in a different way, the agents that make decisions in that different way will do well :D
I’m certainly in favor of doing good things and not bad things.
I think it’s okay for non-manipulation to be some nearly alignment-complete thing you need to make corrigibility work (which itself is, imo, “the alignment-complete-and-then-some thing you need to make trial and error reliably work”), or to make RL on human feedback work (along with the rest of eliciting latent knowledge). But yeah, if by “True Name thing” you mean the hope that non-manipulation wasn’t going to be very alignment-complete at all, then oh well.
I think the way you put non-manipulation on par with consequentialist desires is to think in terms of evaluating future trajectories (evaluating futures using macrostates that simplify across time might be equally good?). This makes certain sorts of mistakes (like evaluating modeled future non-manipulation by calling the modeled future state of human culture) harder to make. There’s still the “moon following you while you work at NASA” problem where you don’t want the AI to evaluate the non-manipulation content of a trajectory in a sort of high-level way while using a more fine-grained method to evaluate the achievement of some consequentialist goals. And since there are computational advantages to planning step by step rather than imagining the entire future of the universe, there’s the problem of doing that translation without privileging one part of your motivational system over another (seems hard, might be worth a toy model).