Alex Turner, Oregon State University PhD student working on AI alignment. Reach me at jobeal2[at]gmail[dot]com.
TurnTrout
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
I want to flag Quintin’s comment above as extremely important and—after spending over a month engaging with his ideas—I think they’re probably correct.
I think I have failed to communicate my main point, if these are among your objections. I am not faulting you, but I want you to know that that’s my perception, and keep that in mind as you evaluate these ideas.
I think I’d want to start over and try from a different tack, if I were going to resolve disagreements here. But best to save that for future posts, I think.
There isn’t an objective human reward signal that mirrors an RL agent’s reward.
We get a load of input, have a bunch of impressions, feelings and thoughts, and take some actions.
You’re the second person to confidently have this reaction, and I’m pretty confused why. Here’s a wikipedia article on the human reward system, and here’s one of Steve Byrnes’s posts on the topic. I’m not an expert, but it seems pretty clear that the brain implements some feedback signals beyond self-supervised predictive learning on sensory errors. Those signals comprise the outer criterion, in this argument.
I agree that reward is not literally implemented in the brain as a scalar reward function. But it doesn’t have to be. The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
It’s immaterial whether that’s a simple scalar or a bunch of subsystems with different feedback dimensions—the same inner misalignment arguments apply. Otherwise we could solve inner misalignment by simply avoiding scalar outer criteria; this is absurd.
(Let me know if I’ve misunderstood what you were getting at.)
I’d note that evolution will have implicitly accounted for any previous “misalignment” in shaping our current reward signals: it will have selected for the reward signals that tended to increase fitness given our actual responses to those signals, not the signals that would have increased fitness if we had followed some maximisation process.
Our reward signals weren’t ‘designed’ to be maximised, only to work (to increase fitness).
This is indeed part of my argument, but doesn’t seem related to what I was trying to say.
It’d seem more useful if there were some crisp and clear mechanistic notion of what counted as human reward and what didn’t; I don’t think that’s true (is anyone claiming this?).
There’s an outer criterion by which behavior is graded / feedback is given. A mesa optimizer might be trained (by the usual arguments) which optimizes an outer objective, which is not the same as the outer criterion. We don’t need a crisp and clear mechanistic notion of what counts as human reward for this argument to work.
Strong-upvoted, even though I disagree on the overall argument strength (I think there were some weak arguments but in total the evidence seemed pretty good to me). I look forward to paying out if you want to collect on that.
I think I’d worry that the sets of values that do well under human-evolution/learning conditions is too broad (for a good-according-to-non-selfish-us outcome to be likely). I.e. that re-rolling values under similar evolutionary pressures can give you various value-sets that each achieve similar fitness (or even similar behaviour) but where maximizing utility according to one gets you very low utility according to the others.
Important clarification: Neither Quintin nor I are proposing to mimic evolution in order to hopefully (fingers crossed!) miraculously get human values out the other side. Based on an understanding of how inner alignment works (or doesn’t), Quintin is proposing a gears-level model of what human values are and how they form; the model in turn suggests a relatively simple procedure for recreating the important part of the process within an AI. For growing human values within an AI, not via some hacky solution which is too complicated to shoot down, but based on a gears-level theory of what human values are. No outer selection pressures on evolving AIs or anything like that.
(I know so much about Quintin’s proposal is that I’ve read and written several private docs about the theory.)
By default, AI systems won’t be subject to anything like the environment and pressures that shaped humans and human values. We could aim to create (something analogous to) it, but it’s anything but straightforward. How fragile is the process for humans? Which aspects can be safely simplified/skipped, and how would we know?
Not a full-length explanation, but some thoughts:
I currently think the process is not that fragile. By contrast, consider another (perhaps “classic”) model of alignment. In this model, the “human objective” is an extremely complicated utility function, and we need to get it just right or the future will be ruined. This model has always seemed “off” to me, but I hadn’t been able to put my finger on why.
Quintin’s theory says that the seeming complexity of human values is actually the result of the multiagent bargains struck by subagentic circuits in the brain of varying sophistication, which (explicitly or implicitly) care about different things. Instead of one highly complicated object (“the utility function”) which is sensitive to misspecification, human values is just the multiagent behavior of a set of relatively simple circuits in the brain, where the alignment desirability is somewhat robust to the “bargaining strengths” of those parts.
For example, consider a modified version of yourself who grew up with swapped internal reward for “scuttling spiders” and “small fluffy animals.” I think you’d get along mostly fine, and be able to strike bargains like “this part of the galaxy will have bunnies, this other part will have spiders” without either of you wanting to tile the galaxy with representations of your “utility function.”
And under that view, we do not have to mimic the entire training process because we don’t know what mattered and what didn’t. Quintin’s theory is, in effect, making a claim about what matters: The substance of human values is the multiagent dynamics and relative bargaining strengths of the different parts, and the fact that these parts generally act to preserve their implementation in the brain (prevent value drift) by steering the future observations of the human itself. In a world where we actually get a correct theory of human values, that theory would tell you which parts are important and which parts can be left out. (There is, of course, still the question of how would we know the theory is right. The above does not answer this question.)
It occurs to me that I’m not sure whether you mean [human rewards in evolution] or [rewards for individual learning humans], or both? I’m assuming the evolutionary version, since I’m not clear what inner alignment failure would mean for an individual (what defines the intended goal/behaviour?).
I don’t know what you mean by “human rewards in evolution.” For my part, I’m talking about the reward signals provided by the steering system in a person’s brain. Although some people are hedonists, many are not, and thus they are unaligned with their reward system. If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system. So something else must be steering your decision-making.
Inner alignment failure is the only process in the known universe to ever generate human values
as a jumping-off point, since inner alignment failure did not hit a pre-defined target of human values. It just happened to produce them. If a gun can fire one bullet, I’ll expect it can fire a second. I won’t expect the second bullet to hit the first.
I think there are several elided considerations here. I think OP might be ambiguous with respect to whether “evolution → human values” alignment failure is being considered, when the real relevant alignment failure [EDIT: for this sentence] is “human reward system → human values.” I agree that most “bullets” fired by evolution will not hit human values. I think the latter scenario is much more interesting, however, and I think it takes more time to step through.
One thing to note in general is that AFAICT anthropic hypotheses take huge penalties compared to non-anthropic hypotheses, depending on how much anthropic lifting is required to explain our observations.
Scenario A: You’re transported to another world. Based off some quick calculations/experiments you perform (you’re lucky you had a near magically powerful laser and telescope on you, plus your pockit), it seems to be roughly similair to earth in geological features. You then come across what looks like a building, human sized. Outside are a giant (3.5 m tall) and a humanoid (around 1.5 m tall). Shape wise, they’re similair, but they’re covered up. What’s the probability that they’re off the same race, and their species is not significantly polymorphic? How’d you update your evidence on seeing a bunch more of each size wandering around the forest?
One of the things probability-sight tells me is how constraining my models are. One of the benefits of learning more fields is being able to extract sharper likelihood ratios from the same evidence. Here, my likelihood ratios are pretty unsharp. And also, I can feel I’m entering an unrefined conditional distribution of my beliefs, where conditioning on world transport and also earth-similarity produces something pretty strange.
My gut tells me that such large variation in size is rare. And I have cached thoughts about the stress which extreme height places on bones?
However, what are the alternatives? The probability of convergently evolving the same shape in parallel is small. If they could genetically modify themselves or even just customize their morphology using more exotic tech, it seems even less likely that they would stick to two morphologies (which I see wandering around the forest). (I’ve recently updated my thinking on how evolved advanced minds generally will work, and that model suggests that their preferences would probably be very compatible with variety, even though they probably aren’t human preferences.)
Of the hypotheses predicting they are of the same species, extreme sexual dimorphism seems to have the highest posterior probability. Something about this seems wrong to me, or rare, which suggests that there’s some gut evidence I haven’t yet consciously incorporated. There are also more conjunctive possibilities like “they have culture and also exotic morphology modification tech but there are ~two acceptable morphologies”, but this basically feels like a garbage just-so story that I’d need way more evidence to properly elevate to attention.
I think maybe if I had sharper models of evolutionary history, I’d see a sharper (perhaps Gaussian) form like you do. My other hypothesis is that this conditional distribution is really weird and I’d be surprised if you could narrow down to that shape so quickly.
Scenario B: You are transported to another world with your technical doodads. Looking around you, you find yourself in a plain stretching as far as the eye can see. You walk around for a couple of days and nothing changes. You’re terrified of vast expanses of water. Unfortunately, you are unmoored from the digital sea and can’t just check how far fast they’re rising, the tidal range or even where they are. What’s the probability that you’re x-meters above sea-level, supposing this planet even has oceans?
(Written before reading your answer) Ah, another area I don’t know much about. Time for more qualitative reasoning. I think that I’m most curious about what is on the plain. Is there life? Presumably since I’m walking for days, there is some level of humidity, which suggests a water cycle, I think? And in that case, there are probably oceans. And if I can eyeball the soil composition, I could estimate the expected last time where a flood / rainfall occurred, and so that will give me some information about the altitude.
I don’t know how altitude affects rain frequency (supposing the atmospheric dynamics are even remotely similar to Earth’s), but under the flooding-is-possible hypotheses, observing “water has not touched this soil in a long time” represents a weak-to-moderate likelihood ratio against close-to-sea-level hypotheses. (With the weak-to-moderate from the unknown variance of sea level and of storms in this part of the world.)
Yay! Keep up the good work :) I bet there’s a way to stick to it better, I’d advise you to keep trying things on that front.
I just wouldn’t use most apps or websites. By adopting a bright-line whitelist approach with clear, universally applicable exception handling and scheduled eg email checks, I didn’t have constant temptations to rationalize breaking my own rules, like I did when I was sometimes allowed to use a service.
Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points
Part of the reason the spread seems small is that people are correctly inferring that this comment section is not a venue for debating the object-level question of Probability(doom via AI), but rather for discussing EY’s viewpoint as written in the post. See e.g. https://www.lesswrong.com/posts/34Gkqus9vusXRevR8/late-2021-miri-conversations-ama-discussion for more of a debate.
One of my favorite things about you, John, is that you are excellent at prompting me to direct my attention towards important questions which lead to promising insights. Thanks for that!
I answered your questions, originally in my private notes, but partway through decided I would comment them.
Imagine that your field achieved perfection—the ultimate theory, perfect understanding, building The Thing.
What has been achieved in the idealized version of the field, which has not yet been achieved today? What are the main barriers between here and there?
Detailed understanding of what it means for one agent to be aligned with another agent, or a group of agents.
We can easily check the functional properties of a causal process and argue they satisfy theorems saying WHP it veers towards highly desirable states.
Like, I could point out the ways in which DRL fails at a glance, without appealing to instrumental convergence in particular?
Or maybe that’s part of the theory?
These theorems are intuitively obviously correct and corresponding to original-intuitive-reality.
They are so correct that it’s easy to persuade people to use the aligned approach.
We understand what agents are, and how people fit into that picture, and the theory retrodicts all past problems with governments, with corporations, with principal agents
We know how to take any reasonable training process and make it an aligned training process, with minimal alignment tax (<5%)
Barriers:
We don’t know what agents are.
We don’t know what alignment means
We don’t know how to prove the right kinds of theorems
We don’t know if our concepts are even on the right track
They probably aren’t, except insofar as they spur the right language. It feels more like “how do I relate KL div and optimality probability” rather than “let’s prove theorems about retargetability”
Often, in hindsight, a field turns out to have been bottlenecked on the development of some new measurement method, ranging from physical devices like the thermometer to abstract ideas like Shannon’s entropy and information channel capacity.
In what places does it look like your field is bottlenecked on the ability to usefully measure something? What are the main barriers to usefully measuring those things?Bottlenecks
Abstraction
Uninterpretable
Not sure where to draw “category boundaries”
Alignment
Don’t know what alignment really is
Or how to measure ground truth
Needs vocab of concepts we don’t have yet
Power-seeking
Unclear what gets trained and how to measure according to distributions
Would need current agent beliefs to use current formalism
Also current formalism doesn’t capture game theoretic aspects of logical blackmail etc
Intent seems more important
But this is bottlenecked on “what is going on in the agent’s mind”
Interpretability
I’m not sure. Networks are big and messy.
Capability
Actually I bet this is natural in terms of the right alignment language
Compute efficiency in terms of ability to bring about cognitively important outcomes
Seems strictly harder than “capability”
Time remaining until TAI
Uncertainty about how AI works, how minds work, what the weighted-edge distance is on lattice of AI discoveries
Barriers
Conceptual roadblocks
But what?
(Filled in above)
What are the places where your field is flailing around in the dark, trying desperate ideas and getting unhelpful results one after another? What are the places where it feels like the problem is formulated in the wrong language, and a shift to another frame might be required to ask the right question or state the right hypothesis?
Flailing:
IDA
ELK
Everything theoretical feels formulated wrong, except maybe logical induction / FFS / John’s work
This is important!
(Also I wouldn’t be surprised if I’d say Vanessa’s work is not flailing, if I could understand it)
Retargetability → IC seems like an important piece but not the whole thing, part of it is phrased correctly
AUP was flailing
Did I get impact wrong? Or is reward maximization wrong?
I think I got impact right philosophically, but not the structure of how to get one agent to properly care about impact on other agents.
I just found a good trick (penalize agent for impact to other goals it could have and pursue) which works really well in a range of cognitively available situations (physical proximity) but which breaks down under tons of optimization pressure
And the “don’t gain power for your own goal” seems like it should be specifiable and non-hacky, but I don’t actually see how to do it right.
But note that getting impact right for-real wouldn’t save the world AFAICT
Basically everything else
What happened when I shifted to the retargetability frame?
IDT i did that until recently, actually; original post was too anchored on instrumental convergence over outcome sets, missing the elegant functional statement
and my shift to this frame still feels incomplete.
Corrigibility still feels like it should work in the right language and grounding
Sometimes, we have a few different models, each of which works really well in different places. Maybe it feels like there should be some model which unifies them all, which could neatly account for all these phenomena at once—like the unification of electricity, magnetism and optics in the 19th century.
Are there different models in your field which feel like they point to a not-yet-known unified model?I guess different decision theories? Not super familiar
Not coming up with as many thoughts here, because I feel like our “partial models” are already contradicted and falsified on their putative domains of applicability, so what good would a unification do? More concisely stated wrongness?
One of the main ways we notice (usually implicit) false assumptions in our models is when they come into conflict with some other results, patterns or constraints. This may look like multiple models which cannot all be true simultaneously, or it may look like one model which looks like it cannot be true at all yet nonetheless keeps matching reality quite well. This is a hint to reexamine the assumptions under which the models are supposedly incompatible/impossible, and especially look for any hidden assumptions in that impossibility argument.
Are there places in your field where a few models look incompatible, or one model looks impossible, yet nonetheless the models match reality quite well?Tempted to say “no”, because of the last phrase in the last sentence not seeming true.
Here was one. The instrumental convergence theorems required a rather precise environmental symmetry, which seemed weird. But now I have a new theory which relates abstract functional properties of how events come about, to those events coming out similarly for most initial conditions. And that doesn’t have anything to do with environmental / outcome-level symmetries. So at first the theory was right in its domain of applicability, but the domain seemed absurdly narrow on a few dimensions.
it was narrow not because I missed how to prove sufficiently broad theorems about MDPs, but because I was focusing on the wrong details and missing the broader concept underlying everything I’d observed.
I guess the impossibility of value learning seems correct but spiritually inapplicable to the problem we want to solve, but I don’t quite know how to articulate that.
A few months back I wrote about how corrigibility is often impossible under reward maximization. But reward maximization seems pretty useful for motivating agents. But it’s so so so broken for nontrivial kinds of motivation.
The space of possible physical laws or theorems or principles is exponentially vast. Sometimes, the hard part is to figure out what the relevant factors are at all. For instance, to figure out how to reproducibly culture a certain type of cell, a biologist might need to provide a few specific signal molecules, a physical medium with the right elasticity or density, a particular temperature range, and/or some other factors which nobody even thought to test yet.
Are there places in your field where nobody even knows what key factors must be controlled for some important outcome to robustly occur?In a robotics task, how would we ensure test-time agents did anywhere between 2 and 10 jumping jacks in an episode?
What factors would you control there? We don’t know how to “target” along these dimensions, at least it would take more effort than I think it should
Are there places in your field where some concept seems very central to understanding, but nobody knows its True Name yet?
Corrigibility
Corrigibility
Corrigibility
Alignment
I remember thinking there was a concept like this related to logical time, but I forget what it was
At the social level, what are the barriers to solving the main problems in the previous two questions? Why aren’t they already solved? Why isn’t progress being made, or made faster?
Bad incentives — I think people working on eg value learning should not do that
I think the marginal return from more researcher hours on deconfusion outweighs the return from empirical wisdom in applying known-broken paradigms—gut instinct
People don’t know the feel of a True Name-having object, or forget how important it the name discovery is.
I do this sometimes.
Or it’s really hard
Or we need more kinds of minds looking at this problem
But I don’t know what this means
Are there places where your field as a whole, or you personally, pursue things which won’t really help with the main problems (but might kind of “look like” they address the problems)?
Field as a whole
Value learning seems doomed, why are people still working on it
There are non-core researchers working in ways which don’t make sense to me, but I don’t remember who they are or what they were working on
They’re more field-adjacent
I used to be more excited about IDA-style insights but now I feel queasy about skipping over the hard parts of a problem without really getting insights about how alignment works
This is a lesson which I took too long to learn, where I was too tolerant of finding “clever” ways to box my uncertainty. Sequences article about this but i dont wanna go find it right now
What kind of person can I become who would notice this error on my own, before making it, before hearing that this is a failure mode?
Anyways.
Me
I think impact measurement is doomed at least in the current paradigm
In the hindsight of a perfected field, I think impact regularization would be a thing you can do robustly, but not to do a decisive act
I’m basically done working on impact measurement though
I’m finishing up my PhD right now, but I think I’m doing pretty well on this axis now. I used to be bad at pica.
Pick someone you know, or a few people, who are smart and have good judgment. What would their answers to these questions be?
John, I think you would not strongly disagree with most anything I said, but I feel like you would say that corrigibility isn’t as pragmatically important to understand. Or, you might say that True-Name-corrigibility is actually downstream of the True-Name-alignment concepts we need, and it’s the epiphenomenon. I don’t know. This prediction is uncertain and felt more like I queried my John model to give a mental speech which is syntactically similar to a real-John-speech, rather than my best prediction of what you would say.
Update: I had a few relatively anxious days, and decided to relax my music policy slightly. Sometimes I’d be a little too anxious to do work without music. While I fix this problem, I’m going to listen to music more permissively.
I think there’s another trade-off to music while working, which is that it makes me more excited at the cost of making me slightly dumber. Sometimes this is worth it. Marking as something to analyze more later, as I currently (attempt to) refactor my internal systems so that they don’t produce these anxious signals.
Also, added to main post: “I never ever use my phone on the toilet. Ever. This has served me well and seems like a pure win.”
Interested to hear how this goes.
(I think everyone should have well-defined exception handling, because some of you will have crazy shit happen, like “someone died”, and that can make it hard if you’re pondering “do I let myself have an allowance now?”. Failing to plan is planning to fail (in not-wholly-improbable worlds).)
Yup, this is what I did, but I just didn’t have the notebook. I like the bright line.
:)
I also have experienced increased anxiety, which has persisted (and become slightly stronger) 7.5 months after moving.
I have noticed increased anxiety in myself. I think it quite possible I already had most of it before the declutter, and instead wrapped myself in comforting mind-numbing internet usage which obscured it from me. Another person who has done the declutter, reported similar suspicions about themselves.
This has some interesting implications: The worse the problem of wasting time on the internet is for you, the harsher your measures will need to be, and the less you will want to carry them out (except in lucid moments).
This doesn’t ring to my experience. Everything became easier when I said “No reddit”, as opposed to ~”Reddit if I can find a good enough reason.”
I have an additional pet peeve about this post: It has the words “cost-benefit analysis” in the title, but doesn’t have a cost-benefit analysis in the body (at least not of the juicy expected-value variety)!
Please don’t use those words if you’re not going to deliver ;-)
I included eg the Facebook usage meme (which is the real output of a CB analysis), and described the results of other analyses I did. I think the analyses were so lopsided, and the solutions so clean, that assigning numbers would be a distraction. Also, the point of assigning numbers to personal-utility-estimates is, I think, to throw them out after you do the estimate, and do what your updated gut feeling says.
Big agreement & signal boost & push for funding on The “Reverse-engineer human social instincts” research program: Yes, please, please figure out how human social instincts are generated! I think this is incredibly important, for reasons which will become obvious due to several posts I’ll probably put out this summer.