No, the argument above is claiming that A is false.
Are you talking about value learning? My proposal doesn’t tackle advanced value learning. Basically, my argument is “if (A) human values are limited by human ability to comprehend/optimize things and (B) the factors which make something easier or harder to comprehend/optimize are simple, then the AI can avoid accidentally messing up human values — so we can define safe impact measures and corrigibility”. My proposal is not supposed to make the AI learn human values in great detail or extrapolate them out of distribution. My argument is “if A and B hold, then we can draw a box around human values and tell the AI to not mess up the contents of the box — without making the AI useless; yet the AI might not know what exact contents of the box count as ‘human values’”.[1]
The problem with B is that humans have very specialized and idiosyncratic cognitive machinery (the machinery generating experiences) which is much more advanced than human general ability to comprehend things. I interpreted you as making this counterargument in the top level comment. My reply is that I think human values depend on that machinery in a very limited way, so B is still true enough. But I’m not talking about extrapolating something out of distribution. Unless I’m missing your point.
- ^
Why those things follow from A and B is not obvious and depends on a non-trivial argument. I tried to explain it in the first section of the post, but might’ve failed.
- ^
I think the crux might be that I think the ability to sample from a distribution at points we can reach does not imply that we know anything else about the distribution.
So I agree with you that we can sample and evaluate. We can tell whether a food we have made is good or bad, and can have aesthetic taste(, but I don’t think that this is stationary, so I’m not sure how much it helps, not that this is particularly relevant to our debate.) And after gather that data, (once we have some idea about what the dimensions are,) we can even extrapolate, in either naive or complex ways.
But unless values are far simpler than I think they are, I will claim that the naive extrapolation from the sampled points fails more and more as we extrapolate farther from where we are, which is a (or the?) central problem with AI alignment.
I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic
This is related to what I was saying but it wasn’t what I was saying. I was saying “tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values)”. I wasn’t making a claim about the perceived absolute level of integrity.
Probably not worth hashing this out further, I think I get what you’re saying.
I think the main thing I want to convey is that I think you’re saying that LWers (of which I am one) have a very low opinion of the integrity of people at Anthropic, but what I’m actually saying that their integrity is no match for the forces that they are being tested with.
I don’t need to be able to predict a lot of fine details about individuals’ decision-making in order to be able to have good estimates of these two quantities, and comparing them is the second-most question relating to whether it’s good to work on capabilities at Anthropic. (The first one is a basic ethical question about working on a potentially extinction-causing technology that is not much related to the details of which capabilities company you’re working on.)
How good/bad is it to work on capabilities at Anthropic?
That’s the most clear cut case, but lots of stuff trades off anthropic power with other stuff.
What’s an example decision or two where you would want to ask yourself whether they should get more or less open-ended power? I’m not sure what you’re thinking of.
I certainly agree that the pressures and epistemic environment should make you less optimistic about good decisions being made. And that thinking through the overall situation and what types or decisions you care about are important. (Like, you can think of my comment as making a claim about the importance weighted goodness of decisions.)
I don’t see the relevance of “relative decision making goodness compared to the general population” which I think you agree with, but in that case I don’t see what this was responding to.
Not sure I agree with other aspects of this comment and implications. Like, I think reducing things to a variable like “how good is it to generically empowering this person/group” is pretty reasonable in the case of Anthropic leadership because in a lot of cases they’d have a huge amount of general open ended power, though a detailed model (taking into account what decisions you care about etc) would need to feed into this.
Not the main thrust of the thread, but for what it’s worth, I find it somewhat anti-helpful to flatten things into a single variable of “how much you trust Anthropic leadership to make decisions which are good from your perspective”, and then ask how optimistic/pessimistic you are about this variable.
I think I am much more optimistic about Anthropic leadership on many axis relative to an overall survey of the US population or Western population – I expect them to be more libertarian, more in favor of free speech, more pro economic growth, more literate, more self-aware, higher IQ, and a bunch of things.
I am more pessimistic about their ability to withstand the pressures of a trillion dollar industry to shape their incentives than the people who are at Anthropic.
I believe the people working there are siloing themselves intellectually into an institution facing incredible financial incentives for certain bottom lines like “rapid AI progress is inevitable” and “it’s reasonably likely we can solve alignment” and “beating China in the race is a top priority”, and aren’t allowed to talk to outsiders about most details of their work, and this is a key reason that I expect them to screw up their decision-making.
I am optimistic about their relative-ability to have a sensible conversation about the next 5 years and what alignment failures look like, relative to most people on earth. This is not the standard I require to expect people to not do ML training runs that lead to human extinction, but nonetheless I predict they will do relatively quite well on this axis.
I don’t have a single variable here, I have a much more complicated model than this. It looks to me that collapsing questions of trust about people or groups into a single varibale of how optimistic I am about them making decisions which are good from my values has been a common question-substitution in the Effective Altruism scene, where I think people have been repeatedly hoodwinked by sociopaths due to not moving toward a more detailed model that predicts exactly where and when someone will make good vs bad decisions.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Another note: My guess is that people on LessWrong tend to be overly pessimistic about Anthropic leadership (in terms of how good of decisions Anthropic leadership will make under the LessWrong person’s views and values) and Anthropic employees tend to be overly optimistic.
I’m less confident that people on LessWrong are overly pessimistic, but they at least seem too pessimistic about the intentions/virtue of Anthropic leadership.
Thanks!
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.
Can you explain how you got the diffs from https://www.anthropic.com/rsp-updates ? I see the links to previous versions, but nothing that’s obviously a diff view to see the actual changed language.
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
Yeah, I don’t think this is necessarily in contradiction with my comment. Things can be effectively just LARP/PR without being consciously LARP/PR. (Indeed, this is likely the case in most instances of LARP-y behavior.)
Agreed on the rest.
Employees at Anthropic don’t think the RSP is LARP/PR. My best guess is that Dario doesn’t think the RSP is LARP/PR.
This isn’t necessarily in conflict with most of your comment.
I think I mostly agree the RSP is toothless. My sense is that for any relatively subjective criteria, like making a safety case for misalignment risk, the criteria will basically come down to “what Jared+Dario think is reasonable”. Also, if Anthropic is unable to meet this (very subjective) bar, then Anthropic will still basically do whatever Anthropic leadership thinks is best whether via maneuvering within the constraints of the RSP commitments, editing the RSP in ways which are defensible, or clearly substantially loosening the RSP and then explaining they needed to do this due to other actors having worse precautions (as is allowed by the RSP). I currently don’t expect clear cut and non-accidental procedural violations of the RSP (edit: and I think they’ll be pretty careful to avoid accidental procedural violations).
I’m skeptical of normal employees having significant influence on high stakes decisions via pressuring the leadership, but empirical evidence could change the views of Anthropic leadership.
How you feel about this state of affairs depends a lot on how much you trust Anthropic leadership to make decisions which are good from your perspective.
Minimally it’s worth noting that Dario and Jared are much less concerned about misalignment risk than I am and I expect only partial convergence in beliefs due to empirical evidence (before it’s too late).
I think the RSP still has a few important purposes:
I expect that the RSP will eventually end up with some transparency commitments with some teeth. These won’t stop Anthropic from proceeding if Anthropic leadership thinks this is best, but it might at least mean there ends up being common knowledge of whether reasonable third parties (or Anthropic leadership) think the current risk is large.
I think the RSP might end up with serious security requirements. I don’t expect these will be met on time in short timelines but the security bar specified in advance might at least create some expectations about what a baseline security expectation would be.
Anthropic might want to use the RSP the bind itself to the mast so that investors or other groups have a harder time pressuring it to spend less on security/safety.
There are some other more tenative hopes (e.g., eventually getting common expectations of serious security or safety requirements which are likely to be upheld, regulation) which aren’t impossible.
And there are some small wins already, like Google Deepmind having set some security expectations for itself which it is reasonably likely to follow through with if it isn’t too costly.
After spending some time chatting with Gemini I’ve learned that a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values:
The “goal-content integrity” argument (that an AI might choose not to wirehead to protect its learned task-specific values) requires the AI to be more than just a standard model-based RL agent. It would need:
A model of its own values and how they can change.
A meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
Thanks for the clarifying comment. I agree with block-quote 8 from your post:
Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.
I think what you’re saying is that we want the AI’s reward function to be more like the reward circuitry humans have, which is inaccessible and difficult to hack, and less like money which can easily be stolen.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces the trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about cognition-updating, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
Though maybe the AI would just prefer the button when it finds it because it yields higher reward.
For example, if you punish cheating on tests, students might learn the value “cheating is wrong” and never cheat again or form a habit of not doing it. Or they might temporarily not do it until there is an opportunity to do it without negative consequences (e.g. the teacher leaves the classroom).
I also agree that “intrinsic” and “instrumental” motivation are more useful categories than “intrinsic” and “extrinsic” for the reasons you described in your comment.
I’d been pretty much assuming that AGI labs’ “responsible scaling policies” are LARP/PR, and that if an RSP ever conflicts with their desire to release a model, either the RSP will be swiftly revised, or the testing suite for the model will be revised such that it doesn’t trigger the measures the AGI lab doesn’t want to trigger. I. e.: that RSPs are toothless and that their only purposes are to showcase how Responsible the lab is and to hype up how powerful a given model ended up.
This seems to confirm that cynicism.
(The existence of the official page tracking the updates is a (smaller) update in the other direction, though. I don’t see why they’d have it if they consciously intended to RSP-hack this way.)
Here’s another way of thinking about this that should be complementary to my other comment:
Let’s assume our safety property we want to ensure is
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.
Sorry about the delayed reply. I’ve been thinking about how to respond. One of my worries is that human philosophy is path dependent, or another way of saying this is that we’re prone to accepting wrong philosophical ideas/arguments and then it’s hard to talk us out of them. The split of western philosophy into analytical and continental traditions seems to be an instance of this, then even within analytical philosophy, academic philosophers would strongly disagree with each other and each be confident in their own positions and rarely get talked out of them. I think/hope that humans collectively can still make philosophical progress over time (in some mysterious way that I wish I understood), if we’re left to our own devices but the process seems pretty fragile and probably can’t withstand much external optimization pressure.
On formalizations, I agree they’ve stood the test of time in your sense, but is that enough to build them into AI? We can see that they wrong on some questions, but can’t formally characterize the domain in which they are right. And even if we could, I don’t know why we’d muddle through… What if we built AI based on Debate, but used Newtonian physics to answer physics queries instead of human judgment, or the humans are pretty bad at answering physics related questions (including meta questions like how to do science)? That would be pretty disastrous, especially if there are any adversaries in the environment, right?
But
To pursue their values, humans should be able to reason about them. To form preferences about a thing, humans should be able to consider the thing. Therefore, human ability to comprehend should limit what humans can care about. At least before humans start unlimited self-modification. I think this logically can’t be false.
Eliezer Yudkowsky is a core proponent of complexity of value, but in Thou Art Godshatter and Protein Reinforcement and DNA Consequentialism he basically makes a point that human values arose from complexity limitations, including complexity limitations imposed by brainpower limitations. Some famous alignment ideas (e.g. NAH, Shard Theory) kinda imply that human values are limited by human ability to comprehend and it doesn’t seem controversial. (The ideas themselves are controversial, but for other reasons.)
If learning values is possible at all, there should be some simplicity biases which help to learn them. Wouldn’t it be strange if those simplicity biases were absolutely unrelated to simplicity biases of human cognition?
Based on your comments, I can guess that something below is the crux:
You define “values” as ~”the decisions humans would converge to after becoming arbitrarily more knowledgeable”. But that’s a somewhat controversial definition (some knowledge can lead to changes in values) and even given that definition it can be true that “past human ability to comprehend limits human values” — since human values were formed before humans explored unlimited knowledge. Some values formed when humans were barely generally intelligent. Some values formed when humans were animals.
You say that values depend on inscrutable brain machinery. But can’t we treat the machinery as a part of “human ability to comprehend”?
You talk about ontology. Humans can care about real diamonds without knowing what physical things the diamonds are made from. My reply: I define “ability to comprehend” based on ability to comprehend functional behavior of a thing under normal circumstances. Based on this definition, a caveman counts as being able to comprehend the cloud of atoms his spear is made of (because the caveman can comprehend the behavior of the spear under normal circumstances), even though the caveman can’t comprehend atomic theory.
Could you confirm or clarify the crux? Your messages felt ambiguous to me. In what specific way is A false?