michaelcohen 17 Aug 2023 20:59 UTC
LW: 13 AF: 6
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.

michaelcohen 1 Apr 2019 23:39 UTC
LW: 13 AF: 5
AF
on: Asymptotically Unambitious AGI
Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.
Comment prizes:
Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20
A plausible dangerous side-effect. Wei Dai. Link. $40
Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120
Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20
Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90
Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35
Simulating suffering agents. cousin_it. Link. $20
Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20
Options for transfer:
1) Venmo. Send me a request at @Michael-Cohen-45.
2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).
3) Name a charity for me to donate the money to.
I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:
4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).

michaelcohen 16 May 2021 18:52 UTC
LW: 12 AF: 9
AF
on: Formal Inner Alignment, Prospectus
To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we’re layering the consensus algorithm on top of
I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.

Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it’s in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there’s a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.
I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.
Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic
Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.
When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well. Realistically, we will have some hypothesis-proposing heuristic, strong enough to identify models one of which is accurate enough to generate powerful agency. This heuristic will clearly cast a wide net (if not, how would it magically arrive at a good answer? It’s internals would need some hypothesis-generating function). Rather than throwing out the runner ups, the consensus algorithm stores them. The hypothesis generating heuristic is another attack surface for optimization daemons, and I won’t make any claims for now about how easy or hard it is to prevent such a thing.
to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI
Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.
Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash.
It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.
Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly… We have to face the fact that it might require human feedback at any point in the future.
Yeah this feels like a small cost to me. One person can be doing this for many instances at once. If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.
The fourth point [controlling when to ask for more feedback] really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.
A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed “now!”. If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn’t forget to put a stake in the ground marking “lots of progress”. If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.
My third and final example: in one conversation, someone made a claim which I see as “exactly wrong”: that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
Maybe this was someone else, but it could have been me. I think MAP probably does solve the inner alignment problem in theory, but I don’t expect to make progress resolving that question, and I’m interested in hedging against being wrong. Where you say, “We know of no way of doing that” I would say, “We know of ways that might do that, but we’re not 100% sure”. On my to-do list is to write up some of my disagreements with Paul’s original post on optimization daemons in the Solomonoff prior (and maybe with other points in this article). I don’t think it’s good to argue from the premise that a problem is worth taking seriously, and then see what follows from the existence of that problem, because a problem can exist with 10% probability and be worth taking seriously, but one might get in trouble embedding its existence too deeply in one’s view of the world, if it is still on balance unlikely. That’s not to say that most people think Paul’s conclusions are <90% likely, just that one might.

Just Imitate Humans?

michaelcohen9 Mar 2023 13:31 UTC

11 points

72 comments1 min readLW link

Prisoner’s Dilemmas : Altruism :: Battles of the Sexes : Convention

michaelcohen1 Feb 2021 10:39 UTC

11 points

4 comments4 min readLW link

michaelcohen 11 Jul 2019 2:23 UTC
LW: 11 AF: 6
AF
in reply to: Rohin Shah’s comment on: IRL in General Environments
I’m sorry it sounded like a dig at CHAI’s work, and you’re right that “typically described” is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it’s nearly complete—I don’t think I’ve seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator’s action might jeopardizing the whole thing.
I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human’s actions instead of from a reward signal gets us any closer to safety, but I’m sorry to have misrepresented views. (And maybe it’s worth mentioning that I’m fiddling with something that bears strong resemblance to Inverse Reward Design, so I’m definitely not that bearish on the whole idea).

michaelcohen 10 Aug 2023 1:32 UTC
LW: 10 AF: 3
0
AF
on: Thoughts on sharing information about language model capabilities
I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.
I agree whole-heartedly with the first sentence. I’m not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn’t the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

IRL in General Environments

michaelcohen9 Mar 2023 13:32 UTC

8 points

20 comments1 min readLW link

michaelcohen 12 Aug 2023 6:37 UTC
LW: 8 AF: 5
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
What is process-based RL?
I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for “we’re all human”-type reasons, and maybe in part because legislators and regulators know that they won’t get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries’ incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn’t a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don’t know, 25% GDP growth from their country’s imitation learners, and say, “these international AI agreements are too cautious and are holding us back from even more growth”; it seems much more likely to me that politicians’ appetite for risking great power conflict requires much worse economic conditions than that.
In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough “coordination mechanisms” to stop highly-capitalized RL projects. I also object a bit to calling a solution here “strong global coordination”. If China makes a law preventing AI that would kill everyone with 1% probability if made, that’s rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don’t need to take desperate action, and that seems doable.
Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don’t have to compete with them after they’ve been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

michaelcohen 11 May 2019 1:28 UTC
LW: 8 AF: 4
AF
in reply to: RyanCarey’s comment on: Not Deceiving the Evaluator
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.