michaelcohen

Karma: 577

michaelcohen 17 Aug 2023 20:59 UTC
LW: 13 AF: 6
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
I think [process-based RL] has roughly the same risk profile as imitation learning, while potentially being more competitive.
I agree with this in a sense, although I may be quite a bit a more harsh about what counts as “executing an action”. For example, if reward is based on an overseer talking about the action with a large group of people/AI assistants, then that counts as “executing the action” in the overseer-conversation environment, even if the action looks like it’s for some other environment, like a plan to launch a new product in the market. I do think myopia in this environment would suffice for existential safety, but I don’t know how much myopia we need.
If you’re always talking about myopic/process-based RLAIF when you say RLAIF, then I think what you’re saying is defensible. I speculate that not everyone reading this recognizes that your usage of RLAIF implies RLAIF with a level of myopia that matches current instances of RLAIF, and that that is a load-bearing part of your position.
I say “defensible” instead of fully agreeing because I weakly disagree that increasing compute is any more of a dangerous way to improve performance than by modifying the objective to a new myopic objective. That is, I disagree with this:
I think you would probably prefer to do process-based RL with smaller models, rather than imitation learning with bigger models
You suggest that increasing compute is the last thing we should do if we’re looking for performance improvements, as opposed to adding a very myopic approval-seeking objective. I don’t see it. I think changing the objective from imitation learning is more likely to lead to problems than scaling up the imitation learners. But this is probably beside the point, because I don’t think problems are particularly likely in either case.

michaelcohen 1 Apr 2019 23:39 UTC
LW: 13 AF: 5
AF
on: Asymptotically Unambitious AGI
Thanks for a really productive conversation in the comment section so far. Here are the comments which won prizes.
Comment prizes:
Objection to the term benign (and ensuing conversation). Wei Dei. Link. $20
A plausible dangerous side-effect. Wei Dai. Link. $40
Short description length of simulated aliens predicting accurately. Wei Dai. Link. $120
Answers that look good to a human vs. actually good answers. Paul Christiano. Link. $20
Consequences of having the prior be based on K(s), with s a description of a Turing machine. Paul Christiano. Link. $90
Simulated aliens converting simple world-models into fast approximations thereof. Paul Christiano. Link. $35
Simulating suffering agents. cousin_it. Link. $20
Reusing simulation of human thoughts for simulation of future events. David Krueger. Link. $20
Options for transfer:
1) Venmo. Send me a request at @Michael-Cohen-45.
2) Send me your email address, and I’ll send you an Amazon gift card (or some other electronic gift card you’d like to specify).
3) Name a charity for me to donate the money to.
I would like to exert a bit of pressure not to do 3, and spend the money on something frivolous instead :) I want to reward your consciousness, more than your reflectively endorsed preferences, if you’re up for that. On that note, here’s one more option:
4) Send me a private message with a shipping address, and I’ll get you something cool (or a few things).

michaelcohen 16 May 2021 18:52 UTC
LW: 12 AF: 9
AF
on: Formal Inner Alignment, Prospectus
To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we’re layering the consensus algorithm on top of
I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.

Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it’s in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there’s a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.
I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.
Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic
Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice.
When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well. Realistically, we will have some hypothesis-proposing heuristic, strong enough to identify models one of which is accurate enough to generate powerful agency. This heuristic will clearly cast a wide net (if not, how would it magically arrive at a good answer? It’s internals would need some hypothesis-generating function). Rather than throwing out the runner ups, the consensus algorithm stores them. The hypothesis generating heuristic is another attack surface for optimization daemons, and I won’t make any claims for now about how easy or hard it is to prevent such a thing.
to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI
Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.
Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash.
It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.
Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly… We have to face the fact that it might require human feedback at any point in the future.
Yeah this feels like a small cost to me. One person can be doing this for many instances at once. If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.
The fourth point [controlling when to ask for more feedback] really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.
A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed “now!”. If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn’t forget to put a stake in the ground marking “lots of progress”. If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.
My third and final example: in one conversation, someone made a claim which I see as “exactly wrong”: that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.
The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!
Maybe this was someone else, but it could have been me. I think MAP probably does solve the inner alignment problem in theory, but I don’t expect to make progress resolving that question, and I’m interested in hedging against being wrong. Where you say, “We know of no way of doing that” I would say, “We know of ways that might do that, but we’re not 100% sure”. On my to-do list is to write up some of my disagreements with Paul’s original post on optimization daemons in the Solomonoff prior (and maybe with other points in this article). I don’t think it’s good to argue from the premise that a problem is worth taking seriously, and then see what follows from the existence of that problem, because a problem can exist with 10% probability and be worth taking seriously, but one might get in trouble embedding its existence too deeply in one’s view of the world, if it is still on balance unlikely. That’s not to say that most people think Paul’s conclusions are <90% likely, just that one might.

michaelcohen 11 Jul 2019 2:23 UTC
LW: 11 AF: 6
AF
in reply to: Rohin Shah’s comment on: IRL in General Environments
I’m sorry it sounded like a dig at CHAI’s work, and you’re right that “typically described” is at best a generalization over too many people, and worst, wrong. It would be more accurate to say that when people describe IRL, I get the feeling that it’s nearly complete—I don’t think I’ve seen anyone presenting an idea about IRL flag the concern that the issue of recognizing the demonstrator’s action might jeopardizing the whole thing.
I did intend to cast some doubt on whether the IRL research agenda is promising, and whether inferring a utility function from a human’s actions instead of from a reward signal gets us any closer to safety, but I’m sorry to have misrepresented views. (And maybe it’s worth mentioning that I’m fiddling with something that bears strong resemblance to Inverse Reward Design, so I’m definitely not that bearish on the whole idea).

michaelcohen 10 Aug 2023 1:32 UTC
LW: 10 AF: 3
0
AF
on: Thoughts on sharing information about language model capabilities
I believe that LM agents based on chain of thought and decomposition seem like the most plausible approach to bootstrapping subhuman systems into trusted superhuman systems. For about 7 years using LM agents for RLAIF has seemed like the easiest path to safety,^[4] and in my view this is looking more and more plausible over time.
I agree whole-heartedly with the first sentence. I’m not sure why you understand it to support the second sentence; I feel the first sentence supports my disagreement with the second sentence! Long-horizon RL is a different way to get superhuman systems, and one encourages that intervening in feedback if the agent is capable enough. Doesn’t the first sentence support the case that it would be safer to stick to chain of thought and decomposition as the key drivers of superhumanness, rather than using RL?

michaelcohen 12 Aug 2023 6:37 UTC
LW: 8 AF: 5
0
AF
in reply to: paulfchristiano’s comment on: Thoughts on sharing information about language model capabilities
What is process-based RL?
I think your intuitions about costly international coordination are challenged by a few facts about the world. 1) Advanced RL, like open borders + housing deregulation, guarantees vast economic growth in wealthy countries. Open borders, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten the integrity of a culture, including especially its norms; AI has the potential, in a way that seems kinda speculative, but intuitively forceful for most people, has the potential to existentially threaten all life. The decisions of wealthy countries are apparently extremely strongly correlated, maybe in part for “we’re all human”-type reasons, and maybe in part because legislators and regulators know that they won’t get their ear chewed off for doing things like the US does. With immigration law, there is no attempt at coordination; quite the opposite (e.g. Syrian refugees in the EU). 2) The number of nuclear states is stunningly small if one follows the intuition that wildly uncompetitive behavior, which leaves significant value on the table, produces an unstable situation. Not every country needs to sign on eagerly to avoiding some of the scariest forms of AI. The US/EU/China can shape other countries’ incentives quite powerfully. 3) People in government do not seem to be very zealous about economic growth. Sorry this isn’t a very specific example. But their behavior on issue after issue does not seem very consistent with someone who would see, I don’t know, 25% GDP growth from their country’s imitation learners, and say, “these international AI agreements are too cautious and are holding us back from even more growth”; it seems much more likely to me that politicians’ appetite for risking great power conflict requires much worse economic conditions than that.
In cases 1 and 2, the threat is existential, and countries take big measures accordingly. So I think existing mechanisms for diplomacy and enforcement are powerful enough “coordination mechanisms” to stop highly-capitalized RL projects. I also object a bit to calling a solution here “strong global coordination”. If China makes a law preventing AI that would kill everyone with 1% probability if made, that’s rational for them to do regardless of whether the US does the same. We just need leaders to understand the risks, and we need them to be presiding over enough growth that they don’t need to take desperate action, and that seems doable.
Also, consider how much more state capacity AI-enabled states could have. It seems to me that a vast population of imitation learners (or imitations of populations of imitation learners) can prevent advanced RL from ever being developed, if the latter is illegal; they don’t have to compete with them after they’ve been made. If there are well-designed laws against RL (beyond some level of capability), we would have plenty of time to put such enforcement in place.

michaelcohen 11 May 2019 1:28 UTC
LW: 8 AF: 4
AF
in reply to: RyanCarey’s comment on: Not Deceiving the Evaluator
A bit of a nitpick: IRD and this formulate how the agent believes the evaluator acts, while being technically agnostic about how the evaluator actually acts (at least in the specification of the algorithm; experiments/theory might be predicated on additional assumptions about the evaluator).
I believe this agent’s beliefs about how the evaluator acts are much more general than IRD. If the agent believed the evaluator was certain about which environment they were in, and it was the “training environment” from IRD, this agent would probably behave very similarly to an IRD agent. But of course, this agent considers many more possibilities for what the evaluator’s beliefs might be.
I agree this agent should definitely be compared to IRD, since they are both agents who don’t “take rewards literally”, but rather process them in some way first. Note that the design space of things which fit this description is quite large.

michaelcohen 10 Mar 2019 3:38 UTC
LW: 7 AF: 3
AF
on: Asymptotically Unambitious AGI
From Paul:
I think the main problem with competitiveness is that you are just getting “answers that look good to a human” rather than “actually good answers.”
The comment was here, but I think it deserves its own thread. Wei makes the same point here (point number 3), and our ensuing conversation is also relevant to this thread.
My answers to Wei were two-fold: one is that if benignity is established, it’s possible to safely tinker with the setup until hopefully “answers that look good to a human” resembles good answers (we never quite reached an agreement about this). The second was an example of an extended setup (one has to read the parent comments to understand it) which would potentially be much more likely to yield actually good answers; I think we agree about this approach.
My original idea when I started working on this, actually, is also an answer to this concern. The reason it’s not in the paper is because I pared it down to a minimum viable product.
Construct an “oracle” by defining “true answers” as follows: answers which help a human do accurate prediction on a randomly sampled prediction task.*
I figured out that I needed a box, and everything else in this setup, and I realized that the setup could be applied to a normal reinforcement learner just as easily as for this oracle, so I simplified the approach.
I honestly need to dig through notes from last year, but my recollection is this: the operator receives an answer to a query, and then gets a random prediction task, which he has to make a prediction about before leaving the box. Later, the prediction is scored, and this is converted into a reward for BoMAI. BoMAI has a model class for how the prediction is scored; the output of these models is an answer for what the ground truth is. In all of these models, the ground truth doesn’t depend on BoMAI’s answer (that is, the model isn’t given read access to BoMAI’s answer). So the prediction task can involve the prediction of outside world events, and the ground truth can be logged from the outside world, because BoMAI doesn’t conceive of its answer having a causal impact on the copy of the world which provides the ground truth for the prediction tasks. For example, the prediction task might sampled from {“True or false: hexalated kenotones will suppress activity of BGQ-1”, “True or false: fluorinating random lysines in hemoglobin will suppress activity of BGQ-1”, etc.} (half of those terms are made up). After this episode, the prediction can be graded in the outside world. With the obvious scoring rule, the oracle would just say “I don’t care plausible it sounds, whatever they ask you, just say it’s not going to work. Most things don’t.” With a better scoring rule, I would expect it to give accurate information in a human-understandable format.
I haven’t thought about this in a while, and I was honestly worse at thinking about alignment at that point in time, so I don’t mean to convey much confidence that this approach works out. What I do think it shows, alongside the idea I came up with in the conversation with Wei, linked above, is that this general approach is powerful and amenable to improvement in ways that render it even more useful.
* A more recent thought: as described, “oracle” is not the right word for this setup. It would respond to “What approaches might work for curing cancer?” with “Doesn’t matter. There are more gaps in your knowledge regarding economics. A few principles to keep in mind…” However, if the prediction task distribution were conditioned in some way on the question asked, one might be able to make it more likely that the “oracle” answers the question, rather than just spewing unrelated insight.
What links here?

michaelcohen 19 Feb 2021 11:38 UTC
LW: 6 AF: 4
AF
in reply to: Rohin Shah’s comment on: Formal Solution to the Inner Alignment Problem
In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p.
This is so much clearer than I’ve ever put it.
(There’s a specific rule that ensures that N decreases over time.)
N won’t necessarily decrease over time, but all of the models will eventually agree with other.
monitor the performance of your system online, and train to correct any problems
I would have described Vanessa’s and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.

michaelcohen 3 Jul 2020 21:30 UTC
LW: 6 AF: 4
AF
in reply to: paulfchristiano’s comment on: The “AI Debate” Debate
The purpose of research now is to understand the landscape of plausible alignment approaches, and from that perspective viability is as important as safety.
Point taken.
I think it is unlikely for a scheme like debate to be safe without being approximately competitive
The way I map these concepts, this feels like an elision to me. I understand what you’re saying, but I would like to have a term for “this AI isn’t trying to kill me”, and I think “safe” is a good one. That’s the relevant sense of “safe” when I say “if it’s safe, we can try it out and tinker”. So maybe we can recruit another word to describe an AI that is both safe itself and able to protect us from other agents.
use those answers [from Debate] to ensure … that the overall system can be stable to malicious perturbations
Is “overall system” still referring to the malicious agent, or to Debate itself? If it’s referring to Debate, I assume you’re talking about malicious perturbations from within rather than malicious perturbations from the outside world?
If your honest answers aren’t competitive, then you can’t do that and your situation isn’t qualitatively different from a human trying to directly supervise a much smarter AI.
You’re saying that if we don’t get useful answers out of Debate, we can’t use the system to prevent malicious AI, and so we’d have to just try to supervise nascent malicious AI directly? I certainly don’t dispute that if we don’t get useful answers out of Debate, Debate won’t help us solve X, including when X is “nip malicious AI in the bud”.
It certainly wouldn’t hurt to know in advance whether Debate is competitive enough, but if it really isn’t dangerous itself, then I think we’re unlikely to become so pessimistic about the prospects of Debate, through our arguments and our proxy experiments, that we don’t even bother trying it out, so it doesn’t seem especially decision-relevant to figure it out for sure in advance. But again, I take your earlier point that a better understanding of the landscape is always going to have some worth.
if your AI could easily kill you in order to win a debate, probably someone else’s AI has already killed you
This argument seems to prove too much. Are you saying that if society has learned how to do artificial induction at a superhuman level, then by the time we give a safe planner that induction subroutine, someone will have already given that induction routine to an unsafe planner? If so, what hope is there as prediction algorithms relentlessly improve? In my view, the whole point of AGI Safety research is to try to come up with ways to use powerful-enough-to-kill-you artificial induction in a way that it doesn’t kill you (and helps you achieve your other goals). But it seems you’re saying that there is a certain level of ingenuity where malicious agents will probably act with that level of ingenuity before benign agents do.
That is, safety separate from competitiveness mostly matters in scenarios where you have very large leads / very rapid takeoffs
It seems fairly likely to me that the next best AGI project behind Deepmind, OpenAI, the USA, and China is way behind the best of those. I would think people in those projects would have months at least before some dark horse catches up.
So competitiveness still matters somewhat, but here’s a potential disagreement we might have: I think we will probably have at least a few months, and maybe more than a year, where the top one or two teams have AGI (powerful enough to kill everyone if let loose), and nobody else has anything more valuable than an Amazon Mechanical Turk worker. [Edit: “valuable” is the wrong word. I guess I mean better at killing.]
For example, it seems to me you need competitiveness for any of the plausible approaches for avoiding deceptive alignment (since they require having an aligned overseer who can understand what a treacherous agent is doing)
Do you think something like IDA is the only plausible approach to alignment? If so, I hadn’t realized that, and I’d be curious to hear more arguments, or just intuitions are fine. The aligned overseer you describe is supposed to make treachery impossible by recognizing it, so it seems your concern is equivalent to the concern: “any agent (we make) that learns to act will be treacherous if treachery is possible.” Are all learning agents fundamentally out to get you? I suppose that’s a live possibility to me, but it seems to me there is a possibility we could design an agent that is not inclined to treachery, even if the treachery wouldn’t be recognized.
Edit: even so, having two internal components that are competitive with each other (e.g. overseer and overseee) does not require competitiveness with other projects.
More generally, trying to maintain a totally sanitized internal environment seems a lot harder than trying to maintain a competitive internal environment where misaligned agents won’t be at a competitive advantage.
I don’t understand the dichotomy here. Are you talking about the problem of how to make it hard for a debater to take over the world within the course a debate? Or are you talking about the problem of how to make it hard for a debater to mislead the moderator? The solutions to those problems might be different, so maybe we can separate the concept “misaligned” into “ambitious” and/or “deceitful”, to make it easier to talk about the possibility of separate solutions.

michaelcohen 10 May 2019 0:23 UTC
LW: 6 AF: 3
AF
in reply to: RyanCarey’s comment on: Not Deceiving the Evaluator
Yep.

michaelcohen 29 Mar 2019 0:32 UTC
LW: 6 AF: 2
AF
on: The Main Sources of AI Risk?
3. Misspecified or incorrectly learned goals/values
I think this phrasing misplaces the likely failure modes. An example that comes to mind from this phrasing is that we mean to maximize conscious flourishing, but we accidentally maximize dopamine in large brains.
Of course, this example includes an agent intervening in the provision of its own reward, but since that seems like the paradigmatic example here, maybe the language could better reflect that, or maybe this could be split into two.
The single technical problem that appears biggest to me is that we don’t know how to align an agent with any goal. If we had an indestructible magic box that printed a number to a screen corresponding to the true amount of Good in the world, we still don’t know how to design an agent that maximizes that number (instead of taking over the world, and tampering with the cameras that are aimed at the screen/the optical character recognition program used to decipher the image). This problems seems to me like the single most fundamental source of AI risk. Is 3 meant to include this?
What links here?
- The Main Sources of AI Risk? by Daniel Kokotajlo (21 Mar 2019 18:28 UTC; 121 points)

michaelcohen 6 Nov 2022 10:17 UTC
LW: 5 AF: 4
0
AF
in reply to: mwacksen’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Care to bet on the results of a survey of academic computer scientists? If the stakes are high enough, I could try to make it happen.
“As a reviewer, I only recommend for acceptance papers that appear to be both valid and interesting.”
Strongly agree - … - Strongly Disagree
“As a reviewer, I would sooner recommend for acceptance a paper that was valid, but not incredibly interesting, than a paper that was interesting, but the conclusions weren’t fully supported by the analysis.”
Strongly agree - … - Strongly Disagree

michaelcohen 24 May 2021 12:53 UTC
LW: 5 AF: 3
AF
on: Finite Factored Sets
I was thinking of some terminology that might make it easier to thinking about factoring and histories and whatnot.
A partition can be thought of as a (multiple-choice) question. Like for a set of words, you could have the partition corresponding to the question “Which letter does the word start with?” and then the partition groups together elements with the same answer.
Then a factoring is set of questions, where the set of answers will uniquely identify an element. The word that comes to mind for me is “signature”, where an element’s signature is the set of answers to the given set of questions.
For the history of a partition X, X can be thought of as a question, and the history is the subset of questions in the factoring that you need the answers to in order to determine the answer to question X.
And then two questions X and Y are orthogonal if there aren’t any questions in the factoring that you need the answer to both for answering X and for answering Y.

michaelcohen 19 Feb 2021 12:01 UTC
LW: 5 AF: 5
AF
in reply to: evhub’s comment on: Formal Solution to the Inner Alignment Problem
If the inner alignment problem did not exist for perfect Bayesians, but did exist for neural networks, then it would appear to be a regime where more intelligence makes the problem go away. If the inner alignment problem were ~solved for perfect Bayesians, but unsolved for neural networks, I think there’s still some of the flavor of that regime, but we do have to be pretty careful to make sure we’re applying the same sort of solution to the non-Bayesian algorithms. I think in Vanessa’s comment above, she’s suggesting this looks doable.
Note the method here of avoiding mesa-optimizers: error bounds. Neural networks don’t have those. Naturally, one way to make mesa-optimizer-deceptively-selected-errors go away is just to have better learning algorithms that make errors go away. Algorithms like Gated Linear Networks with proper error bounds may be a safer building block for AGI. But none of this takes away from the fact that it is potentially important to figure out how to avoid mesa-optimization in neural networks, and I would add to your claim that this is a much harder setting; I would say it’s a harder setting because of the non-existence of error bounds.

michaelcohen 1 Sep 2020 12:36 UTC
LW: 5 AF: 3
AF
on: Introduction To The Infra-Bayesianism Sequence
Looks like we’ve been thinking along very similar lines! https://www.lesswrong.com/posts/RzAmPDNciirWKdtc7/pessimism-about-unknown-unknowns-inspires-conservatism

michaelcohen 27 May 2020 19:16 UTC
5 points
in reply to: Dagon’s comment on: Predicted Land Value Tax: a better tax than an unimproved land value tax
Why land? This would seem to apply to any transferable asset.
This could work for other assets where
- each asset has a natural peer group (in this case, other properties in the neighborhood) from which to predict the value; or the value can’t change so you can just use the market price of the asset itself
- it’s hard to hide the asset
- the asset can’t be imported/exported, or you don’t care if your country loses this asset. For diamonds, needlessly distortionary but not a disaster; for car manufacturing equipment, very bad.
ETA from Wei Dai:
- there are no untaxed substitutes for the asset

michaelcohen 16 Sep 2019 19:48 UTC
LW: 5 AF: 3
AF
on: Reversible changes: consider a bucket of water
I think for most utility functions, kicking over the bucket and then recreating a bucket with identical salt content (but different atoms) gets you back to a similar value to what you were at before. If recreating that salt mixture is expensive vs. cheap, and if attainable utility preservation works exactly as our initial intuitions might suggest (and I’m very unsure about that, but supposing it does work in the intuitive way), then AUP should be more likely to avoid disturbing the expensive salt mixture, and less likely to avoid disturbing the cheap salt mixture. That’s because for those utility functions for which the contents of the bucket were instrumentally useful, the value with respect to those utility functions goes down roughly by the cost of recreating the bucket’s contents. Also, if a certain salt mixture is less economically useful, there will be fewer utility functions for which kicking over the bucket leads to a loss in value, so if AUP works intuitively, it should also agree with our intuition there.
If it’s true that for most utility functions, the particular collection of atoms doesn’t matter, then it seems to me like AUP manages to assign a higher penalty to the actions that we would agree are more impactful, all without any information regarding human preferences.

michaelcohen 23 Jul 2019 19:14 UTC
LW: 5 AF: 2
AF
on: Not Deceiving the Evaluator
Ok I finally identified an incentive for deception. I think it was difficult for me to find because it’s not really about deceiving the evaluator.
Here’s a hypothesis that observations will never refute: the utility which the evaluator assigns to a state is equal to the reward that a human would provide if it were a human that controlled the provision of reward (instead of the evaluator). Under this hypothesis, maximizing evaluator-utility is identical to creating observations which will convince a human to provide high reward (a task which entails deception when done optimally). In a sense, the AI doesn’t think it’s deceiving the evaluator; it thinks the evaluator fully understands what’s going on and likes seeing things that would confuse a human into providing high reward, as if the evaluator is “in on the joke”. One of my take-aways here is that some of the conceptual framing I did got in the way of identifying a failure mode.

michaelcohen 1 May 2019 0:12 UTC
LW: 5 AF: 3
AF
in reply to: Wei Dai’s comment on: Strategic implications of AIs’ ability to coordinate at low cost, for example by merging
One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can’t necessarily tune $λ$ in advance to try to take this into account.
One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.
And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.
EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn’t be surprised anymore if there exists a $λ$ that they would agree to.