# rohinmshah(Rohin Shah)

Karma: 8,263

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://​​rohinshah.com/​​

• Intuitively, the reason that our biases are biases and not a different reward function is because:

1. I would be happy to get rid of my biases, in that I would accept a well-designed self-modification that removed my biases. (“well-designed” is hiding a lot of complexity, but the point is just that such a self-modification exists.)

2. The bias applies across a variety of different scenarios with very different reward functions.

The first point suggests that you require that the human values are reflectively stable. In particular, if the human would choose action a in state s under the (m, R) pair, then they should also say that they would choose action a if they were in state s even when you explain what the consequences of action a will be. This is not a good solution—when people’s speech and people’s behavior disagree, it’s certainly possible that the behavior actually reflects their values and not the speech—but something along these lines seems important.

I’m more interested in the second point though. Let’s consider the setting where you have n different tasks for which you observe the human policy. After running IRL, you have a single rationality model M and multiple rewards R_1 … R_n. Intuitively, the rationality model M is better if the expected complexity of the inferred reward for a new unseen task is lower. That is, if you sample a new task T_{n+1} from the distribution of tasks and run IRL to estimate R_{n+1} using the existing learned rationality model M, you can expect that R_{n+1} will be simple.

What I’m trying to get at here is that the correct rationality model has a lot more explanatory power. Kolmogorov complexity doesn’t really capture that.

Under this definition, it seems likely that you could only get (M(0), R(0)) and (M(4), R(4)), at least out of the compatible pairs you suggested. To break this last tie, perhaps you could add in an assumption that humans are closer to rational than anti-rational on simple tasks.

I do agree that in the fully general case where we observe the full policy for all of human behavior and want to determine all of human values, things get murkier. Some possible answers in this scenario:

• We put a strong prior on humans making plans hierarchically. This could bring us back to the case where we have multiple tasks.

• Assume humans are optimal given constraints on their resources (that is, bounded rationality). Then, we only need to infer a reward function and not a rationality model. It is far from obvious that this is anywhere close to accurate as a model of humans, but it seems plausible enough to warrant investigation.

Both of these answers feel very unsatisfying to me though—they feel like hacks that don’t model reality perfectly.

Side note: How do I set my username? I logged in with Facebook and it never asked me for my name (Rohin Shah) and now I’m just “user 264”.

• To the point of peer review, many AI safety researchers already get peer review by circling their drafts around to other researchers.

It seems to me that this is only a good use of your time if the journal became respectable. (Otherwise you barely increase visibility of the field, no one will care about publishing in the journal, and it doesn’t help academics’ careers much.) There can even be a negative effect where AI safety is perceived as “that fringe field that publishes in <journal>”, which makes AI researchers more reluctant to work on safety.

I don’t know how a journal becomes respectable but I would expect that it’s hard and takes a lot of work (and probably luck), and would want to see a good plan for how the journal will become respectable before I’d be excited to see this happen. I would guess that this wouldn’t be doable without the effort of a senior AI/​ML researcher.

• I’ve noticed that the larger font annoys me enough that I just scroll past it looking for more reasonably-sized fonts, leading to the exact opposite of the desired effect :/​

• I think everyone (including me) would go crazy from solitude in this scenario, so that puts the number at 0. If you guarantee psychological stability somehow, I think most adults (~90% perhaps) would be good at achieving their goals (which may be things like “authoritarian regime forever”). This is pretty dependent on the humans becoming more intelligent—if they just thought faster I wouldn’t be nearly as optimistic, though I’d still put the number above 0.

# The Align­ment Newslet­ter #1: 04/​09/​18

9 Apr 2018 16:00 UTC
11 points
• I agree that an eternal authoritarian regime is pretty catastrophic.

I don’t think that a human in this scenario would be pursuing what they currently consider their goals—I think they would think more, learn more, and eventually settle on a different set of goals. (Maybe initially they pursue their current goals but it changes over time.) But it’s an open question to me whether the final set of goals they settle upon is actually reasonably aligned towards “humanity’s goals”—it may be or it may not be. So it could be catastrophic to amplify a current human in this way, from the perspective of humanity. But, it would not be catastrophic to the human that you amplified. (I think you disagree with the last statement, maybe I’m wrong about that.)

• Yeah, I think I agree with this. You are still able to get peer review from the people you work with, if you work at an organization, but it is preferable to get more varied feedback, and some people may not work at an organization.

• Sorry for the super late response, I only just discovered notifications.

In the cases where the things you are trying out are meta-type things that affect other people, I think it’s worth trying things _well_ even if they have a low chance of success, but quite costly to try things in an okayish way if it has a low chance of success.

One major downside of trying new things is that it makes future attempts to do the same thing less likely to work (because people are less enthusiastic about it and expect it to fail, or you get a proliferation of the new things and half of the people are on one of them and half are on the other and you lose out on network effects and economies of scale). This means that when you try new things, especially ones that make asks of other people, you want to put a _lot_ of effort into getting it right quickly. If you do the 20% effort version and that fails, maybe before you had done this the 90% effort version would have succeeded but now it simply can’t be done, and you’ve lost that value entirely. Whereas if you do the 90% effort version from the start and it fails, you can be reasonably confident that it was just not doable.

In this particular case, there’s also an object-level downside in the case of failure, namely that AI safety is thought of as “that fringe group that publishes in <journal>”.

9 Apr 2018 21:16 UTC
29 points
• Yeah, I think that’s where we disagree. I think that humans are likely to achieve their values-on-reflection, I just don’t know what a human’s “values-on-reflection” would actually be (eg. could be that they want an authoritarian regime with them in charge).

It’s also possible that we have different concepts of values-on-reflection. Eg. maybe you mean that I have found my values-on-reflection only if I’ve cleared out all epistemic pits somehow and then thought for a long time with the explicit goal of figuring out what I value, whereas I would use a looser criterion. (I’m not sure what exactly.)

• Since people seem to be finding it useful, I just updated the archive with public versions of the 5 emails I wrote for CHAI summarizing ~2 months of content.

• It seems like an underlying assumption of this post is that any useful safety property like “corrigibility” must be about outcomes of an AI acting in the world, whereas my understanding of (Paul’s version of) corrigibility is that it is also about the motivations underlying the AI’s actions. It’s certainly true that we don’t have a good definition of what an AI’s “motivation” is, and we don’t have a good way of testing whether the AI has “bad motivations”, but this seems like a tractable problem? In addition, maybe we can make claims of the form “this training procedure motivates the AI to help us and not manipulate us”.

I think of corrigibility as “wanting to help humans” (see here) plus some requirements on the capability of the AI (for example, it “knows” that a good way to help humans is to help them understand its true reasoning, and it “knows” that it could be wrong about what humans value). In the “teach me about charities” example, I think basically any of the behaviors you describe are corrigible, if the AI has no ulterior motive behind it. For example, trying to convince the billionaire to focus on administrative costs because then it would be easier for the AI to evaluate which charities are good or not is incorrigible. However, talking to the billionaire to focus on administrative costs because the AI has noticed that the billionaire is very frugal would be corrigible. (Though ideally the AI would mention all of the options that it sees the billionaire being convinced by, and then asks the billionaire for input on which method of convincing him he would endorse.) I agree that testing corrigibility in such a scenario is hard (though I like Paul’s comment above as an idea for that), but it seems like we can train an agent in such a way that the optimization will knowably (i.e. high but not proof-level confidence) create an AI that is corrigible.

# The Align­ment Newslet­ter #2: 04/​16/​18

16 Apr 2018 16:00 UTC
8 points

# The Align­ment Newslet­ter #3: 04/​23/​18

23 Apr 2018 16:00 UTC
9 points

# The Align­ment Newslet­ter #4: 04/​30/​18

30 Apr 2018 16:00 UTC
8 points
• If you complete the track in it’s entirety, you should be ready to understand most of the work in AI Safety.

Is this specific to MIRI or would it also include the work done by OpenAI, DeepMind, FHI, and CHAI? I don’t see any resources on machine learning currently but perhaps you intend to add those later.

• No headaches. I probably could play high strategy board games all day and enjoy myself the entire time (I know this extends at least to ~6 continuous hours). I do get tired from other kinds of thinking (reading papers, research) but it doesn’t always happen so there’s another cause that I don’t know yet. If I try to soldier through, I feel uncomfortable and find it incredibly difficult to focus, but it doesn’t cause pain.

• There are a lot of resources on logic, computability and complexity theory, provability, etc. which underly Agent Foundations, but not much on how to synthesize them all into Agent Foundations. Similarly there are a lot of resources on deep learning, reinforcement learning, etc. but not much on how to use them to answer important safety questions. In both Agent Foundations and safety-oriented ML, it seems to me that the hard part with no good resources yet is how to figure out what the right question to ask is. So I’m not sure which one would be easier to teach. (Though RAISE seems to be targeting the underlying background knowledge, not the “figure out what to ask” question.)

I do agree with the do-one-thing-well intuition.

# The Align­ment Newslet­ter #5: 05/​07/​18

7 May 2018 16:00 UTC
8 points
• Plausibly an undergraduate math degree would be enough? Agent Foundations is often over my head, but it’s only a little over my head—I definitely have the sense that I could understand the background with not much effort.

Fwiw I also think an undergraduate CS degree is enough to get a good background for safety-oriented ML, as long as you make sure to focus on ML.

The use of the graduate degree is more in training you to do research than to understand any particular background knowledge well.

(Also I’m not claiming that RAISE is not filling a niche—it seems very plausible to me that there are people who would like to work on Agent Foundations who are currently working professionals and not at college, and for them a curated set of resources would be valuable.)