IAFF-User-111

Karma: 16

Some work on connecting UDT and Reinforcement Learning

IAFF-User-11117 Dec 2015 23:58 UTC

4 points

5 comments1 min readLW link

(drive.google.com)

IAFF-User-111 30 Aug 2016 18:17 UTC
LW: 3 AF: 2
AF
on: Simplified explanation of stratification
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.

I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent’s ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).

RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.

P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = … = null).

IAFF-User-111 9 Feb 2017 7:32 UTC
LW: 2 AF: 1
AF
in reply to: Vladimir_Nesov’s comment on: Does UDT *really* get counter-factually mugged?
I reason as follows:
1. Omega inspires belief only after the agent encounters Omega.
2. According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
3. Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).
I think either:
1. the agent does update, in which case, why not update on the result of the coin-flip? or
2. the agent doesn’t update, in which case, what matters is simply the optimal policy given the original prior.

IAFF-User-111 11 Feb 2016 21:40 UTC
LW: 2 AF: 1
AF
in reply to: paulfchristiano’s comment on: Some work on connecting UDT and Reinforcement Learning
“But in the setting you described, the only impact of the policy is on the agent’s actions”

I don’t think so. P_M(\zeta | \pi) is meant to describe the distribution over trajectories given a policy (according to the model). Unless I’m missing something, the model could contain non-causal correlations.

IAFF-User-111 2 Oct 2015 3:01 UTC
LW: 2 AF: 1
AF
in reply to: jessicata’s comment on: Proposal: Modeling goal stability in machine learning
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)

The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.

In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).

I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.

The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.

Learning Impact in RL

IAFF-User-1114 Feb 2017 21:42 UTC

1 point

6 comments1 min readLW link

IAFF-User-111 9 Feb 2017 1:57 UTC
LW: 1 AF: 1
AF
in reply to: danieldewey’s comment on: Learning Impact in RL
I was mostly a gut-feeling when I posted, but let me try and articulate a few:
1. It relies on having a good representation. Small problems with the representation might make it unworkable. Learning a good enough representation and verifying that you’ve done so doesn’t seem very feasible. Impact may be missed if the representation doesn’t properly capture unobserved things and long-term dependencies. Things like the creation of sub-agents seem likely to crop up in subtle, hard to learn, ways.
2. I haven’t looked into it, but ATM I have no theory about when this scheme could be expected to recover the “correct” model (I don’t even know how that would be defined… I’m trying to “learn” my way around the problem :P)
To put #1 another way, I’m not sure that I’ve gained anything compared with proposals to penalize impact in the input space, or some learned representation space (with the learning not directed towards discovering impact).

On the other hand, I was inspired to consider this idea when thinking about Yoshua’s proposal about causal disentangling mentioned at the end of his Asilomar talk here: https://www.youtube.com/watch?v=ZHYXp3gJCaI. This (and maybe some other similar work, e.g. on empowerment) seem to provide a way to direct an agent’s learning towards maximizing its influence, which might help… although having an agent learn based on maximizing its influence seems like a bad idea… but I guess you might be able to then add a conflicting objective (like a regularizer) to actually limit the impact...

So then you’d end up with some sort of adversarial-ish set-up, where the agent is trying to both:
1. maximize potential impact (i.e. by understanding its ability to influence the world)
2. minimize actual impact (i.e. by refraining from taking actions which turn out (eventually) to have a large impact).
Having just finished typing this, I feel more optimistic about this last proposal than the original idea :D We want an agent to learn about how to maximize its impact in order to avoid doing so.

(How) can an agent confidently predict its potential impact without trying potentially impactful actions?
I think it certainly can, because humans can. We use a powerful predictive model of the world to do this. … and that’s all I have to say ATM

IAFF-User-111 30 Aug 2016 19:34 UTC
LW: 1 AF: 1
AF
on: Simplified explanation of stratification
So after talking w/Stuart, I guess what he means by “humans learning from the AI’s actions” is that what humans’ beliefs about U converges to actually changes (for the better). I’m not sure if that’s really desirable, atm.

On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents’). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).

UDT from an RL perspective

IAFF-User-11117 Dec 2015 23:48 UTC

0 points

0 comments1 min readLW link

(drive.google.com)

Does UDT really get counter-factually mugged?

IAFF-User-1114 Feb 2017 21:46 UTC

0 points

7 comments1 min readLW link

Minimizing Empowerment for Safety

IAFF-User-1119 Feb 2017 4:45 UTC

0 points

2 comments1 min readLW link

Looking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost?

IAFF-User-1118 Nov 2017 22:58 UTC

0 points

1 comment1 min readLW link

IAFF-User-111 7 May 2018 18:33 UTC
0 points
AF
on: Doubts about Updatelessness
What do you mean by “in full generality instead of the partial version attained by policy selection”?

IAFF-User-111 21 Oct 2017 2:43 UTC
0 points
AF
on: Funding opportunity for AI alignment research
Paul—how widely do you want this shared?

IAFF-User-111 26 May 2017 1:54 UTC
0 points
AF
on: Why I am not currently working on the AAMLS agenda
The “benign induction problem” link is broken.

IAFF-User-111 13 Mar 2017 1:09 UTC
0 points
AF
in reply to: Vanessa Kosoy’s comment on: An idea for creating safe AI
I agree it’s not a complete solution, but it might be a good path towards creating a task-AI, which is a potentially important unsolved sub-problem.

IAFF-User-111 13 Mar 2017 1:07 UTC
0 points
AF
on: An idea for creating safe AI
I spoke with Huw about this idea. I was thinking along similar lines at some point, but only for “safe-shutdown”, e.g. if you had a self-driving car that anticipated encountering a dangerous situation and wanted to either:
1. pull over immediately
2. cede control to a human operator
It seems intuitive to give it a shutdown policy that triggers in such cases, and that aims to minimize a combined objective of time-to-shutdown and risk-of-shutdown. (Of course, this doesn’t deal with interrupting the agent, ala Armstrong and Orseau.)

Huw pointed out that a similar strategy can be used for any “genie”-style goal (i.e. you want an agent to do one thing as efficiently as possible, and then shut-down until you give it another command), which made me substantially more interested in it.

This seems similar in spirit to giving your agent a short horizon, but now you also have regular terminations, by default, which has some extra pros and cons.

IAFF-User-111 9 Feb 2017 4:36 UTC
0 points
AF
in reply to: Stuart_Armstrong’s comment on: Censoring out-of-domain representations
I agree… if there are specific things you don’t want to be able to do / predict, then you can do something very similar to the cited “Censoring Representations” paper.

But if you want to censor all “out-of-domain” knowledge, I don’t see a good way of doing it.

IAFF-User-111 9 Feb 2017 2:09 UTC
0 points
AF
in reply to: Vladimir_Nesov’s comment on: Does UDT *really* get counter-factually mugged?
This seems only loosely related to my OP.

But it is quite interesting… so you’re proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.

Anyways, I think that’s a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.

IAFF-User-111 9 Feb 2017 2:03 UTC
0 points
AF
in reply to: paulfchristiano’s comment on: Does UDT *really* get counter-factually mugged?
OK that makes sense, thanks. This is what I suspected, but I was surprised that so many people are saying that UDT gets mugged without stipulating this; it made it seem like I was missing something.

Playing devil’s advocate:
1. P(mugger) and P(anti-mugger) aren’t the only relevant quantities IRL
2. I don’t think we know nearly enough to have a good idea of what policy UDT would choose for a given prior. This leads me to doubt the usefulness of UDT.

IAFF-User-111

Some work on con­nect­ing UDT and Re­in­force­ment Learning

Learn­ing Im­pact in RL

UDT from an RL perspective

Does UDT *re­ally* get counter-fac­tu­ally mugged?

Min­i­miz­ing Em­pow­er­ment for Safety

Look­ing for Recom­men­da­tions RE UDT vs. bounded com­pu­ta­tion /​ meta-rea­son­ing /​ op­por­tu­nity cost?

Some work on connecting UDT and Reinforcement Learning

Learning Impact in RL

Does UDT really get counter-factually mugged?

Minimizing Empowerment for Safety

Looking for Recommendations RE UDT vs. bounded computation / meta-reasoning / opportunity cost?