IAFF-User-111

Karma: 16

IAFF-User-111 2 Oct 2015 3:01 UTC
LW: 2 AF: 1
AF
in reply to: jessicata’s comment on: Proposal: Modeling goal stability in machine learning
So the big problem I see with this it is still in the optimization framework, assuming that we actually want to optimize the initial criterion. While we can imagine changing the initial criterion, this is already something we can effectively do with RL if we specify our reward to be something communicated by a human overseer (but of course that doesn’t really solve the problem...)

The proposal is reminiscent of the Actor-Critic framework from RL (analogy: actor—model, critic—criterion), which learns a policy (the actor) and a value function (the critic) simultaneously.

In that case, you have the true reward function playing the role of the initial criterion, so you don’t actually get to evaluate the true criterion (which would be something like distance from the optimal policy), you get what amounts to noisy samples of it. The goal in both cases is to learn a good model (i.e. policy, for Actor-Critic).

I think there is a conceptual issue with this proposal as it stands, namely, the interplay between the changes in the model and criterion are not taken into account. E.g. there is no guarantee that recursively applying F to the initial_model using the criteria output by X would give you anything like the model output by X.

The cool thing about Actor-Critic is that you can prove (under suitable assumptions) that this method actually gives you an unbiased estimate of the true policy gradient (Sutton 99: https://webdocs.cs.ualberta.ca/~sutton/papers/SMSM-NIPS99.pdf). IIRC, it requires the assumption that the critic is trained to convergence in-between each update of the actor, though.

UDT from an RL perspective

IAFF-User-11117 Dec 2015 23:48 UTC

0 points

0 comments1 min readLW link

(drive.google.com)

Some work on connecting UDT and Reinforcement Learning

IAFF-User-11117 Dec 2015 23:58 UTC

4 points

5 comments1 min readLW link

(drive.google.com)

IAFF-User-111 27 Jan 2016 23:53 UTC
0 points
AF
in reply to: paulfchristiano’s comment on: Some work on connecting UDT and Reinforcement Learning
I don’t understand why you say:
1. it “seems to require a richer model than we usually use in [RL]”.
2. “This seems to happen in your setting.”
3. Are you suggesting that a model as I’ve defined it is not satisfactory/sufficient for some reason?
4. can you elaborate a bit?

IAFF-User-111 28 Jan 2016 0:06 UTC
0 points
AF
on: Attempting to refine “maximization” with 3 new -izers
skimmed it.

It would be helpful to define “stopping point” and “stopping distance”.

Wrt local optima:

Deep Neural Nets were historically thought to suffer from local optima. Recently, this viewpoint has been challenged; see, e.g. “The Loss Surfaces of Multilayer Networks” http://arxiv.org/abs/1412.0233 and references.

Although the issue remains unclear, I currently suspect that local optima are not a practical obstacle for an (omniscient) hill-climber in the real world.

I wasn’t convinced overall by the statement about tiling (or not). I think you should give more detailed arguments about why you do or don’t expect these agents to tile, and explain the set-up a bit more, too: are you imagining agents that take a single action, based on their current policy, to adopt a new policy, which is then not subject to further modification? Or how can you ensure that agents do not modify their policy in such a way that policy_new encourages further modifications which can compound?

IAFF-User-111 28 Jan 2016 0:31 UTC
0 points
AF
on: New(ish) AI control ideas
Thanks! I love having central repos.

A quick question / comment, RE: “I decided to try and attack as many of these ideas as I could, head on, and see if there was any way of turning these objections.”

Q: What do you mean (or have in mind) in terms of “turning [...] objections”? I’m not very familiar with the phrase.

Comment: One trend I see is that technical safety proposals are often dismissed by appealing to one of the 7 responses you’ve given. Recently I’ve been thinking that we should be a bit less focused on finding airtight solutions, and more focused on thinking about which proposed techniques could be applied in various scenarios to significantly reduce risk. For example, boxing an agent (e.g. by limiting it’s sensors/actuators) might significantly increase how long it takes to escape.

IAFF-User-111 29 Jan 2016 6:09 UTC
0 points
AF
on: Resource gathering agent
Doesn’t seem workable to me: being “completely ignorant” suggests an improper prior. An agent with a proper prior over its utility function can integrate over it and maximize expected utility and which action maximizes expected utility will depend on this prior.

IAFF-User-111 11 Feb 2016 21:40 UTC
LW: 2 AF: 1
AF
in reply to: paulfchristiano’s comment on: Some work on connecting UDT and Reinforcement Learning
“But in the setting you described, the only impact of the policy is on the agent’s actions”

I don’t think so. P_M(\zeta | \pi) is meant to describe the distribution over trajectories given a policy (according to the model). Unless I’m missing something, the model could contain non-causal correlations.

IAFF-User-111 26 Aug 2016 20:51 UTC
0 points
AF
on: Learning values versus indifference
Why doesn’t normalizing rewards work?

(i.e. set max_pi(expected returns)=1 and min_pi(expected_returns)=0, for all environments)… I assume this is what you’re talking about at the end?

IAFF-User-111 30 Aug 2016 0:10 UTC
0 points
AF
on: Learning values versus indifference
RE: my last question- After talking to Stuart, I think one way of viewing the problem with such a proposal is: The agent cares about its future expected utility (which depends on the state/history, not just the MDP).

IAFF-User-111 30 Aug 2016 18:17 UTC
LW: 3 AF: 2
AF
on: Simplified explanation of stratification
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.

I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent’s ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).

RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.

P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = … = null).

IAFF-User-111 30 Aug 2016 19:34 UTC
LW: 1 AF: 1
AF
on: Simplified explanation of stratification
So after talking w/Stuart, I guess what he means by “humans learning from the AI’s actions” is that what humans’ beliefs about U converges to actually changes (for the better). I’m not sure if that’s really desirable, atm.

On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents’). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).

IAFF-User-111 3 Jan 2017 3:19 UTC
0 points
AF
in reply to: jessicata’s comment on: My current take on the Paul-MIRI disagreement on alignability of messy AI
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).

It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).

I’d paraphrase what he’s said as:

“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”

Which I emphatically agree with.

IAFF-User-111 3 Jan 2017 3:19 UTC
0 points
AF
in reply to: jessicata’s comment on: My current take on the Paul-MIRI disagreement on alignability of messy AI
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).

It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).

I’d paraphrase what he’s said as:

“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”

Which I emphatically agree with.

IAFF-User-111 3 Jan 2017 3:28 UTC
0 points
AF
in reply to: Wei Dai’s comment on: My current take on the Paul-MIRI disagreement on alignability of messy AI
I really agree with #2 (and I think with #1, as well, but I’m not as sure I understand your point there).

I’ve been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn’t seem obvious to most… but I haven’t really considered that “efficient aligned AIs almost certainly exist as points in mindspace”. In fact I’m not sure I agree 100% (basically because “Moloch” (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).

I think “trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed” remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really “takes off”.

IAFF-User-111 3 Jan 2017 20:04 UTC
0 points
AF
on: My current take on the Paul-MIRI disagreement on alignability of messy AI
Points 5-9 seem to basically be saying: “We should work on understanding principles of intelligence so that we can make sure that AIs are thinking the same way as humans do; currently we lack this level of understanding”.

I don’t really understand point 10, especially this part:

“They would most likely generalize in an unaligned way, since the reasoning rules would likely be contained in some sub-agent (e.g. consider how Earth interpreted as an “agent” only got to the moon by going through reasoning rules implemented by humans, who have random-ish values; Paul’s post on the universal prior also demonstrates this).”

IAFF-User-111 5 Jan 2017 5:16 UTC
0 points
AF
in reply to: paulfchristiano’s comment on: My current take on the Paul-MIRI disagreement on alignability of messy AI
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.

These properties also seem sufficient for a treacherous turn (in an unaligned AI).

IAFF-User-111 5 Jan 2017 5:26 UTC
0 points
AF
in reply to: jessicata’s comment on: My current take on the Paul-MIRI disagreement on alignability of messy AI
Thanks, I think I understand that part of the argument now. But I don’t understand how it relates to:

“10. We should expect simple reasoning rules to correctly generalize even for non-learning problems. ”

^Is that supposed to be a good thing or a bad thing? “Should expect” as in we want to find rules that do this, or as in rules will probably do this?