Yes, that’s basically what I mean. I think I’m trying to refer to the same issue that Paul mentioned here: https://www.lesswrong.com/posts/pZhDWxDmwzuSwLjou/asymptotically-benign-agi#ZWtTvMdL8zS9kLpfu
I like that you emphasize and discuss the need for the AI to not believe that it can influence the outside world, and cleanly distinguish this from it actually being able to influence the outside world. I wonder if you can get any of the benefits here without needing the box to actually work (i.e. can you just get the agent to believe it does? and is that enough for some form/degree of benignity?)
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
I’m calling this the “no grue assumption” (https://en.wikipedia.org/wiki/New_riddle_of_induction).
My concern here is that this assumption might be False, even in a strong sense of “There is no such U”.
Have you proven the existence of such a U? Do you agree it might not exist? It strikes me as potentially running up against issues of NFL / self-reference.
Also, it’s worth noting that this assumption (or rather, Lemma 3) also seems to preclude BoMAI optimizing anything *other* than revealed preferences (which others have noted seems problematic, although I think it’s definitely out of scope).
Still wrapping my head around the paper, but...
1) It seems too weak: In the motivating scenario of Figure 3, isn’t is the case that “what the operator inputs” and “what’s in the memory register after 1 year” are “historically distributed identically”?
2) It seems too strong: aren’t real-world features and/or world-models “dense”? Shouldn’t I be able to find features arbitrarily close to F*? If I can, doesn’t that break the assumption?
3) Also, I don’t understand what you mean by: “it’s on policy behavior [is described as] simulating X”. It seems like you (rather/also) want to say something like “associating reward with X”?
Just exposition-wise, I’d front-load pi^H and pi^* when you define pi^B, and also clarify then that pi^B considers human-exploration as part of it’s policy.
″ This result is independently interesting as one solution to the problem of safe exploration with limited oversight in nonergodic environments, which [Amodei et al., 2016] discus ”
^ This wasn’t super clear to me.… maybe it should just be moved somewhere else in the text?
I’m not sure what you’re saying is interesting here. I guess it’s the same thing I found interesting, which is that you can get sufficient (and safe-as-a-human) exploration using the human-does-the-exploration scheme you propose. Is that what you mean to refer to?
Maybe “promotional of” would be a good phrase for this.
ETA: NVM, what you said is more descriptive (I just looked in the appendix).
RE footnote 2: maybe you want to say “monotonically increasing as a function of” rather than “proportional to”. (It’s a shame there doesn’t seem to be a shorter way of saying the first one, which seems to be more often what people actually want to say...)
I’m not sure. I was trying to disagree with your top level comment :P
FWICT, both of your points are actually responses to be point (3).
RE “re: #2”, see: https://en.wikipedia.org/wiki/Value_of_information#Characteristics
RE “re: #3”, my point was that it doesn’t seem like VoI is the correct way for one agent to think about informing ANOTHER agent. You could just look at the change in expected utility for the receiver after updating on some information, but I don’t like that way of defining it.
I think it is rivalrous.
Xrisk mitigation isn’t the resource; risky behavior is the resource. If you engage in more risky behavior, then I can’t engage in as much risky behavior without pushing us over into a socially unacceptable level of total risky behavior.
If there is a cost to reducing Xrisk (which I think is a reasonable assumption), then there will be an incentive to defect, i.e. to underinvest in reducing Xrisk. There’s still *some* incentive to prevent Xrisk, but to some people everyone dying is not much worse than just them dying.
1) Yep, independence.
2) Seems right as well.
3) I think it’s important to consider “risk per second”, because
(i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.
(ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.
(iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile
Whether or not this happens depends on the learning algorithm. Let’s assume an IID setting. Then an algorithm that evaluates many random parameter settings and choses the one that gives the best performance would have this effect. But a gradient-based learning algorithm wouldn’t necessarily, since it only aims to improve its predictions locally (so what you say in the ETA is more accurate, **in this case**, I think).
Also, I just wanted to mention that Stuart Armstrong’s paper “Good and safe uses of AI oracles” discusses self-fulfilling prophecies as well; Stuart provides a way of training a predictor that won’t be victim to such effects (just don’t reveal its predictions when training). But then it also fails to account for the effect its predictions actually have, which can be a source of irreducible error… The example is a (future) stock-price predictor: making its predictions public makes them self-refuting to some extent, as they influence market actors decisions.
I dunno… I think describing them as a tragedy of the commons can help people understand why the problems are challenging and deserving of attention.
RE Sarah: Longer timelines don’t change the picture that much, in my mind. I don’t find this article to be addressing the core concerns. Can you recommend one that’s more focused on “why AI-Xrisk isn’t the most important thing in the world”?
RE Robin Hanson: I don’t really know much of what he thinks, but IIRC his “urgency of AI depends on FOOM” was not compelling.
What I’ve noticed is that critics are often working from very different starting points, e.g. being unwilling to estimate probabilities of future events, using common-sense rather than consequentialist ethics, etc.